Fine-Tuning Qwen3-1.7B for Scientific Article Summarization via QLoRA

Overview

This project demonstrates the application of parameter-efficient fine-tuning (PEFT) to adapt the Qwen3-1.7B-Base large language model for the task of scientific article summarization. Leveraging 4-bit quantization, LoRA (Low-Rank Adaptation), and QLoRA techniques, I successfully fine-tune the model on a subset of the PubMed summarization dataset while maintaining feasibility on a single consumer-grade GPU.

The goal is to improve the model’s ability to generate concise, factual summaries of biomedical research articles — a critical capability for accelerating scientific literature review and knowledge extraction.

Base Model

We use Qwen3-1.7B-Base, the base (non-instruct) variant of the Qwen3 series developed by Alibaba Cloud. Key characteristics:

Architecture: Decoder-only Transformer
Parameters: ~1.7 billion
Context window: 32,768 tokens (we limit input to 1,024 tokens for training efficiency)
Tokenizer: Qwen2-style tokenizer with chat formatting support
Precision: Loaded in 4-bit (NF4) quantization to reduce memory footprint

The base model is instruction-agnostic and requires supervised fine-tuning (SFT) to perform domain-specific tasks like summarization.

Dataset

I use a curated subset of the PubMed Summarization dataset (originally from ccdv/pubmed-summarization), which contains scientific article–abstract pairs from biomedical literature.

Split	Size	Source
Train	10,000	`./dataset/train.parquet`
Validation	1,000	`./dataset/valid.parquet`
Test	1,000	`./dataset/test.parquet`

Each sample contains:

article: Full text of a scientific paper section
abstract: Expert-written summary (target)

The dataset was preprocessed and saved in Apache Parquet format for fast, memory-efficient loading.

Baseline Evaluation

Before fine-tuning, I evaluated the zero-shot performance of the quantized Qwen3-1.7B-Base model on the test set using greedy decoding (max_new_tokens=1024, do_sample=False).

ROUGE scores (F1, %) were computed against ground-truth abstracts:

Metric	Baseline
ROUGE-1	38.03
ROUGE-2	12.26
ROUGE-L	21.35
ROUGE-Lsum	31.45

The model produced generic or incomplete summaries (e.g., “The article discusses…”), confirming the need for domain adaptation.

HellaSwag Evaluation

To assess whether fine-tuning on summarization degrades the model’s general commonsense reasoning capabilities, I evaluated it on the HellaSwag benchmark — a multiple-choice dataset testing physical and social reasoning via sentence completion.

Using a fully offline, custom implementation compatible with quantized models, I computed zero-shot accuracy by ranking candidate endings via log-probability:

Setting	HellaSwag Accuracy
Before QLoRA	47.04%
After QLoRA	46.36%
Δ	–0.68 pp

🟡 Key insight: Fine-tuning on PubMed led to a minor degradation in commonsense reasoning (–0.68 pp), which is negligible and within typical run-to-run variance. This suggests the model retained strong general language understanding despite domain specialization.

💡 Crucially, the drop is small (<1 pp), indicating no catastrophic forgetting — the model remains broadly competent while gaining domain-specific summarization skills.

Fine-Tuning Methodology

We employed a full QLoRA pipeline to maximize efficiency and stability:

1. Quantization & Parameter Efficiency

4-bit NF4 quantization via bitsandbytes
Double quantization and bfloat16 compute dtype for numerical stability
LoRA adapters (r=8, α=16) applied to q_proj and v_proj layers only
→ < 0.1% of model parameters trained (~1.2M trainable params vs 1.7B total)

2. Data Preparation

Prompts formatted as:

You are a helpful assistant who writes concise, factual summaries…
Article:
{full_text}
Summary:
{abstract}

Input tokenized with add_special_tokens=True for the prompt, False for the summary
Labels masked with -100 for prompt tokens (loss computed only on summary)
Max sequence length: 1,024 tokens (768 for article + 256 for summary)

3. Training Configuration

Parameter	Value
Epochs	1 (early stopping enabled)
Batch size	1 (gradient accumulation not used)
Optimizer	`paged_adamw_8bit`
Learning rate	2e-4 (cosine schedule)
Warmup	250 steps
Mixed precision	`bf16`
Gradient checkpointing	Enabled
Eval frequency	Every 200 steps
Early stopping	Patience = 10, Δ = 1e-4

A custom collator ensured correct padding and label masking. ROUGE metrics were computed post-hoc to avoid OOM during training.

Results Achieved

After fine-tuning, the model shows consistent improvement across all ROUGE metrics:

Metric	Baseline	After QLoRA	Δ
ROUGE-1	38.03	39.75	+1.72
ROUGE-2	12.26	15.37	+3.11
ROUGE-L	21.35	22.21	+0.86
ROUGE-Lsum	31.45	36.53	+5.08

💡 ROUGE-Lsum improved by +5.1 points — indicating significantly better sentence-level summary coverage.

Training Dynamics

Training loss decreased steadily from ~1.72 to ~1.60 over 200 steps
Validation loss remained stable (~1.60), suggesting no overfitting
Early stopping did not trigger (training completed 1 epoch)

Qualitatively, generated summaries became:

More specific (mentioning key methods/results)
Better aligned with abstract structure
Less repetitive and more fluent

Conclusion and Next Steps

We successfully demonstrated that QLoRA enables effective domain adaptation of Qwen3-1.7B on consumer hardware, yielding measurable gains in scientific summarization quality with minimal computational cost.

Key Takeaways

4-bit quantization + LoRA makes billion-parameter LLM fine-tuning accessible
Careful prompt engineering and label masking are critical for SFT success
ROUGE-Lsum is a sensitive metric for summary coherence in scientific texts
Domain fine-tuning causes only negligible degradation in general reasoning (HellaSwag –0.68 pp)

Future Work

Larger data: Scale to full PubMed (≈120k samples)
Longer context: Experiment with 2K–4K sequence lengths using Qwen’s RoPE scaling
Decoding improvements: Test beam search, length normalization, and constrained decoding
Human evaluation: Assess factual consistency and clinical relevance
Multi-task learning: Jointly train on summarization + question answering

This project serves as a reproducible blueprint for efficient LLM adaptation in low-resource research settings — particularly valuable for biomedical NLP where domain expertise and compute often compete.

References

Project repository: https://github.com/AndreyGermanov/qwen3_scientific_summarization
Model card: https://huggingface.co/GermanovDev/qwen3-pubmed-summarization

Fine-Tuning Qwen3-1.7B for Scientific Article Summarization via QLoRA

Fine-Tuning Qwen3-1.7B for Scientific Article Summarization via QLoRA

Table of contents

Fine-Tuning Qwen3-1.7B for Scientific Article Summarization via QLoRA

Overview

Base Model

Dataset

Baseline Evaluation

HellaSwag Evaluation

Fine-Tuning Methodology

1. Quantization & Parameter Efficiency

2. Data Preparation

3. Training Configuration

Results Achieved

Training Dynamics

Conclusion and Next Steps

Key Takeaways

Future Work

References

Table of contents

Datasets

Datasets

Code

Code