We present LLM-Neo, a parameter-efficient knowledge distillation (KD) framework that leverages Low-Rank Adaptation (LoRA) to transfer knowledge from a large “teacher” model into a much smaller “student” model. Specifically, we distill Meta-Llama-3-8B (quantized to 4-bit) into Llama-3.2-1B augmented with LoRA adapters (r=4, α=8) on the Stanford Sentiment Treebank (SST-2) classification task. By training only ∼0.1 % of the student’s parameters via LoRA, LLM-Neo retains over 95 % of the teacher’s performance while achieving a final SST-2 accuracy of 0.9346. This approach reduces GPU memory usage and end-to-end training time, demonstrating that KD+LoRA is an effective, low-cost path to compressing large language models without sacrificing much accuracy.
Large language models (LLMs) such as Meta-Llama-3-8B have demonstrated state-of-the-art performance on many NLP benchmarks but come at a high computational and memory cost. For real-world deployment—especially on resource-constrained devices—it is crucial to compress these models while retaining as much of their original accuracy as possible. Knowledge distillation (KD) [Hinton et al., 2015] is a popular compression technique that trains a smaller “student” model to mimic the outputs of a larger “teacher” model. However, naive KD still requires fine-tuning the majority of the student parameters, which can be expensive for models with hundreds of millions or billions of weights.
Recently, Low-Rank Adaptation (LoRA) [Hu et al., 2022] emerged as a parameter-efficient fine-tuning method: instead of updating all weights, LoRA adds a pair of low-rank matrices into each transformer layer and only trains those. This combination drastically reduces the number of trainable parameters (often to 0.1–1 % of the model) while preserving performance. In this work, we propose LLM-Neo, which unifies KD and LoRA into a single pipeline for compressing large LLMs. We use Meta-Llama-3-8B (4-bit quantized) as the teacher and distill it into a LoRA-adorned Llama-3.2-1B student on the Stanford Sentiment Treebank (SST-2) dataset. Our method draws inspiration from the recent arXiv paper “Parameter Efficient Knowledge Distillation for Large Language Models,” adapting their ideas to the Llama-3 family. We show that by training just ∼0.1 % of parameters via LoRA, LLM-Neo achieves a final test accuracy of 0.9346 on SST-2—over 95 % of the teacher’s accuracy—while drastically reducing memory footprint and training time.
Our goal is to transfer knowledge from a large 8-billion-parameter teacher (Meta-Llama-3-8B) into a 1-billion-parameter student (Llama-3.2-1B) with minimal additional trainable parameters. LLM-Neo achieves this via LoRA-based adapters inserted in each transformer layer and a hybrid KD loss that balances teacher imitation with ground-truth supervision.
bitsandbytes to reduce GPU memory usage during distillation.r = 4, scaling factor α = 8) into all query/key/value projection matrices of each transformer layer. Only these LoRA matrices (∼0.1 % of total parameters) are trainable. The remainder of Llama-3.2-1B remains frozen.We denote:
The combined loss per sample is defined as:
where:
By freezing the majority of weights (both teacher and student) and updating only low-rank adapters, LLM-Neo dramatically reduces memory and compute cost compared to full fine-tuning.
We evaluate LLM-Neo on SST-2 (Stanford Sentiment Treebank) to classify sentences as “positive” or “negative.” Below are detailed steps and dataset statistics.
Meta-Llama-3-Tokenizer. Truncate or pad to max length 128.from transformers import AutoModelForSequenceClassification, AutoTokenizer teacher_tokenizer = AutoTokenizer.from_pretrained("Meta/Llama-3-8B-4bit") teacher_model = AutoModelForSequenceClassification.from_pretrained( "Meta/Llama-3-8B-4bit", load_in_4bit=True, device_map="auto" ) teacher_model.eval() # freeze
from peft import LoraConfig, get_peft_model from transformers import AutoModelForSequenceClassification, AutoTokenizer student_tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-3.2-1B") base_student = AutoModelForSequenceClassification.from_pretrained("NousResearch/Llama-3.2-1B") # Freeze full student for param in base_student.parameters(): param.requires_grad = False # Configure LoRA adapters lora_config = LoraConfig( r=4, lora_alpha=8, target_modules=["q_proj", "k_proj", "v_proj"], lora_dropout=0.1, bias="none", task_type="SEQ_CLS" ) student_model = get_peft_model(base_student, lora_config)
DataLoader with batch size 16, shuffle train, no augmentation.| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch Size | 16 |
| Initial Learning Rate | 5e-5 |
| LoRA Rank (r) | 4 |
| LoRA Alpha (α) | 8 |
| Temperature (KD) | 2.0 |
| Loss Mix Ratio (α_gap) | 0.5 |
| Weight Decay | 0.01 |
| Optimizer | AdamW |
| Max Sequence Length | 128 |
| Gradient Accumulation | None |
All experiments run on a single-GPU setup with LoRA and 4-bit teacher to minimize memory.
Below are the logged training checkpoints and corresponding validation metrics for LLM-Neo on SST-2:
| Step | Training Loss | Progress (Epoch %) | Validation Loss | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|---|---|---|
| 1000 | 0.3856 | 0.2375 | 0.3681 | 0.9243 | 0.9273 | 0.9073 | 0.9482 |
| 2000 | 0.3681 | 0.4751 | 0.3634 | 0.9266 | 0.9297 | 0.9077 | 0.9527 |
| 3000 | 0.3648 | 0.7126 | 0.3599 | 0.9346 | 0.9366 | 0.9253 | 0.9482 |
| 4000 | 0.3662 | 0.9501 | 0.3580 | 0.9278 | 0.9310 | 0.9062 | 0.9572 |
These results confirm that LLM-Neo achieves over 95 % of teacher performance while requiring a fraction of GPU memory and training cost. The small drop (∼ 1.5 %) in absolute accuracy is offset by the drastic reduction in trainable parameters and resource usage.
We have introduced LLM-Neo, a simple yet effective framework for parameter-efficient knowledge distillation of large language models. By integrating LoRA into the student and using a hybrid KD+CE loss, we distill Meta-Llama-3-8B into Llama-3.2-1B with only ∼0.1 % of the student’s parameters trainable. On the SST-2 sentiment classification benchmark, LLM-Neo achieves 0.9346 accuracy—over 95 % of the teacher’s performance—while reducing GPU memory footprint by > 60 % and speeding up inference.
Key takeaways:
Future Work may explore:
Dive into the full implementation and all hyperparameter scripts on GitHub:
➡️ LLM NEO
Let’s keep in touch!
We hope LLM-Neo paves the way for more widespread deployment of powerful LLMs in resource-constrained settings. Thanks for your interest—happy building! 🚀