
This project fine-tunes Phi-3 Mini (3B parameters) using LoRA (Low-Rank Adaptation) on the MBPP dataset to build a lightweight, specialized model that automatically detects and repairs common Python code bugs โ including syntax errors, indentation mistakes, incorrect loops, and variable misuse โ all trainable on a single NVIDIA T4 GPU in under one hour.
Debugging is one of the most time-consuming tasks in a developer's workflow โ especially for beginners. While large frontier models like GPT-4 can handle code repair, they require expensive API calls and are not tailored for lightweight, offline, or embedded use cases.
This project explores a targeted question:
Can a compact open-weights LLM, fine-tuned with parameter-efficient methods, reliably fix common Python bugs without the overhead of a massive model?
The answer, as demonstrated in this work, is yes โ with meaningful caveats.
Automated Program Repair (APR) has been studied for decades, but LLM-based approaches have opened new possibilities. Rather than symbolic rule-matching, neural models can generalize across syntactic patterns and produce human-readable corrections.
Microsoft's Phi-3 Mini (~3.8B parameters) is a strong choice for this task because:
Low-Rank Adaptation (LoRA) enables fine-tuning of large models by injecting small trainable rank-decomposition matrices into selected layers โ while freezing the original weights. This means:

The dataset is derived from the MBPP (Mostly Basic Python Problems) benchmark by Google Research, available on HuggingFace:
๐ google-research-datasets/mbpp
MBPP contains ~1,000 beginner-level Python programming problems with reference solutions, making it well-suited for constructing supervised bug-fix pairs.
Each MBPP sample was transformed into an instruction-style bug-fix pair:
This approach provides clean, verifiable ground truth for every training sample.
| Split | Samples | Purpose |
|---|---|---|
| Train | ~900 | Model training |
| Validation | ~100 | Hyperparameter tuning |
| Test | ~100 | Final evaluation |
| Total | ~1,100 | โ |
| Category | Examples |
|---|---|
| Syntax Errors | Missing colons, unclosed brackets |
| Indentation | Incorrect or missing indentation levels |
| Loop Errors | Wrong range, off-by-one iteration |
| Print Logic | Incorrect output formatting or calls |
| Variable Usage | Wrong variable name or undefined reference |
Each training example follows a structured instruction template:
### Instruction:
Fix the bug in the following Python code.
### Input:
for i in range(5)
print(i)
### Response:
for i in range(5):
print(i)
This format aligns with Phi-3's instruction-tuning style and ensures consistent model behavior at inference time.
datasets library
| Criterion | Value | Rationale |
|---|---|---|
| Model | Phi-3 Mini | Efficient open-weights LLM |
| Architecture | Decoder-only Transformer | Standard for text generation |
| Parameters | ~3.8B | Balanced capability and efficiency |
| Context Length | 8,192 tokens | Sufficient for function-level code |
| Source | microsoft/Phi-3-mini-4k-instruct | HuggingFace Hub |
from peft import LoraConfig lora_config = LoraConfig( r=16, # Rank of update matrices lora_alpha=32, # Scaling factor lora_dropout=0.05, # Dropout for regularization target_modules=["q_proj", "v_proj"], # Apply LoRA to attention layers bias="none", task_type="CAUSAL_LM" )
Design Rationale:
r=16: A moderate rank that balances expressiveness and memory efficiencyalpha=32 (= 2รr): Standard scaling that provides stable gradient updatestarget_modules=["q_proj", "v_proj"]: Attention projection layers capture the most task-relevant patterns; targeting both query and value matrices provides effective adaptationlora_dropout=0.05: Light regularization to prevent overfitting on the small dataset| Parameter | Value | Rationale |
|---|---|---|
| Epochs | 3 | Sufficient convergence without overfit |
| Batch Size | 4 | Fits within 16GB VRAM constraint |
| Learning Rate | 2e-4 | Standard LoRA training rate |
| Optimizer | AdamW | Stable convergence with weight decay |
| Max Length | 256 | Covers typical Python function length |
| Precision | FP16 | Faster training, lower memory usage |
| Scheduler | Linear warmup | Smooth learning rate ramp-up |
| Component | Specification |
|---|---|
| GPU | NVIDIA Tesla T4 |
| VRAM | 16 GB |
| Training Time | ~1 hour |
| Framework | HuggingFace Transformers + PEFT |
| Experiment Tracking | Weights & Biases (W&B) |
1. Load Base Model (Phi-3 Mini, FP16)
โ
2. Apply LoRA Adapters (PEFT)
โ
3. Load & Preprocess Dataset (MBPP โ Bug-Fix pairs)
โ
4. Train via HuggingFace Trainer (3 epochs)
โ
5. Monitor Loss via W&B Dashboard
โ
6. Evaluate on Held-Out Test Set
โ
7. Save & Publish LoRA Adapters to HuggingFace Hub

The training loss decreased steadily across all three epochs, indicating consistent learning without signs of divergence or severe overfitting on the training set. Validation loss tracked closely with training loss throughout, suggesting reasonable generalization given the dataset size.
๐ Training metrics and loss curves are tracked live on the W&B Dashboard.
Input (Buggy Code):
for i in range(5) print(i)
Base Model Output:
# Often reproduced the bug or added unrelated explanation text for i in range(5) print(i)
Fine-Tuned Model Output:
for i in range(5): print(i)
Input (Buggy Code):
def greet(name): print("Hello, " + name)
Fine-Tuned Model Output:
def greet(name): print("Hello, " + name)
Input (Buggy Code):
total = 0 for num in range(10): total += number print(total)
Fine-Tuned Model Output:
total = 0 for num in range(10): total += num print(total)

| Metric | Base Model | Fine-Tuned Model |
|---|---|---|
| Syntax Fix Accuracy | Low | Noticeably Higher |
| Indentation Correction | Inconsistent | Reliable |
| Variable Error Fixing | Occasional | Improved |
| Complex Logic Bugs | Limited | Limited (unchanged) |
| Instruction Adherence | Moderate | High |
Note: Quantitative metrics (e.g., exact match accuracy, CodeBLEU) were not computed due to dataset and tooling constraints. This is acknowledged as a limitation โ see Section 7.
Phi3-debugLLM-LoRA/
โโโ data/
โ โโโ prepare_dataset.py # MBPP โ bug-fix pair conversion
โโโ training/
โ โโโ train.py # LoRA fine-tuning script
โโโ evaluation/
โ โโโ evaluate.py # Inference and qualitative testing
โโโ notebooks/
โ โโโ demo.ipynb # End-to-end walkthrough
โโโ README.md
from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", torch_dtype=torch.float16, device_map="auto" ) model = PeftModel.from_pretrained(base_model, "Sud1212/phi3-debug-llm-lora") tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") prompt = """### Instruction: Fix the bug in the following Python code. ### Input: for i in range(5) print(i) ### Response:""" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
All training hyperparameters, dataset preprocessing steps, and model configurations are documented in the GitHub repository. The LoRA adapter weights are publicly available on HuggingFace Hub for direct use or further fine-tuning.
LoRA Efficiency: The parameter-efficient fine-tuning approach worked exactly as intended โ the model adapted to the bug-fixing task in ~1 hour on a T4 GPU, with only a small fraction of parameters updated (~0.5% of total model weights).
Instruction Template Consistency: Adopting a structured prompt format (Instruction / Input / Response) significantly improved the model's ability to follow the task correctly at inference time, compared to unstructured prompting.
Syntax-Level Bug Fixing: The model demonstrated reliable performance on syntactically well-defined errors โ missing colons, indentation mismatches, and simple variable name errors โ the categories most consistently represented in the training data.
Dataset Scale: With ~900 training samples, the model's generalization to unseen bug patterns is inherently limited. Larger and more diverse datasets would improve robustness.
Prompt Sensitivity: Minor changes to the instruction phrasing at inference time occasionally produced inconsistent outputs. This is a known challenge with smaller instruction-tuned models.
Evaluation Gap: Without automated quantitative metrics (exact match, CodeBLEU, pass@k), it's difficult to precisely quantify improvement. Evaluation was primarily qualitative โ a recognized limitation of this project.
Complex Logical Bugs: Multi-line logic errors (e.g., incorrect algorithm design, wrong conditional logic) remain largely out of reach for a model trained on this dataset. These require semantic understanding beyond surface-level pattern matching.
| Limitation | Description |
|---|---|
| Small Training Dataset | ~900 samples limits generalization to diverse or novel bug patterns |
| Qualitative Evaluation Only | No automated metrics (CodeBLEU, pass@k) computed; quantitative comparison absent |
| Syntax-Focused Scope | Model excels at surface-level bugs but struggles with complex logical errors |
| No Regression Testing | No checks to verify that "fixed" code is functionally correct beyond visual inspection |
| Single Language | Fine-tuned exclusively on Python; does not generalize to other programming languages |
| Prompt Sensitivity | Outputs can vary with small changes in prompt phrasing |
Several extensions of this work could meaningfully improve both capability and reliability:
Larger and More Diverse Datasets: Incorporating datasets like BugsInPy, CodeNet, or synthetically augmented MBPP variants would improve generalization across bug categories.
Quantitative Evaluation Pipeline: Implementing automated metrics โ CodeBLEU, exact match accuracy, and pass@k using unit tests โ would provide rigorous, reproducible benchmarks.
Broader Bug Type Coverage: Extending training to cover algorithmic logic errors, API misuse, and type errors would push the model toward more practical utility.
Multi-Language Support: Adapting the pipeline for JavaScript, Java, or C++ bugs with language-specific datasets.
Integration as a VS Code / IDE Extension: Packaging the model as a lightweight local IDE plugin for real-time bug suggestions without API dependency.
RLHF / DPO Alignment: Using human preference feedback or Direct Preference Optimization to further align outputs toward syntactically and semantically correct fixes.
Quantization for Edge Deployment: Applying INT4/INT8 quantization (e.g., via GGUF/llama.cpp) to enable the model to run on CPU-only environments.
This project demonstrates that compact open-weights LLMs can be efficiently adapted for specialized developer tasks using parameter-efficient fine-tuning โ without large compute budgets or proprietary APIs.
The fine-tuned Phi-3 Mini + LoRA model:
While limitations around dataset scale and evaluation depth remain, this work establishes a clean, reusable pipeline for task-specific LLM adaptation โ one that can be extended with richer data, broader bug coverage, and quantitative benchmarking.
The core insight: You don't need a 70B model to fix a missing colon. With the right training setup, a 3B model can learn to do it reliably.
| Resource | Link |
|---|---|
| ๐ GitHub Repository | suddhumaddi/Phi3-debugLLM-LoRA |
| ๐ค HuggingFace Model | Sud1212/phi3-debug-llm-lora |
| ๐ W&B Dashboard | wandb.ai/suddhumaddi-woxsen-university |
| ๐ฆ Base Dataset | google-research-datasets/mbpp |
| Tool / Library | Purpose |
|---|---|
| HuggingFace Transformers | Model loading, tokenization, training |
| PEFT (LoRA) | Parameter-efficient fine-tuning |
| PyTorch | Deep learning backend |
| Weights & Biases | Experiment tracking & visualization |
| HuggingFace Hub | Model & adapter hosting |
| Google Colab / Kaggle | GPU compute environment |
Sudarshan Maddi
Student, Woxsen University
๐ GitHub | ๐ค HuggingFace | ๐ W&B
For questions, issues, or collaboration inquiries, please open an issue on the GitHub Repository.
Published on ReadyTensor ยท Educational Content โ Academic Solution Showcase