Author: K.V. Mokshith Rao
Program: LLMED Certification — Module 1 Capstone Project
Institution: International Institute of Information Technology
This project fine-tunes nlpaueb/legal-bert-base-uncased using LoRA
(Low-Rank Adaptation) on the CUAD dataset for multi-class contract
clause classification across 41 legal clause types. The fine-tuned
model achieves 71.46% accuracy and 0.677 weighted F1, compared
to a baseline of 3.28% accuracy — a dramatic improvement demonstrating
the effectiveness of parameter-efficient fine-tuning for legal NLP tasks.
No catastrophic forgetting was detected on the MMLU general benchmark.
Contract clause classification is a high-value legal NLP task. Legal
professionals spend significant time manually reviewing contracts to
identify the type and implications of each clause — a process that
is slow, expensive, and error-prone.
The goal of this project is to fine-tune a domain-specific language
model (LegalBERT) to automatically classify contract clauses into one
of 41 CUAD clause types, such as:
Legal NLP is one of the most impactful application areas for LLMs.
The CUAD dataset provides expert-labelled annotations from real
commercial contracts, making it an ideal benchmark for fine-tuning
evaluation. Automating clause classification can:
| Property | Value |
|---|---|
| Name | CUAD (Contract Understanding Atticus Dataset) |
| Source | theatticusproject/cuad-qa |
| Task | Multi-class sequence classification |
| Classes | 41 CUAD clause types |
| License | Creative Commons Attribution 4.0 |
| Language | English |
The raw CUAD dataset was preprocessed into a structured CSV file
(clauses.csv) with two columns: text (the clause text) and
label (the clause type name).
Preprocessing steps:
sklearn.LabelEncoder to convert string labels to integer IDslabel_mapping.json for reproducibilityrandom_state=42Dataset Statistics:
| Split | Size |
|---|---|
| Train | ~7,930 examples |
| Test | 1,983 examples |
Tokenization:
nlpaueb/legal-bert-base-uncasedClass Distribution Note:
The CUAD dataset is significantly imbalanced. Some classes have hundreds
of examples (e.g., Renewal Term: 133 test samples) while others have
very few (e.g., Agreement Date: 1 sample). This imbalance directly
impacts per-class performance and is discussed in the Limitations section.
Model: nlpaueb/legal-bert-base-uncased
LegalBERT was chosen because it is pre-trained specifically on legal
text corpora including legal contracts, court decisions, and legislation.
This domain-specific pre-training gives it an advantage over general
BERT models for legal NLP tasks.
Unlike general-purpose models like bert-base-uncased, LegalBERT
understands legal terminology, clause structures, and contract language
patterns natively — making it the ideal base for fine-tuning on CUAD.
Instead of full fine-tuning (which updates all ~110M parameters),
we used LoRA (Low-Rank Adaptation) via the HuggingFace PEFT library.
LoRA injects small trainable matrices into the attention layers,
dramatically reducing the number of trainable parameters while
maintaining strong task performance.
LoRA Configuration:
| Parameter | Value | Reasoning |
|---|---|---|
| Rank (r) | 16 | Good balance of capacity vs efficiency |
| Alpha | 32 | Standard 2x rank scaling |
| Dropout | 0.1 | Regularization to prevent overfitting |
| Target modules | query, value | Standard BERT LoRA targets |
| Task type | SEQ_CLS | Sequence classification |
| Bias | none | Standard setting |
Trainable parameters: Only ~0.5% of total parameters are trained,
making this extremely parameter-efficient compared to full fine-tuning.
Training Arguments:
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Learning rate | 2e-4 |
| Batch size (train) | 16 |
| Batch size (eval) | 16 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Eval strategy | Per epoch |
| Save strategy | Per epoch |
| Best model loading | True |
| Optimizer | AdamW (default) |
Hardware & Framework:
| Property | Value |
|---|---|
| Hardware | Kaggle GPU (T4) |
| Framework | HuggingFace Transformers + PEFT 0.18.1 |
| Experiment tracking | Weights & Biases |
| Training time | ~15 minutes |
Experiment Tracking:
🔗 W&B Project — contract-intelligence
The base model (nlpaueb/legal-bert-base-uncased) was evaluated on
the test set without any fine-tuning. This establishes the reference
point for measuring improvement.
| Metric | Baseline (Untrained) |
|---|---|
| Accuracy | 3.28% |
| Weighted F1 | 0.0082 |
| Macro F1 | 0.0053 |
The near-random performance (3.28% vs 1/41 = 2.4% random chance)
confirms the base model has no prior knowledge of CUAD clause types.
It essentially guesses, demonstrating that fine-tuning is necessary
for this task.
The model was trained for 5 epochs with evaluation after each epoch.
Loss decreased consistently and accuracy improved steadily, with no
signs of overfitting.
| Epoch | Train Loss | Val Loss | Accuracy | Weighted F1 | Macro F1 |
|---|---|---|---|---|---|
| 1 | 5.992 | 4.285 | 43.22% | 0.316 | 0.158 |
| 2 | 2.881 | 2.485 | 65.81% | 0.601 | 0.382 |
| 3 | 2.203 | 2.124 | 69.79% | 0.651 | 0.448 |
| 4 | 1.958 | 2.005 | 71.05% | 0.668 | 0.488 |
| 5 | 1.852 | 1.944 | 71.46% | 0.677 | 0.502 |
Key observations:
After 5 epochs of LoRA fine-tuning:
==================================================
MODEL PERFORMANCE COMPARISON
==================================================
Metric Baseline Fine-Tuned
--------------------------------------------------
Accuracy 0.0328 0.7146
Weighted F1 0.0082 0.6771
Macro F1 0.0053 0.5016
==================================================
| Metric | Baseline | Fine-Tuned | Improvement |
|---|---|---|---|
| Accuracy | 3.28% | 71.46% | +68.18% |
| Weighted F1 | 0.0082 | 0.6771 | +82x |
| Macro F1 | 0.0053 | 0.5016 | +94x |
High-performing classes:
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Non-Disparagement (13) | 1.00 | 1.00 | 1.00 | 89 |
| Termination for Convenience (14) | 0.96 | 0.95 | 0.96 | 109 |
| Expiration Date (4) | 0.92 | 0.97 | 0.95 | 127 |
| Audit Rights (32) | 0.84 | 0.93 | 0.88 | 82 |
| Effective Date (3) | 0.88 | 0.90 | 0.89 | 125 |
| Renewal Term (5) | 0.72 | 0.97 | 0.83 | 133 |
Low-performing classes (insufficient training data):
| Class | F1 | Support | Reason |
|---|---|---|---|
| Agreement Date (2) | 0.00 | 1 | Only 1 test sample |
| Parties (1) | 0.00 | 14 | Very few samples |
| Document Name (0) | 0.00 | 22 | Few samples + ambiguous |
To verify the fine-tuned model retained general language understanding,
it was evaluated on a 100-sample subset of the MMLU Abstract Algebra
benchmark alongside the untrained base model:
| Model | MMLU Abstract Algebra Accuracy |
|---|---|
| Base model (untrained) | 19.00% |
| Fine-tuned model | 24.00% |
No catastrophic forgetting — fine-tuned model improved by 5.00% on MMLU.
Both models score near random chance (25%) on this reasoning task —
which is expected, as LegalBERT is an encoder classification model,
not a reasoning model. Critically, the fine-tuned model did not
degrade compared to the base model, confirming that domain-specific
LoRA fine-tuning preserved general language ability.
Real predictions from the fine-tuned model on unseen contract clauses:
| Clause | Predicted Type | Confidence | Correct? |
|---|---|---|---|
| "Either party may terminate this Agreement upon 30 days written notice." | Termination for Convenience | 79.50% | |
| "Licensee shall not transfer or sublicense any rights granted herein." | Anti-Assignment | 61.04% | |
| "This Agreement shall be governed by the laws of California." | Governing Law | 96.87% | |
| "The Company shall maintain insurance coverage of at least $1,000,000." | Insurance | 97.44% | |
| "Neither party shall disclose confidential information to third parties." | Anti-Assignment | 41.98% |
4 out of 5 predictions are correct with high confidence. The last
clause (confidentiality) was misclassified as Anti-Assignment —
understandable since CUAD does not have a dedicated "Confidentiality"
label, and the model associated the "shall not disclose" language
with restriction/anti-transfer semantics.
LoRA efficiency: Training only ~0.5% of parameters was sufficient
to achieve 71.46% accuracy on a 41-class classification task. The full
training run completed in approximately 15 minutes on a Kaggle T4 GPU —
demonstrating that parameter-efficient fine-tuning is highly practical
for domain adaptation without expensive hardware.
LegalBERT as base: The domain-specific pre-training of LegalBERT
on legal corpora gave it a strong foundation for understanding contract
language. This likely contributed to faster convergence and better
final performance compared to using a general BERT model.
Stratified splitting: Using stratified train/test splits ensured
all 41 classes were represented proportionally in both splits, leading
to more reliable evaluation metrics.
Experiment tracking: Logging all metrics, hyperparameters, and
system stats to Weights & Biases made it easy to monitor training
in real time and reproduce results.
Class imbalance: The CUAD dataset is heavily imbalanced. Some
classes like Renewal Term (133 test samples) have strong performance
while Agreement Date (1 test sample) scores zero. This is a fundamental
dataset limitation, not a model limitation.
Zero-shot classes: Classes 0, 1, 2, 7, 9, 21, 22, 27, 37, 38
scored F1 = 0.00 due to too few training examples. With more balanced
data or oversampling, these classes could be improved.
Tokenization truncation: With max_length=256, some longer contract
clauses are truncated. A few misclassifications may be caused by
important context being cut off.
MMLU interpretability: Using MMLU to measure catastrophic forgetting
for a classification model (not a generative model) is imperfect — the
results should be interpreted as a rough proxy check rather than a
definitive benchmark.
| Resource | Link |
|---|---|
| 🤗 HuggingFace Model | Mokshith31/legalbert-contract-clause-classification |
| 💻 GitHub Repository | MokshithRao/legalbert-contract-clause-classification |
| 📊 W&B Experiment Tracking | [https://api.wandb.ai/links/mokshithrao87-international-instute-of-information-techn/bgh7n0xt |
| 📂 Kaggle Dataset | kvmokshithrao/contract-clauses-dataset |
| 📓 Kaggle Notebook | kvmokshithrao/finetuning |
# 1. Clone the repository git clone https://github.com/MokshithRao/legalbert-contract-clause-classification cd legalbert-contract-clause-classification # 2. Install dependencies pip install -r requirements.txt # 3. Run inference using published model python -c " import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained('nlpaueb/legal-bert-base-uncased') base = AutoModelForSequenceClassification.from_pretrained( 'nlpaueb/legal-bert-base-uncased', num_labels=41) model = PeftModel.from_pretrained( base, 'Mokshith31/legalbert-contract-clause-classification') model.eval() clause = 'This Agreement shall be governed by the laws of California.' inputs = tokenizer(clause, return_tensors='pt', truncation=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) pred_id = outputs.logits.argmax(dim=-1).item() print(f'Predicted class ID: {pred_id}') "
To retrain from scratch:
kvmokshithrao/contract-clauses-datasetfinetuning.ipynb on KaggleWANDB_API_KEY in Kaggle SecretsThis project successfully demonstrates end-to-end LoRA fine-tuning
of LegalBERT for a challenging 41-class legal classification task.
Key achievements:
The results confirm that parameter-efficient fine-tuning with LoRA
is a practical and powerful approach for domain-specific legal NLP,
enabling high-quality task adaptation without expensive full fine-tuning.
Hendrycks, D., Burns, C., Chen, A., & Ball, S. (2021). CUAD: An
Expert-Annotated NLP Dataset for Legal Contract Review.
arXiv preprint arXiv
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., &
Androutsopoulos, I. (2020). LEGAL-BERT: The Muppets straight out
of Law School. Findings of EMNLP 2020.
Hu, E., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank
Adaptation of Large Language Models.
arXiv preprint arXiv
Wolf, T., et al. (2020). HuggingFace's Transformers: State-of-the-art
Natural Language Processing. EMNLP 2020.
Hendrycks, D., et al. (2021). Measuring Massive Multitask Language
Understanding. ICLR 2021. (MMLU benchmark)