Fine-Tuning LegalBERT for Contract Clause Classification Using LoRA

Author: K.V. Mokshith Rao
Program: LLMED Certification — Module 1 Capstone Project
Institution: International Institute of Information Technology

TL;DR

This project fine-tunes nlpaueb/legal-bert-base-uncased using LoRA
(Low-Rank Adaptation) on the CUAD dataset for multi-class contract
clause classification across 41 legal clause types. The fine-tuned
model achieves 71.46% accuracy and 0.677 weighted F1, compared
to a baseline of 3.28% accuracy — a dramatic improvement demonstrating
the effectiveness of parameter-efficient fine-tuning for legal NLP tasks.
No catastrophic forgetting was detected on the MMLU general benchmark.

1. Objective

What Task Are We Fine-Tuning For?

Contract clause classification is a high-value legal NLP task. Legal
professionals spend significant time manually reviewing contracts to
identify the type and implications of each clause — a process that
is slow, expensive, and error-prone.

The goal of this project is to fine-tune a domain-specific language
model (LegalBERT) to automatically classify contract clauses into one
of 41 CUAD clause types, such as:

Termination for Convenience
Governing Law
Non-Compete
IP Ownership Assignment
Audit Rights
Insurance
...and 35 more

Why This Task?

Legal NLP is one of the most impactful application areas for LLMs.
The CUAD dataset provides expert-labelled annotations from real
commercial contracts, making it an ideal benchmark for fine-tuning
evaluation. Automating clause classification can:

Reduce contract review time from hours to seconds
Enable non-lawyers to understand contract risk
Power downstream legal AI applications

2. Dataset

Dataset Details

Property	Value
Name	CUAD (Contract Understanding Atticus Dataset)
Source	theatticusproject/cuad-qa
Task	Multi-class sequence classification
Classes	41 CUAD clause types
License	Creative Commons Attribution 4.0
Language	English

Dataset Preparation

The raw CUAD dataset was preprocessed into a structured CSV file
(clauses.csv) with two columns: text (the clause text) and
label (the clause type name).

Preprocessing steps:

Extracted clause text and label pairs from CUAD QA format
Applied sklearn.LabelEncoder to convert string labels to integer IDs
Saved label_mapping.json for reproducibility
Applied stratified 80/20 train/test split using random_state=42

Dataset Statistics:

Split	Size
Train	~7,930 examples
Test	1,983 examples

Tokenization:

Tokenizer: nlpaueb/legal-bert-base-uncased
Max sequence length: 256 tokens
Padding: max_length
Truncation: True

Class Distribution Note:
The CUAD dataset is significantly imbalanced. Some classes have hundreds
of examples (e.g., Renewal Term: 133 test samples) while others have
very few (e.g., Agreement Date: 1 sample). This imbalance directly
impacts per-class performance and is discussed in the Limitations section.

3. Methodology

3.1 Base Model Selection

Model: nlpaueb/legal-bert-base-uncased

LegalBERT was chosen because it is pre-trained specifically on legal
text corpora including legal contracts, court decisions, and legislation.
This domain-specific pre-training gives it an advantage over general
BERT models for legal NLP tasks.

Unlike general-purpose models like bert-base-uncased, LegalBERT
understands legal terminology, clause structures, and contract language
patterns natively — making it the ideal base for fine-tuning on CUAD.

3.2 Fine-Tuning Approach — LoRA Configuration

Instead of full fine-tuning (which updates all ~110M parameters),
we used LoRA (Low-Rank Adaptation) via the HuggingFace PEFT library.
LoRA injects small trainable matrices into the attention layers,
dramatically reducing the number of trainable parameters while
maintaining strong task performance.

LoRA Configuration:

Parameter	Value	Reasoning
Rank (r)	16	Good balance of capacity vs efficiency
Alpha	32	Standard 2x rank scaling
Dropout	0.1	Regularization to prevent overfitting
Target modules	`query`, `value`	Standard BERT LoRA targets
Task type	SEQ_CLS	Sequence classification
Bias	none	Standard setting

Trainable parameters: Only ~0.5% of total parameters are trained,
making this extremely parameter-efficient compared to full fine-tuning.

3.3 Training Setup

Training Arguments:

Parameter	Value
Epochs	5
Learning rate	2e-4
Batch size (train)	16
Batch size (eval)	16
Weight decay	0.01
Warmup ratio	0.1
Eval strategy	Per epoch
Save strategy	Per epoch
Best model loading	True
Optimizer	AdamW (default)

Hardware & Framework:

Property	Value
Hardware	Kaggle GPU (T4)
Framework	HuggingFace Transformers + PEFT 0.18.1
Experiment tracking	Weights & Biases
Training time	~15 minutes

Experiment Tracking:
🔗 W&B Project — contract-intelligence

4. Results

4.1 Baseline Evaluation

The base model (nlpaueb/legal-bert-base-uncased) was evaluated on
the test set without any fine-tuning. This establishes the reference
point for measuring improvement.

Metric	Baseline (Untrained)
Accuracy	3.28%
Weighted F1	0.0082
Macro F1	0.0053

The near-random performance (3.28% vs 1/41 = 2.4% random chance)
confirms the base model has no prior knowledge of CUAD clause types.
It essentially guesses, demonstrating that fine-tuning is necessary
for this task.

4.2 Training Curve

The model was trained for 5 epochs with evaluation after each epoch.
Loss decreased consistently and accuracy improved steadily, with no
signs of overfitting.

Epoch	Train Loss	Val Loss	Accuracy	Weighted F1	Macro F1
1	5.992	4.285	43.22%	0.316	0.158
2	2.881	2.485	65.81%	0.601	0.382
3	2.203	2.124	69.79%	0.651	0.448
4	1.958	2.005	71.05%	0.668	0.488
5	1.852	1.944	71.46%	0.677	0.502

Key observations:

Training loss dropped from 5.99 → 1.85 (69% reduction)
Validation loss dropped from 4.29 → 1.94 (55% reduction)
Largest accuracy gains in epochs 1-2 (baseline → 65.8%)
Steady improvement continued through epoch 5
No divergence between train/val loss — no overfitting

4.3 Post Fine-Tuning Evaluation

After 5 epochs of LoRA fine-tuning:

==================================================
       MODEL PERFORMANCE COMPARISON
==================================================
Metric                      Baseline   Fine-Tuned
--------------------------------------------------
Accuracy                      0.0328       0.7146
Weighted F1                   0.0082       0.6771
Macro F1                      0.0053       0.5016
==================================================

Metric	Baseline	Fine-Tuned	Improvement
Accuracy	3.28%	71.46%	+68.18%
Weighted F1	0.0082	0.6771	+82x
Macro F1	0.0053	0.5016	+94x

4.4 Per-Class Performance (Selected Classes)

High-performing classes:

Class	Precision	Recall	F1	Support
Non-Disparagement (13)	1.00	1.00	1.00	89
Termination for Convenience (14)	0.96	0.95	0.96	109
Expiration Date (4)	0.92	0.97	0.95	127
Audit Rights (32)	0.84	0.93	0.88	82
Effective Date (3)	0.88	0.90	0.89	125
Renewal Term (5)	0.72	0.97	0.83	133

Low-performing classes (insufficient training data):

Class	Support	Reason
Agreement Date (2)	1	Only 1 test sample
Parties (1)	14	Very few samples
Document Name (0)	22	Few samples + ambiguous

4.5 General Benchmark — Catastrophic Forgetting Check

To verify the fine-tuned model retained general language understanding,
it was evaluated on a 100-sample subset of the MMLU Abstract Algebra
benchmark alongside the untrained base model:

Model	MMLU Abstract Algebra Accuracy
Base model (untrained)	19.00%
Fine-tuned model	24.00%

 No catastrophic forgetting — fine-tuned model improved by 5.00% on MMLU.

Both models score near random chance (25%) on this reasoning task —
which is expected, as LegalBERT is an encoder classification model,
not a reasoning model. Critically, the fine-tuned model did not
degrade compared to the base model, confirming that domain-specific
LoRA fine-tuning preserved general language ability.

4.6 Example Inference Results

Real predictions from the fine-tuned model on unseen contract clauses:

Clause	Predicted Type	Confidence
"Either party may terminate this Agreement upon 30 days written notice."	Termination for Convenience	79.50%
"Licensee shall not transfer or sublicense any rights granted herein."	Anti-Assignment	61.04%
"This Agreement shall be governed by the laws of California."	Governing Law	96.87%
"The Company shall maintain insurance coverage of at least $1,000,000."	Insurance	97.44%
"Neither party shall disclose confidential information to third parties."	Anti-Assignment	41.98%

4 out of 5 predictions are correct with high confidence. The last
clause (confidentiality) was misclassified as Anti-Assignment —
understandable since CUAD does not have a dedicated "Confidentiality"
label, and the model associated the "shall not disclose" language
with restriction/anti-transfer semantics.

5. Discussion

What Worked Well

LoRA efficiency: Training only ~0.5% of parameters was sufficient
to achieve 71.46% accuracy on a 41-class classification task. The full
training run completed in approximately 15 minutes on a Kaggle T4 GPU —
demonstrating that parameter-efficient fine-tuning is highly practical
for domain adaptation without expensive hardware.

LegalBERT as base: The domain-specific pre-training of LegalBERT
on legal corpora gave it a strong foundation for understanding contract
language. This likely contributed to faster convergence and better
final performance compared to using a general BERT model.

Stratified splitting: Using stratified train/test splits ensured
all 41 classes were represented proportionally in both splits, leading
to more reliable evaluation metrics.

Experiment tracking: Logging all metrics, hyperparameters, and
system stats to Weights & Biases made it easy to monitor training
in real time and reproduce results.

Challenges Faced

Class imbalance: The CUAD dataset is heavily imbalanced. Some
classes like Renewal Term (133 test samples) have strong performance
while Agreement Date (1 test sample) scores zero. This is a fundamental
dataset limitation, not a model limitation.

Zero-shot classes: Classes 0, 1, 2, 7, 9, 21, 22, 27, 37, 38
scored F1 = 0.00 due to too few training examples. With more balanced
data or oversampling, these classes could be improved.

Tokenization truncation: With max_length=256, some longer contract
clauses are truncated. A few misclassifications may be caused by
important context being cut off.

MMLU interpretability: Using MMLU to measure catastrophic forgetting
for a classification model (not a generative model) is imperfect — the
results should be interpreted as a rough proxy check rather than a
definitive benchmark.

Future Improvements

Longer context: Experiment with max_length=512 to reduce
truncation-related misclassifications
Class balancing: Apply oversampling (SMOTE) or class weights
to improve rare class performance
Hyperparameter tuning: Compare LoRA ranks r=8, 16, 32 to
find optimal capacity
Multi-label classification: CUAD clauses can belong to multiple
types — extending to multi-label could improve real-world utility
Model merging: Merge LoRA adapters into base model and quantize
for faster inference

6. Reproducibility

Code & Resources

Resource	Link
🤗 HuggingFace Model	Mokshith31/legalbert-contract-clause-classification
💻 GitHub Repository	MokshithRao/legalbert-contract-clause-classification
📊 W&B Experiment Tracking	[https://api.wandb.ai/links/mokshithrao87-international-instute-of-information-techn/bgh7n0xt
📂 Kaggle Dataset	kvmokshithrao/contract-clauses-dataset
📓 Kaggle Notebook	kvmokshithrao/finetuning

How to Reproduce

# 1. Clone the repository
git clone https://github.com/MokshithRao/legalbert-contract-clause-classification
cd legalbert-contract-clause-classification

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run inference using published model
python -c "
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained('nlpaueb/legal-bert-base-uncased')
base = AutoModelForSequenceClassification.from_pretrained(
    'nlpaueb/legal-bert-base-uncased', num_labels=41)
model = PeftModel.from_pretrained(
    base, 'Mokshith31/legalbert-contract-clause-classification')
model.eval()

clause = 'This Agreement shall be governed by the laws of California.'
inputs = tokenizer(clause, return_tensors='pt', truncation=True, max_length=256)
with torch.no_grad():
    outputs = model(**inputs)
pred_id = outputs.logits.argmax(dim=-1).item()
print(f'Predicted class ID: {pred_id}')
"

To retrain from scratch:

Upload dataset to Kaggle as kvmokshithrao/contract-clauses-dataset
Open finetuning.ipynb on Kaggle
Enable GPU (T4)
Add WANDB_API_KEY in Kaggle Secrets
Run all cells

7. Limitations

English-only: Trained exclusively on English-language US commercial
contracts. Non-English or non-US contracts may perform poorly.
41 fixed classes: The model only predicts CUAD clause types.
Custom or novel clause types are classified as "Other".
Not legal advice: Model predictions should always be reviewed
by qualified legal professionals. This is a research tool, not
a legal service.
Rare class performance: Classes with <10 training samples
(Agreement Date, Parties, etc.) perform near zero F1.
Sequence length: Clauses longer than 256 tokens are truncated.
Domain shift: Performance may degrade on employment contracts,
real estate agreements, or other non-commercial contract types.

8. Conclusion

This project successfully demonstrates end-to-end LoRA fine-tuning
of LegalBERT for a challenging 41-class legal classification task.

Key achievements:

Accuracy improved from 3.28% → 71.46% (+68%)
Weighted F1 improved from 0.008 → 0.677 (82x better)
Trained efficiently on a single T4 GPU in ~15 minutes
No catastrophic forgetting detected (MMLU: 19% → 24%)
Production-ready model published on HuggingFace Hub
Complete experiment tracking on Weights & Biases

The results confirm that parameter-efficient fine-tuning with LoRA
is a practical and powerful approach for domain-specific legal NLP,
enabling high-quality task adaptation without expensive full fine-tuning.

References

Hendrycks, D., Burns, C., Chen, A., & Ball, S. (2021). CUAD: An
Expert-Annotated NLP Dataset for Legal Contract Review.
arXiv preprint arXiv
.06268.
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., &
Androutsopoulos, I. (2020). LEGAL-BERT: The Muppets straight out
of Law School. Findings of EMNLP 2020.
Hu, E., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank
Adaptation of Large Language Models.
arXiv preprint arXiv
.09685.
Wolf, T., et al. (2020). HuggingFace's Transformers: State-of-the-art
Natural Language Processing. EMNLP 2020.
Hendrycks, D., et al. (2021). Measuring Massive Multitask Language
Understanding. ICLR 2021. (MMLU benchmark)

Fine-Tuning LegalBERT for Contract Clause Classification Using LoRA

Table of contents

Fine-Tuning LegalBERT for Contract Clause Classification Using LoRA

TL;DR

1. Objective

What Task Are We Fine-Tuning For?

Why This Task?

2. Dataset

Dataset Details

Dataset Preparation

3. Methodology

3.1 Base Model Selection

3.2 Fine-Tuning Approach — LoRA Configuration

3.3 Training Setup

4. Results

4.1 Baseline Evaluation

4.2 Training Curve

4.3 Post Fine-Tuning Evaluation

4.4 Per-Class Performance (Selected Classes)

4.5 General Benchmark — Catastrophic Forgetting Check

4.6 Example Inference Results

5. Discussion

What Worked Well

Challenges Faced

Future Improvements

6. Reproducibility

Code & Resources

How to Reproduce

7. Limitations

8. Conclusion

References

Table of contents

Files

Code

Code

Datasets

Datasets