Building the Future of Legal AI: A Deep Dive into RAG Technology

A review by Sandhya Shiwakoti, Aliza Lama, Sadikshya Nepal, Sonishma Basnet, and Somiya Rai, from Herald College, Kathmandu, Nepal (University of Wolverhampton).

Every day, lawyers and legal professionals sift through mountains of documents. What if an AI could not only find the right information instantly but also understand it, summarize it, and answer complex questions about it? That's the promise of a true "Legal GPT" powered by Retrieval-Augmented Generation (RAG).

But how close are we to this reality?

To find out, our team analyzed five recent and important projects in the legal AI space. We looked at both practical systems being built today and the core academic research that powers them. Here’s what we discovered.

Part 1: The Builders - Two Legal AI Systems in the Wild

First, we looked at two systems that people are building right now. These show the real-world application of RAG technology, along with its current challenges.

1. Nyaya-GPT: The AI That Thinks Before It Acts

Created by Debapriya Das, this system is a chatbot designed to answer questions about specific Indian laws. It's more than just a search engine; it uses a clever framework called ReAct (Reason + Act).

What's Great About It: Instead of just searching for an answer in one go, Nyaya-GPT breaks down a complex question, "thinks" about the steps needed to answer it, retrieves information, and can even correct its own mistakes if the first attempt isn't good enough.
The Gaps: Its biggest limitation is its scope. It only works for Indian laws and hasn't been tested with real users to measure its accuracy. It's a fantastic proof-of-concept, but not a general-purpose tool yet.

2. The RAG Chatbot for Legal Documents

This project, by Manuel Orejo, is a more general-purpose tool designed to help anyone chat with their legal documents. It uses a modern tech stack (React, FastAPI, ChromaDB) to create a smooth user experience.

What's Great About It: It has a clean, intuitive chat interface and keeps track of conversation history, making it easy to use. Early user feedback was positive, with users finding it helpful for understanding complex documents.
The Gaps: The project lacks technical details and performance metrics. We don't know how well it works or how accurate its answers are. The user feedback is vague, making it hard to assess its true effectiveness.

Part 2: The Science - What the Research Teaches Us

Next, we dove into three academic papers that reveal the core challenges and breakthroughs in building reliable legal AI.

3. LegalBench-RAG: The Need for a Better Ruler

A huge problem in legal AI is knowing if a system is actually any good. Nicholas Pipitone and Ghita Houir Alami tackled this by creating LegalBench-RAG, the first-ever benchmark specifically for testing the retrieval part of a legal RAG system.

The Key Insight: It's not enough to retrieve the right document; the AI must find the exact snippet of text that answers the question. Their benchmark, built from over 6,800 expert-annotated examples, helps developers measure this "snippet precision."
The Limitation: This benchmark only tests the retrieval step. It doesn't evaluate the quality of the final generated answer or check if the AI is "hallucinating" facts.

4. Eval-RAG: A Smarter Way to Judge AI Answers

While LegalBench-RAG focuses on finding information, Eval-RAG by Cheol Ryu et al. focuses on judging the final answer. They developed a new way to evaluate an AI's response in Korean legal question-answering.

The Key Insight: Instead of just asking one LLM to score another's answer, Eval-RAG first finds the relevant legal document and then asks the LLM to judge the answer based on that document. This method was far better at catching factual errors and aligned much more closely with the evaluations of human lawyers.
The Limitation: The research is focused only on evaluation, not on building a complete question-answering system. It was also limited to the Korean legal domain.

5. Fine-Tuning GPT-3: Proving That Specialization Matters

Finally, a paper by Davide Liga and Livio Robaldo explored whether a general model like GPT-3 could understand specific legal rules. They fine-tuned GPT-3 on GDPR regulations.

The Key Insight: Even with a small amount of specialized legal training data, a fine-tuned GPT-3 significantly outperformed previous models at classifying legal rules (e.g., distinguishing between an "obligation" and a "permission"). This proves that for high-stakes legal work, domain-specific training is crucial.
The Limitation: The task was limited to classification, not generating full answers. The work wasn't integrated into a full RAG pipeline, leaving the retrieval aspect unexplored.

What We Learned: The Three Big Takeaways

After reviewing these five projects, a clear picture emerged about the state of legal AI.

The Field is Fragmented. Amazing progress is happening, but it's happening in isolated pieces. We have smart reasoning engines, powerful evaluation benchmarks, and effective training methods, but no one has put them all together into one comprehensive system yet.
Evaluation is the Biggest Hurdle. We cannot build trustworthy legal AI if we can't reliably measure its performance. Projects like LegalBench-RAG and Eval-RAG are essential first steps, but much more work is needed.
Specialization is Non-Negotiable. General-purpose AI is not enough for the law. The best results come from systems that are fine-tuned on legal data and use frameworks designed for legal reasoning.

The Road Ahead

The path to a true, reliable Legal GPT is clear: the next breakthrough will come from integration. We need a system that combines the clever reasoning of Nyaya-GPT with the rigorous evaluation standards of LegalBench-RAG and Eval-RAG, all built on a foundation of domain-specific fine-tuning.

While the perfect legal AI assistant isn't here yet, the building blocks are. The work reviewed here shows a vibrant, innovative field getting closer every day.

Sources and Credits

This review was based on a detailed analysis of the following works:

Das, D. (2024), Nyaya-GPT: Building Smarter Legal AI with ReAct + RAG.
Orejo, M. (2024), Legal Document AI Assistant.
Pipitone, N. and Houir Alami, G. (2024), LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain.
Ryu, C., et al. (2023), Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA.
Liga, D. and Robaldo, L. (2023), Fine-tuning GPT-3 for legal rule classification.

Building the Future of Legal AI: A Deep Dive into RAG Technology

Table of contents

Building the Future of Legal AI: A Deep Dive into RAG Technology

To find out, our team analyzed five recent and important projects in the legal AI space. We looked at both practical systems being built today and the core academic research that powers them. Here’s what we discovered.

Part 1: The Builders - Two Legal AI Systems in the Wild

1. Nyaya-GPT: The AI That Thinks Before It Acts

2. The RAG Chatbot for Legal Documents

Part 2: The Science - What the Research Teaches Us

3. LegalBench-RAG: The Need for a Better Ruler

4. Eval-RAG: A Smarter Way to Judge AI Answers

5. Fine-Tuning GPT-3: Proving That Specialization Matters

What We Learned: The Three Big Takeaways

The Road Ahead

Sources and Credits

Table of contents

Files