
The system is designed to ingest legal text files (statutes, case summaries, contracts), convert them into semantic embeddings, store them in a vector database, retrieve the most relevant segments based on user questions, and generate context-grounded, non-advisory explanations using an LLM.
The assistant emphasizes safe, transparent, document-based reasoning and includes clear boundaries to avoid providing legal advice.
The entire pipeline is implemented using LangChain, ChromaDB, SentenceTransformers, and an LLM provider (OpenAI/Groq/Google Gemini).
This project demonstrates the foundational concepts of RAG, agentic workflows, and vector search using simple, modular code.
Legal documents—such as Acts, case summaries, and contracts—are typically lengthy, complex, and difficult for non-experts to interpret.
Traditional keyword search is limited, as legal understanding often requires semantic similarity, clause extraction, and contextual reasoning.
With the emergence of Retrieval-Augmented Generation (RAG) systems, it is now possible to:
This project implements a lightweight, domain-focused Legal Document RAG Assistant that enables:
semantic search over fragmented legal texts,
safe interpretation of retrieved context,
AI-generated summaries and explanations without giving legal advice.
This project fulfills all the requirements by implementing:
document loading
chunking
embedding and vector storage
retrieval
RAG prompt construction
LLM pipeline with safe constraints
as outlined in the ReadyTensor Module 1 specification.
The system is built using a modular pipeline consisting of seven main stages:
.png?Expires=1764941273&Key-Pair-Id=K2V2TN6YBJQHTG&Signature=tgDW2N6zpW9Lk8R7Ir31YZLFRqfKzZcRGPc7q0uyhCKBzCDxukjTOrK6yoyqhcfKepAhsgtoYCzqKebz0IbSg1J8~3bt86JtuZ84085SDBlgEbnquvPNgZw3FKmwLUjMaJbIU2rl-L87guow35fljfO--P-0UN0Hrn5i079kzEmd3b0RREbeqnaUC3BvD34YWf6356e6hIf3sQkOKi6lC4crk51i73pzO1tX0qlKENFbqbpRo2p0hhTGOEsgj7vdR4FOsLEKPKl4TV0enijrU1RXO~6grIFgXK7uiwVzKGhAeN5WRHi0LsgxdLutQwasFuUmuRKMpnjLz23X5wJKJQ__)
Legal text files placed in the data/ folder are loaded using a simple loader.
Metadata such as source and doc_type are extracted automatically.
Example types:
def load_documents() -> List[Dict[str, Any]]: results: List[Dict[str, Any]] = [] if not DATA_DIR.exists(): print(f"[WARN] data directory not found: {DATA_DIR}") return results for file_path in DATA_DIR.iterdir(): if not file_path.is_file(): continue # For this project we keep it simple and support only .txt if file_path.suffix.lower() != ".txt": # You can extend this later to support PDFs, DOCX, etc. continue try: with file_path.open("r", encoding="utf-8") as f: text = f.read() except Exception as e: # noqa: BLE001 print(f"[WARN] Failed to read {file_path.name}: {e}") continue text = text.strip() if not text: continue # Simple heuristic: detect rough "type" from filename name_lower = file_path.stem.lower() if "act" in name_lower or "section" in name_lower: doc_type = "statute_or_act" elif "case" in name_lower or "judgment" in name_lower: doc_type = "case_law" elif "contract" in name_lower or "agreement" in name_lower: doc_type = "contract_or_agreement" else: doc_type = "legal_notes_or_other" results.append( { "content": text, "metadata": { "source": file_path.name, "path": str(file_path), "doc_type": doc_type, }, } ) return results
Since legal documents are long, each document is broken into overlapping text chunks.
This ensures higher recall when performing semantic search.
def chunk_text(self,text: str,chunk_size: int = 500,overlap: int = 50) -> List[str]: if not text: return [] text = text.strip() n = len(text) if n == 0: return [] chunks: List[str] = [] start = 0 # Simple character-based chunking with overlap while start < n: end = min(start + chunk_size, n) chunk = text[start:end].strip() if chunk: chunks.append(chunk) if end == n: break # move the window with overlap start = max(0, end - overlap) return chunks
Chunks are embedded using:
The embeddings capture semantic meaning rather than exact keyword matching.
A persistent ChromaDB instance stores:
embeddings
chunk text
metadata (source, document type, chunk index)
This enables fast top-k semantic similarity search during query time.
When a user asks a question:
A specialized legal-safe prompt guides the LLM to:
The system supports multiple LLM providers via environment variables:
OpenAI (gpt-4o-mini)
Groq (Llama-3.1-8B-Instant)
Google Gemini (1.5-Flash)
The final answer is created using:
PromptTemplate → LLM → OutputParser
via a LangChain runnable chain.
The system was tested using a collection of sample legal texts, including:
Indian Contract Act – Key Sections
NDA (Non-Disclosure Agreement) Sample Contract
Fundamental Rights (Constitutional Summary)
Case Summary: Donoghue v. Stevenson
Data Protection Guidelines
Test Procedure:
Documents were placed in the data/ folder as .txt files.
The application was run using:
python src/app.py
Experimental results showed that the Legal Document RAG Assistant:
✔ Correctly retrieved relevant legal clauses
Semantic retrieval successfully matched questions to relevant sections of the Contract Act, NDA, and case summaries.
✔ Produced grounded explanations
Responses were consistent with the retrieved context and avoided hallucinations due to the strict RAG prompt.
✔ Demonstrated strong interpretability
Output included:
clause summaries,
section mentions,
bullet points,
plain-language restatements.
✔ Enforced legal-safety boundaries
The system consistently refused to give legal advice, answering with:
“I’m not sure based on the provided documents.”
when retrieval was insufficient.
✔ Worked across multiple LLM providers
OpenAI, Groq, and Gemini all generated coherent RAG-based answers.

This project demonstrates a complete end-to-end implementation of a RAG-based Legal Document Analyzer, fulfilling all requirements for AAIDC Module 1:
document ingestion
chunking
embedding
vector search
prompt construction
context-grounded LLM reasoning
The assistant is able to:
extract relevant legal information,
summarize complex clauses,
explain concepts in simple terms,
maintain safety by avoiding legal advice,
scale with additional documents and LLM providers.
This foundational architecture serves as a robust template for future modules involving:
agent workflows,
tool integration,
retrieval agents,
multi-step legal QA pipelines,
web UI deployment,
or integration with case/contract management systems.
The project successfully showcases practical use of RAG and agentic principles in a real-world legal domain.