End-to-End RAG App on India Census 2011 Using Groq and LangChain
This project implements a Retrieval-Augmented Generation (RAG) system that answers natural‑language questions over the India Census 2011 PDFs. The application uses a Groq‑hosted Llama 3.1 model for generation, LangChain for orchestration, FAISS for vector search, and Streamlit for the user interface.
Objective and Scope
The primary objective is to build an interactive question‑answering app where a user can ask queries such as “What is the literacy rate in state X?” or “How many districts does state Y have?” and receive answers grounded strictly in the India Census 2011 documents placed in a local ./india_census (India 2011) directory.
Key goals:
Ingest multiple Census 2011 PDF files.
Convert them into manageable text chunks with preserved context.
Index those chunks in a vector store for efficient semantic retrieval.
Use a Groq Llama 3.1 model to generate answers constrained by retrieved context.
Provide a simple, reproducible UI via Streamlit.
System Architecture
2.1 Data Ingestion and Preprocessing
Source data: India Census 2011 PDF reports saved locally inside ./us_census.
Loader: PyPDFDirectoryLoader("./us_census") recursively reads all PDF files from the directory and converts them into LangChain Document objects, preserving basic metadata like file path and page number.
Text splitting: RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) breaks long pages into overlapping chunks. This overlap mitigates answer fragmentation across chunk boundaries and keeps semantic continuity for the retriever.
Only the first subset of documents (e.g., docs[
2.2 Embedding and Vector Indexing
Embedding model: GoogleGenerativeAIEmbeddings(model="models/embedding-001") (or equivalently a Gemini embedding model) converts each chunk into a dense vector representation.
Vector store: FAISS.from_documents(final_documents, embeddings) builds an in‑memory FAISS index from these vectors, enabling approximate nearest‑neighbour search over the Census text.
Session caching: Streamlit’s st.session_state stores embeddings, docs, splitter, and vectors. The vector_embedding() function only runs once per session (if "vectors" not in st.session_state), avoiding repeated embedding calls and improving responsiveness and quota usage.
2.3 Retrieval and Generation Pipeline
LLM backend: ChatGroq(model_name="llama-3.1-8b-instant", groq_api_key=...) uses Groq’s hosted Llama 3.1 8B model, selected for its latency and strong performance on retrieval‑augmented use cases.
Prompt design: A ChatPromptTemplate structures the system instruction and injects both retrieved context and user question:
Instructs the model to “Use only the provided context to answer” and explicitly say “I do not know” if the answer is not present.
Wraps retrieved text in a … block to clearly delimit source material.
Chains:
create_stuff_documents_chain(llm, prompt) creates a document‑combining chain that “stuffs” all retrieved chunks plus the question into the prompt for a single LLM call.
create_retrieval_chain(retriever, document_chain) composes the FAISS-backed retriever with the document chain into a single RAG pipeline. The pipeline accepts {"question": user_question} and returns an object that typically includes answer and context (the list of documents actually used).
3.2 Session and Performance Considerations
Session state: Using st.session_state ensures that embeddings and FAISS indices are computed once per session rather than on every question, which is essential when using external embedding APIs with rate limits.
Chunk subset: Limiting to ~20 documents during development reduces embedding load and speeds up prototyping, while still demonstrating full RAG behaviour.
Versioning and reproducibility:
All dependencies (Streamlit, LangChain, langchain-community, langchain-groq, langchain-google-genai, FAISS, etc.) are captured in requirements.txt.
A .gitignore file excludes venv/, pycache/, and .env to keep the repository clean and secret‑safe.
Python environment: The app is developed in a dedicated Python 3.12 virtual environment to avoid conflicts between LangChain, Pydantic, and other libraries.
Potential extensions:
Replace Google embeddings with a self‑hosted or GPU‑backed model once the environment supports PyTorch or another backend.
Add state selection and metadata-aware retrieval (e.g., filter by state, district, rural/urban) using document metadata.
Integrate evaluation tooling (such as LangSmith) to automatically track and score answer quality over a suite of benchmark questions.