RAG with Groq and langchain

End-to-End RAG App on India Census 2011 Using Groq and LangChain
This project implements a Retrieval-Augmented Generation (RAG) system that answers natural‑language questions over the India Census 2011 PDFs. The application uses a Groq‑hosted Llama 3.1 model for generation, LangChain for orchestration, FAISS for vector search, and Streamlit for the user interface.

Objective and Scope
The primary objective is to build an interactive question‑answering app where a user can ask queries such as “What is the literacy rate in state X?” or “How many districts does state Y have?” and receive answers grounded strictly in the India Census 2011 documents placed in a local ./india_census (India 2011) directory.
Key goals:
Ingest multiple Census 2011 PDF files.
Convert them into manageable text chunks with preserved context.
Index those chunks in a vector store for efficient semantic retrieval.
Use a Groq Llama 3.1 model to generate answers constrained by retrieved context.
Provide a simple, reproducible UI via Streamlit.
System Architecture
2.1 Data Ingestion and Preprocessing
Source data: India Census 2011 PDF reports saved locally inside ./us_census.
Loader: PyPDFDirectoryLoader("./us_census") recursively reads all PDF files from the directory and converts them into LangChain Document objects, preserving basic metadata like file path and page number.

Text splitting: RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) breaks long pages into overlapping chunks. This overlap mitigates answer fragmentation across chunk boundaries and keeps semantic continuity for the retriever.
Only the first subset of documents (e.g., docs[

]) is used to keep initial experimentation light and embeddings cost/quota under control.

2.2 Embedding and Vector Indexing
Embedding model: GoogleGenerativeAIEmbeddings(model="models/embedding-001") (or equivalently a Gemini embedding model) converts each chunk into a dense vector representation.
Vector store: FAISS.from_documents(final_documents, embeddings) builds an in‑memory FAISS index from these vectors, enabling approximate nearest‑neighbour search over the Census text.
Session caching: Streamlit’s st.session_state stores embeddings, docs, splitter, and vectors. The vector_embedding() function only runs once per session (if "vectors" not in st.session_state), avoiding repeated embedding calls and improving responsiveness and quota usage.

2.3 Retrieval and Generation Pipeline
LLM backend: ChatGroq(model_name="llama-3.1-8b-instant", groq_api_key=...) uses Groq’s hosted Llama 3.1 8B model, selected for its latency and strong performance on retrieval‑augmented use cases.
Prompt design: A ChatPromptTemplate structures the system instruction and injects both retrieved context and user question:
Instructs the model to “Use only the provided context to answer” and explicitly say “I do not know” if the answer is not present.
Wraps retrieved text in a … block to clearly delimit source material.
Chains:
create_stuff_documents_chain(llm, prompt) creates a document‑combining chain that “stuffs” all retrieved chunks plus the question into the prompt for a single LLM call.
create_retrieval_chain(retriever, document_chain) composes the FAISS-backed retriever with the document chain into a single RAG pipeline. The pipeline accepts {"question": user_question} and returns an object that typically includes answer and context (the list of documents actually used).

Streamlit Application Design
3.1 User Interface
The Streamlit app exposes a minimal but functional web UI:
Title: "RAG with Groq and Google GenAI" displayed via st.title.
Input widget: st.text_input("Enter your question from documents") where the user enters natural‑language queries about India Census 2011.
Action button: "Get Answer" triggers:
A call to vector_embedding() to ensure the FAISS index is initialized.
Construction of the document and retrieval chains.
Invocation of the RAG pipeline with the user’s question.
Answer display: The answer field from the retrieval chain response is rendered directly with st.write.
Source context viewer: A Streamlit expander labelled "Documents used for answer" shows the page contents of the retrieved documents, separated by lines. This provides transparency and allows the user (or evaluator) to verify how the answer traces back to the underlying Census text.

3.2 Session and Performance Considerations
Session state: Using st.session_state ensures that embeddings and FAISS indices are computed once per session rather than on every question, which is essential when using external embedding APIs with rate limits.
Chunk subset: Limiting to ~20 documents during development reduces embedding load and speeds up prototyping, while still demonstrating full RAG behaviour.

Environment and Security Practices
Environment variables:
GROQ_API_KEY for Groq’s Llama 3.1 models.
GOOGLE_API_KEY for Google Generative AI embeddings.
These are loaded from a local .env file using python-dotenv and never committed to version control.

Versioning and reproducibility:
All dependencies (Streamlit, LangChain, langchain-community, langchain-groq, langchain-google-genai, FAISS, etc.) are captured in requirements.txt.
A .gitignore file excludes venv/, pycache/, and .env to keep the repository clean and secret‑safe.

Python environment: The app is developed in a dedicated Python 3.12 virtual environment to avoid conflicts between LangChain, Pydantic, and other libraries.

Reflections and Extensions
This project demonstrates a complete RAG workflow tailored to a real public dataset (India Census 2011), including data ingestion from PDFs, semantic indexing, retrieval, and grounded answer generation using Groq’s Llama 3.1.

Potential extensions:
Replace Google embeddings with a self‑hosted or GPU‑backed model once the environment supports PyTorch or another backend.
Add state selection and metadata-aware retrieval (e.g., filter by state, district, rural/urban) using document metadata.
Integrate evaluation tooling (such as LangSmith) to automatically track and score answer quality over a suite of benchmark questions.

RAG with Groq and langchain

RAG with Groq and langchain

Datasets

Datasets