๐ Retrieval-Augmented Generation (RAG) for Document Question Answering
๐ Overview
This project implements an end-to-end Retrieval-Augmented Generation (RAG) application that enables users to ask natural-language questions over a collection of PDF documents.
The system retrieves the most relevant document chunks using vector similarity search and generates accurate answers using a Large Language Model (LLM).
The application is built using LangChain, OpenAI GPT-4, FAISS, and Streamlit.
๐ฏ Problem Statement
Large documents such as census reports, research papers, and policy documents are difficult to search and understand manually.
Traditional keyword search often fails to capture semantic meaning.
This project solves that problem by:
๐ง Solution Architecture
The system follows the standard RAG pipeline:

๐ ๏ธ Technologies Used
Python
๐ Dataset
The application uses publicly available PDF documents, such as:
India Census Reports
https://censusindia.gov.in
Open government and policy documents
Only open and non-copyrighted data is used.
๐ป Key Code Snippet (Vector Store Creation)
@st.cache_resource def create_vector_store(): embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY) loader = PyPDFDirectoryLoader("./india_census") docs = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = splitter.split_documents(docs) vectorstore = FAISS.from_documents(chunks, embeddings) return vectorstore
๐ฌ Sample Questions
โWhat is the literacy rate according to the census data?โ
โWhich state has the highest population?โ
โSummarize the key findings from the document.โ
Ypip install -r requirements.txt
streamlit run app.py
๐ Results
Accurate document-grounded answers
Reduced hallucinations due to strict context usage
Fast response time using FAISS similarity search
๐ฎ Future Enhancements
Support for document uploads via UI
Persistent vector storage on disk
Multi-document comparison
Role-based access and chat history
๐ Conclusion
This project demonstrates a practical and scalable implementation of Retrieval-Augmented Generation for real-world document understanding.
It highlights how combining vector databases with large language models can significantly improve information retrieval and question answering.