RAG Document Q&A using OpenAI

📄 Retrieval-Augmented Generation (RAG) for Document Question Answering
🔍 Overview

This project implements an end-to-end Retrieval-Augmented Generation (RAG) application that enables users to ask natural-language questions over a collection of PDF documents.
The system retrieves the most relevant document chunks using vector similarity search and generates accurate answers using a Large Language Model (LLM).

The application is built using LangChain, OpenAI GPT-4, FAISS, and Streamlit.

🎯 Problem Statement
Large documents such as census reports, research papers, and policy documents are difficult to search and understand manually.
Traditional keyword search often fails to capture semantic meaning.

This project solves that problem by:

Converting documents into vector embeddings
Retrieving relevant content semantically
Generating context-aware answers using an LLM

🧠 Solution Architecture
The system follows the standard RAG pipeline:

Document Loading
PDF documents are loaded from a local directory.
Text Chunking
Large documents are split into smaller overlapping chunks for better context retention.
Embedding Generation
Each chunk is converted into a vector embedding using OpenAI embeddings.
Vector Storage
Embeddings are stored in a FAISS vector database for fast similarity search.
Retrieval
Relevant document chunks are retrieved based on semantic similarity.
Answer Generation
Retrieved context is passed to GPT-4 to generate accurate answers.

🛠️ Technologies Used

Python

Streamlit – Web UI
LangChain – RAG framework
OpenAI GPT-4 – Answer generation
OpenAI Embeddings – Semantic embeddings
FAISS – Vector database
dotenv – Environment variable management

📂 Dataset
The application uses publicly available PDF documents, such as:

India Census Reports
https://censusindia.gov.in

Open government and policy documents
Only open and non-copyrighted data is used.

💻 Key Code Snippet (Vector Store Creation)

@st.cache_resource
def create_vector_store():
    embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)

    loader = PyPDFDirectoryLoader("./india_census")
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    chunks = splitter.split_documents(docs)

    vectorstore = FAISS.from_documents(chunks, embeddings)
    return vectorstore

💬 Sample Questions
“What is the literacy rate according to the census data?”
“Which state has the highest population?”
“Summarize the key findings from the document.”

Ypip install -r requirements.txt
streamlit run app.py

📈 Results
Accurate document-grounded answers
Reduced hallucinations due to strict context usage
Fast response time using FAISS similarity search

🔮 Future Enhancements
Support for document uploads via UI
Persistent vector storage on disk
Multi-document comparison
Role-based access and chat history

📌 Conclusion
This project demonstrates a practical and scalable implementation of Retrieval-Augmented Generation for real-world document understanding.
It highlights how combining vector databases with large language models can significantly improve information retrieval and question answering.

RAG Document Q&A using OpenAI

RAG Document Q&A using OpenAI

Files

Datasets

Datasets