RAG Question-Answering Assistant (Final Submission)
This project implements a Retrieval-Augmented Generation (RAG) assistant that answers questions using a custom local document set. The core deliverable is a working pipeline that performs document ingestion, chunking, embedding, retrieval, and grounded response generation with OpenAI models.
Specific Objectives
Ingest local text documents from a custom corpus.
Split documents into retrieval-friendly chunks.
Create vector embeddings for chunks.
Retrieve top-k relevant chunks for each query.
Generate context-grounded answers with an LLM.
Provide a minimal CLI interface for interactive testing.
Return answer sources for transparency.
2) Intended Audience and Use Case
Target Audience
Students learning RAG fundamentals.
Early-stage AI practitioners building document QA systems.
Developers who need a small, understandable baseline before production-scale deployment.
Primary Use Case
Ask domain questions over a curated document collection and get grounded answers with source references.
Prerequisites
Python 3.10+ (tested in this workspace with Python 3.14).
Basic Python and command-line familiarity.
OpenAI API key.
3) Problem Definition
Generic LLM responses can be inaccurate for niche knowledge domains or private documents. This project addresses that limitation by retrieving relevant context from user-provided files before generation, reducing hallucination risk and increasing answer relevance for the selected corpus.
artificial_intelligence.txt
biotechnology.txt
climate_science.txt
quantum_computing.txt
sample_documents.txt
space_exploration.txt
sustainable_energy.txt
Basic Dataset Stats
Number of source documents: 7
File format: .txt
Domain coverage: AI, biotech, climate, quantum, space, sustainability
6) Methodology
The implemented pipeline follows a standard RAG flow:
Load documents
Read supported files from the data directory.
Chunk documents
Use RecursiveCharacterTextSplitter with overlap to preserve local context.
Embed chunks
Generate OpenAI embeddings (text-embedding-3-small by default).
Index in memory
Store chunk text, metadata, IDs, and vectors.
Retrieve top-k chunks
Compute cosine similarity between query vector and indexed vectors.
Prompt + generate
Build a grounded prompt with retrieved context and generate an answer with gpt-4o-mini (default).
Return answer + sources
Output answer text, context chunks, and source filenames.
src/vectordb.py
In-memory vector index (chunk_text(), add_documents(), search()).
demo.py
Minimal CLI for sample queries and interactive Q&A.
Tools/Frameworks
LangChain core abstractions (prompting/output parsing)
LangChain OpenAI integrations
OpenAI embeddings and chat model
NumPy for vector math
Python dotenv for environment config
VectorDB Design and Retrieval Logic
This VectorDB is a simple in-memory vector store that enables semantic retrieval for a RAG system. During ingestion, each document is split into smaller chunks using a recursive character-based text splitter. Chunking improves retrieval precision by ensuring embeddings represent focused, semantically coherent text instead of entire large documents. A small chunk overlap is used to preserve context across boundaries so important information isn’t lost between adjacent segments.
Each chunk is converted into a dense vector using OpenAI embeddings and stored in a NumPy matrix along with its metadata. At query time, the user question is embedded into the same vector space. The system then computes cosine similarity between the query vector and all stored document vectors. Cosine similarity is used because it measures semantic direction similarity independent of vector magnitude, making it well-suited for text embeddings.
The top-k most similar chunks are returned and passed to the LLM as grounded context, enabling accurate Retrieval-Augmented Generation.
Ingestion verification: confirm all files load and chunk successfully.
Retrieval verification: confirm top-k chunks are semantically relevant to test queries.
Grounding verification: confirm generated answers align with retrieved context.
Source transparency: confirm source filenames are returned with responses.
Example Verification Queries
“What is machine learning?”
“How does deep learning work?”
“What are key AI ethics concerns?”
9) Results Summary
Observed behavior in local runs:
Documents are loaded and indexed successfully.
Retrieval returns relevant chunks for sample domain questions.
Answers are coherent and grounded in the corpus.
Source file names are returned for traceability.
10) Limitations and Trade-offs
In-memory index only
Not designed for very large corpora or persistent multi-session serving.
No reranker
Retrieval quality depends on embedding similarity alone.
No quantitative benchmark
Evaluation is currently functional + qualitative, not a full benchmark suite.
API dependency
Requires OpenAI access at runtime.
OPENAI_API_KEY=your_key_here OPENAI_MODEL=gpt-4o-mini OPENAI_EMBEDDING_MODEL=text-embedding-3-small
Install
Use the project environment and install dependencies from requirements file.
Run
From project root:
python demo.py
Then test predefined and interactive questions.