# Abstract
This project implements a Retrieval-Augmented Generation (RAG) based Question Answering system that allows users to query information from documents using natural language. The system processes PDF documents, converts them into embeddings using Sentence Transformers, stores them in a vector database (Chroma), and retrieves relevant context when a query is issued. A local Large Language Model (LLM) powered by Llama.cpp generates the final response using the retrieved context.
The architecture combines document retrieval and generative language models, enabling more accurate responses grounded in the source documents.
Traditional LLMs generate answers based only on training data, which can lead to:
1.hallucinations
2.outdated information
3.inability to reference private documents
To address this, Retrieval-Augmented Generation (RAG) integrates:
1.Document retrieval
2.Vector similarity search
3.Context-aware LLM response generation
This project demonstrates a local RAG pipeline capable of:
Ingesting documents
embedding document chunks
storing vectors
retrieving relevant context
generating answers using a local LLM
Such systems are widely used in:
enterprise knowledge assistants
document QA systems
customer support automation
internal company search engines
The project follows a standard RAG pipeline consisting of the following stages:
#1. Data Ingestion
Documents (PDF) are loaded using a document loader.
Tool used:
PyMuPDFLoader
#2. Document Chunking
Large documents are split into smaller chunks to improve retrieval quality.
Technique:
Recursive Character Text Splitter
Benefits:
better semantic embedding
improved retrieval accuracy
#3. Embedding Generation
Each text chunk is converted into a numerical vector representation.
Embedding Model:
SentenceTransformer
Purpose:
enable semantic similarity search
#4. Vector Database Storage
All embeddings are stored in a vector database.
Database:
ChromaDB
Features:
fast similarity search
persistent storage
scalable retrieval
#5. Query Processing
User query is converted to an embedding and compared with stored vectors.
Steps:
Query embedding generation
Similarity search in vector DB
Retrieve top-k relevant chunks
#6. Response Generation
The retrieved chunks are passed to a local LLM which generates the final response.
Model:
Llama.cpp
Advantages:
privacy
offline inference
reduced API cost
#System Architecture Diagram
+----------------------+
| PDF Files |
+----------+-----------+
|
v
+----------------------+
| Document Loader |
| (PyMuPDFLoader) |
+----------+-----------+
|
v
+----------------------+
| Text Chunking |
| Recursive Splitter |
+----------+-----------+
|
v
+----------------------+
| Embedding Model |
| SentenceTransformers |
+----------+-----------+
|
v
+----------------------+
| Vector Database |
| ChromaDB |
+----------+-----------+
|
User Query |
v
+----------------------+
| Query Embedding |
+----------+-----------+
|
v
+----------------------+
| Similarity Search |
+----------+-----------+
|
v
+----------------------+
| Retrieved Context |
+----------+-----------+
|
v
+----------------------+
| Local LLM (Llama.cpp)|
+----------+-----------+
|
v
+----------------------+
| Generated Response |
+----------------------+
Chunk size impacts retrieval accuracy
Smaller chunks increase recall but may reduce context
Medium-sized chunks gave the best results
Key results observed from the pipeline:
Document retrieval significantly improved answer accuracy
The system correctly retrieves relevant sections from documents
Llama.cpp generates contextual answers grounded in retrieved chunks
Performance characteristics:
Metric Result
Retrieval latency ~200–400 ms
LLM response time ~1–3 sec
Retrieval accuracy ~80–90%
This project demonstrates a practical implementation of a local Retrieval-Augmented Generation (RAG) architecture for document question answering.
Key takeaways:
Combining vector retrieval with LLMs improves answer reliability
Local LLM deployment ensures privacy and reduced cost
Vector databases enable scalable semantic search
Future work can extend the system by adding:
real-time APIs
document indexing pipelines
evaluation frameworks
production deployment