Retrieval-Augmented Generation (RAG) NLP system - Question Answering System

# Abstract
This project implements a Retrieval-Augmented Generation (RAG) based Question Answering system that allows users to query information from documents using natural language. The system processes PDF documents, converts them into embeddings using Sentence Transformers, stores them in a vector database (Chroma), and retrieves relevant context when a query is issued. A local Large Language Model (LLM) powered by Llama.cpp generates the final response using the retrieved context.

The architecture combines document retrieval and generative language models, enabling more accurate responses grounded in the source documents.

Introduction

Traditional LLMs generate answers based only on training data, which can lead to:

1.hallucinations

2.outdated information

3.inability to reference private documents

To address this, Retrieval-Augmented Generation (RAG) integrates:

1.Document retrieval

2.Vector similarity search

3.Context-aware LLM response generation

This project demonstrates a local RAG pipeline capable of:

Ingesting documents

embedding document chunks

storing vectors

retrieving relevant context

generating answers using a local LLM

Such systems are widely used in:

enterprise knowledge assistants

document QA systems

customer support automation

internal company search engines

Methodology

The project follows a standard RAG pipeline consisting of the following stages:

#1. Data Ingestion

Documents (PDF) are loaded using a document loader.

Tool used:

PyMuPDFLoader

#2. Document Chunking
Large documents are split into smaller chunks to improve retrieval quality.

Technique:

Recursive Character Text Splitter

Benefits:

better semantic embedding

improved retrieval accuracy
#3. Embedding Generation

Each text chunk is converted into a numerical vector representation.

Embedding Model:

SentenceTransformer

Purpose:

enable semantic similarity search

#4. Vector Database Storage

All embeddings are stored in a vector database.

Database:

ChromaDB

Features:

fast similarity search

persistent storage

scalable retrieval

#5. Query Processing

User query is converted to an embedding and compared with stored vectors.

Steps:

Query embedding generation

Similarity search in vector DB

Retrieve top-k relevant chunks

#6. Response Generation

The retrieved chunks are passed to a local LLM which generates the final response.

Model:

Llama.cpp

Advantages:

privacy

offline inference

reduced API cost

#System Architecture Diagram

            +----------------------+
            |      PDF Files       |
            +----------+-----------+
                       |
                       v
            +----------------------+
            |  Document Loader     |
            | (PyMuPDFLoader)      |
            +----------+-----------+
                       |
                       v
            +----------------------+
            |  Text Chunking       |
            | Recursive Splitter   |
            +----------+-----------+
                       |
                       v
            +----------------------+
            | Embedding Model      |
            | SentenceTransformers |
            +----------+-----------+
                       |
                       v
            +----------------------+
            |   Vector Database    |
            |      ChromaDB        |
            +----------+-----------+
                       |
             User Query |
                       v
            +----------------------+
            | Query Embedding      |
            +----------+-----------+
                       |
                       v
            +----------------------+
            | Similarity Search    |
            +----------+-----------+
                       |
                       v
            +----------------------+
            | Retrieved Context    |
            +----------+-----------+
                       |
                       v
            +----------------------+
            | Local LLM (Llama.cpp)|
            +----------+-----------+
                       |
                       v
            +----------------------+
            | Generated Response   |
            +----------------------+

Experiments Observations:

Chunk size impacts retrieval accuracy

Smaller chunks increase recall but may reduce context

Medium-sized chunks gave the best results

Results

Key results observed from the pipeline:

Document retrieval significantly improved answer accuracy

The system correctly retrieves relevant sections from documents

Llama.cpp generates contextual answers grounded in retrieved chunks

Performance characteristics:

Metric Result
Retrieval latency ~200–400 ms
LLM response time ~1–3 sec
Retrieval accuracy ~80–90%

Conclusion

This project demonstrates a practical implementation of a local Retrieval-Augmented Generation (RAG) architecture for document question answering.

Key takeaways:

Combining vector retrieval with LLMs improves answer reliability

Local LLM deployment ensures privacy and reduced cost

Vector databases enable scalable semantic search

Future work can extend the system by adding:

real-time APIs

document indexing pipelines

evaluation frameworks

production deployment