Retrieval-Augmented Generation using LlamaIndex, FAISS, and OpenAI GPT-4

Retrieval-Augmented Generation (RAG) using LlamaIndex, FAISS, and OpenAI

Published on Ready Tensor – Jun 2025

A semantic search-powered RAG pipeline that grounds GPT-4 responses in user documents using FAISS, LlamaIndex, and Sentence-BERT.

Introduction

Large Language Models (LLMs) like GPT-4 are incredibly powerful, yet they often generate fluent but factually incorrect answers, especially when external context is needed. This project tackles that limitation through Retrieval-Augmented Generation (RAG) — a hybrid framework that grounds generation in relevant, retrieved data chunks from user-provided documents.

By combining vector similarity search (FAISS) with LlamaIndex for document parsing, and GPT-4 for final generation, this pipeline enables grounded, high-fidelity responses that reduce hallucinations and improve relevance.

Why It Matters

Reduces Hallucination:
LLMs often generate confident but incorrect answers. RAG mitigates this by grounding the model’s response in real, retrieved information.
Improves Accuracy & Relevance:
With access to up-to-date or domain-specific data, RAG enables language models to give contextually appropriate answers—even on topics not present during training.
Supports Real-Time Knowledge Access:
RAG systems can be updated instantly via the retrieval database, without needing to retrain the LLM. This is critical for dynamic domains like finance, healthcare, or legal services.

Problem Statement

Traditional LLMs generate responses based solely on their internal training data, making them prone to:

Hallucinations in domain-specific contexts
Inability to access updated, private, or long-form documents
Weak traceability and verifiability of outputs

To address this, the solution retrieves contextually relevant passages from external documents and feeds them into the prompt, improving response accuracy, transparency, and grounding.

Motivation

Limitations of Traditional Language Models

Large language models like GPT are trained on vast amounts of data up to a certain cutoff date. However:

They cannot access new information after training.
They may lack knowledge in specific domains (e.g., proprietary datasets, niche fields).
Their responses can become outdated or inaccurate over time.

How RAG Addresses This

RAG enhances language models by enabling them to retrieve external knowledge dynamically at inference time. It acts like a research assistant that:

Converts the query into a dense vector using Sentence-BERT.
Performs semantic similarity search over a FAISS-based vector index built from uploaded documents.
Feeds the top relevant chunks as context to the language model, which then generates an informed, accurate answer.

Key Benefits

Access to Real-Time Information: Responses reflect facts from the most recent or user-supplied documents without retraining the model.
Customization for Domains: Users can upload their own PDFs, CSVs, or text files to personalize the retrieval corpus.
Higher Answer Quality: Combines the linguistic fluency of LLMs with factual grounding from document chunks retrieved via FAISS.
Modular and Offline-Compatible: Built using open-source components like FAISS and Sentence-BERT, the system runs locally without needing managed APIs.

Architecture Overview

RAG System Components

RAG architecture consists of two key components: the Retriever and the Generator, working in tandem to produce accurate, context-aware responses.

1. Retriever

Function: Converts the user's natural language query into a dense vector embedding using Sentence-BERT (all-MiniLM-L6-v2).
Search: It uses this vector to perform a similarity search in a vector database (FAISS) to find top-k relevant chunks from the uploaded document.
Goal: To bring relevant context from an external corpus that can support or guide the language model’s generation.

Think of it as a librarian who finds the best books or paragraphs that match your question.

2. Generator

Function: Takes the original user query and the retrieved documents, then concatenates them into a single prompt.
Model: Sends the combined input to OpenAI's GPT-4 to generate a natural language response.
Goal: To ensure the final output is fluent and coherent, while being grounded in factual information retrieved via FAISS.

It’s like a student writing an answer using the exact pages the librarian just handed over.

Screenshot 2025-06-19 210159.png
Figure 1: High-level architecture illustrating the interaction between Retriever and Generator components in the RAG-based question-answering pipeline

RAG Workflow

The RAG workflow is a step-by-step pipeline that bridges retrieval-based search with generative language modeling, resulting in high-quality, context-aware answers.

Step-by-Step Process

1. Input Query

The user submits a question or prompt (e.g., “What is Retrieval-Augmented Generation?”).
This is treated as the natural language input to the system.

2. Embed the Query

The query is converted into a vector embedding using Sentence-BERT (all-MiniLM-L6-v2).
This embedding captures the semantic meaning of the query.

3. Retrieve Top-k Chunks

The embedding is used to search a FAISS vector index for the most relevant document chunks.
The system retrieves the top-k most similar chunks using cosine similarity.

Figure 2: End-to-end RAG workflow illustrating how PDFs are ingested, semantically indexed, and queried using LangChain, Pinecone, and GPT to generate enriched responses

Tools & Frameworks

RAG systems are built using a modular set of tools, allowing you to mix and match components based on use-case requirements, latency tolerance, and cost.

Embeddings

Sentence-BERT (all-MiniLM-L6-v2)
- A variation of BERT optimized for generating sentence-level embeddings.
- Converts queries and document chunks into semantic vectors.
- Fast and effective for semantic similarity tasks.
- Ideal for offline and local use with FAISS.

Vector Database

FAISS (Facebook AI Similarity Search)
- Open-source similarity search library developed by Meta AI.
- Used in this project for building a local, in-memory vector index from embedded text chunks.
- Enables fast and memory-efficient retrieval based on cosine similarity.

Language Models

GPT-4 (OpenAI)
- Generates responses based on retrieved document context.
- Strong reasoning and generation capabilities, suitable for fact-grounded output.
HuggingFace Transformers (Sentence-BERT)
- Converts text (query and document chunks) into embeddings.
- Used in semantic similarity search via FAISS.

Development Platforms

Python
- Core language for implementing the entire RAG pipeline.
- Used with libraries like faiss, sentence-transformers, PyPDF2, and openai.
Jupyter Notebooks / Google Colab
- Used for building, testing, and debugging the step-by-step RAG pipeline.
- Enables inline visualization, code annotation, and reproducible experiments.
- Makes it easy to demo and iterate on document-based question answering workflows.

Code Repository

GitHub Repository

The repo includes:

A modular pipeline (.ipynb)
Utility scripts for preprocessing and chunking
Example queries and configuration parameters
Instructions for running on Google Colab

Code Walkthrough

Step 1: Set OpenAI API Key

Screenshot 2025-06-19 221018.png

Step 2: Upload Your File

Screenshot 2025-06-19 221301.png

Step 3: Initialize Embedding and LLM Services

Screenshot 2025-06-19 221849.png

Step 4: Create FAISS Vector Index

Screenshot 2025-06-19 222035.png

Evaluation

What to Evaluate in a RAG System:

1. Relevance of Retrieved Chunks

Are the chunks relevant and on-topic?
Poor retrieval reduces output quality.

2. Coherence and Correctness

Is the output logically structured and accurate?
Avoid hallucinations or unsupported claims.

Metrics to Use:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Measures overlap of n-grams between the generated and reference texts.
BLEU (Bilingual Evaluation Understudy)
Often used in translation but works for RAG if a reference answer is available.
Human Evaluation
Essential for qualitative aspects like clarity, fluency, and factual accuracy—especially in high-stakes domains like healthcare or law.

Applications

Chatbots for Customer Support

RAG enables chatbots to search and retrieve answers from internal documentation, FAQs, or support logs.
Using FAISS and Sentence-BERT, chatbots can deliver responses grounded in real company-specific data.
GPT-4 enhances fluency and personalization in responses.

Legal and Healthcare Document Search

Users can query domain-specific documents like contracts, case law, or clinical literature.
The retriever (FAISS + Sentence-BERT) ensures relevant clauses or passages are pulled, reducing manual search effort.
GPT-4 summarizes or explains complex findings in plain language.

Academic Research Assistants

Researchers can upload papers, notes, or textbook excerpts.
FAISS indexes the chunks, and the RAG system retrieves relevant material on demand.
GPT-4 generates summaries, comparisons, or context-aware explanations.

Challenges

Latency in Retrieval

Embedding queries and searching FAISS can introduce slight delays in large datasets.
Real-time performance depends on FAISS index size and embedding model speed.
Optimization (e.g., chunk size tuning, caching) is needed for low-latency use.

Data Freshness & Index Updates

FAISS indexes are static unless reloaded — new document uploads require re-embedding and re-indexing.
Stale data may result in outdated or irrelevant responses.
Automation of indexing pipeline is key for live deployments.

Cost of API Calls (if applicable)

Using GPT-4 involves API usage costs for each generation.
Local embedding and retrieval (via Sentence-BERT + FAISS) are free, but generation remains billable.
Heavy usage may require batching or cost-optimized alternatives.

Future Improvements

Integrate hybrid retrieval (dense + sparse) for improved precision
Add support for OCR and scanned PDFs using Tesseract or Azure Vision
Experiment with open-source LLMs (e.g., LLaMA-3, Mistral) for offline inference
Deploy as a web app with file upload and real-time chat interface

About Me

I'm a software engineer passionate about NLP, applied AI, and building real-world solutions that bridge language models with meaningful context. I built this project to explore how retrieval-based augmentation can enhance both the accuracy and trustworthiness of LLM-generated content.

Feel free to connect:
LinkedIn – Rizmi Sowdhagar