Simple rag system

Abstract

This project presents a production-ready Retrieval-Augmented Generation (RAG) assistant built with LangChain, OpenAI GPT models, and a FAISS vector store. The system integrates document ingestion, vector-based semantic search, and large language model reasoning to enable accurate, source-grounded question answering. Documents in markdown or text format are automatically processed, chunked, embedded using Sentence Transformers, and indexed into a FAISS vector store for efficient retrieval. During query time, the assistant retrieves the most relevant document chunks and uses them as context for a GPT model to generate informed responses with citations.

The project includes a configurable pipeline, interactive command-line interface (CLI), and programmatic API, allowing users to ask questions and receive context-grounded answers. Environment-based configuration enables customizable model selection, chunking strategy, retrieval depth, and vector store paths. The modular design supports easy extension, such as adding new document formats, enhancing retrieval with re-ranking, or integrating memory. Overall, this RAG assistant serves as a robust, extensible foundation for building domain-specific AI applications grounded in trusted documents.

Introduction

Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external knowledge sources to improve accuracy, reduce hallucinations, and provide source-grounded answers. Traditional LLMs often generate plausible-sounding but incorrect information due to lack of context. By integrating a retrieval system, such as a FAISS vector store, RAG systems can query relevant document chunks and use them to inform the LLM’s responses. This project presents a production-ready RAG assistant built with LangChain, OpenAI GPT models, and FAISS, capable of answering user queries based on custom document corpora.

Methodology

The system is composed of three main components:

Document Ingestion and Indexing:

Documents (Markdown or text) are loaded from a documents/ directory.

Text is split into overlapping chunks to capture context.

Embeddings are generated using a Sentence Transformers model or OpenAI embeddings.

Embeddings are stored in a FAISS vector store for fast similarity search.

Query Processing:

User queries are converted into embeddings.

FAISS retrieves the top-K most semantically similar document chunks.

Retrieved chunks are incorporated into a prompt template that guides the LLM’s response.

Response Generation:

OpenAI GPT models (configurable as GPT-3.5-turbo or GPT-4) generate answers based on retrieved context.

Source attribution is included by referencing the documents used for retrieval.

Responses are returned via an interactive CLI or programmatic API.

Experiments

To evaluate the RAG assistant:

Setup:

A set of documents covering Python, machine learning, and LangChain guides was ingested.

FAISS vector store was created with chunk size 1000 characters and 200-character overlap.

OpenAI GPT-3.5-turbo was used for response generation.

Queries:

General knowledge questions: e.g., "What is Python used for?"

Domain-specific questions: e.g., "Explain RAG and its benefits."

Edge cases: e.g., ambiguous queries requiring retrieval from multiple documents.

Evaluation Metrics:

Accuracy and relevance of answers.

Correct attribution to source documents.

Response time (latency per query).

Results

The assistant successfully retrieved relevant document chunks for user queries.

Answers were accurate, contextually rich, and included proper source attribution.

The FAISS vector store enabled fast retrieval (<1 second per query for a small corpus).

Ambiguous queries were handled better than baseline LLM-only generation, as retrieved context grounded the responses.

Interactive CLI demonstrated usability and seamless integration with programmatic API calls.