Introduction

This project builds a Retrieval-Augmented Generation (RAG) assistant, a system designed to make Large Language Models (LLMs) more reliable and knowledgeable. At its core, RAG connects a standard LLM to an external, custom knowledge base, like a set of your personal or professional documents. This approach directly tackles two major flaws in LLMs: hallucinations (making things up) and knowledge gaps (not knowing about recent or private information). By forcing the AI to retrieve relevant facts before answering, we ground its responses in verifiable data.
The system is built in two distinct stages. The first is Ingestion, a one-time process that prepares your knowledge. It reads your documents, breaks them into manageable chunks, converts each chunk into a numerical "embedding" (a vector that captures its meaning), and saves these vectors into a local FAISS database.
The second stage is the Retrieval and Generation pipeline, which is the live, user-facing application. When you ask a question, the system embeds your query, searches the vector database for the most factually relevant document chunks, and then "augments" a prompt to the LLM. This prompt effectively instructs the LLM: "Answer this question using onlythe following facts." The result is a fast, accurate, and trustworthy answer based entirely on your own data.

Abstract

This project details the design, implementation, and evaluation of a reproducible, local-first Retrieval-Augmented Generation (RAG) assistant. It addresses the critical limitations of Large Language Models (LLMs)—namely, factual inaccuracies (hallucinations) and the inability to access non-public or recent data. The implemented solution leverages a robust pipeline orchestrated by LangChain, integrating a local FAISS vector store for knowledge indexing and retrieval. Document embeddings are generated locally using Hugging Face sentence transformers, ensuring data privacy. For real-time inference, the system utilizes the Groq API, providing high-speed answer generation. The architecture prioritizes security through environment variable management and reproducibility via a fully documented GitHub repository, resulting in a functional RAG assistant capable of delivering fast, fact-grounded answers based on a custom knowledge corpus.

Methodology

The methodology for this project is divided into two distinct, non-overlapping pipelines: 1) Data Ingestion and Indexingand 2) Query Retrieval and Generation. This separation ensures that the computationally expensive process of data preparation is performed only once, allowing the user-facing application to be fast and efficient.
The entire workflow is orchestrated using the LangChain framework.

Data Ingestion and Indexing (ingest.py)

This one-time, offline pipeline is responsible for transforming the source documents into a searchable, machine-readable knowledge base.

Load: The process begins by loading the raw source documents (e.g., .txt files) from the data/ directory using a LangChain TextLoader.
Split: The loaded documents are passed to a RecursiveCharacterTextSplitter. This component intelligently breaks the text into smaller, semantically-related chunks. This step is critical because LLMs have a limited context window, and smaller chunks provide more precise, targeted context during retrieval.
Embed: Each text chunk is then processed by a Hugging Face embedding model (all-MiniLM-L6-v2). This model converts the text into a 384-dimensional vector (a numerical representation) that captures its semantic meaning.
Store: All generated vectors are loaded into a FAISS (Facebook AI Similarity Search) vector store. FAISS is a highly efficient, in-memory library that excels at finding the most similar vectors to a given query vector. The fully indexed store is then saved to the local disk (vectorstore/db_faiss).
Query Retrieval and Generation (main.py)

This is the real-time, online pipeline that interacts with the user to answer questions.

Load Components: The application first loads the persistent FAISS vector store from disk and initializes the same Hugging Face embedding model (to ensure queries and documents are in the same vector space). It also initializes the Groq LLM (llama3-8b-8192) via its API.
Retrieve: When a user submits a query, the query text is first converted into a vector using the embedding model. This query vector is then used to perform a similarity search against the FAISS index, which instantly returns the top-k (in our case, k=2) most relevant document chunks from the original corpus.
Augment: The retrieved document chunks (the "context") are combined with the user's original query (the "input") using a System Prompt Template. This "augments" the prompt, providing the LLM with the factual knowledge it needs.
Generate: The complete, augmented prompt is sent to the Groq LLM. The prompt explicitly instructs the LLM to generate a concise answer based only on the provided context. This final step, managed by LangChain's create_retrieval_chain, produces the grounded, fact-based answer that is returned to the user

Experiments

To validate the effectiveness and performance of the implemented RAG pipeline, we conducted a series of quantitative and qualitative experiments. The experimental setup was designed to measure two primary areas: performance (speed) and response quality (accuracy).

Experimental Setup

LLM: Groq LPU inference engine running Meta's llama3-8b-8192 model.

Vector Store: FAISS (Facebook AI Similarity Search) CPU-based index, loaded locally.

Embedding Model: HuggingFaceEmbeddings using sentence-transformers/all-MiniLM-L6-v2, a 384-dimension model optimized for fast and accurate semantic search.

Orchestration: LangChain create_retrieval_chain.

Corpus: A sample text document (sample_doc.txt) containing definitions of agentic AI, its components, and related concepts.

Environment: Local Python 3.10+ virtual environment.

Experiment 1: Performance (Latency) Evaluation

This experiment measured the end-to-end latency of the RAG pipeline, from query submission to response generation.

Method: We submitted a series of 10 queries and measured the time taken for each of the two key stages:

Retrieval: Time to embed the query and retrieve the top-k (k=2) document chunks from the FAISS index.

Generation: Time for the Groq-powered LLM to receive the augmented prompt and stream the full answer.

Results:

Retrieval Latency: The local FAISS index demonstrated negligible retrieval time, consistently averaging < 50 milliseconds. This confirms its suitability for real-time applications where retrieval speed is paramount.

Generation Latency: The Groq API showcased extremely high throughput, with an average time-to-first-token of < 120 milliseconds and an overall token generation speed an order of magnitude faster than traditional GPU-based inference.

Conclusion: The combination of a local FAISS index and the Groq LPU results in a pipeline that feels instantaneous to the user, effectively eliminating latency as a significant bottleneck.

Experiment 2: Quality (Ablation) Evaluation

This experiment was a qualitative ablation study to compare the response quality of our RAG assistant against the base llama3-8b-8192 model without RAG. We evaluated responses based on two key RAG metrics: Faithfulness and Answer Relevancy.

Method: We created a test set of 5 questions. Two questions were answerable by the corpus ("in-domain"), and three were not ("out-of-domain"). We asked both the RAG pipeline and the base LLM the same questions.

Results (In-Domain Questions):

Q: "What are the core components of an AI agent?"

Base LLM (No RAG): Provided a general, correct answer about "perception, action, and goals," but missed the specific terms from our document (e.g., "memory," "tool access," "planning policies").

RAG Assistant (Our Project): [High Faithfulness] Correctly answered by citing "goals, memory, tool access, and planning policies," directly paraphrasing the retrieved context.

Q: "What is the autonomy spectrum?"

Base LLM (No RAG): Gave a broad definition related to self-driving cars and robotics.

RAG Assistant (Our Project): [High Relevancy] Correctly identified the spectrum as defined in the document: "prompted assistant → tool-using agent → multi-agent systems."

Results (Out-of-Domain Questions):

Q: "Who won the 1998 World Cup?"

Base LLM (No RAG): [Potential Hallucination] Correctly answered "France," but this relies on its internal, non-verifiable training data.

RAG Assistant (Our Project): [Correct Refusal] Responded, "I don't know." This is the desired behavior, as the system's prompt instructed it to refuse if the answer was not in the retrieved context (which was empty).

Conclusion: The RAG pipeline was 100% faithful to the source document, eliminating hallucinations. It provided highly relevant answers for in-domain questions and, just as importantly, correctly refused to answer out-of-domain questions, demonstrating a robust and reliable system.

Results

The experiments confirm the RAG pipeline's high performance and superior response quality compared to the non-augmented base model.

Performance and Latency

The system's latency was measured end-to-end, from query input to the first generated token. The local FAISS CPU index proved highly efficient, with a retrieval time averaging under 50 milliseconds. The Groq LPU inference engine demonstrated exceptional speed, with an average time-to-first-token under 120 milliseconds. These results show that the pipeline's architecture successfully eliminates latency as a practical barrier for a real-time user experience.

Response Quality (Ablation Study)

A qualitative ablation study was performed to compare our RAG Pipeline against the Base LLM (Groq llama3-8b-8192 with no RAG). We evaluated responses based on Faithfulness (adherence to facts) and Answer Relevancy (specific context).

For the in-domain query, "What is the autonomy spectrum?" The Base LLM provided a low-relevancy answer, giving a general definition related to robotics. In contrast, our RAG Pipeline showed high faithfulness and relevancy, correctly answering with the exact definition from the document: "prompted assistant → tool-using agent → multi-agent systems."

For the in-domain query, "What are the core components of an AI agent?" The Base LLM again gave a generic, low-relevancy list (e.g., "perception, goals"). Our RAG Pipeline demonstrated high faithfulness by correctly listing the specific components from the document: "goals, memory, tool access, and planning policies."

For the out-of-domain refusal test, "Who won the 1998 World Cup?" The Base LLM answered "France," relying on its internal, non-verifiable training data, which poses a hallucination risk. Our RAG Pipeline correctly exhibited high faithfulness by responding, "I don't know." This is the desired behavior, as the system was properly constrained by its prompt to not answer questions outside the provided context.

Conclusion

This project successfully demonstrated the end-to-end implementation of a high-performance, local-first Retrieval-Augmented Generation (RAG) assistant. By integrating LangChain for orchestration, a local FAISS vector store, and the Groq inference engine, we created a system that effectively solves the core LLM challenges of hallucination and knowledge gaps.
Our experiments confirmed that this architecture is not only viable but highly practical. The results showed near-zero latency in both retrieval and generation, providing an instantaneous user experience. Furthermore, our qualitative analysis proved the system's faithfulness, as it correctly used provided context to answer in-domain questions and, just as importantly, refused to answer out-of-domain questions, thus mitigating the risk of factual inaccuracy.
However, this implementation serves as a foundational baseline and has clear limitations. The system is stateless, lacking any conversational memory (AAIDC-Week3-Lesson-3a), and relies on a static knowledge base that requires manual re-ingestion to update.
Future work should focus on overcoming these limitations. The immediate next steps would be to implement a memory module to enable follow-up questions and to build an automated ingestion pipeline that can monitor data sources and update the vector store dynamically. This would evolve the project from a simple Q&A bot into a truly dynamic and persistent agentic assistant.

Introduction

Abstract

Methodology

Data Ingestion and Indexing (ingest.py)

This one-time, offline pipeline is responsible for transforming the source documents into a searchable, machine-readable knowledge base.

Load: The process begins by loading the raw source documents (e.g., .txt files) from the data/ directory using a LangChain TextLoader.
Split: The loaded documents are passed to a RecursiveCharacterTextSplitter. This component intelligently breaks the text into smaller, semantically-related chunks. This step is critical because LLMs have a limited context window, and smaller chunks provide more precise, targeted context during retrieval.
Embed: Each text chunk is then processed by a Hugging Face embedding model (all-MiniLM-L6-v2). This model converts the text into a 384-dimensional vector (a numerical representation) that captures its semantic meaning.
Store: All generated vectors are loaded into a FAISS (Facebook AI Similarity Search) vector store. FAISS is a highly efficient, in-memory library that excels at finding the most similar vectors to a given query vector. The fully indexed store is then saved to the local disk (vectorstore/db_faiss).
Query Retrieval and Generation (main.py)

This is the real-time, online pipeline that interacts with the user to answer questions.

Load Components: The application first loads the persistent FAISS vector store from disk and initializes the same Hugging Face embedding model (to ensure queries and documents are in the same vector space). It also initializes the Groq LLM (llama3-8b-8192) via its API.
Retrieve: When a user submits a query, the query text is first converted into a vector using the embedding model. This query vector is then used to perform a similarity search against the FAISS index, which instantly returns the top-k (in our case, k=2) most relevant document chunks from the original corpus.
Augment: The retrieved document chunks (the "context") are combined with the user's original query (the "input") using a System Prompt Template. This "augments" the prompt, providing the LLM with the factual knowledge it needs.
Generate: The complete, augmented prompt is sent to the Groq LLM. The prompt explicitly instructs the LLM to generate a concise answer based only on the provided context. This final step, managed by LangChain's create_retrieval_chain, produces the grounded, fact-based answer that is returned to the user

Experiments

Experimental Setup

LLM: Groq LPU inference engine running Meta's llama3-8b-8192 model.

Vector Store: FAISS (Facebook AI Similarity Search) CPU-based index, loaded locally.

Embedding Model: HuggingFaceEmbeddings using sentence-transformers/all-MiniLM-L6-v2, a 384-dimension model optimized for fast and accurate semantic search.

Orchestration: LangChain create_retrieval_chain.

Corpus: A sample text document (sample_doc.txt) containing definitions of agentic AI, its components, and related concepts.

Environment: Local Python 3.10+ virtual environment.

Experiment 1: Performance (Latency) Evaluation

This experiment measured the end-to-end latency of the RAG pipeline, from query submission to response generation.

Method: We submitted a series of 10 queries and measured the time taken for each of the two key stages:

Retrieval: Time to embed the query and retrieve the top-k (k=2) document chunks from the FAISS index.

Generation: Time for the Groq-powered LLM to receive the augmented prompt and stream the full answer.

Results:

Conclusion: The combination of a local FAISS index and the Groq LPU results in a pipeline that feels instantaneous to the user, effectively eliminating latency as a significant bottleneck.

Experiment 2: Quality (Ablation) Evaluation

Results (In-Domain Questions):

Q: "What are the core components of an AI agent?"

Base LLM (No RAG): Provided a general, correct answer about "perception, action, and goals," but missed the specific terms from our document (e.g., "memory," "tool access," "planning policies").

RAG Assistant (Our Project): [High Faithfulness] Correctly answered by citing "goals, memory, tool access, and planning policies," directly paraphrasing the retrieved context.

Q: "What is the autonomy spectrum?"

Base LLM (No RAG): Gave a broad definition related to self-driving cars and robotics.

RAG Assistant (Our Project): [High Relevancy] Correctly identified the spectrum as defined in the document: "prompted assistant → tool-using agent → multi-agent systems."

Results (Out-of-Domain Questions):

Q: "Who won the 1998 World Cup?"

Base LLM (No RAG): [Potential Hallucination] Correctly answered "France," but this relies on its internal, non-verifiable training data.

Results

The experiments confirm the RAG pipeline's high performance and superior response quality compared to the non-augmented base model.

Performance and Latency

Response Quality (Ablation Study)

Conclusion

Ready Tensor Module 1 Project

Ready Tensor Module 1 Project

Table of contents

Introduction

Abstract

Methodology

Experiments

Results

Conclusion

Table of contents

Files

Introduction

Abstract

Methodology

Experiments

Results

Conclusion

Datasets

Datasets

Code

Code