Improving Factual Accuracy in LLMs with Retrieval-Augmented Generation (RAG)
πΉ Abstract
Large Language Models (LLMs) are powerful in generating fluent natural language but often face a critical limitation: factual inaccuracy. They sometimes provide answers not grounded in real data, reducing their reliability in knowledge-intensive tasks.
This project implements a Retrieval-Augmented Generation (RAG) pipeline that integrates semantic retrieval (FAISS vector database) with open-source language models (Hugging Face Transformers) to improve factual accuracy. The system allows users to query research publications and receive context-grounded answers, demonstrating how retrieval enhances the reliability of LLMs.
πΉ Introduction
LLMs such as GPT-like models can produce convincing yet incorrect outputs when knowledge is missing. This project addresses this challenge by introducing RAG, a framework that augments LLMs with external knowledge sources. Instead of generating answers based solely on model parameters, the assistant first retrieves relevant passages from a custom dataset and then uses the LLM to synthesize accurate responses.
The project was completed as part of Module 1 of the Agentic AI Developer Certification Program by ReadyTensor.
πΉ Project Overview
The assistant was designed around a simple but effective pipeline:
Retriever β FAISS vector store identifies semantically relevant text chunks.
Generator β A Hugging Face model (flan-t5-base) generates natural answers using only the retrieved context.
This integration ensures that the assistant produces fluent yet factually grounded outputs, reducing hallucinations commonly seen in LLMs.
πΉ Objectives
Build a working retrieval + generation assistant.
Demonstrate how RAG improves factual reliability over LLM-only outputs.
Provide a clear, reproducible implementation for future developers.
πΉ Methodology & Tech Stack
Workflow Steps:
Data Ingestion β Publications were loaded from a JSON dataset.
Chunking β Text split into overlapping passages for better semantic retrieval.
Embeddings β Hugging Faceβs all-MiniLM-L6-v2 model used to encode chunks.
Vector Store β Stored and indexed in FAISS for efficient search.
Retriever β Retrieved top-k relevant chunks for each query.
Generator β Hugging Faceβs flan-t5-base model produced final answers using retrieved evidence.
Tech Stack:
Python 3.10+
LangChain for pipeline orchestration
Hugging Face Transformers for embeddings & generation
FAISS for vector database search
Google Colab for implementation
GitHub for version control and submission
πΉ Workflow Diagram
User Query
β
βΌ
ββββββββββββββββ
β Retriever β β (searches FAISS vector store)
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Vector DB β β (FAISS + embeddings)
ββββββββββββββββ
β
βΌ
Relevant Context
β
βΌ
ββββββββββββββββ
β LLM Generatorβ β (flan-t5-base)
ββββββββββββββββ
β
βΌ
Final Answer
πΉ Results
The assistant successfully answered user questions such as:
βWhat is this publication about?β
βWhat methods or tools were applied?β
βWhat limitations were discussed?β
In all cases, responses were derived from the dataset rather than fabricated. This validates RAG as a robust method to improve factual accuracy in LLM responses.
πΉ Conclusion
This project shows that RAG can significantly reduce hallucinations in LLMs by grounding answers in external datasets. Through the combination of FAISS retrieval and Hugging Face generation, the assistant delivers accurate, context-driven, and reliable outputs.
πΉ Future Work
While the current system demonstrates strong performance on a small dataset, future enhancements could include:
Expanding the dataset to cover larger domains.
Adding a web-based user interface (e.g., Streamlit).
Incorporating session memory for multi-turn conversations.
Benchmarking performance against baseline LLM-only outputs.
These improvements would make the assistant more scalable, interactive, and suitable for real-world deployment.
import os
import re
import json
from dotenv import load_dotenv
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
load_dotenv()
DATA_FILE = "project_1_publications.json"
def preprocess_query(query: str) -> str:
"""Clean query for better retrieval."""
query = query.lower().strip()
query = re.sub(r"[^a-zA-Z0-9\s]", "", query)
return query
def load_documents(file_path: str):
with open(file_path, "r") as f:
data = json.load(f)
docs = []
for record in data:
text = f"Title: {record.get('title', '')}\n\nContent: {record.get('content', '')}"
docs.append(Document(page_content=text, metadata={"id": record.get("id", None)}))
return docs
docs = load_documents(DATA_FILE)
print("Loaded:", len(docs), "docs")
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = splitter.split_documents(docs)
print("Chunks:", len(chunks))
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
gen_pipeline = pipeline(
task="text2text-generation",
model="google/flan-t5-base",
max_new_tokens=256,
temperature=0,
)
llm = HuggingFacePipeline(pipeline=gen_pipeline)
prompt = PromptTemplate.from_template(
"""You are a helpful assistant that answers ONLY using the provided context.
If the answer is not in the context, say: "I don't know based on the provided documents."
Question:
{question}
Context:
{context}
Answer:"""
)
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
rag_chain = (
{
"question": RunnablePassthrough(),
"context": retriever | format_docs,
}
| prompt
| llm
| StrOutputParser()
)
if name == "main":
while True:
user_q = input("\nEnter your question (or 'exit' to quit): ")
if user_q.lower() in ["exit", "quit"]:
break
clean_q = preprocess_query(user_q)
print("\nπ€ Query after preprocessing:", clean_q)
answer = rag_chain.invoke(clean_q)
print("\nπ‘ Answer:", answer)