Improving Factual Accuracy in LLMs with Retrieval-Augmented Generation (RAG)
🔹 Abstract

Large Language Models (LLMs) are powerful in generating fluent natural language but often face a critical limitation: factual inaccuracy. They sometimes provide answers not grounded in real data, reducing their reliability in knowledge-intensive tasks.
This project implements a Retrieval-Augmented Generation (RAG) pipeline that integrates semantic retrieval (FAISS vector database) with open-source language models (Hugging Face Transformers) to improve factual accuracy. The system allows users to query research publications and receive context-grounded answers, demonstrating how retrieval enhances the reliability of LLMs.

🔹 Introduction

LLMs such as GPT-like models can produce convincing yet incorrect outputs when knowledge is missing. This project addresses this challenge by introducing RAG, a framework that augments LLMs with external knowledge sources. Instead of generating answers based solely on model parameters, the assistant first retrieves relevant passages from a custom dataset and then uses the LLM to synthesize accurate responses.

The project was completed as part of Module 1 of the Agentic AI Developer Certification Program by ReadyTensor.

🔹 Project Overview

The assistant was designed around a simple but effective pipeline:

Retriever – FAISS vector store identifies semantically relevant text chunks.

Generator – A Hugging Face model (flan-t5-base) generates natural answers using only the retrieved context.

This integration ensures that the assistant produces fluent yet factually grounded outputs, reducing hallucinations commonly seen in LLMs.

🔹 Objectives

Build a working retrieval + generation assistant.

Demonstrate how RAG improves factual reliability over LLM-only outputs.

Provide a clear, reproducible implementation for future developers.

🔹 Methodology & Tech Stack
Workflow Steps:

Data Ingestion – Publications were loaded from a JSON dataset.

Chunking – Text split into overlapping passages for better semantic retrieval.

Embeddings – Hugging Face’s all-MiniLM-L6-v2 model used to encode chunks.

Vector Store – Stored and indexed in FAISS for efficient search.

Retriever – Retrieved top-k relevant chunks for each query.

Generator – Hugging Face’s flan-t5-base model produced final answers using retrieved evidence.

Tech Stack:

Python 3.10+

LangChain for pipeline orchestration

Hugging Face Transformers for embeddings & generation

FAISS for vector database search

Google Colab for implementation

GitHub for version control and submission

🔹 Workflow Diagram
User Query
│
▼
┌──────────────┐
│ Retriever │ ← (searches FAISS vector store)
└──────────────┘
│
▼
┌──────────────┐
│ Vector DB │ ← (FAISS + embeddings)
└──────────────┘
│
▼
Relevant Context
│
▼
┌──────────────┐
│ LLM Generator│ ← (flan-t5-base)
└──────────────┘
│
▼
Final Answer

🔹 Results

The assistant successfully answered user questions such as:

“What is this publication about?”

“What methods or tools were applied?”

“What limitations were discussed?”

In all cases, responses were derived from the dataset rather than fabricated. This validates RAG as a robust method to improve factual accuracy in LLM responses.

🔹 Conclusion

This project shows that RAG can significantly reduce hallucinations in LLMs by grounding answers in external datasets. Through the combination of FAISS retrieval and Hugging Face generation, the assistant delivers accurate, context-driven, and reliable outputs.

🔹 Future Work

While the current system demonstrates strong performance on a small dataset, future enhancements could include:

Expanding the dataset to cover larger domains.

Adding a web-based user interface (e.g., Streamlit).

Incorporating session memory for multi-turn conversations.

Benchmarking performance against baseline LLM-only outputs.

These improvements would make the assistant more scalable, interactive, and suitable for real-world deployment.

import os
import re
import json

from dotenv import load_dotenv
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

=======================

Load environment & setup

=======================

load_dotenv()

DATA_FILE = "project_1_publications.json"

=======================

Query Preprocessing

=======================

def preprocess_query(query: str) -> str:
"""Clean query for better retrieval."""
query = query.lower().strip()
query = re.sub(r"[^a-zA-Z0-9\s]", "", query)
return query

=======================

Load and preprocess docs

=======================

def load_documents(file_path: str):
with open(file_path, "r") as f:
data = json.load(f)

docs = []
for record in data:
    text = f"Title: {record.get('title', '')}\n\nContent: {record.get('content', '')}"
    docs.append(Document(page_content=text, metadata={"id": record.get("id", None)}))
return docs

docs = load_documents(DATA_FILE)
print("Loaded:", len(docs), "docs")

Split into chunks

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = splitter.split_documents(docs)
print("Chunks:", len(chunks))

=======================

Build vectorstore & retriever

=======================

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

=======================

LLM setup (Flan-T5 base)

=======================

gen_pipeline = pipeline(
task="text2text-generation",
model="google/flan-t5-base",
max_new_tokens=256,
temperature=0,
)

llm = HuggingFacePipeline(pipeline=gen_pipeline)

=======================

Prompt template

=======================

prompt = PromptTemplate.from_template(
"""You are a helpful assistant that answers ONLY using the provided context.
If the answer is not in the context, say: "I don't know based on the provided documents."

Question:
{question}

Context:
{context}

Answer:"""
)

def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)

=======================

RAG Chain

=======================

rag_chain = (
{
"question": RunnablePassthrough(),
"context": retriever | format_docs,
}
| prompt
| llm
| StrOutputParser()
)

=======================

Main run

=======================

if name == "main":
while True:
user_q = input("\nEnter your question (or 'exit' to quit): ")
if user_q.lower() in ["exit", "quit"]:
break

    clean_q = preprocess_query(user_q)
    print("\n🤔 Query after preprocessing:", clean_q)

    answer = rag_chain.invoke(clean_q)
    print("\n💡 Answer:", answer)

Improving Factual Accuracy in LLMs with Retrieval-Augmented Generation (RAG)
🔹 Abstract

🔹 Introduction

The project was completed as part of Module 1 of the Agentic AI Developer Certification Program by ReadyTensor.

🔹 Project Overview

The assistant was designed around a simple but effective pipeline:

Retriever – FAISS vector store identifies semantically relevant text chunks.

Generator – A Hugging Face model (flan-t5-base) generates natural answers using only the retrieved context.

This integration ensures that the assistant produces fluent yet factually grounded outputs, reducing hallucinations commonly seen in LLMs.

🔹 Objectives

Build a working retrieval + generation assistant.

Demonstrate how RAG improves factual reliability over LLM-only outputs.

Provide a clear, reproducible implementation for future developers.

🔹 Methodology & Tech Stack
Workflow Steps:

Data Ingestion – Publications were loaded from a JSON dataset.

Chunking – Text split into overlapping passages for better semantic retrieval.

Embeddings – Hugging Face’s all-MiniLM-L6-v2 model used to encode chunks.

Vector Store – Stored and indexed in FAISS for efficient search.

Retriever – Retrieved top-k relevant chunks for each query.

Generator – Hugging Face’s flan-t5-base model produced final answers using retrieved evidence.

Tech Stack:

Python 3.10+

LangChain for pipeline orchestration

Hugging Face Transformers for embeddings & generation

FAISS for vector database search

Google Colab for implementation

GitHub for version control and submission

🔹 Results

The assistant successfully answered user questions such as:

“What is this publication about?”

“What methods or tools were applied?”

“What limitations were discussed?”

In all cases, responses were derived from the dataset rather than fabricated. This validates RAG as a robust method to improve factual accuracy in LLM responses.

🔹 Conclusion

🔹 Future Work

While the current system demonstrates strong performance on a small dataset, future enhancements could include:

Expanding the dataset to cover larger domains.

Adding a web-based user interface (e.g., Streamlit).

Incorporating session memory for multi-turn conversations.

Benchmarking performance against baseline LLM-only outputs.

These improvements would make the assistant more scalable, interactive, and suitable for real-world deployment.

import os
import re
import json

=======================

Load environment & setup

=======================

load_dotenv()

DATA_FILE = "project_1_publications.json"

=======================

Query Preprocessing

=======================

def preprocess_query(query: str) -> str:
"""Clean query for better retrieval."""
query = query.lower().strip()
query = re.sub(r"[^a-zA-Z0-9\s]", "", query)
return query

=======================

Load and preprocess docs

=======================

def load_documents(file_path: str):
with open(file_path, "r") as f:
data = json.load(f)

docs = []
for record in data:
    text = f"Title: {record.get('title', '')}\n\nContent: {record.get('content', '')}"
    docs.append(Document(page_content=text, metadata={"id": record.get("id", None)}))
return docs

docs = load_documents(DATA_FILE)
print("Loaded:", len(docs), "docs")

Split into chunks

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = splitter.split_documents(docs)
print("Chunks:", len(chunks))

=======================

Build vectorstore & retriever

=======================

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

=======================

LLM setup (Flan-T5 base)

=======================

gen_pipeline = pipeline(
task="text2text-generation",
model="google/flan-t5-base",
max_new_tokens=256,
temperature=0,
)

llm = HuggingFacePipeline(pipeline=gen_pipeline)

=======================

Prompt template

=======================

Question:
{question}

Context:
{context}

Answer:"""
)

def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)

=======================

RAG Chain

=======================

rag_chain = (
{
"question": RunnablePassthrough(),
"context": retriever | format_docs,
}
| prompt
| llm
| StrOutputParser()
)

=======================

Main run

=======================

if name == "main":
while True:
user_q = input("\nEnter your question (or 'exit' to quit): ")
if user_q.lower() in ["exit", "quit"]:
break

    clean_q = preprocess_query(user_q)
    print("\n🤔 Query after preprocessing:", clean_q)

    answer = rag_chain.invoke(clean_q)
    print("\n💡 Answer:", answer)

“RAG-powered Assistant for Exploring Research Publications”