RAG Chatbot

Abstract

Accessing accurate, up-to-date information within academic institutions remains a significant challenge due to the fragmentation of knowledge across various websites, personal academic pages, and administrative documents. To address this, an advanced Retrieval-Augmented Generation (RAG) chatbot was developed to assist users in navigating research-related information within the University of Sheffield’s Department of Computer Science. The system integrates a hybrid retrieval pipeline, combining BM25 keyword matching, Chroma semantic vector search, and a FlagEmbedding-based reranking model, with a Llama-3.2-3B-Instruct language model for answer generation and refinement.
The system features a user-friendly Gradio interface that delivers accurate, coherent responses through a dual-stage generation process incorporating a chain-of-thought prompting. Performance was assessed using 105 human-generated queries and evaluating both retrieval and generation using multiple methods: automated LLM-based grading, human assessment, and the RAGAS Framework.
The evaluation results showed that the developed hybrid retrieval system outperformed baseline methods such as TF-IDF and BM25 in terms of retrieval relevance.
Furthermore, the complete hybrid RAG system consistently surpassed other baseline models in assessing the generation part, including TF-IDF-based RAG, web search RAG, and LLM-only approaches, across all major evaluation dimensions and metrics.

The Figure below shows the user interface of the developed University of Sheffield Assistant chatbot. It demonstrates the system’s ability to retrieve relevant documents and generate context-aware responses to queries related to the Computer Science department.

Introduction

Context and Motivation

Accessing accurate and up-to-date academic information can be challenging due to the fragmented nature of university content across various websites, PDFs, and staff pages. This creates information overload and delays in responses, like university email replies taking two working days. Studies show that AI chatbots can significantly reduce search time and improve efficiency.

However, standard large language models (LLMs) struggle with domain-specific queries and can hallucinate or give unreliable answers. Retrieval-Augmented Generation (RAG) systems address these issues by grounding responses in external documents, improving accuracy and trust.

Aims

This project builds a chatbot for the University of Sheffield’s Computer Science department using a hybrid RAG system that combines keyword-based and semantic retrieval with LLM-based answer generation. It allows users to ask research-related questions and receive accurate, context-aware, and source-backed responses through a conversational interface.

Methodology

The Figure below shows the overall System Architecture of the Hybrid RAG Chatbot.
Queries are routed through either a general query handler or a hybrid retrieval pipeline combining BM25
and ChromaDB. Texts are embedded using the nomic-embed-text-v1.5 model. Retrieved documents are reranked using ‘bge-reranker-v2-m3‘, and answers are generated and refined using the ‘llama3.2:3b-instruct-fp16‘ model. The final response is delivered via a Gradio UI.

Data Collection and Preprocessing

The used data was collected from two key sources:

MyPublication API (Sheffield University API).
University Website: Official departmental pages were scraped and processed to extract relevant academic
and research information. ( Using separate scraping and cleaning code )
All documents were saved as .txt files with their URL sources.

Chunking

All documents were split into 512-character chunks with 50-character overlaps using LangChain’s RecursiveCharacterTextSplitter. It process resulted in 1734 structured chunks.

Hybrid Retrieval Pipeline

This section describes the chatbot's hybrid retrieval pipeline, a crucial part of its RAG architecture designed to find the most relevant documents from a knowledge base to ground answers and reduce errors.

The system combines two retrieval methods:

1. BM25

A traditional keyword-based retriever for lexical precision (e.g., exact word matches), returning its top 10 document chunks.

2. Chroma Vector Search

A semantic search approach using dense vector embeddings. Text is converted into numerical representations (embeddings) using the nomic-embed-text-v1.5 model. These embeddings are stored in a Chroma vector database, which then retrieves its top 10 most semantically similar document chunks to a query using cosine similarity.

Reranker

These two sets of results are combined with equal weighting using LangChain's EnsembleRetriever. To further refine relevance and address potential biases from combining scores, a document reranker (BAAI/bge-reranker-v2-m3 cross-encoder) re-scores the combined documents. It then outputs the top 3 most relevant chunks, applying a relevance score threshold (0.2) to filter out irrelevant information and manage out-of-scope questions. This two-step retrieval and reranking process aims to maximize both recall and precision.

If no documents pass the threshold, the chatbot responds with a fallback message.

The Figure below shows the design of the hybrid retrieval and cross-encoder reranking pipeline used.
Screenshot 2025-05-28 at 13.18.45.png

Answer Generation

For the answer generation component, the LLaMA 3.2 3B Instruct model (llama3.2:3b-instruct-fp16) was selected. This choice was driven by several factors critical for practical deployment: Performance, Cost Effectiveness (open-source, it allows for local deployment), Instruction Tuning.

Prompt Engineering techniques were leveraged to optimize response quality:

Positive Prompting: Instructions focused on desired actions rather than negations to avoid unintended associations and improve clarity.
Chain-of-Thought (CoT) Prompting: Employed during answer refinement to encourage intermediate reasoning steps, enhancing coherence.

To further enhance the reliability and accuracy of the output, a second-stage refinement step is performed. After retrieving relevant documents and generating an initial answer using the LLaMA 3.2 model, the system re-evaluates this initial response for factual accuracy, completeness, and fluency, using the same model and the original context. If the initial answer is deemed accurate, it is rephrased to improve clarity. However, if inaccuracies are detected, the answer is revised using only the retrieved documents to ensure correctness and coherence. This two-step generation process significantly improves the overall quality and reliability of responses while actively working to reduce hallucinations.

To handle general user queries (e.g., greetings, bot capability questions) efficiently, a lightweight regex-based classifier was implemented. This approach uses a dictionary of predefined patterns and responses, bypassing the full RAG pipeline for non-domain-specific questions, thereby reducing latency and computational cost. This was a pragmatic solution adopted after an LLM-based classifier showed inconsistent performance.

Results and Evaluation

This chapter details the evaluation of the developed RAG chatbot.

Evaluation Framework:

A multi-faceted approach was used, combining:

LLM-as-a-judge: LLaMA model (3.2:3B-fp16 for retrieval) scored relevance, hit rate, coherence, fluency, hallucination, and latency.
RAGAS Framework: Assessed retrieval (Context Recall, Context Precision) and generation (Faithfulness, Factual Correctness) using Llama 3:8B model.
Human Evaluation: Five CS students reviewed retrieval (relevance, comprehensiveness, usefulness) and generation (correctness, clarity, hallucination) for 105 curated questions.

Baseline Models

To assess the developed RAG chatbot's performance, several baseline models representing standard retrieval and generation approaches were used for comparison:

TF-IDF Retrieval: A classical keyword-based method ranking documents by term frequency-inverse document frequency, known for being lightweight but limited in semantic understanding.
BM25 Retrieval: Another statistical keyword-based method, improving on TF-IDF by considering term saturation and document length.
TF-IDF RAG: A RAG pipeline using only the TF-IDF retriever (instead of the developed hybrid system) with the same knowledge base and LLaMA 3.2 3B Instruct model (using a simple prompt) for generation.
Web Search RAG: A RAG system that retrieves real-time documents from the internet (via Tavily API) and uses the same LLaMA 3.2 3B Instruct model (with a simple prompt) for answer generation.
LLM Only (No Retrieval): This baseline uses the LLaMA 3.2 3B Instruct model directly to generate answers without any external document retrieval, relying solely on its pre-trained knowledge.

Key Findings:

Retrieval:

The developed hybrid RAG retrieval system significantly outperformed baselines (TF-IDF, BM25) in LLM-judged relevance (3.21 score, 55.96% hit rate) and contrastive preference (preferred 97/105 times over TF-IDF).
In RAGAS, it achieved the highest Context Recall (0.8348) and Context Precision (0.9135 with ref, 0.9519 w/o ref) compared to TF-IDF RAG and Web Search RAG.
Human evaluation confirmed its superiority, with the highest average relevance (4.3/5), comprehensiveness (91.2%), and usefulness (92.6%).

Answer Generation:

The developed RAG chatbot produced the highest quality answers. LLM-as-a-judge gave it the best average answer score (3.79, 81.90% hit rate), coherence (3.88), and relevance (4.04), with the lowest hallucination (2.49/3) and competitive response time (2.16s), outperforming LLM-only and Web Search RAG.
RAGAS showed it had the best Faithfulness (0.8914) and Factual Correctness (0.5531).
Human evaluation corroborated these findings, giving it the highest scores for correctness (4.2/5), clarity (4.1/5), and lowest hallucination (2.8/3).

Conclusion

The project successfully developed a RAG chatbot for answering research-related queries within the University of Sheffield's Computer Science Department. Evaluation results clearly demonstrate its consistent outperformance of baseline models.

Future work could involve:

Implementing restricted live web search and conversational memory.
Adding voice input, multilingual capabilities, and PDF input.
Expanding the knowledge base to other university departments.
Exploring more powerful LLMs or knowledge graphs.