Accessing accurate, up-to-date information within academic institutions remains a significant challenge due to the fragmentation of knowledge across various websites, personal academic pages, and administrative documents. To address this, an advanced Retrieval-Augmented Generation (RAG) chatbot was developed to assist users in navigating research-related information within the University of Sheffield’s Department of Computer Science. The system integrates a hybrid retrieval pipeline, combining BM25 keyword matching, Chroma semantic vector search, and a FlagEmbedding-based reranking model, with a Llama-3.2-3B-Instruct language model for answer generation and refinement.
The system features a user-friendly Gradio interface that delivers accurate, coherent responses through a dual-stage generation process incorporating a chain-of-thought prompting. Performance was assessed using 105 human-generated queries and evaluating both retrieval and generation using multiple methods: automated LLM-based grading, human assessment, and the RAGAS Framework.
The evaluation results showed that the developed hybrid retrieval system outperformed baseline methods such as TF-IDF and BM25 in terms of retrieval relevance.
Furthermore, the complete hybrid RAG system consistently surpassed other baseline models in assessing the generation part, including TF-IDF-based RAG, web search RAG, and LLM-only approaches, across all major evaluation dimensions and metrics.
The Figure below shows the user interface of the developed University of Sheffield Assistant chatbot. It demonstrates the system’s ability to retrieve relevant documents and generate context-aware responses to queries related to the Computer Science department.
Accessing accurate and up-to-date academic information can be challenging due to the fragmented nature of university content across various websites, PDFs, and staff pages. This creates information overload and delays in responses, like university email replies taking two working days. Studies show that AI chatbots can significantly reduce search time and improve efficiency.
However, standard large language models (LLMs) struggle with domain-specific queries and can hallucinate or give unreliable answers. Retrieval-Augmented Generation (RAG) systems address these issues by grounding responses in external documents, improving accuracy and trust.
This project builds a chatbot for the University of Sheffield’s Computer Science department using a hybrid RAG system that combines keyword-based and semantic retrieval with LLM-based answer generation. It allows users to ask research-related questions and receive accurate, context-aware, and source-backed responses through a conversational interface.
The Figure below shows the overall System Architecture of the Hybrid RAG Chatbot.
Queries are routed through either a general query handler or a hybrid retrieval pipeline combining BM25
and ChromaDB. Texts are embedded using the nomic-embed-text-v1.5 model. Retrieved documents are reranked using ‘bge-reranker-v2-m3‘, and answers are generated and refined using the ‘llama3.2:3b-instruct-fp16‘ model. The final response is delivered via a Gradio UI.
The used data was collected from two key sources:
All documents were split into 512-character chunks with 50-character overlaps using LangChain’s RecursiveCharacterTextSplitter. It process resulted in 1734 structured chunks.
This section describes the chatbot's hybrid retrieval pipeline, a crucial part of its RAG architecture designed to find the most relevant documents from a knowledge base to ground answers and reduce errors.
The system combines two retrieval methods:
A traditional keyword-based retriever for lexical precision (e.g., exact word matches), returning its top 10 document chunks.
A semantic search approach using dense vector embeddings. Text is converted into numerical representations (embeddings) using the nomic-embed-text-v1.5 model. These embeddings are stored in a Chroma vector database, which then retrieves its top 10 most semantically similar document chunks to a query using cosine similarity.
These two sets of results are combined with equal weighting using LangChain's EnsembleRetriever. To further refine relevance and address potential biases from combining scores, a document reranker (BAAI/bge-reranker-v2-m3 cross-encoder) re-scores the combined documents. It then outputs the top 3 most relevant chunks, applying a relevance score threshold (0.2) to filter out irrelevant information and manage out-of-scope questions. This two-step retrieval and reranking process aims to maximize both recall and precision.
If no documents pass the threshold, the chatbot responds with a fallback message.
The Figure below shows the design of the hybrid retrieval and cross-encoder reranking pipeline used.
For the answer generation component, the LLaMA 3.2 3B Instruct model (llama3.2:3b-instruct-fp16) was selected. This choice was driven by several factors critical for practical deployment: Performance, Cost Effectiveness (open-source, it allows for local deployment), Instruction Tuning.
Prompt Engineering techniques were leveraged to optimize response quality:
To further enhance the reliability and accuracy of the output, a second-stage refinement step is performed. After retrieving relevant documents and generating an initial answer using the LLaMA 3.2 model, the system re-evaluates this initial response for factual accuracy, completeness, and fluency, using the same model and the original context. If the initial answer is deemed accurate, it is rephrased to improve clarity. However, if inaccuracies are detected, the answer is revised using only the retrieved documents to ensure correctness and coherence. This two-step generation process significantly improves the overall quality and reliability of responses while actively working to reduce hallucinations.
To handle general user queries (e.g., greetings, bot capability questions) efficiently, a lightweight regex-based classifier was implemented. This approach uses a dictionary of predefined patterns and responses, bypassing the full RAG pipeline for non-domain-specific questions, thereby reducing latency and computational cost. This was a pragmatic solution adopted after an LLM-based classifier showed inconsistent performance.
This chapter details the evaluation of the developed RAG chatbot.
A multi-faceted approach was used, combining:
To assess the developed RAG chatbot's performance, several baseline models representing standard retrieval and generation approaches were used for comparison:
The project successfully developed a RAG chatbot for answering research-related queries within the University of Sheffield's Computer Science Department. Evaluation results clearly demonstrate its consistent outperformance of baseline models.