Large Language Models (LLMs) have revolutionized the way we interact with textual data. We used to search through long documents but now we can have AI do that for us and just ask a simple query to an LLM. However, they often generate responses without clear attribution, leading to concerns about factual accuracy and trustworthiness. This project presents a Retrieval-Augmented Generation (RAG) model that integrates source citation into LLM responses. Implemented in Streamlit with LangChain, the model dynamically retrieves information from a collection of text files and accurately cites sources. By leveraging vector stores and embedding-based search, our approach ensures more transparent and reliable AI-generated responses. The system is designed to be adaptable, enabling users to query any uploaded text dataset with full source traceability.
Language models like GPT-4 and Mistral are powerful tools for text generation, but their lack of explicit source citation makes them less than ideal for fact-based applications like research and legal writing. The goal of this project is to bridge this gap by implementing a custom RAG system that enhances LLM responses with automated source attribution. Our Streamlit-based web application processes large text datasets, enables real-time querying, and retrieves supporting document references to ensure factual grounding.
This project was initially tested with Grimm’s Fairy Tales, but its architecture allows for expansion to any collection of text files. The primary components include:
The workflow of our RAG-based citation system consists of several key steps:
To evaluate the effectiveness of our LLM-powered RAG model, we conducted a series of experiments:
The implementation of Retrieval-Augmented Generation (RAG) significantly improved the reliability and transparency of LLM-generated responses. Compared to standard language models, this approach reduced hallucinations by grounding answers in real sources. Users found that responses with citations were more trustworthy and verifiable, as they could trace statements back to the original text.
Key observations include:
This project demonstrates that RAG-based models can enhance AI-generated content by making responses more transparent and source-aware, offering a solid foundation for research, education, and fact-based AI applications.
This project successfully integrates source citation into LLM-generated responses using Retrieval-Augmented Generation (RAG). Our Streamlit-based web application enables users to query any document collection, ensuring traceability and trustworthiness in AI responses. The combination of vector-based retrieval and LLM generation significantly reduces hallucinations while maintaining fast query execution.
Future work includes:
By improving transparency in AI-generated content, this project lays the foundation for fact-grounded, citation-aware AI applications.
# Download the repo git clone https://github.com/DomenickD/LLM_Source_Citation # Change directory cd LLm_Citation_App # Install dependencies pip install -r requirements.txt # Run the Streamlit app streamlit run app.py
There are no datasets linked
There are no datasets linked