This publication details the development and implementation of a Retrieval-Augmented Generation (RAG) system designed for research paper question-answering. The system leverages a modular architecture for efficient document processing, embedding generation, vector database storage, and intelligent retrieval. Utilizing Ollama for both embedding and large language models, the RAG pipeline processes PDF documents, extracts and chunks text, converts it into vector embeddings stored in a Chroma database, and retrieves relevant information to answer user queries through a Gradio-based interactive interface. The project demonstrates a practical approach to enhancing information retrieval and content generation from domain-specific documents.
In the era of information overload, efficiently extracting and understanding key insights from vast quantities of documents, particularly research papers, presents a significant challenge. Traditional keyword-based search methods often fall short in capturing semantic nuances and providing comprehensive answers. This project addresses this challenge by implementing a Retrieval-Augmented Generation (RAG) system. RAG combines the strengths of information retrieval with the generative capabilities of large language models (LLMs) to provide accurate, context-aware, and detailed answers to user queries based on a given corpus of documents. The system aims to facilitate better understanding and interaction with research paper content, reducing the manual effort required for information synthesis.
My RAG system follows a modular design, encompassing text extraction, embedding generation, vector database storage, information retrieval, and LLM-based generation.
Text Extraction: PDF documents are the primary input. Text content is extracted using Langchain's PyPDFLoader, as implemented in file_to_text.py. This converts the structured PDF content into plain text for further processing.
Text Chunking: The extracted text is then divided into smaller, manageable chunks using RecursiveCharacterTextSplitter. This process, handled by the Retriever.py script, ensures that embeddings are generated for semantically coherent sections, optimizing retrieval.
Embedding Generation: For each text chunk, a high-dimensional vector embedding is generated. We utilize the "granite-embedding:278m" model from Ollama, managed by the get_embedding function in embeddings.py. This allows the system to capture the semantic meaning of the text, enabling similarity-based searches.
Vector Database Storage: The generated embeddings, along with their corresponding text chunks, are stored in a persistent Chroma vector database. The database.py file manages the connection and provides the store_embedding function for efficient storage. The database persists in the ./chroma_db directory.
Information Retrieval: When a user submits a query, the system converts the query into an embedding using the same Ollama model. A similarity search is then performed in the Chroma database via a retriever object, configured to fetch the top 5 most similar text chunks. This ensures that the most relevant context is retrieved for the LLM.
Retrieval-Augmented Generation (RAG Pipeline): The core RAG pipeline is implemented in RAG_pipeline.py. It integrates the retrieved context with a large language model. We employ the "granite3.1-moe:1b" model from Ollama. A PromptTemplate guides the LLM to generate answers strictly based on the provided context, ensuring accuracy and preventing hallucination.
User Interface: A web-based interactive chat interface is provided using Gradio, as defined in GradioUI.py. This interface allows users to input queries and receive streamed responses from the RAG pipeline, offering a seamless user experience.
For validating the functionality and performance of our RAG system, several experiments were conducted, focusing on both individual components and the end-to-end pipeline.
Embedding Generation and Storage Tests: The test_embeddings.py script was used to rigorously test the get_embedding and store_embedding functions.
Test cases included:
Batch PDF Processing: The text_chunk.py utility was developed and utilized to simulate real-world data ingestion. This script automates the processing of all PDF files within a designated data folder, extracting text, generating embeddings, and storing them in the Chroma database. This experiment confirmed the system's ability to efficiently handle multiple documents and populate the vector store.
RAG Pipeline Performance: Qualitative assessments were performed using the Gradio UI by posing various questions related to the content of processed PDFs. This involved evaluating the relevance of retrieved information and the coherence and accuracy of the LLM's generated responses, particularly its adherence to the provided context as per the Prompt_template_for_LLM.py guidelines. The streaming capability of the LLM was also observed for responsiveness.
The experiments yielded positive results, demonstrating the effectiveness of the RAG system:
Embedding Robustness: The test_embeddings.py confirmed that the "granite-embedding:278m" model consistently generated embeddings for diverse text inputs, including long strings and special characters, without critical failures. Error handling for invalid inputs was also validated.
Efficient Data Ingestion: The text_chunk.py script successfully processed and ingested multiple PDF documents into the Chroma database. The chunking mechanism and subsequent embedding storage proved efficient, laying a solid foundation for retrieval.
Accurate Retrieval: The Chroma.as_retriever configured with k=5 effectively retrieved the most relevant text chunks from the vector database based on user queries, providing rich context to the LLM.
Context-Aware Generation: The RAG pipeline, integrating the "granite3.1-moe:1b" LLM and the custom prompt template, consistently generated answers that were directly supported by the retrieved context. The LLM demonstrated adherence to the "Adherence to Context" and "No Fabrication" guidelines specified in the prompt.
Interactive User Experience: The Gradio interface provided a smooth and interactive chat experience, with the streaming responses enhancing user engagement. The ability to stream chunks of text from the LLM, as implemented in GradioUI.py, contributed to a responsive feel.
Overall, the system effectively combines retrieval and generation capabilities to provide a powerful tool for question-answering on research papers.
This project successfully developed and implemented a robust Retrieval-Augmented Generation (RAG) system for question-answering on research papers. By modularizing components for text extraction, embedding generation, vector database management, and LLM integration, we created an efficient and scalable solution. The use of Ollama's granite-embedding:278m and granite3.1-moe:1b models proved effective for semantic understanding and coherent response generation, respectively. The Chroma vector database provided reliable storage and efficient retrieval of document embeddings. The Gradio-based user interface enables intuitive interaction, making complex research paper content more accessible. Future work could involve exploring more advanced chunking strategies, fine-tuning LLMs for specific domains, and integrating a wider range of document formats to further enhance the system's capabilities.