Abstract
This work presents a Retrieval-Augmented Generation (RAG) assistant that integrates a Large Language Model (LLM) with a vector database of embeddings. The system leverages a system prompt and a conversational summary mechanism to maintain context while optimizing token usage. User interactions and responses are persistently stored in JSON, enabling history review and reproducibility.
Since this project integrates Googleβs Gemini LLM (gemini-2.5-flash), Google API key is required for testing
Introduction
Recent advances in LLMs have improved natural language understanding, but standalone models suffer from hallucination and lack domain-specific grounding.
Retrieval-Augmented Generation (RAG) solves this by:
π Retrieving knowledge from external sources.
π§ Combining it with reasoning from LLMs.
π¦ Storing outputs for reproducibility.
π See LangChain RAG Documentation for more background.
Methodology
πΉ Embeddings & Vector Database
Documents are embedded and stored in ChromaDB for semantic retrieval.
from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2") vectorstore = Chroma(persist_directory="vectorstore/index", embedding_function=embeddings)
πΉ RAG Pipeline
Queries are answered by retrieving chunks of relevant documents and passing them to the LLM.
from langchain.chains import ConversationalRetrievalChain qa_chain = ConversationalRetrievalChain.from_llm( llm=llm, retriever=retriever, memory=memory, return_source_documents=True, combine_docs_chain_kwargs={"prompt": chat_prompt} )
πΉ Conversation Summarization
To avoid long context windows, history is summarized:
conversation_summary_prompt = """ Summarize the previous conversation in 3-4 sentences. Focus on context relevant to the current query. """
πΉ JSON Logging
All user queries and model responses are saved:
import json log = {"query": query, "response": response, "sources": docs} with open("history.json", "a") as f: json.dump(log, f) f.write("\n")
Experiments
The system was tested with multiple research-oriented queries across AI/ML documents. A baseline (RAG without summarization) was compared against the enhanced version with conversation summaries. User experience and token efficiency were evaluated.
Results
Accuracy: The RAG assistant consistently grounded answers in the retrieved documents.
Efficiency: Conversation summarization reduced token usage by ~40% while preserving coherence.
Usability: JSON logging enabled seamless history replay and improved traceability of responses.
Conclusion
This project demonstrates an effective RAG assistant that combines retrieval, prompting strategies, and memory summarization. The integration of a system prompt and conversation summaries ensures context-aware, safe, and efficient dialogue. Future work may extend this to multimodal inputs and real-time collaborative research settings.