Retrieval-Augmented AI Assistant (Gemini + ChromaDB)

RAG AI Assistant.jpg

Abstract

This project implements a Retrieval-Augmented Generation (RAG) based AI Assistant that answers questions using only the content inside a set of domain documents. The system loads text files, splits them into smaller chunks, generates embeddings using Google Gemini Embeddings, stores them in a ChromaDB vector database, and retrieves relevant context when a user asks a question. An LLM (Gemini 2.5 Flash) then produces a grounded answer based strictly on the retrieved text. When the documents do not contain the answer, the model is forced to respond: “I don’t have enough information from the documents.” This project was completed as part of the Ready Tensor Agentic AI Developer Certification – Module 1.

Methodology

The system follows a standard RAG workflow:

Document Loading: .txt files inside the /data directory are loaded automatically.
Chunking: Each document is split into ~500-character sentence-aware chunks to improve retrieval granularity.
Embeddings: Google Generative AI text-embedding-004 converts each chunk into vector embeddings.
Vector Storage: All embeddings and metadata are stored in ChromaDB.
Similarity Search: At query time, the user question is embedded and matched against stored vectors.
LLM Response: The retrieved context is inserted into a prompt, and Gemini generates the answer grounded in that context.

If the answer is not found in retrieval context, the model returns:
“I don't have enough information from the documents.”

Dataset Sources & Collection

For this project, I prepared three small domain documents stored in the data/ directory.
These act as the custom knowledge base for retrieval:

vaes_intro.txt – Introduction and applications of Variational Autoencoders
vaes_vs_autoencoders.txt – Differences between VAEs and traditional Autoencoders
transformers_basics.txt – How transformers handle long-range dependencies

All documents were manually drafted and curated for clarity, converted to UTF-8 text format, and placed in the data folder for ingestion.
No external proprietary datasets were used, ensuring full reproducibility.

Dataset Description

The knowledge base contains 3 custom text documents, placed in the /data directory:

File Name	Description	Approx. Length
vaes_intro.txt	What VAEs are and common applications	~250 words
vaes_vs_autoencoders.txt	Difference between VAEs and traditional autoencoders	~220 words
transformers_basics.txt	Explanation of transformers and self-attention	~260 words

All files are plain text and fully included in this repository.

Tools, Frameworks & Services

Component	Purpose
Google Gemini (LLM + Embeddings)	Answering and embedding text
ChromaDB	Vector store for chunk embeddings
LangChain	Prompt templating + pipeline
Python 3.10+	Runtime

Implementation Details

vectordb.py:

Acts as a wrapper around ChromaDB for storage and retrieval.

Splits document text into smaller chunks to improve search accuracy.

Uses Google Generative AI embeddings (text-embedding-004) to convert text chunks into vectors.

Stores each chunk along with metadata such as source filename, chunk index, and character length.

Performs similarity search by embedding the user’s query and returning the closest chunks.

Includes retry logic to handle temporary API failures and 504 timeout errors, improving stability.

app.py:

Loads all .txt files from the data folder and prepares them for ingestion.

Calls vectordb.py to chunk, embed, and store the documents.

Defines a Retrieval-Augmented Generation (RAG) prompt template that forces answers to rely only on retrieved context.

Runs the full RAG pipeline:

Search for relevant chunks

Build formatted context

Pass context and question into the LLM

Print the answer with the source filenames

Outputs “I don’t have enough information from the documents.” when the context does not contain an answer.

Security and configuration:

API keys are stored in a .env file and protected using .gitignore.

A .env.example file is provided so others can run the project without access to real secrets.

ChromaDB persistence is used so the system can reuse stored vectors without re-processing documents on every run.

Evaluation Framework

The system was evaluated using real queries executed against the stored documents.
An output was considered correct if:
✅ Answer used only retrieved context
✅ The model cited the correct document
✅ If no answer existed, it returned the fallback sentence

Results & Performance

Query	Expected Result	System Output	Result
“What are VAEs used for?”	Generation, anomaly detection, imputation	✅ Correct & cited	✅ Pass
“Difference between VAEs and autoencoders?”	Comparison exists in documents	✅ Correct & cited	✅ Pass
“How do transformers model long-range dependencies?”	Self-attention mechanism	✅ Correct & cited	✅ Pass

✅ Accuracy: 100% (3/3 queries grounded in retrieved text)

Screenshots (Execution Proof)

Query	Screenshot
What are VAEs used for?
What is the difference between VAEs and autoencoders?
How do transformers model long-range dependencies?

(Images stored in /assets/ folder inside the repo)

Limitations

Works only for .txt files (no PDFs/HTML yet)

Retrieval limited to semantic similarity only

No conversation memory across turns

Small corpus, not benchmarked against large-scale data

Not deployed as API/UI

Future Work

Add PDF ingestion + OCR

Add Streamlit or FastAPI interface

Store chat history for conversational RAG

Support large datasets and cloud-scale vector DB (Pinecone / Weaviate)

Add automated evaluation with more queries

Deployment Considerations

This application can be run locally or deployed using:

Python virtual environment

Docker container

Streamlit web UI for chat-style interface

Environment-secured .env for API keys

Current State Gap

Most general-purpose LLMs can answer questions, but they lack domain-specific context.
This project solves that gap by grounding the model in user-provided documents, reducing hallucinations and improving accuracy.

Monitoring / Maintenance

The system can be extended by:

Adding new .txt files to the data/ directory
Re-running ingestion to update embeddings
Switching to a persistent remote vector DB (e.g., Pinecone/Qdrant) for production

Comparative Analysis

To show the benefit of using a RAG approach, I compared responses from:

✅ RAG pipeline (Gemini + ChromaDB)
❌ LLM-only prompt (no document retrieval)

User Question	LLM-Only (No RAG)	RAG System Output
"What is the difference between VAEs and autoencoders?"	Gives a generic answer based on prior training, even if incorrect	Correctly responds: "I don't have enough information from the documents."
"How do transformers model long-range dependencies?"	Generic theoretical response, no citation	Uses retrieved chunks and cites `(source: transformers_basics.txt)`

Key Result:

The RAG system grounds answers strictly in retrieved documents
It avoids hallucination and admits when required information is missing
This makes responses traceable, reproducible, and trustworthy

This proves that RAG improves reliability compared to a normal LLM prompt that can guess or hallucinate.

Significance and Implications of Work

This RAG design is suitable for teams that want a small, private knowledge assistant without exposing data to external models.
It can be extended to PDF ingestion, enterprise document search, chatbots, or customer support tools.

Industry Insights

Many enterprises already use RAG systems to ground large language models in proprietary knowledge.
Systems like customer support, medical record search, and legal document lookup rely on RAG for factual accuracy and auditability.

Conclusion

This project demonstrates a complete working RAG assistant using Gemini and ChromaDB. It retrieves context, grounds responses, cites sources, handles missing information gracefully, and follows production-aligned design principles.