Abstract

This publication presents a comprehensive, production-ready Retrieval-Augmented Generation (RAG) assistant designed for the Ready Tensor Agentic AI Developer Certification Program. The system efficiently processes Ready Tensor publications, leveraging LangChain for workflow orchestration, Google Gemini 1.5 Flash for generative capabilities, and ChromaDB for vector-based information retrieval. Deployed on the Render cloud platform, the assistant offers a user-friendly Streamlit web interface with real-time performance metrics and conversational memory. Key aspects include robust data processing, efficient context management, and clear source attribution, all contributing to a reliable and verifiable AI interaction. This work demonstrates the practical implementation of a high-performance RAG system, emphasizing its architecture, deployment, and user experience for certification purposes.

Introduction

Large Language Models (LLMs) are capable of generating fluent and informative responses, but they often suffer from hallucinations when answering questions that require factual accuracy or domain-specific knowledge beyond their training data. This limitation poses significant challenges for real-world applications that rely on trustworthy, document-grounded information.

Retrieval-Augmented Generation (RAG) addresses this issue by combining information retrieval techniques with generative language models. Instead of relying solely on parametric knowledge, a RAG system retrieves relevant document segments from an external knowledge source and uses them as context for response generation. This approach helps ensure that generated answers are grounded in actual data, improving reliability and interpretability.

The purpose of this project is to demonstrate the design and implementation of a domain-specific question answering system using a Retrieval-Augmented Generation pipeline. The primary objective is to reduce hallucinations, improve answer accuracy, and maintain contextual continuity across long documents through effective chunking, embedding-based retrieval, and controlled prompt engineering.

This publication presents a practical RAG implementation that includes document ingestion, text chunking with overlap, vector embeddings, similarity-based retrieval, and response generation using Google Gemini (free tier). The system is intentionally scoped to operate on the Ready Tensor publications dataset, making it suitable for technical documentation assistance and knowledge-based systems.

Live Demo: https://rag-assistant-pg48.onrender.com
Source Code: https://github.com/Yassin351/rag-assistant

Problem Statement

While general-purpose language models can answer a wide range of questions, they lack access to up-to-date or domain-specific information and may generate incorrect or fabricated responses when queried about specialized content. This becomes a critical issue in scenarios where accuracy, traceability, and trustworthiness are required, such as technical documentation analysis or knowledge-based systems.

The challenge addressed in this project is to design a system that enables a language model to answer questions based strictly on the Ready Tensor publication corpus, ensuring that responses are grounded in retrieved context rather than unsupported assumptions. The system must also handle long documents effectively while preserving semantic continuity across sections.

System Architecture

The system follows a standard Retrieval-Augmented Generation architecture composed of two primary stages: retrieval and generation.

Document Ingestion: Raw documents (project_1_publications (2).json) are loaded and preprocessed to prepare them for indexing. The loader respects semantic markers (--DIVIDER--) to preserve document structure.

Chunking and Embedding: Documents are split into smaller overlapping chunks to preserve contextual flow. Each chunk is converted into a dense vector representation using Google's embedding-001 model (free tier).

Vector Store: The embeddings are stored in a ChromaDB vector database that enables efficient similarity search with cosine similarity indexing.

Query Processing and Retrieval: User queries are embedded and compared against stored document vectors to retrieve the most relevant chunks (Top-k=5).

Response Generation: Retrieved chunks are passed as context to Google Gemini 1.5 Flash, which generates a final answer grounded in the provided documents with source attribution.

Deployment Architecture

RAG Pipeline Workflow

TDocument Processing Pipeline
The document processing subsystem handles the ingestion and preparation of textual content from the Ready Tensor publications corpus:
Document Loading: Implemented in src.document_loader (or document_loader.py), this component loads the JSON publications file (project_1_publications (2).json) and processes entries using Python's JSON loader with regex-based section splitting to respect semantic boundaries.
Python

Load and process documents

load docu.PNG

Document Chunking: After loading, documents are segmented into smaller chunks using RecursiveCharacterTextSplitter with configurable chunk size (default: 1000 characters) and overlap parameters (default: 200 characters). This chunking strategy preserves semantic coherence—particularly code blocks and paragraph boundaries—while creating appropriately sized segments for embedding.

document chunk.PNG

chunks = text_splitter.split_documents(documents)

Vector Database System
The vector database subsystem manages document embeddings and similarity search operations using ChromaDB:
Embedding Generation: The vector_store.py module utilizes Google's Generative AI embeddings (specifically the models/embedding-001 model) to convert text chunks into dense vector representations (768 dimensions).

Vector Storage: ChromaDB serves as the persistent vector store, maintaining embeddings and their associated document metadata in the ./chroma_db directory. This component supports both the creation of new vector stores and the loading of existing ones between deployments.

llm inter.PNG

Query Processing and Response Generation
The query processing and response generation system combines information retrieval with language generation, augmented with conversational memory:
LLM Integration: The rag_chain.py module interfaces with Google's Gemini API, configuring and initializing the language model (gemini-1.5-flash).

RAG Chain Assembly: The system assembles the retrieval-augmented generation chain that connects vector similarity search with prompt formatting, conversational memory, and language model generation.

Configuration Management
The config.py module centralizes configuration parameters, security settings, and prompt templates:

Environment Variables: Sensitive information like API keys is loaded from environment variables (.env file locally, Render Environment Variables in production).

Project Scope and RAG Configuration
The scope of this project is intentionally limited to the Ready Tensor publications dataset to ensure focused and meaningful retrieval. By constraining the knowledge base to this defined set of documents, the system avoids irrelevant results and improves answer precision.

Parameter	Value	Description
Chunk Size	1000 characters	Text segment size
Chunk Overlap	200 characters	Context preservation between chunks
Embedding Model	`models/embedding-001`	Google Generative AI (768 dimensions)
Vector Store	ChromaDB	Local persistent storage
Top-k Retrieval	5 documents	Number of chunks retrieved per query
LLM	Gemini 1.5 Flash	Temperature 0.3 for factual accuracy
Memory Window	3 turns	Conversation buffer size

These parameters represent a balance between retrieval accuracy, latency (~2.3s average), and resource usage.

Information Flow

##Initialization Phase:
The system checks for an existing vector store
If none exists, it loads documents from the configured directory
Documents are split into semantic chunks
Chunks are embedded and stored in ChromaDB

##Query Phase:
User submits a question through the CLI interface
The system retrieves the vector database
The question is embedded and used to perform similarity search
Relevant document chunks are retrieved
Retrieved context and question are formatted into a prompt
The LLM generates a response based on the provided context
The answer is returned to the user

##Integration Points
The app.py file serves as the integration point, orchestrating the various components:

init_vector_db() handles the initialization workflow
get_vector_db() manages vector database access
query_agent() implements the query workflow
run_cli() provides the user interface
This architecture follows separation of concerns principles, with each module handling a specific aspect of the RAG workflow. The modular design allows for component replacement or enhancement without disrupting the overall system function, making the system adaptable to future improvements in embedding models, language models, or chunking strategies.

Safety and Responsible AI Considerations

Ensuring responsible and safe usage of language models is a critical aspect of this project. The system is designed to reduce hallucinations by grounding responses strictly in retrieved document context rather than relying on the language model's internal knowledge.
Key Safety Measures:
Context Grounding: The model is instructed to answer only based on the retrieved context and to avoid speculative or out-of-scope responses
Source Attribution: Every response includes citations to specific publications and authors, enabling verification
Conservative Handling: Queries that cannot be answered using the available documents are handled with a clear "I don't have sufficient information" message
Domain Limitation: Limiting the system to the predefined Ready Tensor corpus helps control output relevance and minimizes unintended responses
API Security: API keys are stored in environment variables (.env) and never committed to version control

##Key Achievements

The RAG-based AI Publication Agent successfully demonstrates:

Effective Document-Grounded Responses: By combining vector-based retrieval with language model generation, the system produces responses that remain faithful to the source material, reducing the hallucination problems common in pure LLM approaches.
Modular, Maintainable Architecture: The system's clear separation of concerns across document processing, vector representation, and response generation creates a flexible framework that can be extended and maintained with minimal coupling between components.
Practical Implementation Balance: The chosen technologies, particularly the "all-MiniLM-L6-v2" embedding model and "llama-3.1-8b-instant" language model, establish an effective balance between performance, resource efficiency, and response quality.
Intelligent Context Retrieval: The semantic search capabilities enabled by the vector database ensure that responses are informed by the most relevant document fragments, even when user queries don't exactly match document terminology.

Limitations and Future Work

Current Limitations
Ephemeral Vector Store: ChromaDB runs in-memory on Render (free tier), requiring rebuild on deployment restart
Rate Limiting: Subject to Google Gemini free tier limits (requests per minute)
Single Session Memory: No persistent user sessions across browser refreshes
Static Dataset: Limited to the provided Ready Tensor publications without real-time updates
Future Work
Persistent Memory: Implement Redis/PostgreSQL for cross-session memory storage
Multi-Agent Coordination: Extend to LangGraph for complex agent workflows (Module 2 scope)
Retrieval Evaluation: Implement quantitative metrics (precision@k, recall@k, MRR)
Advanced Reranking: Add cross-encoder reranking for improved retrieval accuracy
Multi-Modal Support: Extend to handle images and charts from publications# Heading 1

Conclusion

This project illustrates how Retrieval-Augmented Generation can be used to build a domain-specific question answering system that produces grounded, context-aware responses. By combining document retrieval with controlled generation using Google Gemini (free tier), the system addresses key limitations of standalone language models, such as hallucinations and lack of domain knowledge.
The work demonstrates foundational RAG concepts including chunking with overlap, embedding-based similarity search, conversational memory, and responsible prompt design. With live deployment on Render and source code available on GitHub, this project provides a complete, reproducible reference implementation for the Agentic AI Developer Certification.

Project Overview & Implementation

Project: RAG-Based AI Assistant for Ready Tensor Publications
Certification: Agentic AI Developer Certification (Module 1)
Deployment: https://rag-assistant-pg48.onrender.com (Live)
Repository: https://github.com/Yassin351/rag-assistant
This project is a Retrieval-Augmented Generation (RAG) assistant developed as part of Module 1: Foundations of Agentic AI in the Agentic AI Developer Certification (AAIDC) by Ready Tensor.
The assistant answers user questions by retrieving relevant information from the Ready Tensor publications dataset stored in a vector database and generating grounded responses using Google Gemini (free tier).
Key Features
Memory Management: ConversationBufferWindowMemory with 3-turn context retention
Vector Search: ChromaDB with Google embedding-001 (cosine similarity)
Multiple Interfaces: Interactive CLI and Streamlit web UI
Observability: Comprehensive logging with latency tracking
Source Attribution: Automatic citation of publication titles and authors

#Technologies Used

Python 3.9+
LangChain (LCEL – LangChain Expression Language) - RAG pipeline orchestration
langchain-google-genai - Gemini integration
langchain-chroma - Vector store integration
ChromaDB 0.4.22+ - Vector database
Google Gemini 1.5 Flash (free tier) - LLM
Google Embedding models/embedding-001 - Vector embeddings
Streamlit 1.30.0+ - Web interface
Render - Cloud deployment platform
GitHub - Source control and CI/CD

Project Structure

rag-assistant/
├── .env # API keys (gitignored)
├── .env.example # Template for environment variables
├── requirements.txt # Dependencies
├── config.py # Configuration management
├── document_loader.py # JSON processing & semantic chunking
├── vector_store.py # ChromaDB management
├── rag_chain.py # RAG logic with conversational memory
├── main.py # CLI interface with colored output
├── app.py # Streamlit web UI
├── test_queries.py # Evaluation and benchmarking suite
├── setup.bat # Windows setup script
├── logs/ # Application logs (gitignored)
└── chroma_db/ # Vector store persistence (gitignored)

Setup Instructions

1️⃣ Clone the Repository

git clone https://github.com/Yassin351/rag-assistant.git
cd rag-assistant

2️⃣ Install Dependencies

pip install -r requirements.txt

##3️⃣ Environment Variable Configuration
Create a .env file in the root directory:
bash
cp .env.example .env

Edit .env and add your Google API key:
GOOGLE_API_KEY=your_gemini_api_key_here
Get your free API key from: https://aistudio.google.com/app/apikey

lternative: GitHub Codespaces (Recommended for Cloud Development)
If using GitHub Codespaces:
Add a Codespaces Secret named GOOGLE_API_KEY
Value: your Gemini API key
Restart the Codespace after adding the secret

Running the Project

Step 1: Ingest Documents (Build Vector Store)

Activate environment (if using venv)

venv\Scripts\activate # Windows
source venv/bin/activate # Mac/Linux

Build vector database (one-time setup)

python main.py --rebuild

References

[1] LangChain. (2024). LangChain Documentation: Build context-aware reasoning applications. Retrieved from https://python.langchain.com/docs/

[2] Google DeepMind. (2024). Gemini 1.5 Flash: High-efficiency multimodal model for high-volume applications. Google AI Studio. Available at: https://ai.google.dev/

[3] Google. (2024). Embeddings API: models/embedding-001 Technical Documentation. Google Generative AI. Retrieved from https://ai.google.dev/tutorials/python_quickstart

[4] Chroma. (2024). ChromaDB: The AI-native open-source vector database. Retrieved from https://docs.trychroma.com/

[5] Ready Tensor. (2024). Agentic AI Developer Certification: Module 1 Dataset (project_1_publications.json). Ready Tensor Publications Corpus.

[6] Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

[7] Streamlit. (2024). Streamlit Documentation: A faster way to build and share data apps. Retrieved from https://docs.streamlit.io/

Retrieval-Augmented Generation System with Gemini Pro and ChromaDB for Technical Documentation