
This publication presents a comprehensive, production-ready Retrieval-Augmented Generation (RAG) assistant designed for the Ready Tensor Agentic AI Developer Certification Program. The system efficiently processes Ready Tensor publications, leveraging LangChain for workflow orchestration, Google Gemini 1.5 Flash for generative capabilities, and ChromaDB for vector-based information retrieval. Deployed on the Render cloud platform, the assistant offers a user-friendly Streamlit web interface with real-time performance metrics and conversational memory. Key aspects include robust data processing, efficient context management, and clear source attribution, all contributing to a reliable and verifiable AI interaction. This work demonstrates the practical implementation of a high-performance RAG system, emphasizing its architecture, deployment, and user experience for certification purposes.
Large Language Models (LLMs) are capable of generating fluent and informative responses, but they often suffer from hallucinations when answering questions that require factual accuracy or domain-specific knowledge beyond their training data. This limitation poses significant challenges for real-world applications that rely on trustworthy, document-grounded information.
Retrieval-Augmented Generation (RAG) addresses this issue by combining information retrieval techniques with generative language models. Instead of relying solely on parametric knowledge, a RAG system retrieves relevant document segments from an external knowledge source and uses them as context for response generation. This approach helps ensure that generated answers are grounded in actual data, improving reliability and interpretability.
The purpose of this project is to demonstrate the design and implementation of a domain-specific question answering system using a Retrieval-Augmented Generation pipeline. The primary objective is to reduce hallucinations, improve answer accuracy, and maintain contextual continuity across long documents through effective chunking, embedding-based retrieval, and controlled prompt engineering.
This publication presents a practical RAG implementation that includes document ingestion, text chunking with overlap, vector embeddings, similarity-based retrieval, and response generation using Google Gemini (free tier). The system is intentionally scoped to operate on the Ready Tensor publications dataset, making it suitable for technical documentation assistance and knowledge-based systems.
Live Demo: https://rag-assistant-pg48.onrender.com
Source Code: https://github.com/Yassin351/rag-assistant
While general-purpose language models can answer a wide range of questions, they lack access to up-to-date or domain-specific information and may generate incorrect or fabricated responses when queried about specialized content. This becomes a critical issue in scenarios where accuracy, traceability, and trustworthiness are required, such as technical documentation analysis or knowledge-based systems.
The challenge addressed in this project is to design a system that enables a language model to answer questions based strictly on the Ready Tensor publication corpus, ensuring that responses are grounded in retrieved context rather than unsupported assumptions. The system must also handle long documents effectively while preserving semantic continuity across sections.

The system follows a standard Retrieval-Augmented Generation architecture composed of two primary stages: retrieval and generation.
Document Ingestion: Raw documents (project_1_publications (2).json) are loaded and preprocessed to prepare them for indexing. The loader respects semantic markers (--DIVIDER--) to preserve document structure.
Chunking and Embedding: Documents are split into smaller overlapping chunks to preserve contextual flow. Each chunk is converted into a dense vector representation using Google's embedding-001 model (free tier).
Vector Store: The embeddings are stored in a ChromaDB vector database that enables efficient similarity search with cosine similarity indexing.
Query Processing and Retrieval: User queries are embedded and compared against stored document vectors to retrieve the most relevant chunks (Top-k=5).
Response Generation: Retrieved chunks are passed as context to Google Gemini 1.5 Flash, which generates a final answer grounded in the provided documents with source attribution.

TDocument Processing Pipeline
The document processing subsystem handles the ingestion and preparation of textual content from the Ready Tensor publications corpus:
Document Loading: Implemented in src.document_loader (or document_loader.py), this component loads the JSON publications file (project_1_publications (2).json) and processes entries using Python's JSON loader with regex-based section splitting to respect semantic boundaries.
Python
Document Chunking: After loading, documents are segmented into smaller chunks using RecursiveCharacterTextSplitter with configurable chunk size (default: 1000 characters) and overlap parameters (default: 200 characters). This chunking strategy preserves semantic coherenceβparticularly code blocks and paragraph boundariesβwhile creating appropriately sized segments for embedding.
chunks = text_splitter.split_documents(documents)
Vector Storage: ChromaDB serves as the persistent vector store, maintaining embeddings and their associated document metadata in the ./chroma_db directory. This component supports both the creation of new vector stores and the loading of existing ones between deployments.
RAG Chain Assembly: The system assembles the retrieval-augmented generation chain that connects vector similarity search with prompt formatting, conversational memory, and language model generation.
Environment Variables: Sensitive information like API keys is loaded from environment variables (.env file locally, Render Environment Variables in production).
Project Scope and RAG Configuration
The scope of this project is intentionally limited to the Ready Tensor publications dataset to ensure focused and meaningful retrieval. By constraining the knowledge base to this defined set of documents, the system avoids irrelevant results and improves answer precision.
| Parameter | Value | Description |
|---|---|---|
| Chunk Size | 1000 characters | Text segment size |
| Chunk Overlap | 200 characters | Context preservation between chunks |
| Embedding Model | models/embedding-001 | Google Generative AI (768 dimensions) |
| Vector Store | ChromaDB | Local persistent storage |
| Top-k Retrieval | 5 documents | Number of chunks retrieved per query |
| LLM | Gemini 1.5 Flash | Temperature 0.3 for factual accuracy |
| Memory Window | 3 turns | Conversation buffer size |
These parameters represent a balance between retrieval accuracy, latency (~2.3s average), and resource usage.
##Initialization Phase:
The system checks for an existing vector store
If none exists, it loads documents from the configured directory
Documents are split into semantic chunks
Chunks are embedded and stored in ChromaDB
##Query Phase:
User submits a question through the CLI interface
The system retrieves the vector database
The question is embedded and used to perform similarity search
Relevant document chunks are retrieved
Retrieved context and question are formatted into a prompt
The LLM generates a response based on the provided context
The answer is returned to the user
##Integration Points
The app.py file serves as the integration point, orchestrating the various components:
init_vector_db() handles the initialization workflow
get_vector_db() manages vector database access
query_agent() implements the query workflow
run_cli() provides the user interface
This architecture follows separation of concerns principles, with each module handling a specific aspect of the RAG workflow. The modular design allows for component replacement or enhancement without disrupting the overall system function, making the system adaptable to future improvements in embedding models, language models, or chunking strategies.
Ensuring responsible and safe usage of language models is a critical aspect of this project. The system is designed to reduce hallucinations by grounding responses strictly in retrieved document context rather than relying on the language model's internal knowledge.
Key Safety Measures:
Context Grounding: The model is instructed to answer only based on the retrieved context and to avoid speculative or out-of-scope responses
Source Attribution: Every response includes citations to specific publications and authors, enabling verification
Conservative Handling: Queries that cannot be answered using the available documents are handled with a clear "I don't have sufficient information" message
Domain Limitation: Limiting the system to the predefined Ready Tensor corpus helps control output relevance and minimizes unintended responses
API Security: API keys are stored in environment variables (.env) and never committed to version control
##Key Achievements
The RAG-based AI Publication Agent successfully demonstrates:
Effective Document-Grounded Responses: By combining vector-based retrieval with language model generation, the system produces responses that remain faithful to the source material, reducing the hallucination problems common in pure LLM approaches.
Modular, Maintainable Architecture: The system's clear separation of concerns across document processing, vector representation, and response generation creates a flexible framework that can be extended and maintained with minimal coupling between components.
Practical Implementation Balance: The chosen technologies, particularly the "all-MiniLM-L6-v2" embedding model and "llama-3.1-8b-instant" language model, establish an effective balance between performance, resource efficiency, and response quality.
Intelligent Context Retrieval: The semantic search capabilities enabled by the vector database ensure that responses are informed by the most relevant document fragments, even when user queries don't exactly match document terminology.
Current Limitations
Ephemeral Vector Store: ChromaDB runs in-memory on Render (free tier), requiring rebuild on deployment restart
Rate Limiting: Subject to Google Gemini free tier limits (requests per minute)
Single Session Memory: No persistent user sessions across browser refreshes
Static Dataset: Limited to the provided Ready Tensor publications without real-time updates
Future Work
Persistent Memory: Implement Redis/PostgreSQL for cross-session memory storage
Multi-Agent Coordination: Extend to LangGraph for complex agent workflows (Module 2 scope)
Retrieval Evaluation: Implement quantitative metrics (precision@k, recall@k, MRR)
Advanced Reranking: Add cross-encoder reranking for improved retrieval accuracy
Multi-Modal Support: Extend to handle images and charts from publications# Heading 1
This project illustrates how Retrieval-Augmented Generation can be used to build a domain-specific question answering system that produces grounded, context-aware responses. By combining document retrieval with controlled generation using Google Gemini (free tier), the system addresses key limitations of standalone language models, such as hallucinations and lack of domain knowledge.
The work demonstrates foundational RAG concepts including chunking with overlap, embedding-based similarity search, conversational memory, and responsible prompt design. With live deployment on Render and source code available on GitHub, this project provides a complete, reproducible reference implementation for the Agentic AI Developer Certification.
Project: RAG-Based AI Assistant for Ready Tensor Publications
Certification: Agentic AI Developer Certification (Module 1)
Deployment: https://rag-assistant-pg48.onrender.com (Live)
Repository: https://github.com/Yassin351/rag-assistant
This project is a Retrieval-Augmented Generation (RAG) assistant developed as part of Module 1: Foundations of Agentic AI in the Agentic AI Developer Certification (AAIDC) by Ready Tensor.
The assistant answers user questions by retrieving relevant information from the Ready Tensor publications dataset stored in a vector database and generating grounded responses using Google Gemini (free tier).
Key Features
Memory Management: ConversationBufferWindowMemory with 3-turn context retention
Vector Search: ChromaDB with Google embedding-001 (cosine similarity)
Multiple Interfaces: Interactive CLI and Streamlit web UI
Observability: Comprehensive logging with latency tracking
Source Attribution: Automatic citation of publication titles and authors
#Technologies Used
Python 3.9+
LangChain (LCEL β LangChain Expression Language) - RAG pipeline orchestration
langchain-google-genai - Gemini integration
langchain-chroma - Vector store integration
ChromaDB 0.4.22+ - Vector database
Google Gemini 1.5 Flash (free tier) - LLM
Google Embedding models/embedding-001 - Vector embeddings
Streamlit 1.30.0+ - Web interface
Render - Cloud deployment platform
GitHub - Source control and CI/CD
rag-assistant/
βββ .env # API keys (gitignored)
βββ .env.example # Template for environment variables
βββ requirements.txt # Dependencies
βββ config.py # Configuration management
βββ document_loader.py # JSON processing & semantic chunking
βββ vector_store.py # ChromaDB management
βββ rag_chain.py # RAG logic with conversational memory
βββ main.py # CLI interface with colored output
βββ app.py # Streamlit web UI
βββ test_queries.py # Evaluation and benchmarking suite
βββ setup.bat # Windows setup script
βββ logs/ # Application logs (gitignored)
βββ chroma_db/ # Vector store persistence (gitignored)
git clone https://github.com/Yassin351/rag-assistant.git
cd rag-assistant
pip install -r requirements.txt
##3οΈβ£ Environment Variable Configuration
Create a .env file in the root directory:
bash
cp .env.example .env
Edit .env and add your Google API key:
GOOGLE_API_KEY=your_gemini_api_key_here
Get your free API key from: https://aistudio.google.com/app/apikey
lternative: GitHub Codespaces (Recommended for Cloud Development)
If using GitHub Codespaces:
Add a Codespaces Secret named GOOGLE_API_KEY
Value: your Gemini API key
Restart the Codespace after adding the secret
Step 1: Ingest Documents (Build Vector Store)
venv\Scripts\activate # Windows
source venv/bin/activate # Mac/Linux
python main.py --rebuild
[1] LangChain. (2024). LangChain Documentation: Build context-aware reasoning applications. Retrieved from https://python.langchain.com/docs/
[2] Google DeepMind. (2024). Gemini 1.5 Flash: High-efficiency multimodal model for high-volume applications. Google AI Studio. Available at: https://ai.google.dev/
[3] Google. (2024). Embeddings API: models/embedding-001 Technical Documentation. Google Generative AI. Retrieved from https://ai.google.dev/tutorials/python_quickstart
[4] Chroma. (2024). ChromaDB: The AI-native open-source vector database. Retrieved from https://docs.trychroma.com/
[5] Ready Tensor. (2024). Agentic AI Developer Certification: Module 1 Dataset (project_1_publications.json). Ready Tensor Publications Corpus.
[6] Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
[7] Streamlit. (2024). Streamlit Documentation: A faster way to build and share data apps. Retrieved from https://docs.streamlit.io/