RAG-Based AI Assistant with langChain

Retrieval-Augmented AI Assistant

Introduction

This publication presents a Retrieval-Augmented Generation (RAG) assistant tailored for answering questions over technical documentation and internal knowledge base materials. The system combines: document ingestion, semantic chunking, vector embedding storage, similarity retrieval, and controlled generation. It targets scenarios where accurate, source-grounded answers are preferred over purely generative responses.

Built in the context of the Ready Tensor Agentic AI Developer Certification, the project emphasizes clarity, reproducibility, safety (prompt injection defenses), and evaluation of retrieval quality.

Domain Scope

"Domain" simply means the general type of documents the assistant focuses on. Here we choose: Technical & Project Documentation (design notes, API descriptions, architecture summaries). You can switch to another domain (policies, medical guidelines (non-diagnostic), legal briefs) by replacing the documents in data/.

System Architecture Diagram

Component Walkthrough

Loader: Reads .docx files and extracts plain text.
Chunking: Splits text into overlapping semantic chunks to preserve coherence.
Embeddings: Converts each chunk into a dense vector with a Sentence Transformers model.
Vector Store (ChromaDB): Persists vectors; supports similarity search (approximate nearest neighbors).
Query Preprocessing: Normalizes and optionally expands user queries to improve recall.
Retrieval: Finds top-k relevant chunks.
Prompt Template: Injects retrieved context + user question with strict instructions against hallucination.
Generation: LLM produces an answer constrained to provided context.

Real-World Applications

Developer support (answering questions about APIs and internal services)
Onboarding assistants (summarize architecture decisions)
Knowledge base Q&A for internal tools or processes
Change impact analysis (query past design notes)
Lightweight documentation search instead of full-text keyword search

Competitive Differentiation

Approach	Pros	Cons
Vanilla LLM (no retrieval)	Fast, simple	Hallucinations, outdated knowledge
Keyword Search + Manual Reading	Precise matching	Time-consuming, lacks synthesis
This RAG Assistant	Grounded answers, multi-provider LLM support, extensible preprocessing	Requires embedding/index build

Retrieval Evaluation & Metrics

Evaluating retrieval ensures the right context reaches the LLM.

Recall@K: Fraction of relevant chunks found within top K.
Precision@K: Fraction of retrieved chunks in top K that are actually relevant.
MRR (Mean Reciprocal Rank): Rewards putting the first relevant chunk early.
nDCG (Normalized Discounted Cumulative Gain) (optional): Considers graded relevance & position.

Metric Intuition

High Recall means fewer missed answers; high Precision means less noise; MRR emphasizes ranking quality.

Evaluation Workflow

Prepare data/eval/queries.json with objects: { "query": "...", "relevant_ids": ["doc_0_chunk_2", ...] }.
Run python src/evaluation.py --queries data/eval/queries.json.
Review metric summary and per-query diagnostics.

Query Processing Techniques

Implemented (or stubbed) in src/preprocess.py:

Lowercasing & whitespace normalization
Punctuation stripping
Stopword removal (simple list)
Synonym expansion (configurable map)
Optional spelling correction (placeholder)

Benefits: Improved matching (recall) and cleaner embeddings.

Safety & Prompt Injection Mitigation

Prompt injection attempts to hijack instructions via malicious context. Countermeasures:

Instruction boundary: Template explicitly forbids using anything outside provided context.
Refusal pattern: Respond "I don't have enough information" if answer absent.
No tool execution or external calls based solely on retrieved text.
Sanitization: Strip suspicious control phrases (placeholder for future enhancements).

Future improvements: Content filtering (PII masking), rate limiting, audit logging of queries.

Implementation Notes

Each chunk stored with an ID pattern: doc_<docIndex>_chunk_<chunkIndex> for labeling.
Overlap (50 chars) reduces boundary information loss.
Deterministic embedding model choice via environment variable EMBEDDING_MODEL.

Using the System

Place .docx documentation in data/.
Run python src/app.py (after setting an API key).
Ask a question. System retrieves top chunks and generates grounded answer.

Extensibility

Swap embedding model (larger transformer) for accuracy vs speed.
Add re-ranking stage (e.g., cross-encoder) to refine top-K ordering.
Introduce response citation: append chunk IDs.
Add web/GUI front-end.

Future Work

Dataset expansion & relevance labeling automation
Integration of nDCG & re-ranking
More robust security (prompt injection pattern detection)
Multi-lingual document support

References

LangChain Documentation
ChromaDB
Sentence Transformers
OpenAI / Groq / Google Gemini APIs