As part of my learning journey in the Ready Tensor Flow program on Agentic AI, I built a Retrieval-Augmented Generation (RAG) powered chatbot that specializes in answering questions related to the LangChain documentation. Although this project is not itself an Agentic AI application, it forms a solid foundation for understanding retrieval workflows that underpin more advanced agentic systems.
What sets this project apart is its end-to-end implementation:
FastAPI backend that manages document retrieval and response generation.
Blazor WebAssembly (WASM) frontend, providing a clean UI for interacting with the chatbot.
The LangChain ecosystem is a powerful framework for building applications with LLMs, yet its documentation can be overwhelming to navigate manually. My aim was to build a domain-specific, context-aware chatbot that allows developers and enthusiasts to extract concise answers to specific questions from LangChain’s extensive documentation.
Building both a robust API and an intuitive UI ensured that the project mirrored real-world software engineering practices.
Component | Technology |
---|---|
Backend Framework | FastAPI |
Embedding Model | sentence-transformers/all-MiniLM-L12-v2 |
Vector Store | ChromaDB |
LLM | TinyLlama 1.1B Q4_K_M (via ctransformers) |
Frontend Framework | Blazor WebAssembly (WASM) |
Document Parser | BeautifulSoup & LXML |
Deployment | Local (with Docker planned) |
LanggraphBotDocIngestor/
├── chroma_db/ # Persistent vector database
├── loaders/ # Document ingestion scripts
├── models/ # Contains quantized Llama model files
├── .env # Environment variables
├── memory.py # Session memory implementation
├── ingest.py # Data ingestion into ChromaDB
├── qa_api.py # Main FastAPI backend
├── requirements.txt # Dev dependencies
├── pinned-requirements.txt # Production dependencies
├── README.md # Project documentation
from langchain_community.document_loaders import WebBaseLoader from dotenv import load_dotenv import os load_dotenv() USER_AGENT = os.getenv("USER_AGENT", "LanggraphBotDocIngestor/1.0 (youremailaddress@domain.com)") HEADERS = {"User-Agent": USER_AGENT} # This file contains logic to load and clean docs from the web (LangChain/LangGraph) # 20 - 50 web pages from the LangChain documentation were used for this project. This is not an exhaustive list in the array URLS = [ "https://python.langchain.com/docs/introduction/", "https://python.langchain.com/docs/tutorials/", "https://python.langchain.com/docs/how_to/", "https://python.langchain.com/docs/concepts/" ] def load_documents(): loader = WebBaseLoader(web_paths=URLS, header_template=HEADERS) return loader.load()
import os import shutil from dotenv import load_dotenv from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Chroma from langchain_huggingface import HuggingFaceEmbeddings from loaders.langchain_docs import load_documents from tqdm import tqdm # Load .env variables load_dotenv() EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L12-v2" def ingest_documents(): chroma_dir = "./chroma_db" if os.path.exists(chroma_dir): print("Cleaning up existing Chroma vectorstore...") shutil.rmtree(chroma_dir) print("Loading documents...") raw_docs = load_documents() print("Splitting documents into chunks...") splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.split_documents(raw_docs) print("Embedding and storing in Chroma (locally)...") embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL) vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory=chroma_dir ) vectorstore.persist() print(f"Ingested {len(chunks)} chunks into vectorstore.") if __name__ == "__main__": ingest_documents()
The purpose of the above script is to scrape the documentation and store it in ChromaDB for semantic retrieval.
import os import uuid import traceback from fastapi import FastAPI, HTTPException, Query from fastapi.middleware.cors import CORSMiddleware from langchain.prompts import PromptTemplate from langchain.chains import LLMChain from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma from ctransformers import AutoModelForCausalLM from langchain.llms.base import LLM from memory import SessionMemory from pydantic import BaseModel, PrivateAttr from dotenv import load_dotenv from sklearn.metrics.pairwise import cosine_similarity import numpy as np # python code @app.post("/ask") async def ask_question( request: QueryRequest, session_id: str = Query(default=None) ): try: if not session_id: session_id = str(uuid.uuid4()) history_pairs = memory.get_session(session_id) history_text = "\n".join([f"User: {q}\nBot: {a}" for q, a in history_pairs]) docs = retriever.get_relevant_documents(request.question) context = "\n\n".join([doc.page_content for doc in docs]) if not context.strip() or len(context.strip()) < MIN_CONTEXT_LENGTH: fallback_answer = ( "I specialize in answering questions about LangGraph and LangChain documentation. \n" "That topic appears unrelated, so I can't provide a reliable answer." ) return {"response": fallback_answer, "session_id": session_id} result = llm_chain.invoke({ "history": history_text, "context": context, "question": request.question }) raw_answer = result.get("text", "").strip() if isinstance(result, dict) else str(result).strip() # More python code
This implementation's purpose is to embed the question/query, retrieve similar document chunks, and pass them to the language model for response generation.
from typing import Dict, List, Tuple from threading import Lock class SessionMemory: def __init__(self): self.sessions: Dict[str, List[Tuple[str, str]]] = {} self.lock = Lock() def add_message(self, session_id:str, question: str, answer: str): with self.lock: if session_id not in self.sessions: self.sessions[session_id] = [] self.sessions[session_id].append((question, answer)) def get_session(self, session_id: str) -> List[Tuple[str, str]]: with self.lock: return list(self.sessions.get(session_id, [])) def clear_session(self, session_id: str): with self.lock: self.sessions.pop(session_id, None)
This implementation defines methods to get and set a user's session to be used for chat history which only applies to the user's active browser session as it goes when the browser window is closed. This was done as an alternative to creating a user authentication setup.
# Prompt Template prompt_template = PromptTemplate( input_variables=["history", "context", "question"], template=( "You are a helpful assistant specializing in LangGraph and LangChain documentation.\n" "Example Q&A:\n" "Q: What is LangChain?\n" "A: LangChain is an open-source framework for developing applications powered by language models.\n" "Q: What is LangGraph?\n" "A: LangGraph extends LangChain to enable building applications as stateful graphs.\n\n" "Now, using the following context:\n{context}\n\n" "Conversation history:\n{history}\n\n" "Q: {question}\nA:" ) )
The above shows the prompt template which is sent to the LLM.
@code {
private string question = "";
private bool isLoading = false;
private string? sessionId;
// Chat History
private List<(string User, string Bot)> chatHistory = new();
// More C# code
private async Task SubmitQuestion()
{
if (string.IsNullOrWhiteSpace(question) || string.IsNullOrWhiteSpace(sessionId)) return;
var thisQuestion = question;
question = string.Empty;
isLoading = true;
try
{
var request = new QuestionRequest { Question = thisQuestion };
var client = HttpClientFactory.CreateClient("LangGraphDocsBotAPI");
// Send session_id as query string
var url = $"/ask?session_id={sessionId}";
var result = await client.PostAsJsonAsync(url, request);
if (result.IsSuccessStatusCode)
{
var response = await result.Content.ReadFromJsonAsync<QaResponse>();
var answer = response?.Response ?? "No response received.";
//add to history
chatHistory.Add((thisQuestion, answer));
}
else
{
var errorText = await result.Content.ReadAsStringAsync();
chatHistory.Add((thisQuestion, $"Error: {result.StatusCode}\n{errorText}"));
}
}
catch (Exception ex)
{
chatHistory.Add((thisQuestion, $"Exception: {ex.Message}"));
Console.WriteLine("Exception: " + ex);
}
finally
{
isLoading = false;
}
}
}
@page "/ask"
@using LangGraphDocsBot.Models
@inject IHttpClientFactory HttpClientFactory
@inject IJSRuntime JS
<div class="container mt-4" style="max-width: 800px;">
<h3 class="mb-4">Ask LangGraphDocsBot</h3>
@if (chatHistory.Any() || isLoading)
{
<div class="chat-box border p-3 rounded bg-light mb-3">
@foreach (var exchange in chatHistory)
{
<div class="d-flex justify-content-end mb-2">
<div class="p-2 bg-primary text-white rounded" style="max-width: 75%;">
@exchange.User
</div>
</div>
<div class="d-flex justify-content-start mb-2">
<div class="p-2 bg-white border rounded shadow-sm" style="max-width: 75%;">
@((MarkupString)Markdig.Markdown.ToHtml(exchange.Bot))
</div>
</div>
}
@if (isLoading)
{
<div class="d-flex align-items-center mb-2">
<span class="spinner-border spinner-border-sm me-2 text-primary" role="status"></span>
<em>LangGraphDocsBot is typing...</em>
</div>
}
</div>
}
<div class="input-group">
<input class="form-control" @bind="question" @bind:event="oninput" placeholder="Ask a question..." />
<button class="btn btn-primary" @onclick="SubmitQuestion" disabled="@string.IsNullOrWhiteSpace(question)">Ask</button>
</div>
</div>
This block provides a conversational interface to make querying documentation seamless and user-friendly.
POST /ask Content-Type: application/json { "question": "What is LangGraph?" } { "response": "LangGraph extends LangChain to enable building applications as stateful graphs.", "session_id": "b7c3901c-e137-4cc8-9453-52cf0795c7f2" }
To assess the performance of the retrieval component in this RAG pipeline, I adopted the following manual evaluation strategy:
sklearn.metrics.pairwise.cosine_similarity
to measure semantic overlap between the user's question and the retrieved context.While future improvements may include automated evaluation pipelines and ground truth benchmarks, these initial steps helped me gauge how well the vector store and embedding model perform under realistic use.
While the project achieves its primary goal, it comes with some notable limitations:
LLM Performance: TinyLlama is computationally efficient but occasionally generates vague or generic answers.
Contextual Memory: Session-based memory exists but lacks sophisticated dialogue management.
Frontend Simplicity: The UI is intentionally minimal for demonstration purposes and could use some improvements in persistent chat history; for now the chat history is limited to the user's browser session.
Limited Retrieval Scope: Only the LangChain documentation was ingested. Adding LangGraph source examples would make answers richer.
Successfully combined RAG pipelines, semantic search, and quantized LLMs into an integrated solution.
Developed an end-to-end prototype with both backend and frontend.
This project is provided under the MIT License. You are free to use, modify, and distribute the codebase for both commercial and non-commercial purposes, provided that the original license terms are included. Refer to the LICENSE file in the GitHub repository for more information.
LangChain Documentation
LangGraph Documentation
Blazor
Chroma DB
ctransformers
Tiny Llama
Thank you for reading!