A simple RAG-powered assistant (ChatBot) that answers questions on the LangChain Documentation

Publication Overview

As part of my learning journey in the Ready Tensor Flow program on Agentic AI, I built a Retrieval-Augmented Generation (RAG) powered chatbot that specializes in answering questions related to the LangChain documentation. Although this project is not itself an Agentic AI application, it forms a solid foundation for understanding retrieval workflows that underpin more advanced agentic systems.

What sets this project apart is its end-to-end implementation:

FastAPI backend that manages document retrieval and response generation.

Blazor WebAssembly (WASM) frontend, providing a clean UI for interacting with the chatbot.

Why This Project?

The LangChain ecosystem is a powerful framework for building applications with LLMs, yet its documentation can be overwhelming to navigate manually. My aim was to build a domain-specific, context-aware chatbot that allows developers and enthusiasts to extract concise answers to specific questions from LangChain’s extensive documentation.

Building both a robust API and an intuitive UI ensured that the project mirrored real-world software engineering practices.

Technical Stack

Component	Technology
Backend Framework	FastAPI
Embedding Model	sentence-transformers/all-MiniLM-L12-v2
Vector Store	ChromaDB
LLM	TinyLlama 1.1B Q4_K_M (via ctransformers)
Frontend Framework	Blazor WebAssembly (WASM)
Document Parser	BeautifulSoup & LXML
Deployment	Local (with Docker planned)

Backend Project Structure

LanggraphBotDocIngestor/
├── chroma_db/ # Persistent vector database
├── loaders/ # Document ingestion scripts
├── models/ # Contains quantized Llama model files
├── .env # Environment variables
├── memory.py # Session memory implementation
├── ingest.py # Data ingestion into ChromaDB
├── qa_api.py # Main FastAPI backend
├── requirements.txt # Dev dependencies
├── pinned-requirements.txt # Production dependencies
├── README.md # Project documentation

How it works

Load up LangChain documentation (Backend)

from langchain_community.document_loaders import WebBaseLoader
from  dotenv import load_dotenv
import os

load_dotenv()

USER_AGENT = os.getenv("USER_AGENT", "LanggraphBotDocIngestor/1.0 (youremailaddress@domain.com)")
HEADERS = {"User-Agent": USER_AGENT}

# This file contains logic to load and clean docs from the web (LangChain/LangGraph)

# 20 - 50 web pages from the LangChain documentation were used for this project. This is not an exhaustive list in the array 
URLS =  [
    "https://python.langchain.com/docs/introduction/",
    "https://python.langchain.com/docs/tutorials/",
    "https://python.langchain.com/docs/how_to/",
    "https://python.langchain.com/docs/concepts/"
]

def load_documents():
    loader = WebBaseLoader(web_paths=URLS, header_template=HEADERS)
    return loader.load()

Document Ingestion

import os
import shutil
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from loaders.langchain_docs import load_documents
from tqdm import tqdm

# Load .env variables
load_dotenv()

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L12-v2"

def ingest_documents():
    chroma_dir = "./chroma_db"
    
    if os.path.exists(chroma_dir):
        print("Cleaning up existing Chroma vectorstore...")
        shutil.rmtree(chroma_dir)

    print("Loading documents...")
    raw_docs = load_documents()

    print("Splitting documents into chunks...")
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = splitter.split_documents(raw_docs)

    print("Embedding and storing in Chroma (locally)...")
    embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL)
    
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=chroma_dir
    )

    vectorstore.persist()
    print(f"Ingested {len(chunks)} chunks into vectorstore.")

if __name__ == "__main__":
    ingest_documents()

The purpose of the above script is to scrape the documentation and store it in ChromaDB for semantic retrieval.

Query Workflow (Backend)

import os
import uuid
import traceback
from fastapi import FastAPI, HTTPException, Query
from fastapi.middleware.cors import CORSMiddleware
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from ctransformers import AutoModelForCausalLM
from langchain.llms.base import LLM
from memory import SessionMemory
from pydantic import BaseModel, PrivateAttr
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# python code

@app.post("/ask")
async def ask_question(
    request: QueryRequest,
    session_id: str = Query(default=None)
):
    try:
        if not session_id:
            session_id = str(uuid.uuid4())

        history_pairs = memory.get_session(session_id)
        history_text = "\n".join([f"User: {q}\nBot: {a}" for q, a in history_pairs])

        docs = retriever.get_relevant_documents(request.question)
        context = "\n\n".join([doc.page_content for doc in docs])

        if not context.strip() or len(context.strip()) < MIN_CONTEXT_LENGTH:
            fallback_answer = (
                "I specialize in answering questions about LangGraph and LangChain documentation. \n"
                "That topic appears unrelated, so I can't provide a reliable answer."
            )
            return  {"response": fallback_answer, "session_id": session_id}

        result = llm_chain.invoke({
            "history": history_text,
            "context": context,
            "question": request.question
        })
        raw_answer = result.get("text", "").strip() if isinstance(result, dict) else str(result).strip()

# More python code

This implementation's purpose is to embed the question/query, retrieve similar document chunks, and pass them to the language model for response generation.

Session Management

from typing import Dict, List, Tuple
from threading import Lock

class SessionMemory:
    def __init__(self):
        self.sessions: Dict[str, List[Tuple[str, str]]] = {}
        self.lock = Lock()
    
    def add_message(self, session_id:str, question: str, answer:  str):
        with self.lock:
            if session_id not in self.sessions:
                self.sessions[session_id] = []
            self.sessions[session_id].append((question, answer))
    
    def get_session(self, session_id: str) -> List[Tuple[str, str]]:
        with self.lock:
            return list(self.sessions.get(session_id, []))
    
    def clear_session(self, session_id: str):
        with self.lock:
            self.sessions.pop(session_id, None)

This implementation defines methods to get and set a user's session to be used for chat history which only applies to the user's active browser session as it goes when the browser window is closed. This was done as an alternative to creating a user authentication setup.

# Prompt Template
prompt_template = PromptTemplate(
    input_variables=["history", "context", "question"],
    template=(
       "You are a helpful assistant specializing in LangGraph and LangChain documentation.\n"
        "Example Q&A:\n"
        "Q: What is LangChain?\n"
        "A: LangChain is an open-source framework for developing applications powered by language models.\n"
        "Q: What is LangGraph?\n"
        "A: LangGraph extends LangChain to enable building applications as stateful graphs.\n\n"
        "Now, using the following context:\n{context}\n\n"
        "Conversation history:\n{history}\n\n"
        "Q: {question}\nA:"
    )
)

The above shows the prompt template which is sent to the LLM.

Frontend Query (Blazor UI)

@code {
    private string question = "";
    private bool isLoading = false;
    private string? sessionId;

    // Chat History
    private List<(string User, string Bot)> chatHistory = new();

// More C# code
 private async Task SubmitQuestion()
    {
        if (string.IsNullOrWhiteSpace(question) || string.IsNullOrWhiteSpace(sessionId)) return;

        var thisQuestion = question;
        question = string.Empty;
        isLoading = true;

        try
        {
            var request = new QuestionRequest { Question = thisQuestion };
            var client = HttpClientFactory.CreateClient("LangGraphDocsBotAPI");

            // Send  session_id as query string
            var url = $"/ask?session_id={sessionId}";
            var result = await client.PostAsJsonAsync(url, request);

            if (result.IsSuccessStatusCode)
            {
                var response = await result.Content.ReadFromJsonAsync<QaResponse>();
                var answer = response?.Response ?? "No response received.";

                //add  to history
                chatHistory.Add((thisQuestion, answer));
            }
            else
            {
                var errorText = await result.Content.ReadAsStringAsync();
                chatHistory.Add((thisQuestion, $"Error: {result.StatusCode}\n{errorText}"));
            }
        }
        catch (Exception ex)
        {
            chatHistory.Add((thisQuestion, $"Exception: {ex.Message}"));
            Console.WriteLine("Exception: " + ex);
        }
        finally
        {
            isLoading = false;
        }
    }
}

Response Display (Blazor UI)

@page "/ask"
@using LangGraphDocsBot.Models
@inject IHttpClientFactory HttpClientFactory
@inject IJSRuntime JS

<div class="container mt-4" style="max-width: 800px;">
    <h3 class="mb-4">Ask LangGraphDocsBot</h3>

    @if (chatHistory.Any() || isLoading)
    {
        <div class="chat-box border p-3 rounded bg-light mb-3">
            @foreach (var exchange in chatHistory)
            {
                <div class="d-flex justify-content-end mb-2">
                    <div class="p-2 bg-primary text-white rounded" style="max-width: 75%;">
                        @exchange.User
                    </div>
                </div>
                <div class="d-flex justify-content-start mb-2">
                    <div class="p-2 bg-white border rounded shadow-sm" style="max-width: 75%;">
                        @((MarkupString)Markdig.Markdown.ToHtml(exchange.Bot))
                    </div>
                </div>
            }

      @if (isLoading)
            {
                <div class="d-flex align-items-center mb-2">
                    <span class="spinner-border spinner-border-sm me-2 text-primary" role="status"></span>
                    <em>LangGraphDocsBot is typing...</em>
                </div>
            }
        </div>
    }
    <div class="input-group">
        <input class="form-control" @bind="question" @bind:event="oninput" placeholder="Ask a question..." />
        <button class="btn btn-primary" @onclick="SubmitQuestion" disabled="@string.IsNullOrWhiteSpace(question)">Ask</button>
    </div>
</div>

This block provides a conversational interface to make querying documentation seamless and user-friendly.

Sample payload (request and response)

POST /ask
Content-Type: application/json

{
  "question": "What is LangGraph?"
}

{
  "response": "LangGraph extends LangChain to enable building applications as stateful graphs.",
  "session_id": "b7c3901c-e137-4cc8-9453-52cf0795c7f2"
}

Evaluating Retrieval Effectiveness

To assess the performance of the retrieval component in this RAG pipeline, I adopted the following manual evaluation strategy:

Relevance Review: For a sample of test queries, I reviewed the retrieved context chunks to verify if they contained answers matching the user question.
Hit Rate: I logged whether the correct document (as judged by a human evaluator) appeared in the top-3 results retrieved.
Embedding Cosine Similarity: I used sklearn.metrics.pairwise.cosine_similarity to measure semantic overlap between the user's question and the retrieved context.

While future improvements may include automated evaluation pipelines and ground truth benchmarks, these initial steps helped me gauge how well the vector store and embedding model perform under realistic use.

Limitations

While the project achieves its primary goal, it comes with some notable limitations:

LLM Performance: TinyLlama is computationally efficient but occasionally generates vague or generic answers.

Contextual Memory: Session-based memory exists but lacks sophisticated dialogue management.

Frontend Simplicity: The UI is intentionally minimal for demonstration purposes and could use some improvements in persistent chat history; for now the chat history is limited to the user's browser session.

Limited Retrieval Scope: Only the LangChain documentation was ingested. Adding LangGraph source examples would make answers richer.

Future Work

Improve LLM Responses: Integrate more powerful APIs like GPT-4 or LLama3 via external services.
Enhance UI: Add streaming responses, and modern UI/UX improvements.
Full Agentic Capabilities: Future projects will build on this foundation by adding task planning, memory chains, and autonomous behaviors.

Key Achievements

Successfully combined RAG pipelines, semantic search, and quantized LLMs into an integrated solution.
Developed an end-to-end prototype with both backend and frontend.

Github Repository

Python (Fast API Backend): LangGraphBotDoctIngestor
Blazor UI (ASP.NET Core Frontend): LangGraphDocsBot

License

This project is provided under the MIT License. You are free to use, modify, and distribute the codebase for both commercial and non-commercial purposes, provided that the original license terms are included. Refer to the LICENSE file in the GitHub repository for more information.

References

LangChain Documentation
LangGraph Documentation
Blazor
Chroma DB
ctransformers
Tiny Llama

Thank you for reading!