This project implements an advanced Retrieval Augmented Generation (RAG) workflow chatbot to enhance question-answering accuracy and reduce LLM hallucinations. It leverages LangGraph to create a stateful, multi-step process that includes document retrieval, relevance grading, and web search fallback. This project aims to create a documentation bot that can answer user questions based on provided documentation. It leverages Large Language Models (LLMs), a graph database, and a novel self-reflection workflow to efficiently store, retrieve, and validate information.
Abstract: The primary goal of this project is to streamline the process of accessing and understanding documentation. By utilizing LLMs, a graph database, and a self-reflection workflow, the bot can provide accurate and relevant answers to user queries, even if they are complex or phrased differently from the original documentation. The bot also aims to minimize hallucinations and ensure that the generated answers are grounded in the provided documentation.
Input: Natural language questions from users.
urls = [ "https://lilianweng.github.io/posts/2024-11-28-reward-hacking/", "https://lilianweng.github.io/posts/2024-07-07-hallucination/", "https://lilianweng.github.io/posts/2024-04-12-diffusion-video/", "https://lilianweng.github.io/posts/2024-02-05-human-data-quality/", "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/" ]
what is agent memory?
Output: Coherent and contextually relevant responses generated by the LLM, augmented with retrieved documents and/or web search results.
Hello Advanced RAG! %%{init: {'flowchart': {'curve': 'linear'}}}%% graph TD; __start__([<p>__start__</p>]):::first retrieve(retrieve) grade_documents(grade_documents) generate(generate) websearch(websearch) __end__([<p>__end__</p>]):::last retrieve --> grade_documents; websearch --> generate; __start__ -.-> websearch; __start__ -.-> retrieve; grade_documents -.-> websearch; grade_documents -.-> generate; generate -. useful .-> __end__; generate -. not useful .-> websearch; generate -. not supported .-> generate; classDef default fill:#f2f0ff,line-height:1.2 classDef first fill-opacity:0 classDef last fill:#bfb6fc ---route question--- ---route question to RAG--- ---Retrieving--- ---check document relevance to question--- ---grade: document not relevant--- ---grade: document not relevant--- ---grade: document not relevant--- ---grade: document not relevant--- ---assess granted documents--- ---decision: not all docs are relevant to question --- ---web search--- State keys: dict_keys(['question', 'web_search', 'documents']) ---generate--- ---check hallucination--- ---decision: generation is grounded in documents--- ---grade generation vs question--- ---decision: generation addresses question--- {'question': 'what is agent memory?', 'generation': 'Agent memory is a crucial component in artificial intelligence that allows AI models to retain information across interactions rather than treating every query as a new conversation. It helps the agent to better plan and is used to store and recall information. Agents are stateful and they use this memory to store previous interactions and use them in subsequent calls.', 'web_search': True, 'documents': [Document(metadata={}, page_content='In general, the memory for an agent is something that we provide via context in the prompt passed to LLM that helps the agent to better plan\nUnderstanding Agent Memory in AI: Types, Use Cases, and Implementation | by Siladitya Ghosh | Feb, 2025 | Medium Understanding Agent Memory in AI: Types, Use Cases, and Implementation The secret lies in agent memory — a crucial component that allows AI to store and recall information across interactions. What agent memory is and why it’s essential The different types of memory used in AI agents Real-world use cases of memory in AI systems By the end, you’ll have a solid understanding of how AI agents use memory and how to implement it in your own projects. What is Agent Memory? Agent memory allows AI models to retain information across interactions rather than treating every query as a new conversation. Types of Agent Memory\nTowards AGI: [Part 1] Agents with Memory - SuperAGI Towards AGI: [Part 1] Agents with Memory Agents are an emerging class of artificial intelligence (AI) systems that use large language models (LLMs) to interact with the world. However, agents store the previous interactions in variables and use it in subsequent LLM calls. Therefore, agents are stateful and they have memory. Deep dive into various types of Agent Memory LTM type 2: Semantic memory (aka Reflections): It stores an agent’s knowledge about the world and itself. LTM type 3: Procedural memory: This memory represents the agent’s procedures for thinking, acting, decision-making, etc. AI Agent Systems for fully automated sales, marketing, support and app development\nEnroll now: https://bit.ly/3YwWJeR Build agentic memory into your applications with LLMs as Operating Systems: Agent Memory, a short course\nMemory for agents Memory for agents At Sequoia’s AI Ascent conference in March, I talked about three limitations for agents: planning, UX, and memory. But what even is memory? While the exact shape of memory that your agent has may differ by application, we do see different high level types of memory. Below is my rough, ELI5 explanation of each type and practical ways for how todays agents may use and update this memory type. Besides just thinking about the type of memory to update in their agents, we also see developers thinking about how to update agent memory. One way to update agent memory is “in the hot path”. Why do we care about memory for agents?')]} Process finished with exit code 0
LangSmith trace: https://smith.langchain.com/public/436a9f0e-f390-468a-bb0b-eaa9df0e08ce/r
Output:
Web Search: true
Generation: Agent memory is a crucial component in artificial intelligence that allows AI models to retain information across interactions rather than treating every query as a new conversation. It helps the agent to better plan and is used to store and recall information. Agents are stateful and they use this memory to store previous interactions and use them in subsequent calls.
Documents: In general, the memory for an agent is something that we provide via context in the prompt passed to LLM that helps the agent to better plan Understanding Agent Memory in AI: Types, Use Cases, and Implementation | by Siladitya Ghosh | Feb, 2025 | Medium Understanding Agent Memory in AI: Types, Use Cases, and Implementation The secret lies in agent memory — a crucial component that allows AI to store and recall information across interactions. What agent memory is and why it’s essential The different types of memory used in AI agents Real-world use cases of memory in AI systems By the end, you’ll have a solid understanding of how AI agents use memory and how to implement it in your own projects. What is Agent Memory? Agent memory allows AI models to retain information across interactions rather than treating every query as a new conversation. Types of Agent Memory Towards AGI: [Part 1] Agents with Memory - SuperAGI Towards AGI: [Part 1] Agents with Memory Agents are an emerging class of artificial intelligence (AI) systems that use large language models (LLMs) to interact with the world. However, agents store the previous interactions in variables and use it in subsequent LLM calls. Therefore, agents are stateful and they have memory. Deep dive into various types of Agent Memory LTM type 2: Semantic memory (aka Reflections): It stores an agent’s knowledge about the world and itself. LTM type 3: Procedural memory: This memory represents the agent’s procedures for thinking, acting, decision-making, etc. AI Agent Systems for fully automated sales, marketing, support and app development Enroll now: https://bit.ly/3YwWJeR Build agentic memory into your applications with LLMs as Operating Systems: Agent Memory, a short course Memory for agents Memory for agents At Sequoia’s AI Ascent conference in March, I talked about three limitations for agents: planning, UX, and memory. But what even is memory? While the exact shape of memory that your agent has may differ by application, we do see different high level types of memory. Below is my rough, ELI5 explanation of each type and practical ways for how todays agents may use and update this memory type. Besides just thinking about the type of memory to update in their agents, we also see developers thinking about how to update agent memory. One way to update agent memory is “in the hot path”. Why do we care about memory for agents?
Question 2:
what is GRPO and MoE in DeepSeek architecture?
Output:
Hello Advanced RAG! %%{init: {'flowchart': {'curve': 'linear'}}}%% graph TD; __start__([<p>__start__</p>]):::first retrieve(retrieve) grade_documents(grade_documents) generate(generate) websearch(websearch) __end__([<p>__end__</p>]):::last retrieve --> grade_documents; websearch --> generate; __start__ -.-> websearch; __start__ -.-> retrieve; grade_documents -.-> websearch; grade_documents -.-> generate; generate -. useful .-> __end__; generate -. not useful .-> websearch; generate -. not supported .-> generate; classDef default fill:#f2f0ff,line-height:1.2 classDef first fill-opacity:0 classDef last fill:#bfb6fc ---route question--- ---route question to websearch--- ---web search--- State keys: dict_keys(['question']) ---generate--- ---check hallucination--- ---decision: generation is grounded in documents--- ---grade generation vs question--- ---decision: generation addresses question--- {'question': 'what is GRPO and MoE in DeepSeek architecture?', 'generation': 'Group Relative Policy Optimization (GRPO) is a reinforcement learning technique developed by DeepSeek to enhance the reasoning capabilities of large language models. It regularizes by directly adding the KL divergence between the trained policy and the reference policy to the objective function. MoE (Mixture of Experts) is not responsible for emergent reasoning in DeepSeek-R1 architecture, as evidenced by the DeepSeek-R1-Distill-Qwen-32B model, which maintains all of DeepSeek-R1’s reasoning capabilities despite being fully dense and not having MoE.', 'documents': [Document(metadata={}, page_content='“We intentionally limit our constraints to this structural format, avoiding any content-specific biases — such as mandating reflective reasoning or promoting particular problem-solving strategies — to ensure that we can accurately observe the model’s natural progression during the RL process.” — DeepSeek-R1-Zero/R1 technical report Group Relative Policy Optimization (GRPO) is a reinforcement learning (RL) technique developed by DeepSeek to enhance the reasoning capabilities of large language models (LLMs). “Instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the objective function, avoiding complicating the calculation of 𝐴_𝑖.” — DeepSeekMath “The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question” — DeepSeekMath\nThe GRPO algorithm is at the heart of the newest DeepSeek R1 architecture. In this tutorial, we will discuss the details of the formula\nDeepSeek-R1-Distill-Qwen-32B is not an MoE model and yet it retains all reasoning properties of DeepSeek-R1. If MoE were essential to DeepSeek-R1’s emergent reasoning, then distilling the model into a fully dense architecture should have eliminated its ability to perform deep, structured reasoning. Since DeepSeek-R1-Distill-Qwen-32B has no MoE and retains its reasoning abilities, this is direct empirical proof that MoE is not responsible for emergent reasoning. The strongest empirical refutation of the claim that DeepSeek-R1’s reasoning abilities require MoE lies in the very nature of DeepSeek-R1-Distill-Qwen-32B, a model that maintains all of DeepSeek-R1’s reasoning capabilities despite being fully dense. It is a smoking gun proof that DeepSeek-R1’s emergent reasoning behaviors are model agnostic and stem purely from reinforcement learning, structured optimization, and Chain-of-Thought training, not from architectural sparsity or expert gating.\nExplore the groundbreaking advancements of DeepSeek R1, a reinforcement learning-driven reasoning model. Unlike traditional models that heavily rely on supervised fine-tuning, DeepSeek R1 adopts a reinforcement learning-only (RL-only) approach to develop its reasoning capabilities. DeepSeek-R1 was trained using GRPO instead of PPO, as it allowed reinforcement learning on a large-scale language model without requiring a critic network. By learning from DeepSeek V3’s evolution, DeepSeek R1 was able to push the boundaries of reinforcement learning-driven reasoning, achieving state-of-the-art performance in complex problem-solving tasks while maintaining computational efficiency. DeepSeek R1 represents a paradigm shift in reinforcement learning-driven reasoning models, proving that logical self-improvement can emerge without supervised fine-tuning.\nThey used this data to train DeepSeek-V3-Base on a set of high quality thoughts, they then pass the model through another round of reinforcement learning, which was similar to that which created DeepSeek-r1-zero, but with more data (we’ll get into the specifics of the entire training pipeline later). They took DeepSeek-V3-Base, with these special tokens, and used GRPO style reinforcement learning to train the model on programming tasks, math tasks, science tasks, and other tasks where it’s relatively easy to know if an answer is correct or incorrect, but requires some level of reasoning.')]} Process finished with exit code 0
LangSmith trace: https://smith.langchain.com/public/46a28e4e-c7ed-494b-8cfd-14ea7b6d3fe6/r
Output:
Group Relative Policy Optimization (GRPO) is a reinforcement learning technique developed by DeepSeek to enhance the reasoning capabilities of large language models. It regularizes by directly adding the KL divergence between the trained policy and the reference policy to the objective function. MoE (Mixture of Experts) is not responsible for emergent reasoning in DeepSeek-R1 architecture, as evidenced by the DeepSeek-R1-Distill-Qwen-32B model, which maintains all of DeepSeek-R1’s reasoning capabilities despite being fully dense and not having MoE.
“We intentionally limit our constraints to this structural format, avoiding any content-specific biases — such as mandating reflective reasoning or promoting particular problem-solving strategies — to ensure that we can accurately observe the model’s natural progression during the RL process.” — DeepSeek-R1-Zero/R1 technical report Group Relative Policy Optimization (GRPO) is a reinforcement learning (RL) technique developed by DeepSeek to enhance the reasoning capabilities of large language models (LLMs). “Instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the objective function, avoiding complicating the calculation of 𝐴_𝑖.” — DeepSeekMath “The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question” — DeepSeekMath The GRPO algorithm is at the heart of the newest DeepSeek R1 architecture. In this tutorial, we will discuss the details of the formula DeepSeek-R1-Distill-Qwen-32B is not an MoE model and yet it retains all reasoning properties of DeepSeek-R1. If MoE were essential to DeepSeek-R1’s emergent reasoning, then distilling the model into a fully dense architecture should have eliminated its ability to perform deep, structured reasoning. Since DeepSeek-R1-Distill-Qwen-32B has no MoE and retains its reasoning abilities, this is direct empirical proof that MoE is not responsible for emergent reasoning. The strongest empirical refutation of the claim that DeepSeek-R1’s reasoning abilities require MoE lies in the very nature of DeepSeek-R1-Distill-Qwen-32B, a model that maintains all of DeepSeek-R1’s reasoning capabilities despite being fully dense. It is a smoking gun proof that DeepSeek-R1’s emergent reasoning behaviors are model agnostic and stem purely from reinforcement learning, structured optimization, and Chain-of-Thought training, not from architectural sparsity or expert gating. Explore the groundbreaking advancements of DeepSeek R1, a reinforcement learning-driven reasoning model. Unlike traditional models that heavily rely on supervised fine-tuning, DeepSeek R1 adopts a reinforcement learning-only (RL-only) approach to develop its reasoning capabilities. DeepSeek-R1 was trained using GRPO instead of PPO, as it allowed reinforcement learning on a large-scale language model without requiring a critic network. By learning from DeepSeek V3’s evolution, DeepSeek R1 was able to push the boundaries of reinforcement learning-driven reasoning, achieving state-of-the-art performance in complex problem-solving tasks while maintaining computational efficiency. DeepSeek R1 represents a paradigm shift in reinforcement learning-driven reasoning models, proving that logical self-improvement can emerge without supervised fine-tuning. They used this data to train DeepSeek-V3-Base on a set of high quality thoughts, they then pass the model through another round of reinforcement learning, which was similar to that which created DeepSeek-r1-zero, but with more data (we’ll get into the specifics of the entire training pipeline later). They took DeepSeek-V3-Base, with these special tokens, and used GRPO style reinforcement learning to train the model on programming tasks, math tasks, science tasks, and other tasks where it’s relatively easy to know if an answer is correct or incorrect, but requires some level of reasoning.
Chroma.from_documents
is used to create and persist the vectorstore.Chroma.as_retriever()
is used to create a retriever object.TavilySearchResults
tool is used to execute web searches.This project has the potential to significantly improve knowledge access and retrieval within organizations. It can be used for:
Further improvements:
Benefits:
Advantages:
Disadvantages:
Tradeoffs
Prerequisites
pip install -r requirements.txt
pip install -r requirements.txt
.env
file and add your OpenAI and Tavily API keys:OPENAI_API_KEY=
LANGCHAIN_API_KEY=
LANGCHAIN_PROJECT=
LANGCHAIN_TRACING_V2=true
TAVILY_API_KEY=
PYTHONPATH=/Users/junfanzhu/Desktop/langgraph
python graph/graph.py
graph/chains/answer_grader.py
:
GradeAnswer
class: Defines the structure of the output for answer grading.answer_grader
: A chain that uses an LLM to assess whether an answer addresses a question.graph/chains/hallucination_grader.py
:
GradeHallucination
class: Defines the structure of the output for hallucination grading.hallucination_grader
: A chain that uses an LLM to assess whether a generation is grounded in the provided documents.graph/chains/retrieval_grader.py
:
GradeDocuments
class: Defines the structure of the output for document relevance grading.retrieval_grader
: A chain that uses an LLM to assess the relevance of a document to a question.graph/chains/router.py
:
RouteQuery
class: Defines the structure of the output for routing questions.question_router
: A chain that uses an LLM to route a question to the appropriate data source (vectorstore or web search).graph/chains/tests/test_chains.py
:
graph/generation.py
:
generation_chain
, which takes a user question and context, and generates an answer using an LLM.graph/graph.py
:
load_dotenv()
.StateGraph
with GraphState
.RETRIEVE
, GRADE_DOCUMENTS
, WEBSEARCH
, GENERATE
.route_question
allowing routing to either RETRIEVE
or WEBSEARCH
based on the query.RETRIEVE
-> GRADE_DOCUMENTS
.GRADE_DOCUMENTS
-> WEBSEARCH
or GENERATE
(conditional).WEBSEARCH
-> GENERATE
.GENERATE
-> conditional routing based on grade_generation_grounded_in_documents_and_question
to either GENERATE
(retry), END
(success), or WEBSEARCH
(augment with web results).decide_to_generate(state)
function:
state["web_search"]
to determine if web search is needed.WEBSEARCH
or GENERATE
based on the condition.grade_generation_grounded_in_documents_and_question(state)
:
route_question(state)
:
WEBSEARCH
or RETRIEVE
based on the query.workflow.compile()
.get_graph().draw_mermaid_png()
.graph/state.py
:
GraphState
TypedDict to manage the state of the workflow.GraphState
includes:
question
: The user's question.generation
: The LLM-generated response.web_search
: A boolean flag indicating whether web search is needed.documents
: A list of retrieved documents.graph/nodes/generate.py
:
GraphState
as input.generation_chain
with the documents and question.generation_chain
uses a prompt from langchain hub, and the OpenAI LLM to generate the output.graph/nodes/grade_documents.py
:
GraphState
as input.retrieval_grader
chain with the document and question.binary_score
from the grader's output.filtered_docs
.web_search
to True
.web_search
flag.retrieval_grader
uses a structured output parser, and a defined prompt to make the grading decision.graph/nodes/retrieve.py
:
GraphState
as input.state["question"]
.retriever.invoke(question)
.graph/nodes/web_search.py
:
GraphState
as input.web_search_tool
with the question.Document
object with the joined content.ingestion.py
:
WebBaseLoader
.RecursiveCharacterTextSplitter
.OpenAIEmbeddings
.Chroma.from_documents
.Chroma.as_retriever()
.question_router
determines whether to use the vectorstore or web search based on the question.retrieve
function fetches relevant documents from the Chroma database.grade_documents
function assesses the relevance of each retrieved document using the retrieval_grader
chain.web_search
function performs a web search using the TavilySearchResults tool.generate
function uses the generation_chain
to generate an answer based on the retrieved documents and user question.hallucination_grader
checks if the generated answer is grounded in the documents.answer_grader
checks if it addresses the user's question.grade_documents(state: GraphState)
retrieval_grader
, to determine the relevance of each retrieved document to the user's question.retrieval_grader
chain uses a structured LLM output, ensuring that the grading result is in a consistent format (binary_score
: "yes" or "no"). This is achieved through the GradeDocuments
class, which defines the expected output structure.web_search
flag is set to True
if any document is deemed irrelevant, triggering a web search to provide more comprehensive information. This mechanism ensures that the system can adapt to situations where the initial retrieval might not be sufficient.decide_to_generate(state)
web_search
flag within the GraphState
to determine the next step in the workflow.web_search
is True
, it routes the workflow to the WEBSEARCH
node, ensuring that a web search is performed to supplement the retrieved documents. This is essential for cases where the initial document retrieval is not adequate or when external information is needed.web_search
is False
, it routes the workflow to the GENERATE
node, bypassing the web search and directly generating the response from the retrieved documents. This streamlines the process when the retrieved documents are sufficient, reducing latency and resource usage.retrieve(state: GraphState)
GraphState
and uses the ChromaDB retriever to perform a semantic search.OpenAIEmbeddings
to find documents that are semantically similar to the query, even if they don't contain the exact keywords. This ensures that the system can understand the intent behind the question and retrieve relevant information.grade_generation_grounded_in_documents_and_question(state: GraphState)
hallucination_grader
chain, which utilizes the GradeHallucination
class for structured output, to check if the answer is grounded in the documents. This step is crucial for preventing the LLM from generating answers that are not supported by the provided context.answer_grader
chain, which utilizes the GradeAnswer
class for structured output, to check if it addresses the user's question. This ensures that the generated answer is not only accurate but also relevant to the user's query.GradeHallucination
and GradeAnswer
classes ensures that the grading process is consistent and reliable.Ideas
Overall
This project offers a novel approach to documentation access and retrieval by incorporating a self-reflection workflow. By combining LLMs with a graph database and adaptive routing, it provides an efficient and user-friendly way to find accurate and relevant answers to questions. While there are challenges to overcome, the potential benefits are significant.
The core of this project is to build a robust and reliable question-answering system using Retrieval Augmented Generation, but with a focus on mitigating common LLM challenges like hallucinations and irrelevant responses. We've implemented an advanced RAG pipeline that goes beyond basic retrieval.
First, we leverage LangGraph to orchestrate a stateful workflow, which allows us to manage complex interactions between different components. This is crucial because we're not just retrieving and generating; we're also grading the relevance of retrieved documents. The key innovation here is the 'grade_documents' node. We use an LLM to assess the semantic relevance of each retrieved document to the user's query. If a document is deemed irrelevant, we trigger a web search using Tavily Search as a fallback, ensuring that the final response is comprehensive and accurate. This conditional logic, managed by LangGraph, is what distinguishes this system.
We've also focused on modularity. Each step—retrieval, grading, web search, and generation—is encapsulated in its own node, making the system highly extensible. We're using ChromaDB for efficient vector storage and retrieval, and OpenAI's LLMs for both embedding and generation.
In essence, this project tackles the 'garbage in, garbage out' problem by actively filtering and augmenting the retrieved context. The LangGraph framework allows us to create a pipeline that dynamically adapts to the quality of retrieved information, resulting in more reliable and accurate responses compared to a basic RAG setup. This design shows an understanding of how to build complex LLM applications that address real-world challenges."
RAG self-reflection workflow reflects on
We also implement a routing element, routing our request to correct datastore with info of the answer.
Project is based on 3 ideas
ReAct (Reason-Act)
Autonomy in LLM applications (5 levels):
Eden Marco: LangGraph-Develop LLM powered AI agents with LangGraph
There are no datasets linked
There are no models linked
There are no models linked
There are no datasets linked