The biosynthesis of amino acids, nucleotides, and related molecules constitutes a cornerstone of cellular metabolism, encompassing highly regulated and interconnected pathways that are fundamental to all living systems. These biosynthetic processes, characterized by their complexity and intricate regulatory mechanisms, present significant challenges for students and researchers seeking to master the underlying biochemical principles. Traditional approaches to accessing this often require extensive navigation through different textbooks, making targeted information retrieval both time-consuming and inefficient.
To address these challenges, we have developed a domain-specific Retrieval-Augmented Generation (RAG) system that focuses exclusively on the biosynthesis of amino acids, nucleotides, and related molecules as presented in the sixth edition of Lehninger Principles of Biochemistry by David L. Nelson and Michael M. Cox. This specialized application leverages advanced natural language processing techniques to provide precise responses to biochemistry questions within this specific domain.
The system architecture integrates several state-of-the-art technologies: Streamlit serves as the user interface framework, enabling interactive query submission and result visualization; the Mistral-7B-Instruct-v0.2 model functions as the core language generation component, providing scientifically accurate responses based on retrieved content; the sentence-transformers/all-MiniLM-L6-v2 model generates high-quality embeddings for semantic similarity matching; and ChromaDB provides efficient vector storage and retrieval capabilities for the biochemical content.
The approach implemented in this project involves a domain-specific Retrieval-Augmented Generation (RAG) system. First, biochemistry content from Lehninger Principles of Biochemistry is preprocessed and segmented into semantically coherent text chunks to maintain contextual integrity. These chunks are then transformed into vector representations (embeddings) using the sentence-transformers/all-MiniLM-L6-v2 embedding model to capture semantic relationships between biochemical concepts. The generated embeddings are indexed and stored in a ChromaDB vector database for efficient similarity-based retrieval operations. Upon receiving user queries, the system employs cosine similarity calculations to identify and retrieve the top-4 most relevant text chunks from the database. Finally, these retrieved chunks serve as contextual input for the Mistral-7B-Instruct-v0.2 language model, which generates scientifically accurate responses grounded in the provided biochemical content.
The system uses a retrieval approach with these key components:
For robust document preprocessing, the system is configured with the following key parameters:
Data Directory:
The system expects the raw biochemical text data to reside in a directory named data
.
Maximum Chunk Size:
Texts are segmented into semantically coherent chunks of up to 800 characters. This size balances contextual integrity with efficient embedding and retrieval performance.
Chunk Overlap:
To preserve context across chunk boundaries, an overlap of 50 characters is applied between consecutive chunks. This overlap helps ensure continuity and minimizes information loss during retrieval.
Validation Check:
A validation check is included to ensure that the overlap value remains smaller than the chunk size, thereby avoiding invalid segmentation logic.
Our system leverages ChromaDB as the vector store to enable efficient storage and retrieval of biochemical content embeddings. The configuration emphasizes persistence and semantic retrieval performance:
Persistence Directory:
The database persists its data at ./biochem_chroma_db
, ensuring that embeddings are saved across sessions. This avoids repeated embedding, enabling faster startup and more efficient reuse.
Collection Name:
The biochemistry knowledge base is stored in a dedicated ChromaDB collection named "lehninger_biochem"
, providing a clear and domain-specific namespace.
Embedding Model:
The system uses sentence-transformers/all-MiniLM-L6-v2
to generate semantic vector representations of the text chunks. This model was selected for its strong balance between speed and semantic accuracy in scientific domains.
Embedding Device:
Device selection is handled dynamically, enabling the model to run on either CPU or GPU depending on the system’s available resources.
This configuration underpins the semantic search capabilities of the system, allowing it to identify and retrieve the most relevant document chunks in response to biochemical queries.
The system employs a sophisticated retrieval approach to identify the most relevant biochemical information in response to user queries. This process is configured with the following key parameters and methods:
Number of Results (n_results):
Upon receiving a query, the system retrieves the top 4 most semantically similar document chunks from the vector database. This number ensures the language model is provided with a focused yet comprehensive context for generation. A validation is included to ensure this value is always positive.
Search Mechanism:
Retrieval uses semantic similarity search instead of traditional keyword matching. By comparing the vector embeddings of the user's query with those of stored document chunks, the system identifies conceptually related content, even if exact keywords are absent, resulting in more accurate and contextually relevant results.
The core language generation component of our system is powered by a carefully configured Large Language Model (LLM), enabling accurate and grounded responses based on retrieved biochemical content.
Model:
The system uses mistralai/Mistral-7B-Instruct-v0.2
, a 7-billion parameter instruction-tuned model. It was selected for its strong performance in generating coherent, context-aware responses—a critical requirement for answering domain-specific scientific questions.
Provider:
The model is served via the Together platform, which provides hosted access to performant open-source LLMs.
Authentication:
An API key is used to authenticate requests to the Together platform, ensuring secure and controlled access.
Temperature:
Set to 0.0
for deterministic behavior. This eliminates randomness and creativity, prioritizing factual accuracy—essential in scientific settings.
Max Tokens:
The model output is limited to 512 tokens, encouraging concise, focused answers while ensuring responses remain within the context window.
Streaming:
Disabled (False
), so users receive the full response at once rather than partial tokens during generation.
This configuration ensures that the LLM synthesizes relevant context into precise and reliable answers, maintaining fidelity to the source biochemical material.
Evaluation during development was primarily manual and iterative. We tested numerous queries, carefully analyzing the LLM's responses for accuracy, conciseness, and adherence to the strict grounding rules. This hands-on approach allowed us to quickly identify and address issues like hallucination and unwanted commentary.
Automated Evaluation Loop: Implementing a formal automated evaluation pipeline with a "gold standard" question-answer dataset would be a valuable next step for systematic performance tracking.
Memory Components: Adding conversational memory to allow for follow-up questions within a session.
Broader Context: Expanding the knowledge base to include more chapters or even the entire Lehninger textbook.