Practical RAG for Domain QA: An In-Memory OpenAI-Embedded Retrieval-Augmented Generation Assistant

We use cookies to improve your browsing experience and to analyze our website traffic. By clicking “Accept All” you agree to our use of cookies. Privacy policy.

RAG Domain QA Assistant

GitHub Repository: samrat-kar/rag-domain-qa-assistant

Introduction

Large Language Models (LLMs) are powerful but have a critical blind spot: they can only answer from what they were trained on. When you ask them about private documents, proprietary knowledge, or niche domains not well-covered in their training data, they either hallucinate confidently wrong answers or admit they don't know. This is a fundamental limitation for any real-world enterprise or domain-specific use case.

Retrieval-Augmented Generation (RAG) solves this by combining two complementary strengths: the precision of search (finding the right information) and the fluency of generation (turning that information into a clear, coherent answer). Instead of relying on memorized weights, the model is given the relevant text at query time and asked to reason over it.

This project implements a clean, minimal RAG pipeline from scratch — covering document ingestion, semantic chunking, vector embedding, cosine-similarity retrieval, and grounded LLM generation — with the goal of building a working, understandable baseline for document question-answering.

Purpose and Objectives

Purpose

Build and evaluate a simple, reproducible RAG assistant that answers user questions exclusively from a provided local document corpus, grounding every response in retrieved evidence rather than model-memorized knowledge.

Specific Objectives

Ingest local text documents from a custom corpus.
Split documents into retrieval-friendly chunks that preserve semantic coherence.
Generate dense vector embeddings for each chunk.
Retrieve the top-k most relevant chunks for any user query using cosine similarity.
Build a grounded prompt from retrieved context and generate an answer with an LLM.
Provide a minimal CLI interface for interactive testing.
Return source file names alongside answers for transparency and citation.

Intended Audience and Use Case

Target Audience

Students learning RAG fundamentals who want a fully readable, dependency-light implementation.
Early-stage AI practitioners building document QA prototypes before moving to production-scale systems.
Developers who need a clean, understandable RAG baseline to extend or benchmark against.

Primary Use Case

Ask domain-specific questions over a curated multi-topic document collection and receive grounded, cited answers that can be traced back to source files.

Prerequisites

Python 3.10+ (tested with Python 3.14).
Basic Python and command-line familiarity.
An OpenAI API key (for embeddings and chat completion).

Problem Definition

Generic LLM responses are unreliable for niche domains and private documents. A model trained on public internet data cannot know the contents of your internal files, proprietary reports, or specialized corpora — and when asked, it will often fabricate plausible-sounding but incorrect answers (hallucination).

This project directly addresses that limitation by retrieving relevant text passages from user-provided documents before generation. The model is constrained to answer only from the retrieved context, which:

Reduces hallucination risk by anchoring generation to real source content.
Increases answer relevance for the selected corpus.
Enables source citation so users can verify answers.
Makes the system transparent and auditable.

Scope, Assumptions, and Boundaries

In Scope

Text-based document question-answering over local files.
Retrieval using dense OpenAI embeddings and cosine similarity.
CLI-based interactive query interface.
Source attribution per answer.

Out of Scope

Web application or graphical UI deployment.
Multi-user session management or authentication.
Fine-tuning custom LLM weights.
Large-scale distributed or persistent vector indexing.
Streaming responses.

Assumptions

Input files are valid UTF-8 plain text.
The OpenAI API key is available at runtime via .env.
The document volume is small-to-medium and fits comfortably in memory.

Dataset Sources and Characteristics

Data Source

All documents are local .txt files in the data/ directory, covering seven distinct knowledge domains:

File	Domain
`artificial_intelligence.txt`	AI, ML, LLMs, deep learning, ethics
`biotechnology.txt`	Genomics, CRISPR, biotech applications
`climate_science.txt`	Climate change, greenhouse gases, impacts
`quantum_computing.txt`	Qubits, quantum algorithms, hardware
`sample_documents.txt`	General reference
`space_exploration.txt`	Missions, rockets, space science
`sustainable_energy.txt`	Renewables, solar, wind, energy storage

Why Multi-Domain?

Using documents from seven distinct domains allows meaningful cross-domain retrieval testing. A query about machine learning should retrieve from artificial_intelligence.txt and not confuse it with content from quantum_computing.txt — validating that embedding-based retrieval is doing semantic work and not just keyword matching.

Dataset Statistics

Number of source documents: 7
File format: .txt
Encoding: UTF-8
Total domain coverage: AI, biotech, climate, quantum computing, space, sustainability, general reference

Methodology

The pipeline follows a standard RAG flow, decomposed into seven sequential stages. Each stage has a clear input, transformation, and output.

Stage 1: Document Loading

What happens: The RAGAssistant.load_documents() method scans the data/ directory and reads every supported file (.txt, .csv, .json, .md) as UTF-8 text. Each file is stored as a dict with a content field and a metadata field containing the source filename and path.

Why: Keeping loading generic (plain text, no parsing) maximizes compatibility across document types and avoids format-specific failure modes during ingestion.

Stage 2: Document Chunking

What happens: Each document's text is passed to VectorDB.chunk_text(), which uses LangChain's RecursiveCharacterTextSplitter with a chunk size of 500 characters and 50-character overlap, splitting on \n\n, \n, . , , and "" in that priority order.

Why chunking matters: Embedding an entire multi-page document into a single vector loses retrieval precision — the vector represents an average of all topics in the document. Smaller, focused chunks give embeddings that represent tighter semantic concepts, making retrieval dramatically more accurate.

Why overlap matters: A 50-character overlap ensures that sentences or ideas spanning a chunk boundary are not cut off mid-thought. This preserves local context and prevents information loss at boundaries.

Why recursive splitting: Splitting on paragraph breaks first (\n\n), then line breaks, then sentences, then words preserves the natural structure of the text. It avoids mid-sentence splits unless absolutely necessary, keeping chunks more semantically coherent than naive fixed-length slicing.

Stage 3: Embedding

What happens: All chunks are batch-embedded using OpenAI's text-embedding-3-small model via OpenAIEmbeddings.embed_documents(). The resulting vectors are stored as a NumPy float32 matrix.

Why OpenAI embeddings: text-embedding-3-small is a production-quality, semantically rich embedding model. It maps text into a 1536-dimensional space where semantically similar texts cluster together — the foundation of meaningful retrieval.

Why float32 NumPy: For small-to-medium corpora, a dense NumPy matrix is fast, simple, and requires no external vector database infrastructure. It keeps the system self-contained and easy to inspect.

Stage 4: In-Memory Indexing

What happens: The VectorDB maintains three parallel in-memory lists (_documents, _metadatas, _ids) plus the NumPy embedding matrix (_embeddings). All chunk data is aligned by index position.

Why in-memory: For this assignment's scope (7 documents, ~tens of chunks), an in-memory index provides instant lookup with zero infrastructure overhead. The design can be replaced with a persistent vector database (e.g., Chroma, Pinecone, Weaviate) for larger workloads without changing the rest of the pipeline.

Stage 5: Query-Time Retrieval

What happens: When a user submits a question, it is embedded using the same text-embedding-3-small model (embed_query()). The query vector is compared against all stored document vectors using cosine similarity:

similarity(q, d) = (q · d) / (||q|| × ||d||)

The top-k chunks ranked by similarity are selected and returned.

Why cosine similarity: Cosine similarity measures the angular distance between two vectors, ignoring their magnitude. This is ideal for text embeddings because it captures semantic direction (what the text is about) independently of text length or embedding scale. Two chunks that discuss the same concept will point in similar directions in embedding space, even if one is longer.

Why top-k (not just top-1): Providing multiple retrieved chunks gives the LLM richer context. A single chunk might partially answer a question; three chunks together may span the full answer across paragraph boundaries. The default n_results=3 balances context richness against prompt length.

Stage 6: Grounded Prompt Construction and Generation

What happens: The retrieved chunks are joined with --- separators into a context string. This context plus the user's question are inserted into a ChatPromptTemplate:

You are a helpful assistant. Use only the provided context to answer the question.
If the context does not contain enough information, say so honestly rather than making up an answer.

Context:
{context}

Question: {question}

Provide a clear, concise answer based on the context provided.

The prompt is passed through a LangChain chain (prompt | llm | StrOutputParser) to gpt-4o-mini at temperature 0.2, and the answer string is extracted.

Why temperature 0.2: Lower temperature makes the model more deterministic and factual — appropriate for QA where consistency and accuracy matter more than creativity.

Why explicit grounding instructions: The prompt explicitly tells the model to use only the provided context and to admit when context is insufficient. This is the key RAG safety mechanism that suppresses hallucination — without it, the model may blend retrieved context with memorized training data.

Stage 7: Answer and Source Return

What happens: The query() method returns a dict containing the answer text, the raw context chunks, and the unique set of source filenames that contributed to the retrieved context.

Why return sources: Source attribution is essential for verifiability. Users can inspect which documents the answer came from and cross-check if needed. It also supports debugging — if an answer is wrong, checking sources quickly reveals whether the problem is in retrieval (wrong chunks) or generation (model misread correct chunks).

Domain Handling

How the System Handles Multiple Domains

The corpus spans seven distinct knowledge domains — AI, Biotechnology, Climate Science, Quantum Computing, Space Exploration, Sustainable Energy, and General Reference. Each domain is stored in a separate .txt file. During ingestion, the extract_domain() function in src/vectordb.py maps each source filename to a human-readable domain label:

Source File	Domain Label
`artificial_intelligence.txt`	AI
`biotechnology.txt`	Biotechnology
`climate_science.txt`	Climate Science
`quantum_computing.txt`	Quantum Computing
`space_exploration.txt`	Space Exploration
`sustainable_energy.txt`	Sustainable Energy
`sample_documents.txt`	General

This domain label is stored in the metadata of every chunk at index time, so every retrieved chunk carries its domain origin.

Domain-Filtered Retrieval

The VectorDB.search() and ChromaKnowledgeBase.search() methods both accept an optional domain_filter parameter. When set, only chunks belonging to that domain are considered during ranking — non-matching chunks are masked before the cosine similarity sort.

# Restrict retrieval to Biotechnology documents only
result = assistant.query("How does CRISPR work?", domain_filter="Biotechnology")

# See all available domain labels in the current index
print(assistant.list_domains())
# ['AI', 'Biotechnology', 'Climate Science', 'Quantum Computing', ...]

In demo.py, the interactive CLI also supports domain-tag prefixes:

You: [AI] What is backpropagation?
(Filtered to domain: AI)
Assistant: ...

Cross-Domain Queries

For queries that naturally span multiple domains (e.g., "How is AI used in climate science?"), no filter should be applied — the embedding model handles cross-domain routing implicitly. The domain_filter parameter is optional and defaults to None (full corpus search).

Knowledge Base Integration

Persistent Knowledge Base with ChromaDB

The original in-memory VectorDB is fully functional for small corpora and learning use cases, but re-embeds all documents on every startup. The new ChromaKnowledgeBase in src/knowledge_base.py provides a persistent alternative backed by ChromaDB.

from src.knowledge_base import ChromaKnowledgeBase
from src.app import RAGAssistant

# Build a persistent knowledge base (embeddings saved to ./chroma_db/)
kb = ChromaKnowledgeBase(persist_dir="./chroma_db")
assistant = RAGAssistant(store=kb)
assistant.load_and_ingest("./data")   # only runs once; subsequent startups reuse the index

Both stores share the same interface (add_documents(), search(), list_domains()), so RAGAssistant works with either without any other changes.

Incremental Indexing

ChromaKnowledgeBase.add_documents() checks which source files are already indexed and skips them automatically. This means:

First run: all 7 documents are chunked, embedded, and stored.
Subsequent runs: startup is instant — no API calls, no re-embedding.
Adding a new document: only the new file is processed.

Choosing a Store

	`VectorDB` (in-memory)	`ChromaKnowledgeBase` (persistent)
Persistence	No — re-embeds on every run	Yes — index survives restarts
Incremental updates	No	Yes — skips already-indexed sources
Infrastructure	None	Local `./chroma_db/` directory
Best for	Learning, quick demos	Production, larger corpora

Query Processing

Limitations of Naive Query Processing

Passing the raw user question directly to the embedding model works well for clear, concise queries, but fails for:

Vague queries — "Tell me about energy" retrieves broadly without focus.
Short queries — sparse embeddings that may not align with longer document chunks.
Compound questions — "Compare solar and wind energy" may retrieve well for one topic but miss the other.

Query Rewriting

QueryProcessor.rewrite() in src/query_processor.py uses the LLM to rephrase a query into a more precise, retrieval-friendly form before embedding:

# Enable at init time
assistant = RAGAssistant(use_query_processor=True)

# Or per-query
result = assistant.query("Tell me about energy", rewrite_query=True)
print(result["retrieval_query"])
# → "What are the main sources and technologies of sustainable energy?"

The rewritten query is embedded for retrieval; the original question is still used for the LLM generation prompt, so the final answer remains natural.

Query Decomposition

QueryProcessor.decompose() splits compound or comparative questions into independent sub-queries, each of which can be retrieved separately:

from src.query_processor import QueryProcessor
qp = QueryProcessor(llm)

sub_queries = qp.decompose("Compare deep learning and quantum computing")
# → ["What is deep learning?", "What is quantum computing?"]

Each sub-query can then be passed to assistant.query() independently and the results merged, giving the LLM richer, targeted context for each part of the comparison.

Memory and Reasoning Capabilities

Current Memory Model: Stateless Per-Query

In the current implementation, each call to assistant.query() is completely independent. The assistant has no memory of previous questions or answers within a session. Every query starts fresh — only the retrieved document chunks and the current question are visible to the LLM.

This is intentional for the baseline: stateless QA is simpler, predictable, and avoids the complexity of managing conversation history. But it has a clear limitation for interactive use — the user cannot ask follow-up questions that reference prior answers.

Turn 1: "What is machine learning?"       → ✅ answered from corpus
Turn 2: "Can you give me an example?"    → ❌ assistant has no context of Turn 1

Adding Conversational Memory

To support multi-turn dialogue, conversation history can be maintained as a list of prior (question, answer) pairs and injected into the prompt alongside retrieved context:

# Conceptual extension — not in current codebase
history = []

def query_with_memory(assistant, question, history, n_results=3):
    result = assistant.query(question, n_results=n_results)
    history_text = "\n".join(
        f"Q: {h['question']}\nA: {h['answer']}" for h in history[-3:]
    )
    # Inject history into prompt for context-aware generation
    history.append({"question": question, "answer": result["answer"]})
    return result

This pattern — known as windowed conversation memory — keeps only the last k turns in the prompt to avoid exceeding the model's context window. LangChain provides ready-made memory classes (ConversationBufferWindowMemory, ConversationSummaryMemory) that can be plugged into the existing chain.

Memory Strategies

Strategy	How It Works	Best For
No memory (current)	Each query is independent	Simple QA, batch processing
Window memory	Last k turns injected into prompt	Short interactive sessions
Summary memory	Prior turns summarised by LLM into a short paragraph	Long sessions, limited context window
Entity memory	Track key entities (people, topics) mentioned across turns	Domain-specific dialogue
External memory	Store session history in a database; retrieve relevant turns	Persistent multi-session agents

Reasoning Capabilities

The reasoning in this system operates at two levels:

1. Retrieval reasoning (implicit) — The embedding model reasons about semantic similarity: it maps the user's question and all document chunks into a shared vector space where conceptually related text clusters together. This is not symbolic reasoning, but it is a form of learned semantic inference — the model understands that "machine learning" and "supervised training" are related without explicit rules.

2. Generative reasoning (explicit) — Once relevant chunks are retrieved, gpt-4o-mini performs in-context reasoning over the provided text. The model can:

Synthesise information spread across multiple chunks into a coherent answer.
Draw inferences that are implied by but not stated verbatim in the context.
Recognise when the retrieved context is insufficient and admit it honestly.

The temperature setting of 0.2 keeps this reasoning focused and consistent — the model follows the evidence rather than generating creative but unsupported conclusions.

Reasoning Enhancement Opportunities

Enhancement	What It Adds
Chain-of-thought prompting	Ask the LLM to reason step-by-step before answering, improving accuracy on complex questions
Multi-hop retrieval	Retrieve, generate an intermediate answer, then retrieve again using that answer as a new query
Self-consistency	Run the same query multiple times and select the most common answer — reduces variance on ambiguous questions
Reasoning traces	Return the LLM's step-by-step reasoning alongside the final answer for auditability

System Architecture

User Query
    │
    ▼
[Embed Query]  ──── text-embedding-3-small ────▶  Query Vector
    │
    ▼
[Cosine Similarity Search]  ──── VectorDB._embeddings ────▶  Top-k Chunks
    │
    ▼
[Build Grounded Prompt]  ──── ChatPromptTemplate ────▶  Prompt String
    │
    ▼
[LLM Generation]  ──── gpt-4o-mini (temp=0.2) ────▶  Answer Text
    │
    ▼
[Return Answer + Sources]

Ingestion flow (one-time at startup):

data/*.txt
    │
    ▼
[load_documents()]  ──▶  Document Dicts
    │
    ▼
[chunk_text()]  ──▶  Text Chunks (500 chars, 50 overlap)
    │
    ▼
[embed_documents()]  ──▶  Float32 Vectors
    │
    ▼
[VectorDB._embeddings]  ──▶  In-Memory NumPy Matrix

Component Breakdown

`src/app.py` — RAGAssistant

The central orchestration class. It owns the LLM, the prompt chain, and the vector store, and exposes two main methods:

load_and_ingest(data_path) — loads documents from disk and adds them to the vector store.
query(question, n_results) — runs the full RAG pipeline (retrieve → build context → generate → return).

The LangChain chain (prompt | llm | StrOutputParser) makes the generation pipeline composable and easy to swap components in.

`src/vectordb.py` — VectorDB

The retrieval engine. Responsibilities:

chunk_text() — splits raw text into retrieval-friendly chunks.
add_documents() — embeds and indexes a list of document dicts.
search() — embeds a query and returns top-k chunks by cosine similarity.

The class is self-contained: it manages its own embedding client and internal storage, making it independently testable.

`demo.py` — CLI Interface

Minimal entry point for running the system. It:

Initializes RAGAssistant and ingests the data/ directory.
Runs three predefined example queries to show baseline behavior.
Opens an interactive loop for free-form question-answering.

Design Decisions and Trade-offs

Decision	Choice	Rationale
Vector store	In-memory NumPy	Zero infrastructure, simple, fits the assignment scope
Embedding model	`text-embedding-3-small`	Production-quality, cost-efficient, 1536-dim
LLM	`gpt-4o-mini`	Capable, fast, cost-effective for QA tasks
Chunking strategy	Recursive character splitting	Preserves natural text structure better than fixed slicing
Chunk size	500 chars / 50 overlap	Balances retrieval precision vs. context completeness
Similarity metric	Cosine similarity	Magnitude-invariant, standard for text embedding retrieval
LLM temperature	0.2	Favors accuracy and consistency over creativity
Retrieval k	3	Enough context breadth without excessive prompt length

Evaluation Framework and Verification

Evaluation Strategy

The system is evaluated across four layers:

Ingestion verification — confirm all seven files load and chunk without errors.
Retrieval verification — automated ground-truth test suite via RetrievalEvaluator.
Grounding verification — confirm generated answers align with retrieved context rather than hallucinated content.
Source transparency — confirm source filenames and domain labels are returned with every response.

Retrieval Evaluation with `RetrievalEvaluator`

src/evaluator.py provides a RetrievalEvaluator class with 13 ground-truth QA pairs covering all 7 domains. It runs each question through the full RAG pipeline and measures three metrics:

Metric	Definition
Source Accuracy	Fraction of queries where the expected domain appears anywhere in the top-k retrieved chunks
Top-1 Accuracy	Fraction of queries where the expected domain is the highest-ranked retrieved chunk
Mean Top Similarity	Average cosine similarity of the top-ranked chunk across all queries

from src.evaluator import RetrievalEvaluator

evaluator = RetrievalEvaluator()
output = evaluator.run(assistant, n_results=3)
RetrievalEvaluator.print_report(output)

Sample output:

========================================================================
  RETRIEVAL EVALUATION REPORT
========================================================================
[PASS] What is machine learning?
       Expected : AI
       Retrieved: AI
       Top sim  : 0.8712

[PASS] How does CRISPR gene editing work?
       Expected : Biotechnology
       Retrieved: Biotechnology
       Top sim  : 0.8541
...
------------------------------------------------------------------------
  Source Accuracy  (domain in top-k)  : 100.00%
  Top-1 Accuracy   (domain ranked #1) : 100.00%
  Mean Top Similarity                 : 0.8634
  Total Queries Evaluated             : 13
========================================================================

Ground-Truth Test Cases

Query	Expected Domain
"What is machine learning?"	AI
"How does deep learning work?"	AI
"What are key AI ethics concerns?"	AI
"What is quantum entanglement?"	Quantum Computing
"How do quantum computers differ from classical computers?"	Quantum Computing
"How does CRISPR gene editing work?"	Biotechnology
"What are applications of genomics in medicine?"	Biotechnology
"What causes climate change?"	Climate Science
"What are the effects of greenhouse gas emissions?"	Climate Science
"How does solar energy generate electricity?"	Sustainable Energy
"What are the benefits of wind power?"	Sustainable Energy
"What was the Apollo moon mission?"	Space Exploration
"How do rockets achieve orbit?"	Space Exploration

Grounding Check

For each query, the retrieved context chunks are inspected to verify:

The chunks come from the correct domain file.
The answer does not introduce facts absent from the retrieved context.
When context is insufficient, the model appropriately admits it rather than fabricating.

Results Summary

Observed behavior across local test runs:

All 7 documents loaded and chunked successfully at startup.
Retrieval consistently returns domain-appropriate chunks for test queries (e.g., AI questions retrieve from artificial_intelligence.txt, not space_exploration.txt).
Generated answers are coherent, concise, and grounded in the retrieved corpus.
Source filenames are consistently returned and match the expected domain for each query.
The model appropriately hedges when a query falls outside the document scope.

Limitations and Trade-offs

In-memory index only — The vector store is not persisted between runs. Every startup re-embeds all documents. For larger corpora or production serving, a persistent vector database (Chroma, FAISS, Pinecone) would be necessary.
No reranker — Retrieval relies solely on first-stage embedding cosine similarity. A cross-encoder reranker (e.g., a fine-tuned BERT model) could improve precision by re-scoring retrieved chunks before passing them to the LLM.
No quantitative benchmark — Evaluation is functional and qualitative. A production system would use a ground-truth QA evaluation dataset with metrics like RAGAS (faithfulness, answer relevance, context precision/recall) for rigorous measurement.
API dependency — Both embedding and generation require live OpenAI API access. An offline alternative (local embedding models via SentenceTransformers + local LLM via Ollama) would remove this dependency.
Single-hop retrieval — Complex multi-part questions that require information from multiple documents may not be fully answered by a single retrieval pass. Multi-hop or iterative RAG approaches could address this.

Deployment and Maintenance Considerations

Local Deployment

The system runs as a CLI tool with python demo.py. No server or database setup required. Suitable for individual use, prototyping, and learning.

Path to Production

Concern	Recommendation
Persistence	Replace in-memory VectorDB with Chroma, FAISS, or Pinecone
Scale	Add batch embedding with rate-limit handling
API reliability	Add retry logic and fallback for OpenAI outages
Monitoring	Log query latency, retrieval latency, top-k source distribution
Observability	Track user feedback on answer usefulness
Auth	Add API key or session auth for multi-user serving

Corpus Maintenance

Refresh documents periodically as source knowledge evolves.
Re-index after significant corpus updates (new files, major edits).
Review chunking settings and prompt based on observed failure cases.

Reproducibility and Setup

Environment Variables

Create a .env file in the project root:

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

Install Dependencies

pip install -r requirements.txt

Run

python demo.py

The CLI will:

Initialize the assistant and embed all documents in data/.
Run three predefined example queries.
Open an interactive prompt for free-form questions.

Why This Work Matters

RAG is one of the most practical and widely adopted patterns in applied LLM engineering today. It bridges the gap between general-purpose language models and domain-specific knowledge by grounding generation in retrieved evidence — making AI systems more accurate, verifiable, and trustworthy for real-world use.

This project provides a compact, readable, end-to-end RAG baseline that is:

Educational — every component is explicit, documented, and inspectable.
Practical — the pipeline covers real production concerns (chunking strategy, embedding choice, grounding prompts, source citation).
Extensible — clear module boundaries make it straightforward to swap in a persistent vector store, a reranker, a local LLM, or a quantitative evaluation suite.

It establishes a clean foundation for anyone building document-grounded QA systems, from learning projects to early-stage production prototypes.

Licensing and Usage Rights

Project License

This project is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Permission	Details
Share	Copy and redistribute the material in any medium or format
Adapt	Remix, transform, and build upon the material
Attribution required	You must credit the original author and indicate if changes were made
NonCommercial	You may not use the material for commercial purposes
ShareAlike	Adaptations must be distributed under the same CC BY-NC-SA 4.0 license

Full license text: LICENSE

In plain terms: You are free to learn from, fork, adapt, and share this project for educational, research, and non-commercial use, provided you give attribution and apply the same license to any derivative works. Commercial use (selling, productising, or monetising this code or its derivatives) is not permitted without explicit written permission.

Third-Party Dependencies and Their Licenses

The project builds on several open-source libraries. Their licenses are compatible with non-commercial use:

Library	License	Notes
LangChain	MIT	Core abstraction for prompting and chaining
langchain-openai	MIT	OpenAI integration for LangChain
langchain-text-splitters	MIT	RecursiveCharacterTextSplitter
ChromaDB	Apache 2.0	Persistent vector store
NumPy	BSD 3-Clause	Vector math and embedding matrix
python-dotenv	BSD 3-Clause	Environment variable loading
OpenAI Python SDK	MIT	API client for embeddings and chat

All libraries listed above permit use, modification, and redistribution for non-commercial purposes. Refer to each library's individual license for full terms.

OpenAI API Terms

This project calls the OpenAI API for:

Embeddings via text-embedding-3-small
Chat completion via gpt-4o-mini

Use of the OpenAI API is governed by OpenAI's Terms of Service and Usage Policies. Key points relevant to this project:

You must provide your own OpenAI API key — no key is bundled with this project.
API usage incurs costs billed to your OpenAI account.
Outputs generated via the API may not be used to train or fine-tune models that compete with OpenAI.
Ensure your use case complies with OpenAI's content and usage policies.

Document Corpus

The seven .txt files in the data/ directory are educational reference documents created for this project covering publicly known topics in AI, biotechnology, climate science, quantum computing, space exploration, sustainable energy, and general science. They are included under the same CC BY-NC-SA 4.0 license as the rest of the project and may be used, adapted, or replaced for non-commercial educational purposes.