Designing RAG AI Research Assistant with Python, ChromaDB, and OpenAI(GPT 3.5)

github repo: https://github.com/Ugofebe/tensor-ready-rag-assistant

Abstract

This publication examines the architecture of building a simple RAG system. It all started as a learning project, but along the line, I got a freelancing job to build a RAG application for educational research papers within a tertiary institution. Look out for my publication about this RAG app in my next post. It covers every process, including the development of this AI system, also known as a chatbot, and it is beginner-friendly for those with solid knowledge in Python. It is meant to help researchers retrieve relevant information, thereby saving study and research time by over 90 percent. This describes Jacmate — a Retrieval-Augmented Generation (RAG) assistant that ingests local research PDFs/TXT into a persistent ChromaDB vector store, uses LangChain utilities for loading, chunking, and embeddings, and calls OpenAI chat models (via LangChain) to generate concise responses with the help of well-crafted prompts. It creates human-readable answers to user research queries, and the endpoint is created using FastAPI. By the end of this publication, every user should have a deep theoretical and practical understanding of how it works and the problem RAG applications solve.

Project overview
Architecture diagram
Data ingestion (files → ChromaDB)
Chunking & Embeddings]
Insertion details (insert_chroma.py & insert_publications)
Retrieval & RAG flow
API & Client usage
Role of Prompt Engineering
Model configuration & sample outputs
Deployment, reproducibility & tips
Security & best practices
Contribution & license

Project overview

Jacmate(this RAG app) is structured to be easy to run locally and to extend for production. The technologies and frameworks used are industry popular and trusted. RAG applications save a lot in cost compared to building an LLM from scratch or fine-tuning a model. Key files:

inserting_file.py — loads PDFs and .txt files using LangChain community loaders (PyPDFLoader, TextLoader).
jac_functions.py — chunking, embedding, insertion into ChromaDB, search/retrieval helpers, and prompt construction.
insert_chroma.py — example ingestion script that calls the ingestion flow and populates a persistent Chroma DB at ./research_db.
app.py — FastAPI server exposing /ask to query the RAG pipeline.

Architecture diagram

Flow: Files → Loaders → Chunking → Embeddings → ChromaDB → Retriever → LLM → Client

Data ingestion (files → ChromaDB)

This entails the process of injecting these files into chromaDB.
High-level flow:

Discover files (PDF/TXT) on disk.
Load page text using LangChain loaders.
Chunk long documents (overlapping windows) to preserve context.
Embed chunks into dense vectors.
Insert vectors + docs + metadata into a ChromaDB collection (ml_publications).

Example loader invocation (from insert_chroma.py):


def load_pdf_to_strings(documents_path):
    """Load research publications from PDF files (including subfolders) and return as list of strings"""
    
    # List to store all documents
    documents = []
    
    # Walk through all subfolders
    for root, dirs, files in os.walk(documents_path):
        for file in files:
            if file.lower().endswith(".pdf"):  # check for PDFs
                file_path = os.path.join(root, file)
                try:
                    loader = PyPDFLoader(file_path)
                    loaded_docs = loader.load()
                    documents.extend(loaded_docs)
                    print(f" Successfully loaded: {file}")
                except Exception as e:
                    print(f"Error loading {file}: {str(e)}")
    
    print(f"\n Total documents loaded: {len(documents)}")
    
    # Extract content as strings and return
    # publications = [doc.page_content for doc in documents]
    # Extract content as strings and return
    publications = [doc.page_content for doc in documents if doc.page_content.strip()]

    print(f"\n Total documents after stripping: {len(publications)}")

    return publications

from inserting_file import load_pdf_to_strings
publication = load_pdf_to_strings("data/400 Level/1st Semester")
db = insert_publications(collection, publication, title="400 level")

Inserting runs insert_publications (see next section).

Chunking & Embeddings

Chunking: This entails the process of splitting the texts into tokens, depending on the developer.

Embedding: This involves the process of converting these texts into vectors for easy retrieval from whichever vector store the user prefers, e.g, Pinecone, ChromaDB, Faiss, Weavite.
Chunking strategy (in jac_functions.chunk_research_paper):

Chunk size: 1000 characters. This could be referred to as tokens in some cases, so do not get it mixed up.
Overlap: 200 characters. What this simply means is that in cases where texts do not align, for instance, you have a paragraph of 1150 characters/tokens, there is room to overlap extra texts to accommodate the remaining 150 characters. It repeats the next 150 characters to ensure that context isn't lost.
Separators prioritized: paragraphs, line breaks, sentence boundaries, spaces. These bare factors to consider while overlapping.

This approach preserves continuity across chunk boundaries and improves retrieval recall.

Embedding model (used in jac_functions.embed_documents):

sentence-transformers/all-MiniLM-L6-v2 (via langchain_huggingface.HuggingFaceEmbeddings) — efficient and commonly used for semantic search.
Device selection: auto-detects cuda, mps, or cpu.

Snippet (embedding call):

model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": device},
)
embeddings = model.embed_documents(documents)

Insertion details (`insert_chroma.py` & `insert_publications`)

insert_chroma.py sets up the Chroma client and collection:

client = chromadb.PersistentClient(path="./research_db")
collection = client.get_or_create_collection(
    name="ml_publications",
    metadata={"hnsw:space": "cosine"}
)

insert_publications (core logic in jac_functions.py):

For each document, chunk it using chunk_research_paper.
Extract chunk texts and call embed_documents() to get vectors.
Build stable IDs and call collection.add(embeddings=..., ids=..., documents=..., metadatas=...).

Key behavior considerations:

next_id = collection.count() is used to avoid id collisions when adding multiple publications in the same run.
Metadata stored with each chunk includes title and chunk_id, which helps when presenting sources to users.

Note: In working with different vector databases for production, I will advise the use of postgress pgvector database, which saves cost, or Pinecone, which is fully managed and has a super-efficient response time.

Retrieval & RAG flow

Retrieval: This entails the whole process of retrieving responses containing the closest meaning to the user prompts, using methods like cosine similarity as explained:

Common Similarity Metrics

Cosine Similarity: This method measures the cosine of the angle between two vectors and focuses purely on their direction, not their magnitude (length).
Use Case: Ideal for semantic search and document comparison where document length can vary, but the content's meaning (direction in the vector space) is the primary concern. A score of 1 indicates perfect similarity, 0 indicates no similarity (perpendicular), and -1 indicates complete dissimilarity (opposite directions).
Euclidean Distance (L2 Distance): This calculates the straight-line physical distance between two points (vectors) in a multi-dimensional space.
Use Case: Best when the absolute differences and magnitude of the vectors carry meaningful information, such as in image search based on pixel intensities or certain user profile comparisons. A smaller distance indicates greater similarity.
Dot Product (Inner Product): This measures both the direction and magnitude of vectors.
Use Case: Often used when vectors are normalized (scaled to unit length), in which case it is mathematically equivalent to cosine similarity but computationally faster, as the normalization calculation can be skipped.
Links:
https://www.elastic.co/search-labs/blog/vector-similarity-techniques-and-scoring
https://redis.io/blog/vector-similarity/#:~:text=Once%20you%20have%20vectors%2C%20you,one%20affects%20your%20application's%20accuracy.

Search is implemented in search_research_db:

Embed the user query: query_vector = embeddings.embed_query(query)
Call Chroma collection.query(...) with query_embeddings=[query_vector], n_results=k.
Return documents, metadatas, and distances.

Example snippet:

results = collection.query(
    query_embeddings=[query_vector],
    n_results=top_k,
    include=["documents", "metadatas", "distances"]
)

answer_research_question builds the context and prompt, then calls the LLM:

context = "\n\n".join([
    f"From {chunk['title']}:\n{chunk['content']}" for chunk in relevant_chunks
])
prompt = prompt_template.format(context=context, question=query)
response = llm.invoke(prompt)
return response.content, relevant_chunks

Notes:

The prompt template is research-focused and asks the LLM to synthesize findings using the retrieved chunks as context.
llm.invoke(...) is used because an LLM wrapper (e.g., LangChain's ChatOpenAI) is passed in.

API & Client usage

The app.py FastAPI server handles requests to /ask. The endpoint instantiates embeddings and the LLM per request (simple approach) and calls answer_research_question:

@app.post("/ask")
def ask_question(payload: QueryRequest):
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    llm_gpt = ChatOpenAI(model_name='gpt-4o-mini', temperature=0.7)
    answer, sources = answer_research_question(payload.question, collection, embeddings, llm_gpt)
    return {...}

Example Python client:

import requests
payload = {'question': 'Summarize the primary contributions across these papers.'}
resp = requests.post('http://localhost:8000/ask', json=payload)
print(resp.json())

Curl example:

curl -X POST "http://localhost:8000/ask" -H "Content-Type: application/json" -d '{"question": "What are the main contributions in the ML papers?"}'

Role of Prompt Engineering

Prompt engineering played a critical role in the system and functioned as core system logic, not merely a formatting layer. During development, multiple prompt iterations were tested before arriving at the final, effective version.

Initial (Failed) Prompt

This first prompt was overly permissive and treated the LLM as a general-purpose assistant:

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    Answer the following question using the information below if relevant.

    Context:
    {context}

    Question:
    {question}

    Answer:
    """
)

Final (Correct) Prompt

The prompt was refined to explicitly constrain the LLM’s role to that of a research assistant:

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    Based on the following research findings, answer the students' question:

    Research Context:
    {context}

    Researcher's Question: {question}

    Answer: Provide a comprehensive answer based on the findings above.
    """
)

This was better because it was:
Clearly framed the LLM as a research-bound responder
Reduced hallucinations by anchoring responses to retrieved context
Prevented the model from acting as a general LLM

Model configuration & sample outputs

To use GPT-3.5 (chat) through LangChain:

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

Sample concise output:

These papers present an efficient embedding and retrieval pipeline using all-MiniLM-L6-v2, chunked indexing with overlaps to preserve context, and an LLM-based synthesizer that compiles high-level findings into readable summaries.

Cited synthesis (longer):

Embedding & Efficiency — MiniLM provides compact, meaningful vectors.

Chunking Strategy — overlapping chunks help preserve context and improve recall.

Application — the RAG approach produces concise summaries with citations to retrieved chunks.

Cost & latency notes:

gpt-3.5-turbo is typically cheaper and faster than GPT-4/4o; prompt size (retrieved context) influences cost.
Consider truncating or summarizing context before sending it to reduce token cost.

Deployment, reproducibility & tips

To ingest data: python insert_chroma.py (this creates/updates ./research_db).
To run server: python app.py or uvicorn app:app --host 0.0.0.0 --port 8000.
Re-generate publication: python scripts/generate_publication.py.
Convert to PDF (simple renderer): python scripts/docx_to_pdf.py.

Optimization tips:

Run embedding on GPU (CUDA) or MPS for speed if available, because we tried using AWS EC2 t2 micro and it was pretty fast.
Batch embedding operations to improve throughput.
Consider summarizing retrieved chunks to a smaller, representative context for costly LLM calls.

Security & best practices

Add .env to .gitignore and never commit API keys.

OPENAI_API_KEY=your_openai_api_key_here
GROQ_API_KEY=your_groq_api_key_here

Avoid committing research_db/ or large binary files — use .gitignore to exclude research_db/ and generated docs (publication.docx, publication.pdf).
If you need to store large binaries, use Git LFS or external artifact storage (S3, releases, etc.).

Recommended .gitignore entries (already suggested in repo):

.env
.venv
research_db/
publication.docx
publication.pdf
.DS_Store

Contributing & license

Contributions welcome: add tests for jac_functions, build a small UI, or add CI checks. Consider adding an MIT license for public distribution.

Appendix — Small code reference

Chunking (excerpt from jac_functions.py):

def chunk_research_paper(paper_content, title):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    chunks = text_splitter.split_text(paper_content)
    ...

Insert/embedding (excerpt):

embeddings = embed_documents(chunk_texts)
collection.add(
    embeddings=embeddings,
    ids=ids,
    documents=chunk_texts,
    metadatas=chunked_publication
)

Search snippet:

query_vector = embeddings.embed_query(query)
results = collection.query(query_embeddings=[query_vector], n_results=top_k, include=["documents","metadatas","distances"])

Prompt template ():
Prompt engineering played a critical role and served a s a core system logic.

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    Based on the following research findings, answer the students' question:

    Research Context:
    {context}

    Researcher's Question: {question}

    Answer: Provide a comprehensive answer based on the findings above.
    """
)

License & Attribution

This project is licensed under the MIT License - see the LICENSE file for details.
Third-Party Licenses

ChromaDB: Apache License 2.0
Sentence Transformers: Apache License 2.0
OpenAI SDK: MIT License
LangChain (if used): MIT License

Acknowledgments

HuggingFace for Sentence Transformers
ChromaDB team for their wonderful database
OpenAI for LLM API access
The open-source community at large

Contact

Author: Ugochukwu Febechukwu
Repository: https://github.com/Ugofebe/tensor-ready-rag-assistant

Designing RAG AI Research Assistant with Python, ChromaDB, and OpenAI(GPT 3.5)

Table of contents

Abstract

Table of contents

Project overview

Architecture diagram

Data ingestion (files → ChromaDB)

Chunking & Embeddings

Insertion details (`insert_chroma.py` & `insert_publications`)

Note: In working with different vector databases for production, I will advise the use of postgress pgvector database, which saves cost, or Pinecone, which is fully managed and has a super-efficient response time.

Retrieval & RAG flow

Common Similarity Metrics

API & Client usage

Role of Prompt Engineering

Initial (Failed) Prompt

Model configuration & sample outputs

Deployment, reproducibility & tips

Security & best practices

Contributing & license

Appendix — Small code reference

License & Attribution

Acknowledgments

Contact

Table of contents

Code

Code

Datasets

Datasets

Table of contents

Abstract

Table of contents

Project overview

Architecture diagram

Data ingestion (files → ChromaDB)

Chunking & Embeddings

Insertion details (insert_chroma.py & insert_publications)

Note: In working with different vector databases for production, I will advise the use of postgress pgvector database, which saves cost, or Pinecone, which is fully managed and has a super-efficient response time.

Retrieval & RAG flow

Common Similarity Metrics

API & Client usage

Role of Prompt Engineering

Initial (Failed) Prompt

Model configuration & sample outputs

Deployment, reproducibility & tips

Security & best practices

Contributing & license

Appendix — Small code reference

License & Attribution

Acknowledgments

Contact

Table of contents

Code

Code

Datasets

Datasets

Insertion details (`insert_chroma.py` & `insert_publications`)