github repo: https://github.com/Ugofebe/tensor-ready-rag-assistant

This publication examines the architecture of building a simple RAG system. It covers every process, including the development of this AI system, also known as a chatbot, and it is beginner-friendly for those with solid knowledge in Python. This describes Jacmate — a Retrieval-Augmented Generation (RAG) assistant that ingests local research PDFs/TXT into a persistent ChromaDB vector store, uses LangChain utilities for loading, chunking, and embeddings, and calls OpenAI chat models (via LangChain) to generate concise responses with the help of well-crafted prompts. It creates human-readable answers to user research queries, and the endpoint is created using FastAPI. By the end of this publication, every user should have a deep theoretical and practical understanding of how it works and the problem RAG applications solve.
insert_chroma.py & insert_publications)Jacmate is structured to be easy to run locally and to extend for production. The technologies and frameworks used are industry popular and trusted. Key files:
inserting_file.py — loads PDFs and .txt files using LangChain community loaders (PyPDFLoader, TextLoader).jac_functions.py — chunking, embedding, insertion into ChromaDB, search/retrieval helpers, and prompt construction.insert_chroma.py — example ingestion script that calls the ingestion flow and populates a persistent Chroma DB at ./research_db.app.py — FastAPI server exposing /ask to query the RAG pipeline.
Flow: Files → Loaders → Chunking → Embeddings → ChromaDB → Retriever → LLM → Client
This entails the process of injecting these files into chromaDB.
High-level flow:
ml_publications).Example loader invocation (from insert_chroma.py):
def load_pdf_to_strings(documents_path): """Load research publications from PDF files (including subfolders) and return as list of strings""" # List to store all documents documents = [] # Walk through all subfolders for root, dirs, files in os.walk(documents_path): for file in files: if file.lower().endswith(".pdf"): # check for PDFs file_path = os.path.join(root, file) try: loader = PyPDFLoader(file_path) loaded_docs = loader.load() documents.extend(loaded_docs) print(f"✅ Successfully loaded: {file}") except Exception as e: print(f"❌ Error loading {file}: {str(e)}") print(f"\n📂 Total documents loaded: {len(documents)}") # Extract content as strings and return # publications = [doc.page_content for doc in documents] # Extract content as strings and return publications = [doc.page_content for doc in documents if doc.page_content.strip()] print(f"\n📂 Total documents after stripping: {len(publications)}") return publications
from inserting_file import load_pdf_to_strings publication = load_pdf_to_strings("data/400 Level/1st Semester") db = insert_publications(collection, publication, title="400 level")
Inserting runs insert_publications (see next section).
Chunking: This entails the process of splitting the texts into tokens, depending on the developer.
Embedding: This involves the process of converting these texts into vectors for easy retrieval from whichever vector store the user prefers, e.g, Pinecone, ChromaDB, Faiss, Weavite.
Chunking strategy (in jac_functions.chunk_research_paper):
This approach preserves continuity across chunk boundaries and improves retrieval recall.
Embedding model (used in jac_functions.embed_documents):
sentence-transformers/all-MiniLM-L6-v2 (via langchain_huggingface.HuggingFaceEmbeddings) — efficient and commonly used for semantic search.cuda, mps, or cpu.Snippet (embedding call):
model = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": device}, ) embeddings = model.embed_documents(documents)
insert_chroma.py & insert_publications)insert_chroma.py sets up the Chroma client and collection:
client = chromadb.PersistentClient(path="./research_db") collection = client.get_or_create_collection( name="ml_publications", metadata={"hnsw:space": "cosine"} )
insert_publications (core logic in jac_functions.py):
chunk_research_paper.embed_documents() to get vectors.collection.add(embeddings=..., ids=..., documents=..., metadatas=...).Key behavior considerations:
next_id = collection.count() is used to avoid id collisions when adding multiple publications in the same run.title and chunk_id, which helps when presenting sources to users.Retrieval: This entails the whole process of retrieving responses containing the closest meaning to the user prompts, using methods like cosine similarity as explained:
Search is implemented in search_research_db:
query_vector = embeddings.embed_query(query)collection.query(...) with query_embeddings=[query_vector], n_results=k.documents, metadatas, and distances.Example snippet:
results = collection.query( query_embeddings=[query_vector], n_results=top_k, include=["documents", "metadatas", "distances"] )
answer_research_question builds the context and prompt, then calls the LLM:
context = "\n\n".join([ f"From {chunk['title']}:\n{chunk['content']}" for chunk in relevant_chunks ]) prompt = prompt_template.format(context=context, question=query) response = llm.invoke(prompt) return response.content, relevant_chunks
Notes:
llm.invoke(...) is used because an LLM wrapper (e.g., LangChain's ChatOpenAI) is passed in.The app.py FastAPI server handles requests to /ask. The endpoint instantiates embeddings and the LLM per request (simple approach) and calls answer_research_question:
@app.post("/ask") def ask_question(payload: QueryRequest): embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") llm_gpt = ChatOpenAI(model_name='gpt-4o-mini', temperature=0.7) answer, sources = answer_research_question(payload.question, collection, embeddings, llm_gpt) return {...}
Example Python client:
import requests payload = {'question': 'Summarize the primary contributions across these papers.'} resp = requests.post('http://localhost:8000/ask', json=payload) print(resp.json())
Curl example:
curl -X POST "http://localhost:8000/ask" -H "Content-Type: application/json" -d '{"question": "What are the main contributions in the ML papers?"}'
To use GPT-3.5 (chat) through LangChain:
from langchain_openai import ChatOpenAI llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
Sample concise output:
These papers present an efficient embedding and retrieval pipeline using all-MiniLM-L6-v2, chunked indexing with overlaps to preserve context, and an LLM-based synthesizer that compiles high-level findings into readable summaries.
Cited synthesis (longer):
- Embedding & Efficiency — MiniLM provides compact, meaningful vectors.
- Chunking Strategy — overlapping chunks help preserve context and improve recall.
- Application — the RAG approach produces concise summaries with citations to retrieved chunks.
Cost & latency notes:
gpt-3.5-turbo is typically cheaper and faster than GPT-4/4o; prompt size (retrieved context) influences cost.python insert_chroma.py (this creates/updates ./research_db).python app.py or uvicorn app:app --host 0.0.0.0 --port 8000.python scripts/generate_publication.py.python scripts/docx_to_pdf.py.Optimization tips:
.env to .gitignore and never commit API keys.OPENAI_API_KEY=your_openai_api_key_here GROQ_API_KEY=your_groq_api_key_here
research_db/ or large binary files — use .gitignore to exclude research_db/ and generated docs (publication.docx, publication.pdf).Recommended .gitignore entries (already suggested in repo):
.env
.venv
research_db/
publication.docx
publication.pdf
.DS_Store
Contributions welcome: add tests for jac_functions, build a small UI, or add CI checks. Consider adding an MIT license for public distribution.
Chunking (excerpt from jac_functions.py):
def chunk_research_paper(paper_content, title): text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = text_splitter.split_text(paper_content) ...
Insert/embedding (excerpt):
embeddings = embed_documents(chunk_texts) collection.add( embeddings=embeddings, ids=ids, documents=chunk_texts, metadatas=chunked_publication )
Search snippet:
query_vector = embeddings.embed_query(query) results = collection.query(query_embeddings=[query_vector], n_results=top_k, include=["documents","metadatas","distances"])
Prompt template (excerpt):
prompt_template = PromptTemplate( input_variables=["context", "question"], template=""" Based on the following research findings, answer the students' question: Research Context: {context} Researcher's Question: {question} Answer: Provide a comprehensive answer based on the findings above. """ )
This project is licensed under the MIT License - see the LICENSE file for details.
Third-Party Licenses
ChromaDB: Apache License 2.0
Sentence Transformers: Apache License 2.0
OpenAI SDK: MIT License
LangChain (if used): MIT License
HuggingFace for Sentence Transformers
ChromaDB team for their wonderful database
OpenAI for LLM API access
The open-source community at large
Author: Ugochukwu Febechukwu
Repository: https://github.com/Ugofebe/tensor-ready-rag-assistant