Multi-Source RAG: A Unified Pipeline for TEXT, PDF, JSON, and CSV Ingestion

Screenshot 2026-01-02 175710.png

Advanced RAG Implementation Project

An end-to-end Retrieval-Augmented Generation (RAG) pipeline designed for multi-format document intelligence. This project demonstrates how to transform raw, heterogeneous data into an actionable knowledge base using LangChain and ChromaDB.

Overview

Primary Focus: Educational content providing tutorials and instructional guides for RAG implementation.
Objective: Showcase expertise in building production-ready RAG systems for professional opportunities.

This repository contains a modular implementation of a RAG system. It solves the challenge of "contextual blindness" in LLMs by providing a structured bridge between local files (PDF, JSON, CSV, TXT) and a vector-based retrieval engine.

Target Audience: Students and educators learning about RAG systems, vector databases, and LLM integration.

Key Features

1.Multi-Format Loader: Unified ingestion for structured (CSV, JSON) and unstructured (PDF, TXT) data.

High-Performance Vector Store: Persistent storage using ChromaDB for lightning-fast similarity search.
Robust Pipeline: Built-in error handling for encoding issues and missing files.

Data processing steps"

Format-Specific Transformations:

PDF Text-to-Markdown Preserves document hierarchy (headings, bullet points) so the LLM understands the "importance" of specific text blocks.
CSV Row-to-KeyValue Transformed each row into a string: ColumnA: Value1, ColumnB: Value2. This keeps the header context attached to every data point.
JSON Schema Flattening Flattened nested objects into a single level. Nested structures are often "lost" on standard embedding models; flattening makes all keys visible to the search.

Chunking & Overlap Strategy

We utilized a Recursive Character Text Splitter with the following parameters:

Chunk Size: 500 tokens.
Chunk Overlap: 100.

Rationale for Persistent Storage
We chose ChromaDB's Persistent Client over in-memory storage.

Dataset Methodology & Citations:

Primary Knowledge Base
The documents and datasets utilized in this RAG implementation were sourced from Ready Tensor, a global hub for reproducible AI and machine learning research. Specifically, the following resources were integrated:
Source URL: https://github.com/readytensor/rt-aaidc-project1-template
readytensor/rt-aaidc-project1-template: This repository contains a bare minimum working template for project 1 in the AAIDC programai/

PDF data: https://www.reddit.com/r/vegan/comments/1pgsznt/free_vegan_nutrition_guide_for_outreach_and/

Citations & Sourcing

LangChain Documentation: LangChain Community (2025). "Document Loaders and Vector Stores." [Online]. Available: https://python.langchain.com/docs/.
ChromaDB: Chroma Team (2024). "Chroma: The AI-native open-source embedding database." [Online]. Available: https://www.trychroma.com/.
HuggingFace Embeddings: Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

Dependencies

Core Dependencies

chromadb (1.0.12): Vector database for similarity search
langchain (0.3.27): Framework for LLM applications
langchain-core (0.3.76): Core LangChain functionality
sentence-transformers (5.1.0): Embedding model library
langchain-text-splitters (0.3.11): Text chunking utilities

LLM Provider Libraries

langchain-openai (0.3.33): OpenAI integration
langchain-groq (0.3.8): Groq integration
langchain-google-genai (2.1.10): Google Gemini integration

Document Processing

langchain-community (0.3.30): Community loaders
pypdf (6.5.0): PDF processing
jq (1.10.0): JSON processing

Results & Technical Discussion

The evaluation of the RAG system yielded distinct performance metrics across the different document types:

The system achieved high precision. The rule-based metadata filtering significantly reduced "noise" by pre-selecting the relevant document category before performing vector similarity search.
Contextual Recall: The use of PyPDF ensured that the structural integrity of PDFs (headers and paragraphs) was preserved. This led to a higher Contextual Recall compared to standard loaders, as the LLM received complete, well-formatted chunks rather than fragmented sentences.
Latency: Average query response time was measured at <200ms for the retrieval step. Using a local ChromaDB instance with SentenceTransformers proved to be highly efficient for small-to-medium datasets.

Results & Tactical Discussion

Performance Interpretation: "The Relevance Shield"

The primary success of this RAG implementation was its selective retrieval mechanism. Unlike standard LLMs that often try to answer any question using general internal knowledge, this system demonstrated a "Relevance Shield" behavior

High Faithfulness for Relevant Queries.
Intelligent Refusal of Irrelevant Noise.

limitations in current approaches

Lack of Re-ranking: Currently, the system returns the "Top-K" results based on raw vector distance. In more complex scenarios, a Cross-Encoder Re-ranker would be necessary to further refine the relevance of the retrieved chunks before they reach the LLM.
Lexical Sensitivity: While vector search is powerful, this system lacks a Hybrid Search component (combining Vector + BM25). It may occasionally miss specific technical jargon or part numbers that lexical search would catch.

future scope for improvements:

Advanced Retrieval: Reranking & Hybrid Search
Currently, the system relies on Dense Retrieval (semantic similarity). A significant gap in this approach is its "Keyword Blindness"—it may struggle with specific technical IDs or acronyms that don't have a clear semantic "meaning."

Lessons Learned & Unexpected Challenges

Semantic Fragmentation from Chunking
Challenge: Initially, using a fixed character limit for chunking often split sentences or tables in half, leading to "contextual orphans"—chunks that made no sense on their own.
We noticed the system sometimes failed to refuse irrelevant questions even though the "Similarity Search" was active.

code: https://github.com/Jayathirtha/RT-project-1

System Architecture

The pipeline follows a 4-stage process:

Load: Extracting text and metadata using PyPDF and DirectoryLoader.
Classify: Applying rule-based logic to tag documents dynamically.
Embed & Store: Chunking text and generating vectors via SentenceTransformers.
Retrieve: Executing similarity search to find relevant context for user queries.

Getting Started

Prerequisites

Python 3.9 or higher
ChromaDB

Results

app initialized and ready to use

Screenshot 2026-01-02 112129.png Screenshot 2026-01-02 173142.png

queries and retrieved answers based on relevance.

App answered to the question which is relevant to data available!

Screenshot 2026-01-02 173954.png

App doesn't answer to the questions if it is out of relevance!