CsvPal: A RAG-driven assistant for CSV analysis with automatic Python code generation

Abstract

CsvPal is a retrieval-augmented generation (RAG) system specifically designed for interactive analysis of small to medium-sized CSV files through natural language queries. It will help data science beginners overcome the fundamental challenge of identifying and executing analytical opportunities within their CSV datasets. The system leverages advanced AI technologies including LangChain's document processing, ChromaDB vector storage, Sentence Transformers embeddings, and Groq's Llama-3.3-70b-versatile model to intelligently retrieve the top 10 most relevant data rows and provide contextually grounded insights alongside executable python code. Through its user-friendly Gradio web interface deployed on Hugging Face Spaces, users can explore their data using natural language queries.

Link to the Tool:

CsvPal-AI

Screenshot 2025-08-12 130906.png

Introduction

The entry barrier to data science remains high for beginners, while many educational resources cover theoretical concepts and programming syntax, novice analysts often struggle with a fundamental challenge: "What analysis can be done with this data?" This gap between theoretical understanding and practical application creates a significant bottleneck in data science education. Students may grasp statistical concepts but find it difficult to identify which analyses to apply to real datasets. This is particularly challenging with CSV files, the most common data format in both business and academic contexts.

The rise of large language models (LLMs) offers promising potential to bridge this educational gap. By providing conversational interfaces, LLMs could guide users through data exploration. However, LLMs face significant limitations, including issues with hallucinations and the inability to access user-specific data, which restrict their effectiveness as tools for analyzing real datasets. Retrieval-Augmented Generation (RAG) addresses these challenges by grounding model responses in actual data, ensuring that generated insights and suggestions reflect the unique characteristics of the user's dataset rather than relying on generic examples.

CsvPal harnesses these principles to create an educational tool tailored for beginner data scientists. The system retrieves the 10 most relevant rows for each query, ensuring that the generated insights are grounded in the actual data. Users upload CSV files and ask open-ended questions like "What analysis can I perform with this data?" or "What patterns might be interesting here?" The system responds with specific suggestions tailored to their dataset's characteristics, complete with executable Python code that demonstrates proper analytical techniques.

Methodology

Screenshot 2025-08-09 233602.png

1) CSV upload flow

Users upload small to medium-sized CSVs via a Gradio interface.
After the CSV is uploaded, the first 5 rows are displayed along with the shape of the dataset.

2) Document creation

LangChain’s CSVLoader converts rows into structured, retrievable text units that preserve column-value pairs and schema cues.
Each upload initializes a unique collection namespace to prevent cross-file contamination.

3) Embedding generation

Sentence-Transformer (e.g., all-MiniLM-L6-v2) encodes row chunks, schema summaries, into high dimensional vectors.

4) Vector store configuration

ChromaDB stores embeddings and metadata (e.g., row indices, column names) so that they can be searched quickly during queries.

5) Query Response Phase

Natural-language input defines the task or insight the user wants from the CSV.
The same sentence-transformer encodes the question into an embedding for comparable similarity space.
ChromaDB retrieves the top-10 most relevant rows/chunks that best match the query intent for grounding.
The model is prompted via LangChain to produce a beginner-friendly explanation that directly references retrieved chunks and values and to outline clear, stepwise guidance for replicating or extending the analysis along with relevant python code.

Experiments

1) Objective

Test CSVPal‑AI on the Iris dataset to check accuracy, beginner‑friendliness, and responsiveness using top‑10 row retrieval.

2) Dataset & Setup

Dataset: 150 samples, 4 numeric features, 1 species label (3 classes). Stack: all‑MiniLM‑L6‑v2 embeddings, ChromaDB vector store, Llama‑3.3‑70B‑Versatile, LangChain, Gradio (Hugging Face Spaces). Schema‑aware row chunks used for retrieval.

3) Queries Tested

What analysis can be done in the data?
What patterns can be seen between sepal length and petal length?

Once the analysis is done, users can select the Clear option in the interface. This removes all vectors for the current CSV from ChromaDB and resets the session, ensuring no mixing of data between files. The user can then upload a new CSV and start fresh without any leftover context from the previous dataset.

Results

The results showed that CSVPal‑AI provided accurate and context‑aligned answers, consistently referencing correct feature names and species labels from the retrieved rows.

Input: What analysis can be done in the data?
Input: What patterns can be seen between sepal length and petal length?

Conclusion

CsvPal-AI demonstrates the effectiveness of retrieval-augmented generation for making CSV data analysis accessible through natural language interaction. The integration of semantic retrieval with code generation addresses both immediate user information needs and longer-term reproducibility requirements, bridging the gap between conversational interfaces and traditional programmatic data analysis.

Future enhancements could address current limitations through several directions: implementing hybrid retrieval strategies combining semantic and keyword-based search, adding support for larger datasets through optimized chunking and indexing approaches, and incorporating automatic data profiling to enhance query understanding.

Technologies and Tools

CSV loader for loading CSV files
Groq API for llama model responses
LangChain for LLM integration
Sentence Transformers for embedding generation
ChromaDB for storing embeddings
Gradio for web interface
Hugging Face Spaces for deployment
Python for backend logic and integrations

Abstract

Link to the Tool:

CsvPal-AI

Screenshot 2025-08-12 130906.png

Introduction

Methodology

Screenshot 2025-08-09 233602.png

1) CSV upload flow

Users upload small to medium-sized CSVs via a Gradio interface.
After the CSV is uploaded, the first 5 rows are displayed along with the shape of the dataset.

2) Document creation

LangChain’s CSVLoader converts rows into structured, retrievable text units that preserve column-value pairs and schema cues.
Each upload initializes a unique collection namespace to prevent cross-file contamination.

3) Embedding generation

Sentence-Transformer (e.g., all-MiniLM-L6-v2) encodes row chunks, schema summaries, into high dimensional vectors.

4) Vector store configuration

ChromaDB stores embeddings and metadata (e.g., row indices, column names) so that they can be searched quickly during queries.

5) Query Response Phase

Natural-language input defines the task or insight the user wants from the CSV.
The same sentence-transformer encodes the question into an embedding for comparable similarity space.
ChromaDB retrieves the top-10 most relevant rows/chunks that best match the query intent for grounding.
The model is prompted via LangChain to produce a beginner-friendly explanation that directly references retrieved chunks and values and to outline clear, stepwise guidance for replicating or extending the analysis along with relevant python code.

Experiments

1) Objective

Test CSVPal‑AI on the Iris dataset to check accuracy, beginner‑friendliness, and responsiveness using top‑10 row retrieval.

2) Dataset & Setup

3) Queries Tested

What analysis can be done in the data?
What patterns can be seen between sepal length and petal length?

Results

The results showed that CSVPal‑AI provided accurate and context‑aligned answers, consistently referencing correct feature names and species labels from the retrieved rows.

Input: What analysis can be done in the data?
Input: What patterns can be seen between sepal length and petal length?

Conclusion

Technologies and Tools

CSV loader for loading CSV files
Groq API for llama model responses
LangChain for LLM integration
Sentence Transformers for embedding generation
ChromaDB for storing embeddings
Gradio for web interface
Hugging Face Spaces for deployment
Python for backend logic and integrations

CsvPal: A RAG-driven assistant for CSV analysis with automatic Python code generation

Table of contents

Abstract

Link to the Tool:

Introduction

Methodology

1) CSV upload flow

2) Document creation

3) Embedding generation

4) Vector store configuration

5) Query Response Phase

Experiments

1) Objective

2) Dataset & Setup

3) Queries Tested

Results

Conclusion

Technologies and Tools

Table of contents

Files

Abstract

Link to the Tool:

Introduction

Methodology

1) CSV upload flow

2) Document creation

3) Embedding generation

4) Vector store configuration

5) Query Response Phase

Experiments

1) Objective

2) Dataset & Setup

3) Queries Tested

Results

Conclusion

Technologies and Tools