RAG Chatbot

📑 Abstract

This project presents a Retrieval-Augmented Generation (RAG) chatbot that combines dense vector search with a lightweight language model to provide accurate, context-grounded responses. Unlike generic chatbots that rely purely on model knowledge, this system retrieves relevant information from a structured document base before generating its response. The solution is optimized for both CPU and GPU environments, supporting OpenVINO for Intel accelerators and PyTorch for CUDA-enabled devices.

The system is lightweight, easy to deploy locally, and can be adapted for real-world knowledge bases, research archives, or organizational document retrieval.

🪄 Introduction

Recent advances in large language models have shown impressive capabilities in open-domain conversations. However, these models may hallucinate facts when not provided with external context.
RAG (Retrieval-Augmented Generation) bridges this gap by integrating retrieval (vector search) and generation (LLM response) into a single pipeline.

In this project, we built a RAG chatbot capable of:

Loading a custom dataset from JSON files,
Splitting documents into semantically meaningful chunks,
Indexing them using FAISS for fast retrieval,
Generating accurate answers using a lightweight LLM backend.

This architecture ensures that responses are context-grounded, fast, and hardware-flexible.

🧪 Methodology

1. Data Preprocessing

Input documents are stored in structured JSON format (title and content fields).
Each document is converted into a langchain Document object.
Text is split into overlapping chunks using RecursiveCharacterTextSplitter to optimize retrieval accuracy.

2. Vector Store Creation

Hugging Face Sentence Transformer embeddings (all-MiniLM-L6-v2) are used to convert text chunks into dense vectors.
FAISS is used to build a fast and efficient vector index for similarity search.

3. Language Model Integration

A dual-backend strategy is used:
- OpenVINO for Intel GPU/CPU acceleration.
- PyTorch for standard CPU or CUDA GPU execution.
A HuggingFace Pipeline wraps the model for easy integration with langchain.

4. RAG Pipeline

On user query:
1. The system performs vector similarity search on the FAISS index.
2. Relevant documents above a similarity threshold are retrieved.
3. A structured prompt is built using the retrieved context.
4. The LLM generates the final answer strictly based on retrieved knowledge.

5. User Interface

The chatbot is wrapped in a GUI built with Toga, providing a clean interface for interaction.
Messages and responses are log

🧪 Experiments

Experiment Setup

Hardware:
- Intel UHD GPU and NVIDIA GPU tested
- CPU fallback supported
Dataset:
- Sample publication JSON file with multiple document entries
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Generation Model: HuggingFaceTB/SmolLM2-360M-Instruct

Experimental Goals

Evaluate retrieval accuracy
Test response speed across hardware types
Measure output quality under different chunk sizes and thresholds

Experiment Parameters

Parameter	Value
Chunk Size	1000 characters
Overlap	200 characters
Top-k Retrieved Chunks	10
Similarity Threshold	0..1

Results

Conclusion

This project shows how Retrieval-Augmented Generation (RAG) can make chatbots more reliable by grounding responses on real data instead of relying only on a language model.
By combining a retriever, a vector database, and a generation model, the chatbot provides accurate and context-aware answers to document-based queries.

Although the system performs well, there’s still room for improvement in speed, memory handling, and scalability for larger datasets.
Overall, this work is a practical example of how RAG can turn traditional chatbots into smarter, knowledge-driven assistants.

📑 Abstract

The system is lightweight, easy to deploy locally, and can be adapted for real-world knowledge bases, research archives, or organizational document retrieval.

🪄 Introduction

In this project, we built a RAG chatbot capable of:

Loading a custom dataset from JSON files,
Splitting documents into semantically meaningful chunks,
Indexing them using FAISS for fast retrieval,
Generating accurate answers using a lightweight LLM backend.

This architecture ensures that responses are context-grounded, fast, and hardware-flexible.

🧪 Methodology

1. Data Preprocessing

Input documents are stored in structured JSON format (title and content fields).
Each document is converted into a langchain Document object.
Text is split into overlapping chunks using RecursiveCharacterTextSplitter to optimize retrieval accuracy.

2. Vector Store Creation

Hugging Face Sentence Transformer embeddings (all-MiniLM-L6-v2) are used to convert text chunks into dense vectors.
FAISS is used to build a fast and efficient vector index for similarity search.

3. Language Model Integration

A dual-backend strategy is used:
- OpenVINO for Intel GPU/CPU acceleration.
- PyTorch for standard CPU or CUDA GPU execution.
A HuggingFace Pipeline wraps the model for easy integration with langchain.

4. RAG Pipeline

On user query:
1. The system performs vector similarity search on the FAISS index.
2. Relevant documents above a similarity threshold are retrieved.
3. A structured prompt is built using the retrieved context.
4. The LLM generates the final answer strictly based on retrieved knowledge.

5. User Interface

The chatbot is wrapped in a GUI built with Toga, providing a clean interface for interaction.
Messages and responses are log

🧪 Experiments

Experiment Setup

Hardware:
- Intel UHD GPU and NVIDIA GPU tested
- CPU fallback supported
Dataset:
- Sample publication JSON file with multiple document entries
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Generation Model: HuggingFaceTB/SmolLM2-360M-Instruct

Experimental Goals

Evaluate retrieval accuracy
Test response speed across hardware types
Measure output quality under different chunk sizes and thresholds

Experiment Parameters

Parameter	Value
Chunk Size	1000 characters
Overlap	200 characters
Top-k Retrieved Chunks	10
Similarity Threshold	0..1

RAG Chatbot

Table of contents

📑 Abstract

🪄 Introduction

🧪 Methodology

1. Data Preprocessing

2. Vector Store Creation

3. Language Model Integration

4. RAG Pipeline

5. User Interface

🧪 Experiments

Experiment Setup

Experimental Goals

Experiment Parameters

Results

Conclusion

Table of contents

Files

📑 Abstract

🪄 Introduction

🧪 Methodology

1. Data Preprocessing

2. Vector Store Creation

3. Language Model Integration

4. RAG Pipeline

5. User Interface

🧪 Experiments

Experiment Setup

Experimental Goals

Experiment Parameters

Results

Conclusion

Code

Code