This publication introduces an open-source toolkit for building Retrieval-Augmented Generation (RAG) systems using Wikipedia data. It provides a reproducible pipeline for data ingestion, chunking, embedding, vector indexing, and LLM-powered retrieval, enabling rapid experimentation and deployment of RAG architectures for research and production.
This toolkit demonstrates how to build a scalable RAG system using Wikipedia as a knowledge base, leveraging modern embedding models and LLMs. It aims to accelerate research and prototyping in retrieval-augmented NLP.
RAG systems are foundational for trustworthy, up-to-date, and context-aware AI applications. This project lowers the barrier to entry for practitioners and researchers by providing a ready-to-use, modular, and extensible codebase.
(Wikipedia → Chunker → Embedder → Vector Store → Retriever → LLM)
git clone https://github.com/SosiSis/Deep-Learning-Wikipedia-RAG
cd deep-learning-wikipedia-rag
pip install -r requirements.txt
cp .env.example .env
Edit .env with your API keys and preferences
from rag_pipeline import RAGPipeline
pipeline = RAGPipeline()
answer = pipeline.query("What is the history of deep learning?")
print(answer)
GitHub Issues: https://github.com/SosiSis/Deep-Learning-Wikipedia-RAG/issues
Maintainer Email: sosinasisay29@gmail.com
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
FAISS: Facebook AI Similarity Search
Sentence Transformers
.env.example: Configuration template
Sample Data: Example Wikipedia chunks and embeddings