This project implements an Arabic-language Retrieval-Augmented Generation (RAG) system that answers Arabic user queries using real-world information extracted from Arabic news sources. It leverages modern NLP components including multilingual embeddings, vector search, and large language models (LLMs).
The goal is to provide accurate, context-aware answers by retrieving the most relevant passages from a vector database populated with scraped Arabic content, and using an LLM to generate responses.
Scrape Arabic news articles from the web
Generate semantic embeddings from Arabic text
Store and retrieve embeddings using a vector database
Build a question-answering pipeline using LangChain + LLM
Deploy a simple Gradio interface for Arabic Q&A
The system consists of the following components:
Data Extraction: Arabic news scraping using Firecrawl
Text Cleaning: Content cleaning with BeautifulSoup
Embedding Generation: Using paraphrase-multilingual-MiniLM-L12-v2 from Sentence Transformers
Vector Storage: Embeddings indexed in a ChromaDB vector database
Question Embedding: Arabic questions converted to vector representations
Document Retrieval: Similarity search for relevant article segments
Answer Generation: Answering using Google Gemini 1.5 Flash via LangChain
Web UI: Gradio interface for Arabic question input and answer display