The KNBS RAG System is an AI-powered question-answering tool designed to provide fast, accurate, and fully sourced responses to statistical queries about Kenya. It automatically processes and indexes documents published by the Kenya National Bureau of Statistics (KNBS), enabling users to search across the entire KNBS publication archive using natural language. Instead of manually navigating hundreds of PDFs, users can ask plain-language questions and receive precise answers with citations in seconds. This system is built to support researchers, policymakers, students, journalists, and data analysts who require reliable statistical information at scale.
Data Acquisition
All documents were sourced directly from the official KNBS Publications portal:
https://www.knbs.or.ke/publications/
A total of 141 documents were included (137 PDFs and 4 text files). These covered national surveys, economic reports, census data, sectoral statistics, and household surveys up to June 2025.
Document Pre-processing:
Each document was converted into structured text and split into semantically meaningful chunks.
-Total Chunks Generated: 65,104
-Chunk Size: 1,200 characters
-Overlap: 250 characters for context preservation
Document Metadata: Year of publication, data period, document category
Embedding and Indexing:
To enable semantic search, each chunk was transformed into vector embeddings using:
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
The resulting embeddings were stored in:
-Vector Database: ChromaDB (SQLite backend)
-Collection: rag_documents
Retrieval Pipeline:
When a user asks a question, the system:
-Generates an embedding for the query
-Retrieves the Top-8 most relevant chunks using cosine similarity
-Filters results using a distance-threshold to improve precision
Answer Generation:
The answer is generated using the retrieved context only, ensuring grounded outputs.
Primary LLM Used:
Groq — Llama-3.1-8B Instant
Backup Models:
OpenAI (GPT-4o-mini)
Google Gemini (gemini-pro)
Every response includes mandatory citations in the format:
[Source: Document Name, Published: YYYY, Data Period: YYYY–YYYY]
System Constraints:
English language queries only
Knowledge limited to documents included in the index
No access to real-time data or external websites
System Performance
The KNBS RAG System demonstrated strong performance across accuracy, speed, and citation reliability.
Metric Result
-Citation Accuracy 100%
-Factual Accuracy 100%
-Average Response Time 2.34 seconds
-Retrieval Precision 94.4%
-Documents Processed 141
-Total Chunks Indexed 65,104
Coverage Across Domains
The system was tested on over 200 queries covering:
-Economics (GDP, inflation, trade, national accounts)
-Agriculture (crop statistics, livestock data)
-Demographics (census insights, age distribution)
-Employment (labour force participation, unemployment)
-ICT (digital access and adoption)
-Health, education, and housing statistics
The retrieval and grounding remained consistent across all domains.
Example Query
User Query:
“What is Kenya’s unemployment rate?”
System Response:
Kenya’s unemployment rate in 2023–24 was 12.5% at the national level, with youth aged 20–24 experiencing higher unemployment at 30.3%.
[Source: Kenya Housing Survey 2023–24, Published: 2025, Data Period: 2023–24]
Limitations Observed
Responses are limited to KNBS publications and cannot incorporate external sources.
Some older PDFs with poor formatting reduced chunking quality, though retrieval remained usable.
System does not perform time-series forecasting.
The complete installation guide,