This paper presents an improved implementation of a Retrieval-Augmented Generation (RAG) system that addresses two critical challenges in question-answering systems: factual accuracy and contextual awareness. Using Wikipedia articles as the knowledge source, we introduce a fact verification mechanism and conversation management system that significantly improve the reliability and user experience of RAG-based applications. Our implementation demonstrates enhanced accuracy in information retrieval and response generation while maintaining strict adherence to source document content.
👉 GitHub Repository: daishir0/rag-document-assistant
The complete implementation is available in the repository above. It contains all core modules including data collection, vector store building, fact verification, and conversation management.
RAG systems have emerged as a powerful approach for grounding large language model outputs in factual information. However, existing implementations often face challenges with:
Our implementation aims to:
class RAGAssistant: def __init__(self, config: Optional[Config] = None): self.config = config or Config() self.vectorstore = None self.llm = OpenAI( model_name=self.config.model_name, temperature=self.config.temperature ) self.embeddings = OpenAIEmbeddings( model=self.config.embedding_model )
The system consists of four main components:
class FactChecker: def validate_answer( self, answer: str, context: str, sources: List[Dict] ) -> Tuple[bool, float, List[str]]: statements = self._extract_statements(answer) validations = [] for statement in statements: similarity = self._compute_similarity(statement, context) validations.append((similarity >= self.similarity_threshold))
class ConversationMemory: def add_turn(self, question: str, answer: str, context: str): turn = ConversationTurn( question=question, answer=answer, context=context, timestamp=datetime.now() ) self.turns.append(turn)
The system includes a Wikipedia data collector that:
class WikipediaDataCollector: def collect_topic_articles( self, keywords: List[str], max_articles: int ) -> Dict[str, int]: for keyword in keywords[:max_articles]: page = self.wiki.page(keyword) if not page.exists(): continue self._process_article(page)
@dataclass class Config: model_name: str = "gpt-4-turbo-preview" temperature: float = 0.0 chunk_size: int = 1000 chunk_overlap: int = 200 similarity_threshold: float = 0.3
We evaluated our system using three types of queries:
In-context queries
Out-of-context queries
Irrelevant queries
Our system demonstrated:
Context Relevance
Source Attribution
git clone https://github.com/daishir0/rag-document-assistant cd rag-document-assistant pip install -r requirements.txt
cp .env.example .env # Edit .env with your OpenAI API key
from rag_assistant import RAGAssistant assistant = RAGAssistant() assistant.load_vectorstore("data/vectorstore") result = assistant.query("What is data science?")
Our implementation demonstrates significant improvements in RAG system reliability through fact verification and conversation management. The system shows robust performance in maintaining factual accuracy while providing transparent source attribution.
MIT License
$ python scripts/collect_data.py --topic data_science --max-articles 15 📊 Collection Summary: ✅ Successfully collected: 14 articles ❌ Failed: 0 articles 📁 Data saved to: data 📄 Processed files: data/processed Collected articles: - Data science - Big data - Data mining - Statistical inference - Predictive analytics - Data visualization - Business intelligence - Apache Spark - Hadoop - Python - R - Pandas - NumPy - Scikit-learn
$ python scripts/build_vectorstore.py --input data/processed --output data/vectorstore 2025-05-22 15:53:07,993 - INFO - Building vector store from: data/processed 2025-05-22 15:53:13,520 - INFO - Created vector store with 474 documents ✅ Vector store built successfully! 📁 Saved to: data/vectorstore 🔍 Ready for queries!
Example query responses:
❓ Question: What is data science and what are its main components?
💡 Answer: Data science is a concept that unifies statistics, data analysis, informatics, and their related methods to understand and analyze actual phenomena with data. Its main components include techniques and theories from mathematics, statistics, computer science, information science, and domain knowledge.
🎯 Confidence: 1.00
📚 Sources: data_science_1.txt
❓ Question: Who is the CEO of Google?
💡 Answer: That information is not available in the documents.
⚠️ Warnings: No relevant documents found for this query
🎯 Confidence: 0.00
❓ Question: How do I make a pizza?
💡 Answer: That information is not available in the documents.
⚠️ Warnings: No relevant documents found for this query
🎯 Confidence: 0.00