Abstract
This paper presents a novel Retrieval-Augmented Generation (RAG) chatbot system designed specifically for intelligent question-answering over PowerPoint presentation content. The system combines semantic search capabilities with large language models to enable users to interactively query and extract information from uploaded presentation files. My approach leverages FAISS (Facebook AI Similarity Search) for efficient vector-based retrieval and integrates Google's Gemini API for context-aware response generation. The system architecture implements a modular pipeline comprising text extraction from PowerPoint files, semantic chunking, vector embedding generation, and RAG-based answer synthesis. I demonstrate the system's effectiveness through a web-based interface that supports multi-file upload, real-time processing, and interactive querying. Experimental results show that my system can accurately retrieve relevant information from presentation slides and generate coherent answers based on the retrieved context. The proposed solution addresses the growing need for intelligent document interaction systems, particularly in educational and corporate environments where presentations serve as primary knowledge artifacts.
Keywords: Retrieval-Augmented Generation, Question-Answering, PowerPoint Processing, Semantic Search, FAISS, Vector Embeddings, Natural Language Processing
Methodology
My RAG-based PPT QA system follows a modular architecture comprising four main components: Document Processing, Vector Indexing, Retrieval, and Generation.
3.1 Overall Architecture
The system architecture consists of three main layers:
• Presentation Layer: Web-based user interface for interaction
• Application Layer: FastAPI backend with REST endpoints
• Data Layer: File storage, vector indices, and embedding management
3.2 Document Processing Module
3.2.1 PowerPoint Text Extraction
The system uses the python-pptx library to extract text content from PowerPoint files. The extraction process iterates through all slides and shapes, capturing text from:
• Slide titles and content placeholders
• Text boxes and shapes
• Tables and lists
• Notes sections
The extraction function processes each slide sequentially, aggregating text from all text-bearing shapes while preserving the logical structure through newline separators.
3.2.2 Text Chunking
Extracted text is segmented into semantic chunks to enable efficient retrieval. My chunking strategy uses two approaches:
- Semantic Chunking: Splits text at natural boundaries (double newlines, paragraph breaks)
- Fixed-Size Chunking: Falls back to fixed-size chunks (500 characters) when semantic boundaries are not available
This hybrid approach balances semantic coherence with retrieval granularity, ensuring that relevant information is not split across chunks while maintaining manageable chunk sizes.
3.3 Vector Indexing Module
3.3.1 Embedding Generation
Text chunks are converted into dense vector representations using the SentenceTransformer model all-MiniLM-L6-v2. This model generates 384-dimensional embeddings that capture semantic relationships between text segments. The choice of this model balances computational efficiency with semantic representation quality.
3.3.2 FAISS Index Construction
The generated embeddings are indexed using FAISS (Facebook AI Similarity Search) with an L2 distance metric (IndexFlatL2). The indexing process:
• Encodes all text chunks into embeddings using the SentenceTransformer model
• Creates a FAISS index with dimensionality matching the embedding size
• Adds all embeddings to the index
• Persists the index and chunk metadata to disk for future retrieval
The persistent storage allows the system to maintain indices across sessions, enabling efficient querying of previously processed presentations.
3.4 Retrieval Module
The retrieval module (PPTRetriever) implements semantic search over the indexed content:
• Query Encoding: User queries are encoded into embeddings using the same SentenceTransformer model
• Similarity Search: FAISS performs k-nearest neighbor search (default k=3) to find the most similar chunks
• Context Assembly: Retrieved chunks are assembled into a context string for generation
The system supports searching across:
• Individual presentation embeddings
• All uploaded presentations (cross-file search)
3.5 Generation Module
3.5.1 RAG Pipeline
The generation module implements a RAG pipeline that combines retrieval with generation:
• Context Retrieval: Relevant chunks are retrieved using the retrieval module
• Prompt Construction: The user query and retrieved context are combined into a prompt
• Answer Generation: The Gemini API generates a response grounded in the retrieved context
3.5.2 Gemini Integration
The system integrates with Google's Gemini API for answer generation. The integration supports:
• Multiple model endpoints (generateText, generateContent)
• Context-aware prompting that includes retrieved presentation content
• Error handling and fallback mechanisms
Results
- Experimental Evaluation
5.1 Test Dataset
I evaluated the system using a collection of educational PowerPoint presentations covering topics including:
• Machine Learning concepts (Chapters 1-6)
• Version control systems (Git and GitHub)
The dataset contains 7 presentations with varying lengths and complexity, providing a diverse testbed for evaluation.
5.2 Evaluation Metrics
I assessed the system on several dimensions:
• Retrieval Accuracy: Precision of retrieved chunks relative to query relevance
• Answer Quality: Coherence and relevance of generated responses
• Response Time: End-to-end query processing latency
• System Robustness: Handling of edge cases and error scenarios
5.3 Results
Preliminary evaluation demonstrates:
Effective Retrieval: FAISS-based semantic search successfully retrieves relevant content chunks with high semantic similarity to user queries
Context-Aware Answers: Generated responses are grounded in presentation content and maintain coherence
Fast Response Times: Average query processing time under 2 seconds (excluding API calls)
Multi-File Support: System successfully handles queries across multiple presentations
5.4 Limitations and Future Work
Current limitations include:
• Text-Only Processing: Images and visual elements in presentations are not processed
• Simple Chunking: Current chunking strategy may split related content
• Single Embedding Model: No dynamic model selection based on content type
• Limited Context Window: Retrieved context may not capture long-range dependencies
Future improvements:
• Multi-modal Processing: Incorporate OCR and image understanding for visual content
• Advanced Chunking: Implement sliding window and semantic boundary detection
• Query Optimization: Improve query understanding and expansion
• Evaluation Framework: Develop comprehensive benchmarks for presentation QA
• Fine-tuning: Domain-specific fine-tuning of embedding and generation models