"Quest" is an innovative Retrieval-Augmented Generation (RAG) system designed to tackle complex coding-related queries with improved speed, reasoning, and user experience. Built from scratch using Flask, HNSW indexing, and models like Qwen and Deepseekr1 7B, the project integrates a custom RAG engine, memory buffer, and a responsive frontend. Over a 12-day development period (January 16–28, 2025), the system evolved from a proof-of-concept with 1-minute query times to an optimized version delivering responses in under 15 seconds for retrieval and 4 minutes for reasoning. Key features include fast query retrieval via hashing, dynamic prompt switching, and a metadata-rich dataset of 1850 coding questions. This project demonstrates significant advancements in inference speed, reasoning quality, and UI design, making it a robust tool for interactive problem-solving.
The rapid evolution of AI-driven systems has opened new avenues for solving domain-specific problems, particularly in coding and technical education. However, many existing solutions lack efficiency, context retention, or user-friendly interfaces, leading to suboptimal user experiences. "Quest" addresses these challenges by combining Retrieval-Augmented Generation (RAG) with advanced reasoning capabilities and a tailored frontend. Initiated on January 16, 2025, this project aimed to build a fully functional RAG system from scratch, leveraging open-source tools like Ollama, Flask, and Bootstrap, alongside models such as Qwen and Deepseekr1 7B. The primary goals were to optimize inference speed, enhance response quality, and provide a seamless user interaction experience. This write-up details the methodology, experiments, results, and conclusions drawn from this intensive development journey.
Methodology
The development of "Quest" followed an iterative approach, integrating multiple components into a cohesive system. The methodology can be broken down into the following key areas:
RAG Engine Design:
Built a custom RAG pipeline using Ollama for model inference, incorporating Qwen and Deepseekr1 7B models.
Implemented dynamic prompt switching to adapt to query complexity and introduced a bypass mechanism for low-confidence queries (confidence < 0.6).
Optimized retrieval using HNSW indexing with fine-tuned parameters and hashing for exact-match queries.
Memory Buffer:
Developed a context retention system to store and retrieve the last K interactions, improving response coherence over multiple queries.
Dataset Generation:
Created an on-device dataset of 1850 coding questions using the Qwen Coder model, enriched with metadata (e.g., difficulty, topics, solutions, companies).
Reasoning Optimization:
Integrated the Deepseekr1 7B model for complex reasoning, refining prompt templates to reduce inference time and improve consistency.
Frontend Development:
Designed a responsive UI using Flask, Bootstrap, and JavaScript, featuring interactive elements like a query box, stop button, and theme options.
Performance Tuning:
Employed techniques like hashmapping, partial/exact metadata matching, and inference speed optimization to enhance runtime efficiency.
The development process involved extensive experimentation to refine "Quest"’s components:
Inference Speed Optimization:
Tested initial RAG engine (Version 1) with 1-minute query times, followed by Version 2 (45 seconds) and Version 3 (15 seconds) using custom architecture and hashing.
HNSW Indexing:
Experimented with HNSW parameters to improve retrieval accuracy and speed, comparing it against baseline retrieval methods.
Prompt Engineering:
Tested multiple prompt variations for Qwen and Deepseek models, introducing time constraints to prevent infinite reasoning loops.
Developed a hybrid prompt for reasoning and normal modes, evaluated on response quality and runtime.
UI Enhancements:
Explored Tailwind, Material UI, and Bootstrap, ultimately selecting Bootstrap with CDN-based icons for simplicity and responsiveness.
Added features like multiline query support, stop generation, and copy buttons, testing usability across layouts.
Reasoning Model:
Compared Deepseekr1 7B against lighter models, experimenting with a two-step reasoning function and bypass mechanism for unknown queries.
Edge Case Testing:
Evaluated the system on ambiguous, complex, and unfamiliar queries to ensure robustness and fallback mechanisms worked effectively.
The results of "Quest"’s development highlight significant improvements across performance, usability, and functionality:
Speed Enhancements:
Query retrieval time reduced from 60 seconds (Version 1) to 15 seconds (Version 3), a 75% improvement.
Reasoning time for Deepseekr1 7B decreased from 5–20 minutes to under 4 minutes, an 80% reduction in the worst case.
Retrieval Accuracy:
HNSW indexing and metadata optimization enabled fast, accurate retrieval, with hashing cutting latency for known queries.
Response Quality:
Integration of Deepseekr1 7B and refined prompts improved reasoning depth and consistency, with responses rated "quite good" by January 20.
User Experience:
The Bootstrap-based UI, with features like stop buttons and history management, provided a smooth, interactive experience across themes.
Scalability:
The system handled 1850 coding questions with metadata, supporting diverse query types and edge cases effectively.
Codebase:
A well-structured GitHub repository (https://github.com/udit-rawat/Quest) was delivered, showcasing clean, extensible code.
"Quest" represents a successful endeavor in building a high-performance RAG system from scratch in under two weeks. Starting as a learning exercise on January 16, 2025, it evolved into a polished tool by January 28, 2025, with optimized inference (15-second retrieval, 4-minute reasoning), a metadata-rich dataset, and an intuitive frontend. The project underscores the value of iterative development, balancing trade-offs between speed, quality, and usability. While initial challenges like prompt leakage and memory buffer issues slowed progress, solutions such as HNSW indexing, hybrid prompts, and hashmapping transformed "Quest" into a reliable and efficient system. Future work could focus on further reducing reasoning time, expanding the dataset, and integrating additional models. This journey not only delivered a functional product but also provided invaluable insights into RAG systems, UI design, and AI optimization/