GPTScript Documentation Crawler & RAG Agent: A Technical Product Showcase
Product Overview
GPTScript Documentation Crawler & RAG Agent is an advanced AI-powered tool designed to automate the process of crawling technical documentation, storing relevant content in a vector database, and enabling intelligent retrieval and question-answering. Built with Pydantic AI and Supabase, this system provides a scalable and efficient solution for developers, researchers, and organizations seeking streamlined access to complex documentation resources.
By leveraging Retrieval-Augmented Generation (RAG) models, the system ensures accurate and contextually relevant responses to user queries. The combination of OpenAI embeddings, semantic search, and a user-friendly interface makes this tool a powerful addition to AI-driven documentation search and retrieval solutions.
Features
- π Automated Documentation Crawling: Extracts, chunks, and processes documentation from various sources while maintaining structural integrity.
- π Vector Database Storage: Uses Supabase as a scalable and optimized backend for storing embeddings and document metadata.
- π Semantic Search with OpenAI Embeddings: Enables intelligent, context-aware lookup of relevant documentation sections, significantly improving search efficiency.
- π€ RAG-based Q&A System: Employs Retrieval-Augmented Generation to provide precise and contextually accurate answers to user queries.
- πΎ Preserves Code Blocks & Formatting: Ensures that retrieved documentation retains its original structure, including syntax highlighting and paragraph integrity.
- π¨ Modern UI with Streamlit: Offers an interactive and intuitive querying experience, making documentation searches seamless for users.
- β‘ Fast and Scalable Processing: Efficient indexing and retrieval mechanisms allow for quick searches across large documentation sets.
- π Continuous Updates: Automatically refreshes stored documentation at regular intervals to keep information up-to-date.
Use Cases
Developer Assistance
Developers often need quick access to API references, function definitions, and implementation examples. This tool eliminates the need for manual searching by providing intelligent, context-aware retrieval, saving time and effort.
AI-powered Knowledge Retrieval
Organizations dealing with extensive technical documentation, such as software firms and research institutions, can integrate this system into their internal knowledge bases. AI-assisted support enhances productivity and reduces response times for technical inquiries.
Automated Documentation Analysis
By automating the process of parsing and structuring documentation, this system improves accessibility and usability for developers, support teams, and researchers who need accurate information on demand.
Customer Support Enhancement
Integrating the tool into customer support systems allows for intelligent, automated responses to technical questions, reducing human effort while maintaining accuracy.
Technical Specifications
Prerequisites
To run the system, ensure you have the following dependencies installed:
- Python 3.11+
- Supabase account & database setup
- OpenAI API Key
- Streamlit for UI-based querying
Installation
Clone the repository and set up a virtual environment:
git clone https://github.com/ambruhsia/GPTScript-webCrawler-RAG.git
cd GPTScript-webCrawler-RAG
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
Database Configuration
To set up the vector database in Supabase, follow these steps:
- Log in to your Supabase account.
- Navigate to the SQL Editor.
- Paste and execute the following SQL script:
-- SQL script to create necessary tables and enable vector similarity search
- Ensure that indexing is correctly configured for efficient search and retrieval.
Crawling Documentation
To fetch and store documentation, run:
python crawl_gptscript_docs.py
This command will:
- Crawl the specified documentation websites.
- Chunk content while preserving code blocks and paragraph boundaries.
- Generate vector embeddings using OpenAIβs API and store them in Supabase.
Querying via Streamlit UI
Launch the interactive UI for querying stored documentation:
streamlit run streamlit_ui.py
This provides a user-friendly interface for entering queries and retrieving relevant documentation snippets.
Architecture & Implementation
System Architecture
The system follows a modular architecture with the following components:
- Crawler Module: Uses BeautifulSoup and requests to scrape documentation pages.
- Text Processing & Chunking: Splits long-form documentation into manageable chunks while maintaining contextual integrity.
- Embedding Generation: Converts textual content into vector embeddings using OpenAI models.
- Vector Database (Supabase): Stores processed embeddings and metadata for fast retrieval.
- Query Engine: Accepts user input, retrieves relevant embeddings, and generates responses using a RAG-based approach.
- Streamlit UI: Provides an interactive interface for users to enter queries and view responses.
- Rate Limiting & API Throttling: Ensures compliance with OpenAI API rate limits and prevents excessive load on the database.
- Data Privacy: Secures stored documentation and user queries with encryption and access controls.
- Scalability: Designed to handle growing documentation datasets with optimized indexing and efficient query execution.
Usage & Integration
This tool can be integrated into various AI-driven applications where documentation retrieval and intelligent analysis are required. By leveraging OpenAI embeddings and Supabase, users gain a scalable, efficient, and intelligent search experience.
Integration Possibilities
- Embedding into developer portals for intelligent documentation lookup.
- Enhancing chatbots with AI-powered documentation responses.
- Streamlining internal knowledge management systems for technical teams.
Future Enhancements
- Multi-language Support: Expanding language models for broader accessibility.
- Custom Embedding Models: Allowing users to fine-tune their embeddings for domain-specific queries.
- Expanded Data Sources: Supporting more document formats like PDFs, Markdown, and structured databases.
- Improved UI/UX: Enhancing the Streamlit interface with additional features like filtering and ranking of results.
References & Acknowledgments
This project is adapted from ottomator-agents. We appreciate the contributions of the original developers in shaping this documentation crawler and RAG framework. Their work has been instrumental in building an effective and scalable solution for AI-powered documentation retrieval.