This document describes a compact, production-ready Retrieval-Augmented Generation (RAG) system that answers user queries based on an indexed corpus of documents. The system uses chunked documents stored in a vector database for retrieval, a strong system prompt for the Groq LLM, and a short-term memory component that stores previous interactions (responses & context) to maintain conversation state.
A RAG system augments an LLM with a retrieval step: relevant document passages are retrieved from a vector DB and passed to the LLM so that answers are grounded in the supplied corpus. Key goals for this simple system:
-Accurate, citation-friendly answers drawn from documents.
-Strong system prompt to constrain Groq LLM behavior (tone, citation style, hallucination mitigation).
-Short-term memory to store recent user messages and LLM outputs (improve coherence).
Scalable vector storage of chunked document embeddings.
-Receives user requests (chat or single-turn query).
-Authentication, rate limiting, telemetry.
-Accepts raw docs (PDF, HTML, txt).
-Normalizes, splits into chunks, computes embeddings, writes to vector DB with metadata.
-Stores chunk embeddings + metadata. Supports k-NN search (ANN), filtering by metadata.
-Executes semantic search using query embedding; returns top-K chunks with relevancy scores.
-Builds the final prompt: system prompt, retrieved contexts, recent memory, user query.
-Applies length / token budget management (truncate less relevant items).
##Groq LLM Connector
-Sends assembled prompt to Groq LLM and receives response.
-Handles retries, timeouts.
-Stores recent interactions in a fast store. Used to include prior responses in next prompts.
-Adds citations, enforces safety rules, optionally extracts structured answers.



