AAIDC Module 3 โ Agentic AI in Production.
๐ Live app: https://finsightagent.tech ยท API: https://api.finsightagent.tech/api/v1/health
๐ฆ Repository: https://github.com/phanminhtai23/finsight-agentic-production
Builds on the Module 2 multi-agent system: https://github.com/phanminhtai23/finsight-multi-agent
FinSight is a multi-agent financial research assistant that answers questions about any company โ from uploaded documents or live financial sources โ with every claim cited back to its source. This document describes Project 3: taking the system from "runs on my machine" to "safe to operate in production". It covers the six hardening layers added (reliability, safety, observability, health probes, rate limiting, secure deployment), how to run the system end-to-end, API reference, troubleshooting, and the evaluation results. Readers should leave with a clear picture of the problem solved, the engineering choices made, and enough detail to reproduce and extend the work.
FinSight is a multi-agent financial research assistant (LangGraph supervisor + Retrieval,
Market Research, Analyst, Writer, Critic agents over a Qdrant RAG layer, MCP tools, and grounded
citations). Project 2 made it work. Project 3 makes it operable โ I added the layers a real
deployment needs: reliability (retries/timeouts around every model call), a safety guardrail
layer (prompt-injection defense, PII redaction, advice disclaimer), observability (correlation
ids, structured logs, Prometheus metrics, a consistent error envelope), health/readiness probes,
per-user rate limiting, fail-fast secure config, CI/CD, a hardened container, and an
offline adversarial safety evaluation. 69 automated tests gate the system.
A demo that calls an LLM and returns an answer is not a product. In production the model provider
rate-limits you, users send adversarial or sensitive input, dependencies go down, and operators need
to see what happened when something breaks. Project 3 is about closing exactly that gap on top of
an existing agentic system โ turning FinSight from "runs on my machine" into "safe to operate".
Every LLM and embedding call is wrapped in bounded retry + exponential backoff with jitter +
per-attempt timeout, retrying only transient failures (429 / 5xx / timeout / connection drops)
and re-raising the rest. Streaming retries only before the first token, so a blip at connect
time recovers without ever duplicating mid-stream output.
โ app/core/resilience.py, wired in app/core/llm.py.
A fast, dependency-free layer that runs before and after the model:
app/core/guardrails.py, applied on both the streaming chat and /ask paths.X-Request-ID./metrics โ request rate/latency, LLM calls & retries, guardrail blocks,{"error": {code, message, request_id}}; never a raw stack trace.app/core/middleware.py, app/core/metrics.py, app/core/errors.py./health (liveness) and /readiness โ the latter actively checks Postgres, Redis and Qdrant
concurrently and returns 503 degraded if any dependency is unreachable, so an orchestrator only
sends traffic when the system can actually serve it.
A Redis fixed-window limiter keyed per authenticated user on the expensive chat endpoint. It
fails open if Redis is unavailable (availability over enforcement) and emits a metric on every
block. โ app/core/ratelimit.py.
Deleting a topic or document now removes all three copies of the data atomically: vectors from
Qdrant, relational rows from Postgres, and the raw file from Cloudinary (or local disk when
Cloudinary is not configured). The cloudinary_public_id stored on every Document row drives the
deletion; if it is absent (local-disk fallback), the cloud step is skipped safely.
โ app/rag/ingestion/storage.py (FileStorage.delete), wired in app/services/topic_service.py.
ENVIRONMENT=prod with a default/weak JWT_SECRET or a missingGOOGLE_API_KEY (fail-fast validation in config.py).backend/Dockerfile.prod): multi-stage, non-root, container HEALTHCHECK,docker-compose.prod.yml (built images, restart policies).ci.yml runs automated tests on every push for bothdeploy.yml then auto-deploys the backend to theflowchart TD U["๐ฅ๏ธ React UI"] -->|"REST ยท SSE ยท WebSocket"| MW["๐งฑ Request middleware<br/>correlation-id ยท access log ยท metrics"] MW --> RL{"โฑ๏ธ Rate limit<br/>+ ๐ก๏ธ Guardrails"} RL -->|"allowed"| API["โก FastAPI (SOLID)"] RL -->|"blocked / refused"| U API --> SUP["๐ค LangGraph supervisor + agents<br/>Retrieval ยท Research ยท Analyst ยท Writer ยท Critic"] SUP -->|"retry + timeout"| LLM["๐ง Gemini (chat + embeddings)"] SUP --> QD[("Qdrant")] SUP -->|"MCP client"| MCP["๐ MCP tools"] API --> PG[("Postgres")] API --> RD[("Redis")] API -.->|"/metrics"| PROM["๐ Prometheus"] API -.->|"trace"| LS["๐ LangSmith"] HC["โค๏ธ /health ยท /readiness"] --> PG & RD & QD
Guardrails and rate limiting sit at the edge; reliability wraps the model boundary; observability
spans the whole request. Full map + verify-steps in PRODUCTION.md.
| Dimension | Tooling | Result |
|---|---|---|
| Answer quality vs no-RAG baseline | evals/run_eval.py (LangSmith) | expected-recall, citation coverage, LLM-judge groundedness |
| Safety (adversarial) | evals/run_safety_eval.py โ offline, no API | injection block 5/5, benign false-positive 0/3, PII redaction 2/2 |
| Regression gates | pytest โ 69 tests | reliability, guardrails, rate-limit, health, error-envelope, prod-config |
The safety eval is a labelled adversarial set (evals/safety_dataset.py) and doubles as a CI gate
(tests/test_safety_eval.py), so a future change that weakens the guardrail fails the build.
cp .env.example .env # set GOOGLE_API_KEY (free: aistudio.google.com/apikey) docker compose up -d --build # postgres, qdrant, redis, mcp, api, worker docker compose exec api alembic upgrade head # verify the hardening docker compose exec api pytest -q # 69 tests docker compose exec api python -m evals.run_safety_eval curl localhost:8000/api/v1/readiness # {"status":"ready","dependencies":{...}} curl localhost:8000/metrics | grep finsight_ # Prometheus metrics # production-style run (enforces secure config, non-root image, restart policies) docker compose -f docker-compose.prod.yml --env-file .env up -d --build
A ready-made report ships at samples/sample_financial_report.docx for an end-to-end chat demo
(see the Quick demo section of README.md).
The React + Vite + TypeScript frontend is live at https://finsightagent.tech and ships with the Docker Compose stack.
| Feature | Description |
|---|---|
| Conversation sidebar | Create, switch between, and delete independent conversations; each has its own LangGraph thread and persisted memory. |
| Topic pinning | Pin one or more uploaded document topics to a conversation so the RAG layer scopes retrieval to the right files. |
| Streaming chat | Tokens stream in real time over SSE/WebSocket; a thinking toggle surfaces the agent's step-by-step reasoning before the final answer. |
| Citations | Every factual claim renders as an inline [n] link that deep-links to the exact page in the source document (Cloudinary-hosted). |
| Charts | The Analyst agent produces Chart.js bar, line and pie charts inline; charts persist on page reload. |
| Document upload | Drag-and-drop or file picker; ingestion progress streams live via WebSocket; status transitions queued โ processing โ ready. |
| Dark mode | One-click toggle; preference is persisted in localStorage. |
| User profile & tiers | Register/login with email; profile page shows the active plan (Free), usage stats, and account settings. |
| Rate-limit feedback | When the rate limiter activates the UI shows a clear message with a countdown rather than a silent failure. |
The backend exposes a RESTful API at http://localhost:8000 (production: https://api.finsightagent.tech). Interactive docs are at /docs (Swagger UI) and /redoc.
All endpoints except /api/v1/health and /api/v1/readiness require a Bearer token obtained via POST /api/v1/auth/login.
Authorization: Bearer <token>
| Method | Path | Description |
|---|---|---|
GET | /api/v1/health | Liveness probe โ returns {"status":"ok"} while the process is up. |
GET | /api/v1/readiness | Readiness probe โ checks Postgres, Redis and Qdrant; returns 503 with a degraded status if any dependency is down. |
GET | /metrics | Prometheus metrics endpoint โ request rate/latency, LLM calls, guardrail blocks, rate-limit hits. |
POST | /api/v1/auth/register | Register a new user {"email", "password"}. |
POST | /api/v1/auth/login | Obtain a JWT token {"email", "password"} โ {"access_token", "token_type"}. |
GET | /api/v1/conversations | List the authenticated user's conversations. |
POST | /api/v1/conversations | Create a conversation {"title"}. |
DELETE | /api/v1/conversations/{id} | Delete a conversation and all its messages. Returns 204. |
GET | /api/v1/conversations/{id}/messages | List messages for a conversation. |
POST | /api/v1/conversations/{id}/messages | Send a message (streaming SSE) {"content", "topic_ids"}. |
GET | /api/v1/topics | List the user's document topics. |
POST | /api/v1/topics | Create a topic {"name"}. |
POST | /api/v1/topics/{id}/documents | Upload a document (multipart/form-data); triggers async ingestion. |
GET | /api/v1/documents/{id} | Ingestion status (queued / processing / ready / failed). |
WS | /api/v1/ws/conversations/{id} | WebSocket โ streams tokens and background-task progress events. |
All 4xx/5xx responses share the same JSON structure so clients handle errors uniformly:
{ "error": { "code": "rate_limit_exceeded", "message": "Too many requests โ retry after 42 s.", "request_id": "a1b2c3d4" } }
The request_id matches the X-Request-ID response header and appears in every log line for that request.
Full step-by-step instructions are in DEPLOY.md. Summary:
| Component | Platform |
|---|---|
| Frontend | Vercel (Hobby) โ git push triggers auto-deploy via frontend/vercel.json |
| Backend + workers + datastores | DigitalOcean droplet (Ubuntu 24.04, 4 GB / 2 vCPU) โ Docker Compose (docker-compose.prod.yml) behind Nginx + Let's Encrypt TLS |
| CI/CD | GitHub Actions โ lint + test + build on every push; auto-deploy to the droplet over SSH when main is green |
# One-command production start (hardened image, restart policies, non-root) docker compose -f docker-compose.prod.yml --env-file .env up -d --build docker compose -f docker-compose.prod.yml exec api alembic upgrade head
/readinessOne or more dependencies (Postgres, Redis, Qdrant) are unreachable. Check:
docker compose ps # are all services running? docker compose logs postgres # look for init/auth errors docker compose logs qdrant docker compose logs redis
Run curl localhost:8000/api/v1/readiness โ the response body names the failing dependency:
{"status": "degraded", "dependencies": {"postgres": "ok", "redis": "error", "qdrant": "ok"}}
Gemini's free tier has per-minute quotas. The UI displays the remaining wait time. Alternatively, wait ~60 s and retry, or set RATE_LIMIT_ENABLED=false in .env to disable per-user limiting during development.
processingdocker compose logs -f workerCLOUDINARY_* keys are set (or uploads fall back to local disk โ verify uploads/ is writable).QDRANT_COLLECTION_SHARD_NUM or add swap./metrics is served on the same port as the API. If using Nginx, ensure your location block proxies /metrics to the backend (see nginx.nginx for the reference config).
Run inside the Docker container to match the installed environment:
docker compose exec api pytest -q
Or locally, install with pip install -e ".[dev]" from backend/ before running pytest.
JWT_SECRET is too weak and the app won't bootSet a strong secret in .env: JWT_SECRET=$(openssl rand -hex 32). The app deliberately refuses to start in ENVIRONMENT=prod with a default or short secret.
FinSight is a research aid, not financial advice: investment-style answers carry an automatic
disclaimer, every factual claim is citation-backed, user data is per-user scoped, and PII is redacted
before logging. Intended use, limitations and risk mitigations are documented in
MODEL_CARD.md.
Repository: https://github.com/phanminhtai23/finsight-agentic-production ยท
Docs: README ยท PRODUCTION.md ยท MODEL_CARD.md ยท
Contact: Phan Minh Tai โ phanminhtai23@gmail.com
Tags: agentic-ai ยท production ยท mlops ยท llmops ยท multi-agent ยท langgraph ยท rag ยท
guardrails ยท observability ยท prometheus ยท reliability ยท ci-cd ยท fastapi ยท aaidc