Finops Data Analysis

Abstract

This research presents an enterprise-grade FinOps conversational AI agent that transforms cloud cost data into actionable intelligence through a sophisticated multi-agent architecture powered by LangGraph, LLM-driven Text2SQL, and advanced analytics. Building upon the foundational work of Module 2, this system introduces production-critical capabilities including session-based memory persistence, multi-tenant authentication, advanced forecasting and anomaly detection, comprehensive security guardrails, and dual interfaces (Streamlit UI + REST API). The architecture implements a robust PostgreSQL-based multi-tenant data model supporting tenant-user hierarchies, session management, and artifact tracking. Production enhancements include exponential backoff retry logic, configurable timeouts, execution loop limits, graceful degradation, structured logging, and comprehensive health monitoring. The system demonstrates superior reliability through extensive testing coverage including unit, integration, and end-to-end validation. Future research directions focus on hybrid memory architectures, GraphRAG integration, and scalable text-to-SQL optimization techniques.

Module 3 extends the ReadyTensor Project Module 2 by transforming a basic FinOps data analysis system into a production-ready, conversational AI agent with memory, advanced analytics, security guardrails, and dual interfaces (Streamlit UI + REST API).

Module 2 vs Module 3 — Feature Comparison

Architecture

Module 2 handles only single-turn queries.
Module 3 supports multi-turn conversations with memory.

Analytics

Module 2 offers basic aggregations.
Module 3 adds forecasting, anomaly detection, and correlation analysis.

Visualizations

Module 2 generates simple charts.
Module 3 includes more chart types with automatic chart-type detection.

Memory

Module 2 has no memory support.
Module 3 provides session-based memory and SQLite persistence.

Security

Module 2 includes basic security only.
Module 3 implements SQL injection protection, input validation, and path traversal safeguards.

Error Handling

Module 2 contains minimal error handling.
Module 3 has comprehensive try/catch structures, fallbacks, and detailed logging.

Interfaces

Module 2 exposes only a Streamlit UI.
Module 3 supports a Streamlit UI plus a REST API interface.

Deployment
-Module 2 is designed for local deployment only.
-Module 3 is production-ready and supports monitoring.

Testing

Module 2 relies on manual testing.
Module 3 includes unit tests, integration tests, and system tests.

Keywords: FinOps, Multi-Agent Systems, Text2SQL, LangGraph, Multi-Tenant Architecture, Conversational AI, Cloud Cost Optimization, Production Resilience

Introduction

1.1 Problem Statement
Organizations face critical challenges in cloud financial operations:

Lack of conversational interfaces for FinOps data exploration
Absence of contextual memory across multi-turn queries
Limited predictive analytics capabilities for cost forecasting
Insufficient security guardrails for production deployment
No programmatic access for system integration
Single-tenant limitations preventing enterprise-scale adoption
Inadequate resilience mechanisms leading to system failures

1.2 Solution Overview
This research introduces a production-grade conversational AI agent addressing these challenges through:

Multi-Agent Architecture: LangGraph-orchestrated agents (Text2SQL, Insight, Visualization, Knowledge) with intelligent routing
Advanced Memory System: Session-based persistence with SQLite for conversation history and PostgreSQL for multi-tenant data
Security-First Design: SQL injection prevention, input sanitization, path traversal blocking, and role-based access control
Production Resilience: Retry logic, timeout handling, execution limits, graceful degradation, and comprehensive monitoring
(Proposed) Multi-Tenant Architecture: Full tenant-user hierarchy with session isolation and artifact tracking
Dual Interface Design: Streamlit UI for human interaction and REST API for programmatic access

1.3 Key Contributions

Novel multi-tenant conversation management schema for FinOps applications
Production-grade resilience patterns for LLM-based agent systems
Comprehensive text-to-SQL validation and security framework
Hybrid memory architecture combining short-term and long-term persistence
Extensive testing methodology for conversational AI systems

System Architecture

The FinOps Conversational Agent operates through a modular, multi-stage pipeline implemented using specialized agents. Each agent is responsible for a distinct phase of query understanding, data retrieval, and insight generation. This architecture supports interpretability, fault isolation, and incremental improvement of individual components.

Conversation Intake & Context Handling
Agent: small_talk.py

Receives raw user input.
Handles greetings and non-analytical queries.
Normalizes conversational phrasing before routing.
Maintains continuity across sessions via memory persistence.
This step ensures natural language interaction while preserving user intent.

Supervisor Coordination Layer
Agent: supervisor.py

Serves as the central orchestrator of the multi-agent pipeline.
Applies guardrails:
-- loop limits
-- timeout handling
-- error propagation
-- retry logic (via resilience utilities)
-- Decides which downstream agents must be invoked based on context.
This supervisory control ensures robust and bounded execution.

Intent Classification & Routing
Agent: intent_router.py
Distinguishes between:

Cost analysis
Optimization
Tagging validation
Trend or anomaly detection
Non-FinOps queries
Produces structured routing metadata.
Drives subsequent agent selection (SQL vs insight-only vs visualization).
This decouples intent parsing from downstream logic.

Entity Extraction & Schema Alignment
Agent: entity_extraction.py

Extracts cost columns, service names, temporal constraints, grouping fields, filters (regions, tags)
Aligns extracted entities with the FinOps dataset schema.
Validates required query parameters.
This guarantees structural correctness before SQL generation.

SQL Generation for FinOps Analytics
Agent: text2sql.py

Converts validated intent + entities into executable SQL.
Uses an LLM with: bounded retries, exponential backoff, timeout constraints
Sanitizes output for safety and database compatibility.
This enables adaptive FinOps data querying without manual SQL.

Data Retrieval & Preprocessing
Agent: data_fetcher.py

Executes SQL against CSV or SQLite (configurable).
Applies: column validation, NaN handling, datatype coercion
Produces structured analysis tables for downstream consumption.
This abstracts away implementation details of the data layer.

Insight Synthesis
Agent: insightAgent.py

Performs both deterministic Python cost aggregation and optional LLM insight generation.
Detects: top cost drivers, anomalies, cost distribution patterns, trends
Degrades gracefully if the LLM is unavailable.
Includes resilience measures: timeouts retries schema validation
This agent bridges raw analytics and human-readable narratives.

Visualization Production
Agent: visualizerAgent.py

Generates bar charts, trend plots, etc.
Selects chart type based on intent and metrics.
Saves outputs to the results/ directory for inspection or UI rendering.
This enhances insight interpretability for cost stakeholders.

Knowledge Graph Integration
Agent: knowledge.py

Converts retrieved text snippets or schema docs into a graph.
Supports potential future:
anomaly propagation analysis
This lays groundwork for FinOps knowledge-assisted analytics.

┌─────────────────────────────────────────────────────────────┐
│ USER INTERFACES │
├─────────────────────────────────────────────────────────────┤
│ Streamlit UI (Port 8501) │ REST API (Port 8000) │
│ - Chat interface │ - Session management │
│ - File uploads │ - Query processing │
│ - Memory stats │ - History retrieval │
│ - Visualizations │ - OpenAPI documentation │
└─────────────┬───────────────┴──────────────┬────────────────┘
│ │
└───────────┬───────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
├─────────────────────────────────────────────────────────────┤
│ LangGraph Supervisor (supervisor.py) │
│ - Intent classification │
│ - Agent routing (data_fetcher, insights, visualizer) │
│ - State management │
│ - Memory integration │
└─────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ AGENT LAYER │
├─────────────────────────────────────────────────────────────┤
│ Intent Router │ Data Fetcher │ Insight Agent │
│ - Classifies │ - SQL gen │ - Forecasting │
│ user intent │ - Entity ext. │ - Anomaly detection │
│ │ - Query exec. │ - Correlations │
├─────────────────┼─────────────────┼────────────────────────┤
│ Visualizer │ Knowledge │ Small Talk │
│ - 9 chart types│ - RAG system │ - Casual chat │
│ - Auto-detect │ - FinOps docs │ - Greetings │
└─────────────┬───┴─────────────────┴────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SECURITY & VALIDATION │
├─────────────────────────────────────────────────────────────┤
│ - Input sanitization (validators.py) │
│ - SQL injection prevention │
│ - Path traversal blocking │
│ - Rate limiting │
│ - Error boundaries │
└─────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ MEMORY & PERSISTENCE │
├─────────────────────────────────────────────────────────────┤
│ SQLite Database (finops_memory.db) │
│ ┌────────────────┬──────────────────────────────┐ │
│ │ Sessions │ Conversation History │ │
│ │ - session_id │ - id │ │
│ │ - created_at │ - session_id │ │
│ │ - csv_path │ - role (user/assistant) │ │
│ │ - metadata │ - content │ │
│ │ │ - timestamp │ │
│ │ │ - metadata │ │
│ └────────────────┴──────────────────────────────┘ │
└─────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EXTERNAL SERVICES │
├─────────────────────────────────────────────────────────────┤
│ - Groq LLM API (llama-3.3-70b-versatile) │
│ - LangSmith (optional monitoring) │
└─────────────────────────────────────────────────────────────┘

Query Routing Flow

Data Flow

Single Query Flow:

User Input → User submits query with tenant/user context
Authentication → Validate tenant membership and permissions (Proposed)
Validation → Sanitize input and check security rules
Intent Classification → Determine query type (FinOps, visualization, knowledge)
Memory Retrieval → Load conversation history from PostgreSQL
Agent Selection → Route to appropriate agent(s)
Processing → Execute Text2SQL, analytics, or visualization
Artifact Creation → Store SQL queries, charts, results in PostgreSQL
Memory Update → Persist conversation turn
Response Generation → Format and return results
User Output → Display in UI or return via API

Detailed Flow:

Step 1: User submits query
query = "Show me cost trends for EC2"

Step 2: Validation (validators.py)

validated_query = validate_query(query)
validated_csv = validate_csv_path(csv_path)

Step 3: Memory retrieval (state.py)

conversation_history = get_session_history(session_id)
memory_context = format_memory_context(conversation_history)

Step 4: State initialization

state = init_state(
    original_query=validated_query,
    conversation_history=conversation_history,
    memory_context=memory_context
)

Step 5: Supervisor orchestration (supervisor.py) (- classify_node: Determines intent, - Route to appropriate agent, - data_fetcher_node: Generates SQL, executes, - visualize_node: Creates chart, - knowledge_node: Adds context)
result = run_supervisor(state, validated_csv)

Step 6: Response delivery

return {
    "response": result["response"],
    "chart_path": result["chart_path"]
}

Production Enhancements

2.1 Memory System
Architecture:

Short-term memory: Last 5-10 conversation turns in RAM
Long-term memory: Full history in SQLite
Entity memory: Remembered filters, columns, services
Implementation:

schema/state.py

# Your Python code here. For example:
def init_state(
    original_query: str,
    conversation_history: List[Dict] = None
):
    memory_context = format_memory_context(conversation_history)
    remembered_entities = extract_entities_from_history(conversation_history)
    
    return {
        "original_query": original_query,
        "conversation_history": conversation_history,
        "memory_context": memory_context,
        "remembered_entities": remembered_entities,
        "turn_number": len(conversation_history) // 2 + 1
    }

Benefits:

Context-aware responses
Reference resolution ("show me that again")
Follow-up questions work naturally
Persistent across sessions
2.2 Advanced Analytics
New Capabilities:

Cost Forecasting

Linear regression forecasting
forecast_linear(df, date_col='date', value_col='cost', periods=3)
Returns: [5000, 5200, 5400] (next 3 months)
Anomaly Detection
Z-score based anomaly detection
detect_anomalies_zscore(df, column='cost', z_thresh=3.0)
Returns: {date: cost} for outliers
Isolation Forest for complex patterns
detect_anomalies_isolation(df, column='cost', contamination=0.05)
Statistical Analysis
Moving averages for trend smoothing
moving_average(df, column='cost', window=7)
Correlation analysis
correlation_matrix(df)
Returns correlation between all numeric columns
Dynamic Code Generation:
LLM generates safe Python code based on user query
user_query = "Forecast next quarter costs with anomaly detection"

LLM generates:
result = {
'forecast': forecast_linear(df, 'date', 'cost', periods=3),
'anomalies': detect_anomalies_zscore(df, 'cost'),
'trend': moving_average(df, 'cost', window=30)
}
2.3 Enhanced Visualizations
Chart Types:

Bar Chart (vertical/horizontal)
Line Chart (with area fill)
Pie Chart (with percentages)
Stacked Bar Chart (multi-category over time)
Scatter Plot
Area Chart
Heatmap
Grouped Bar Chart
Custom combinations
Auto-Detection:

# Your Python code here. For example:
def determine_chart_type(query: str, detected_cols: Dict):
    if 'trend' in query and detected_cols['date']:
        return 'line'
    elif 'compare' in query and detected_cols['service']:
        return 'bar'
    elif 'distribution' in query:
        return 'pie'
    elif 'over time' in query and detected_cols['category']:
        return 'stacked_bar'

Features:

Proper axes labels and formatting
Color schemes (viridis, Set3)
Value annotations
Grid lines for readability
Currency formatting ($1,234)
Date formatting
Legend placement
2.4 Security Features
Input Validation:

utils/validators.py

# Your Python code here. For example:
BLOCKED_PATTERNS = [
    r'(?i)(drop|delete|truncate|alter)\s+(table|database)',
    r'(?i)(exec|execute|eval|system)',
    r'<script[^>]*>.*?</script>',
    r'\.\./|\.\.',  # Path traversal
    r'[;\|&`$]'     # Command injection
]

def validate_query(user_query: str):
    for pattern in BLOCKED_PATTERNS:
        if re.search(pattern, user_query):
            raise SecurityError("Potentially harmful content detected")

SQL Injection Prevention:

# Your Python code here. For example:
def validate_sql_query(sql_query: str):
    
    if not sql_query.upper().strip().startswith('SELECT'):
        raise SecurityError("Only SELECT queries allowed")
    
   
    blocked = ['DROP', 'DELETE', 'UPDATE', 'INSERT', 'ALTER']
    for keyword in blocked:
        if keyword in sql_query.upper():
            raise SecurityError(f"Dangerous SQL operation: {keyword}")

File Security:

# Your Python code here. For example:
def validate_csv_path(csv_path: str):
    
    if '..' in csv_path:
        raise SecurityError("Path traversal detected")
    
   
    file_size = os.path.getsize(csv_path) / (1024 * 1024)
    if file_size > 100:  # 100MB limit
        raise ValidationError("File too large")
    
  
    if not csv_path.endswith('.csv'):
        raise ValidationError("Only CSV files allowed")

2.5 Error Handling
Multi-Layer Approach:

Layer 1: Input validation
try:
query = validate_query(user_query)
except ValidationError as e:
return {"response": f"Validation Error: {e}"}

Layer 2: Processing errors
try:
result = process_query(query, csv_path)
except FileNotFoundError:
return {"response": "File not found"}
except PermissionError:
return {"response": "Access denied"}

Layer 3: Agent-level errors
def data_fetcher_node(state):
try:
sql_result = execute_sql(query)
except Exception as e:
logger.error(f"SQL execution failed: {e}")
return {**state, "error": True}

Layer 4: Graceful degradation
if not result:
return {"response": "Unable to process. Using fallback..."}

Error Categories:

ValidationError: Bad input
SecurityError: Potential threats
FileNotFoundError: Missing files
PermissionError: Access issues
DatabaseError: SQL failures
LLMError: API failures

2.6 Logging & Monitoring
Structured Logging:

utils/logger_setup.py
logger = setup_execution_logger()

logger.info(f"Processing query: {query[

]}...")
logger.debug(f"State keys: {list(state.keys())}")
logger.warning(f"Memory context large: {len(memory_context)}")
logger.error(f"SQL execution failed: {e}")

Metrics Tracked:

Query processing time
Agent routing decisions
Memory retrieval latency
SQL execution time
LLM API calls and tokens
Error rates by type
Session activity

User Interface

Features:

Chat-style interface
Memory statistics display
Inline chart rendering
Clear chat/memory options
Export conversation

Run in Streamlit app
https://readytensorproject-2-buqxvtwuwt5ldmpgardpcf.streamlit.app/

Testing Strategy

test_agent.py (End-to-end pipeline test)
Objective: Validate that the entire FinOps multi-agent pipeline executes correctly through: Supervisor → Text2SQL → Insight → Visualization → Knowledge → Memory → Response.

What it tests

The Supervisor can run a query without crashing.
SQL is generated correctly by Text2SQL.
SQL executes and produces a DataFrame.
Insight agent returns meaningful analysis.
agent adds recommendations.
Visualization agent creates a chart when applicable.
MemoryNode updates memory after every query.

Test Data
Uses the default CSV (data/data.csv) or test DB.
Query examples include:
"show total cost by service name"
"plot cost trends"

Why it matters

test ensures the full FinOps agent experience works, exactly as a user experiences in the Streamlit UI.

test_integration_insight_agent.py (Integration Test)
Objective: Test how the Insight Agent behaves in combination with the DataFetcher and the SQL execution pipeline.

What it tests

A valid SQL query returns a DataFrame.
Insight agent runs analysis on the DataFrame.
Insight can detect anomalies using:
statistical patterns
cost spikes
monthly comparisons

Test Data Uses a controlled test CSV or mock DataFrame.

Why it matters: Insight Agent is one of the most critical parts of the FinOps system, responsible for:

anomaly detection
forecasting
summarization.
This integration test ensures it works with real data, not just mocked data.

test_processquery.py (Functional Test for Query Processing Logic)
Objective Validate the process_query() pipeline logic (if you use process_query in your architecture), which handles:

intent detection
routing decisions
entity extraction
preparing state for Supervisor
validating schema context
returning structured outputs

This is a “middle layer” test — not unit, not full E2E. What it tests

Correct intent classification for FinOps vs small-talk queries.
Entity extraction is called with correct arguments.
The returned structure contains:
entities
normalized query
validated fields

Test Data

Mocked schema.json
Mocked user query examples

Why it matters

ensures the decision-making logic of your agent is correct before handing things to LangGraph Supervisor.

test_unit_insightagent.py (Unit Test for Insight Agent)
Objective: Test the internal logic of the Insight Agent in isolation.

What it tests

Can compute basic statistics (mean, sum, max).
Handles empty DataFrames safely.
Detects anomalies based on:
-- outliers
-- unusual spikes
-- sudden decreasing patterns

REST API

OpenAPI Documentation:

http://localhost:8000/docs

Endpoints:

POST /session/create
  Response: {"session_id": "uuid", "message": "Session created"}

POST /session/{session_id}/upload-csv
  Body: FormData(file: CSV)
  Response: {"message": "File uploaded", "file_path": "..."}

POST /session/{session_id}/query
  Body: {"query": "Show costs"}
  Response: {
    "session_id": "uuid",
    "response": "Your costs are...",
    "chart_path": "path/to/chart.png",
    "turn_number": 3,
    "intent": "finops_query",
    "subagent": "data_fetcher"
  }

GET /session/{session_id}/history
  Response: {
    "session_id": "uuid",
    "history": [
      {"role": "user", "content": "...", "timestamp": "..."},
      {"role": "assistant", "content": "...", "timestamp": "..."}
    ],
    "total_messages": 10
  }

GET /sessions
  Response: [
    {
      "session_id": "uuid",
      "created_at": "...",
      "last_activity": "...",
      "message_count": 10,
      "has_csv": true
    }
  ]

DELETE /session/{session_id}
  Response: {"message": "Session deleted"}

GET /health
  Response: {
    "status": "healthy",
    "database": "connected",
    "active_sessions": 5
  }

Usage Example:

import requests

BASE_URL = "http://localhost:8000"

Create session
response = requests.post(f"{BASE_URL}/session/create")
session_id = response.json()["session_id"]

Upload CSV
files = {"file": open("data.csv", "rb")}
requests.post(
    f"{BASE_URL}/session/{session_id}/upload-csv",
    files=files
)

Query
response = requests.post(
    f"{BASE_URL}/session/{session_id}/query",
    json={"query": "Show total costs"}
)
result = response.json()
print(result["response"])

Get history
history = requests.get(
    f"{BASE_URL}/session/{session_id}/history"
).json()

Resilience & Fault Tolerance

Retry Logic with Exponential Backoff
External dependencies (LLM calls, database queries, supervisor execution) are protected using bounded retry logic with exponential backoff.

Benefits:

Graceful recovery from transient network failures
Rate limit handling without overwhelming services
Automatic retry with increasing delays prevents thundering herd

Timeout Handling
All long-running operations are guarded by configurable timeouts.

Timeout Scenarios:

LLM API calls that hang due to network issues
Complex SQL queries on large datasets
Agent loops that exceed expected execution time
Visualization rendering for large datasets

Execution & Loop Limits
Prevent runaway agent behavior through strict execution limits:

pythonMAX_AGENT_TURNS = 10 # Maximum conversation turns per request
MAX_GRAPH_NODES = 50 # Maximum nodes in execution graph
MAX_SQL_ROWS = 100000 # Maximum rows returned from SQL

def supervisor_with_limits(state: Dict) -> Dict:
    """
    Execute supervisor with safety limits.
    """
    turn_count = state.get('turn_number', 0)
    
    if turn_count > MAX_AGENT_TURNS:
        raise ExecutionLimitError(
            f"Exceeded maximum turns ({MAX_AGENT_TURNS}). "
            f"Please simplify your query."
        )
    
    # Execute agents
    result = run_agent_graph(state)
    
    if result['node_count'] > MAX_GRAPH_NODES:
        raise ExecutionLimitError(
            f"Agent execution too complex ({result['node_count']} nodes). "
            f"Please break into smaller queries."
        )
    
    return result

Protection Against:

Infinite loops in agent reasoning
Recursive agent calls without termination
Uncontrolled resource consumption
Denial-of-service through malicious queries

Graceful Degradation & Error Handling: The system degrades gracefully under partial failures.

Deployment

Local Development

# Setup
git clone <repository>
cd finops-agent-module3
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env and add GROQ_API_KEY

# Run Streamlit UI
streamlit run integrations/app.py

# Run API
uvicorn api:app --reload --port 8000

Pre-requisites

GitHub — Source code, versioning, CI/CD workflows
Docker — Containerization of the app
AWS ECR — Private container registry to store images
AWS EC2 — Compute host that pulls and runs the container
GitHub Actions — Automates build, test, push, and deployment

AWS Requirements

AWS account
IAM user with ECR push/pull permissions, EC2 instance deploy permissions

Steps

Containerize Your App
Create a Dockerfile
Place this in the root of your repository:

# Use official Python runtime as a parent image
FROM python:3.11-slim

# Set environment variables
ENV PYTHONUNBUFFERED 1

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

# Copy application code
COPY . .

# Expose Streamlit port (if using Streamlit UI)
EXPOSE 8501

# Default command
CMD ["bash", "-lc", "streamlit run integrations/app.py --server.port=8501 --server.address=0.0.0.0"]

Build & Test the Image Locally
docker build -t finops-agent .
docker run -p 8501
finops-agent
Verify:
The app loads at http://localhost:8501
All endpoints respond correctly
AWS ECR Setup

Create an ECR repository
In AWS Console → ECR → Create Repository
Name: finops-agent
Authenticate Docker to ECR
aws ecr get-login-password --region
| docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr..amazonaws.com
Tag and Push the Docker Image
docker tag finops-agent
<aws_account_id>.dkr.ecr..amazonaws.com/finops-agent

docker push <aws_account_id>.dkr.ecr..amazonaws.com/finops-agent

Provision EC2 Instance

Launch EC2
Choose AMI: Amazon Linux 2
Instance type: t3.medium (or larger)
Allow inbound for: SSH (port 22)
App port (e.g., 8501)
Configure security groups appropriately
Install Docker on EC2
SSH into the instance:
sudo yum update -y
sudo amazon-linux-extras install docker
sudo service docker start
sudo usermod -a -G docker ec2-user

Log out and back in for group changes to take effect.

Pull the ECR Image
aws ecr get-login-password --region
| docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr..amazonaws.com

docker pull <aws_account_id>.dkr.ecr..amazonaws.com/finops-agent

Run the Container
docker run -d
-p 8501

--restart always
--name finops-agent
<aws_account_id>.dkr.ecr..amazonaws.com/finops-agent

Verify the app is available at:

http://<EC2_PUBLIC_IP>

GitHub Actions CI/CD

Store Secrets in GitHub
Add the following in your GitHub repository Settings → Secrets:

Secret Value
AWS_ACCESS_KEY_ID (from IAM user)
AWS_SECRET_ACCESS_KEY (from IAM user)
AWS_REGION e.g., us-east-1
ECR_REPOSITORY finops-agent
AWS_ACCOUNT_ID your account id
OPENAI_API_KEY your OpenAI key
...others as needed ...

CI workflow (.github/workflows/main.yml)

name: CI / CD Deploy

on:
  push:
    branches:
      - main

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Login to ECR
        run: |
          aws ecr get-login-password --region ${{ secrets.AWS_REGION }} \
            | docker login --username AWS \
              --password-stdin ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com

      - name: Build Docker image
        run: |
          docker build -t finops-agent .

      - name: Tag Docker image
        run: |
          docker tag finops-agent:latest \
            ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/finops-agent:latest

      - name: Push to ECR
        run: |
          docker push ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/finops-agent:latest

      - name: Deploy on EC2
        uses: appleboy/ssh-action@v0.1.7
        with:
          host: ${{ secrets.EC2_HOST }}
          username: ${{ secrets.EC2_USER }}
          key: ${{ secrets.EC2_SSH_KEY }}
          script: |
            docker pull ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/finops-agent:latest
            docker stop finops-agent || true
            docker rm finops-agent || true
            docker run -d -p 8501:8501 --restart always \
              --name finops-agent \
              ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/finops-agent:latest

6 Results

The results/ directory is the central output location where the FinOps AI Agent writes all generated artifacts during query processing. Every time a user runs a natural-language query through the system, the Text2SQL engine, Insight Agent, and Visualization Agent create structured outputs which are saved inside this folder for future reference and reproducibility.
Generated SQL outputs are stored as timestamped CSV files, enabling users to inspect the exact rows returned by the query and ensuring transparency between the natural-language request and the underlying data.
Similarly, the Visualization Agent stores all charts (bar charts, line charts, trend plots, anomaly graphs, etc.) as PNG images, each file named with a timestamp to associate the visual directly with the executed SQL. Overall, the results/ folder acts as a complete audit trail of user queries—including raw SQL, query results, insights, and generated visualizations—supporting debugging, record-keeping, and downstream analysis.

Key Achievements
Memory System: Session-based + SQLite persistence
Advanced Analytics: Forecasting, anomaly detection, correlations
Security: Input validation, SQL injection prevention, path traversal blocking
Error Handling: Multi-layer with graceful degradation
Dual Interfaces: Streamlit UI + REST API
Testing: Unit + Integration + System tests with 80%+ coverage
Visualizations: 9 chart

#7. Future Action: Multi-Tenant Database Architecture
7.1 Schema Design Philosophy
The multi-tenant architecture implements a shared database, separate schema pattern optimized for:

Tenant isolation: All queries scoped by tenant_id
User attribution: Track which user created sessions and artifacts
Conversation persistence: Maintain complete chat history
Artifact tracking: Link generated SQL, charts, and files to sessions
Referential integrity: Cascade deletes maintain consistency

7.2 Core Tables
7.2.1 Tenants Table

CREATE TABLE tenants (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    owner_user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    name TEXT NOT NULL,
    plan TEXT NOT NULL,  -- 'free', 'professional', 'enterprise'
    status TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active', 'inactive', 'suspended')),
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Represents organizations or teams using the platform. Each tenant has an owner and subscription plan.
7.2.2 Users Table

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    full_name TEXT NOT NULL,
    display_name TEXT,
    email TEXT UNIQUE NOT NULL,
    password_hash TEXT,
    email_verified BOOLEAN DEFAULT false,
    must_reset_password BOOLEAN DEFAULT true,
    is_system_admin BOOLEAN,
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Stores user credentials and profile. Users can belong to multiple tenants with different roles.

7.2.3 Tenants-Users Association

CREATE TABLE tenants_users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    user_id UUID REFERENCES users(id) ON DELETE SET NULL,
    display_name TEXT,
    role TEXT,  -- 'owner', 'admin', 'member', 'viewer'
    status TEXT DEFAULT 'active' CHECK (status IN ('active', 'inactive')),
    domain TEXT,
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Many-to-many relationship supporting role-based access control within each tenant.

7.2.4 Sessions Table

CREATE TABLE sessions (
    session_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE NO ACTION,
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE NO ACTION,
    title TEXT,
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Represents individual chat sessions. Each session belongs to a tenant-user pair, enabling conversation isolation and retrieval.

Why tenant-user level conversations?

Attribution: Track which user initiated conversations
Access Control: Enforce permissions on session visibility
Usage Analytics: Monitor per-user query patterns
Audit Trail: Maintain compliance and debugging capability
Personalization: Enable user-specific memory and preferences

7.2.5 Messages Table

CREATE TABLE messages (
    message_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    session_id UUID REFERENCES sessions(session_id) ON DELETE CASCADE,
    run_id UUID,
    role VARCHAR(50) NOT NULL CHECK (role IN ('user', 'assistant')),
    message TEXT,
    status TEXT CHECK (status IN ('in_progress', 'completed', 'failed')),
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Stores individual conversation turns. Supports streaming responses through status tracking.

7.2.6 Artifacts Table

CREATE TABLE artifacts (
    artifact_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    session_id UUID NOT NULL REFERENCES sessions(session_id) ON DELETE CASCADE,
    run_id UUID NOT NULL,
    agent_type VARCHAR(50) NOT NULL,  -- 'data_fetcher', 'visualizer', 'insight'
    file_path TEXT NOT NULL,
    sql_query TEXT,
    file_hash VARCHAR(64),
    s3_url TEXT,
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Links generated artifacts (SQL queries, charts, analysis results) to sessions for reproducibility and audit.

7.2.7 Datasets Table

CREATE TABLE datasets (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID,
    user_id UUID,
    dataset_id UUID,
    sf_account_locator TEXT,
    sf_database_name TEXT,
    sf_schema_name TEXT,
    sf_table_name TEXT,
    provider TEXT,  -- 'snowflake', 'bigquery', 'redshift', 'csv'
    schema_json_url TEXT,
    uploaded_file_path TEXT,
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Registers data sources available to tenants. Supports external warehouses (Snowflake, BigQuery) and uploaded CSV files.

7.2.8 Session Summary Table

CREATE TABLE session_summary (
    summary_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    session_id UUID NOT NULL REFERENCES sessions(session_id) ON DELETE CASCADE,
    first_message_id UUID REFERENCES messages(message_id) ON DELETE SET NULL,
    last_message_id UUID REFERENCES messages(message_id) ON DELETE SET NULL,
    summary_description TEXT,
    created_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC'),
    updated_at TIMESTAMPTZ DEFAULT (now() AT TIME ZONE 'UTC')
);

Purpose: Stores LLM-generated summaries of long conversations for quick retrieval and display in session lists.

7.4 Why This Architecture?
Tenant Isolation

Every query filtered by WHERE tenant_id =
Prevents data leakage between organizations
Enables per-tenant rate limiting and quotas

User Attribution

Track which user created sessions and artifacts
Support collaborative workspaces within tenants
Enable user-specific preferences and history

Conversation Continuity

Session-based grouping maintains context
Messages table preserves exact conversation flow
Session summaries enable quick navigation

Artifact Traceability

Link every SQL query and chart to originating session
Support reproducibility and debugging
Enable audit trails for compliance

Scalability

Indexed foreign keys ensure fast lookups
UUID primary keys avoid collisions in distributed systems
Cascading deletes maintain referential integrity

Future Research Directions
8.1 Multi-Tenant Authentication & Authorization

Planned Enhancements:

Tenant-user hierarchy in PostgreSQL
Session-based conversation isolation
Artifact tracking per tenant-user pair
Support enterprise SSO providers (Okta, Auth0, Azure AD)
JWT-based authentication for API access
Refresh token management

8.2 Hybrid Memory System with Embeddings
Architecture:
┌───────────────────────────────────────────────────┐
│ Context Memory (In-Memory Cache) │
│ Last 5-10 turns + Active session state │
└─────────────────┬─────────────────────────────────┘
│
┌─────────────────┴─────────────────────────────────┐
│ Episodic Memory (PostgreSQL Messages) │
│ Full conversation history with timestamps │
└─────────────────┬─────────────────────────────────┘
│
┌─────────────────┴─────────────────────────────────┐
│ Semantic Memory (Vector Database - Pinecone) │
│ Embedded conversations + FinOps knowledge base │
│ Similarity search for relevant context │
└───────────────────────────────────────────────────┘
Research Questions:

What is the optimal embedding model for FinOps domain? (FinBERT vs. general embeddings)
How to balance retrieval from episodic vs. semantic memory?
When should the system retrieve from long-term memory vs. regenerate?

8.3 Advanced Text2SQL Research
Open Problems:

Schema complexity: How to handle schemas with 500+ tables?
Ambiguity resolution: When should the system ask clarifying questions vs. make assumptions?
Query optimization: Can LLMs learn to generate efficient SQL?
Error recovery: How to iteratively refine incorrect SQL?
Multi-database support: Unified text-to-SQL across PostgreSQL, Snowflake, BigQuery

Conclusion

This FinOps AI Agent already delivers a powerful multi-agent reasoning pipeline, memory persistence, advanced analytics, and seamless Text2SQL automation, but there is a clear roadmap for taking it to the next level.

The upcoming milestones focus on building a full memory stack that includes a richer context layer, a detailed episodic memory for past conversations and user-specific behaviors, and a robust semantic memory powered by embeddings and similarity search to store reusable knowledge.
Integrating a hybrid memory architecture (context + episodic + semantic) will enable the system to maintain long-term awareness, improve personalization, and produce more consistent multi-turn insights.
Additional enhancements include introducing graph-based memory (GraphRAG), refining retrieval quality, strengthening guardrails and observability, and expanding the REST API for enterprise-scale deployment. With these upgrades, the platform evolves into a fully autonomous, self-improving FinOps assistant capable of long-term learning and continuous optimisation across cloud financial operations.

Acknowledgments

This research builds upon the ReadyTensor FinOps Module 2 baseline and extends it with production-grade capabilities. Special thanks to the open-source communities behind LangChain, LangGraph, Groq, and Streamlit for enabling rapid development of AI-powered applications.

References

LangGraph Documentation: Multi-Agent Orchestration Patterns
"Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data" (Finegan-Dollak et al., 2018)
"RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers" (Wang et al., 2020)
"BIRD-SQL: A Large-Scale Cross-Domain Text-to-SQL Benchmark" (Li et al., 2023)
PostgreSQL Multi-Tenant Architecture Best Practices

Finops Data Analysis

Table of contents

Abstract

Introduction

System Architecture

Query Routing Flow

Data Flow

Production Enhancements

User Interface

Testing Strategy

REST API

Resilience & Fault Tolerance

Deployment

6 Results

Conclusion

Acknowledgments

References

Table of contents

Code

Code

Datasets

Datasets