LoRAfrica: Modal Serverless Compute Deployment with Streamlit

Project Description

This project represents the final phase of LoRAfrica deployment, making the system accessible to users on the web through scalable serverless infrastructure using Modal and an interactive frontend built with Streamlit.

Aim & Objectives

Aim
Deploy LoRAfrica via Modal and Streamlit for seamless web-based user interaction.

Objectives

Benchmark performance using Naive and vLLM approaches while measuring TTFT, ITL, TPOT, E2E, and TPS
Deploy LoRAfrica using Modal serverless GPU infrastructure
Enable logging and observability with LangSmith, including PII redaction
Build and render a Streamlit-based frontend for user interaction

Methodology

LoRAfrica for CPU (Ollama & Llama.cpp)

The first phase of the project with making LoRAfrica available on user consumer hardware before deploying unto Modal and Streamlit. The base model microsoft/Phi-4-mini-instruct was merged with the LoRA adapters of DannyAI/phi4_african_history_lora_ds2_axolotl which was then pushed to HuggingFace. This allowed a unified model for which developers can further fine-tune id desired.

The Base model and Lora Model were quantised using Llama.cpp in the q4_k_m.gguf format, which reduced the model size to about 2GB (base model) and adapters to 6MB. Make it fully available and runable on consumer hardware via platforms such as Ollama and Llama.cpp interactive UI apps.

Below are the relevant links for LoRAfrica for CPU

LoRAfrica Serverless Compute Deployment

This involved the following;

Benchmarking TTFT, ITL, TPOT, E2E, and TPS with naive approach (baseline) against the vLLM approach.
Deployment of Model using Modal while enabling LangSmith traces and serving the Model frontend via Streamlit

Metrics Explained

Below are the brief overview of the metrics measured;

1. TTFT (Time to First Token)

Measures the latency of the first response.

Unit: ms

2. ITL (Inter-Token Latency)

The time elapsed between two consecutive tokens.

Unit: ms

3. TPOT (Time Per Output Token)

The average generation time per token, excluding the prefill (TTFT) phase.

Unit: ms

4. E2E (End-to-End Latency)

The total time from request to the final token.

Unit: ms

5. TPS (Tokens Per Second)

The overall throughput of the inference engine.

Unit: tokens/sec

Summary Table

Metric	Focus	LaTeX Logic
TTFT	Responsiveness
ITL	Smoothness
TPOT	Speed
E2E	Total Time
TPS	Throughput

Benchmarks were conducted using a custom dataset as shown below;

SYSTEM_PROMPT = "You are a helpful AI assistant specialised in African history which gives concise answers to questions asked."
TEST_PROMPTS = [
    "Who was the richest king in history from the Mali Empire?",
    "Explain the significance of the Great Zimbabwe ruins.",
    "Describe the impact of the Ashanti Empire on West African trade.",
    "What was the role of the Kingdom of Aksum in early Christianity?",
    "Detail the leadership of Queen Amina of Zazzau."
]

For the vLLM approach, the dataset above was converted to a .jsonlformat, as vLLM expects such as shown below;
Template used;

full_prompt = f"<|system|>\n{SYSTEM_PROMPT}\n<|user|>\n{user_q}\n<|assistant|>\n"

Example format

{"prompt": "<|system|>\nYou are a helpful AI assistant specialised in African history which gives concise answers to questions asked.\n<|user|>\nWho was the richest king in history from the Mali Empire?\n<|assistant|>\n", "output_len": 128}

Results

The results of the naive vs vLLM approach are shown and explained briefly below;

1. Global TPS (Tokens Per Second)

Batch Size	Naive TPS	vLLM TPS (output_throughput)
4	17.18	212.42
16	17.14	707.30
32	17.15	1172.94
64	16.50	1804.45

Interpretation:

vLLM achieves ~10–100× higher throughput than the naive approach across all batch sizes.
Naive approach (FastAPI + HuggingFace) TPS saturates at ~17 tokens/sec, indicating GPU underutilization.

2. TTFT (Time To First Token, ms)

Batch	Naive Mean	vLLM Mean	Naive Median	vLLM Median	Naive P95	vLLM P95	Naive P99	vLLM P99
4	1306.01	62.19	420.95	62.77	8533.89	67.57	8744.09	72.30
16	5273.29	71.41	2646.71	65.64	17867.76	105.55	18763.96	107.25
32	7396.85	92.39	5847.75	92.25	23157.50	121.94	24690.97	125.02
64	22182.49	157.05	23376.35	179.74	39222.21	210.91	59003.12	211.92

Interpretation:

Naive approach suffers from extreme TTFT at high batch sizes; vLLM maintains low and predictable TTFT.

3. TPOT (Time Per Output Token, ms)

Batch	Naive Mean	vLLM Mean	Naive Median	vLLM Median	Naive P95	vLLM P95	Naive P99	vLLM P99
4	44.43	17.76	41.87	17.78	62.21	17.95	64.18	17.98
16	44.19	19.03	42.10	19.14	53.61	19.37	64.61	19.45
32	44.77	20.25	42.31	20.39	59.11	20.83	64.68	20.98
64	44.87	22.50	42.40	22.65	64.81	23.39	64.99	23.60

Interpretation:

TPOT is consistently lower for vLLM, showing faster per-token generation when compared to the Naive approach.

4. ITL (Inter-Token Latency, ms)

Batch	Naive Mean	vLLM Mean	Naive Median	vLLM Median	Naive P95	vLLM P95	Naive P99	vLLM P99
4	57.90	17.77	63.02	17.49	67.26	21.83	68.25	30.03
16	57.39	19.04	63.37	18.33	67.44	27.55	68.69	30.81
32	56.40	20.23	62.87	19.63	67.73	27.67	68.74	32.58
64	59.42	22.43	64.13	22.21	68.10	28.87	69.01	38.08

Interpretation:

vLLM shows much lower ITL compared to naive, confirming faster streaming and token delivery.

5. E2E (End-to-End Latency, ms)

Batch	Naive Mean	vLLM Mean	Naive Median	vLLM Median	Naive P95	vLLM P95	Naive P99	vLLM P99
4	1931.48	1536.84	551.98	1538.78	8674.16	1557.75	9341.98	1565.19
16	5817.38	1650.05	3018.40	1653.50	18114.21	1712.05	19265.36	1720.17
32	7870.63	1806.79	6218.77	1818.53	23466.47	1884.95	24817.74	1901.24
64	22902.15	2030.43	23502.29	2065.39	43940.01	2157.93	59194.26	2177.12

Interpretation:

vLLM significantly reduces E2E latency compared to naive, especially at high batch sizes.
Naive approach exhibits huge spikes at P95/P99 due to GPU contention, Python threading overhead, and lack of efficient batching.
vLLM shows low tail latency, making it much more predictable and scalable.

Overall Analysis

Throughput: vLLM outperforms naive FastAPI by ~10–100×, scaling efficiently with batch size and concurrency.
Token Generation: TPOT and ITL are much lower and stable in vLLM.
Latency: TTFT and E2E are dramatically reduced in vLLM; naive FastAPI suffers long-tail spikes.
System Behavior:
- Naive approach is bottlenecked by Python threading, HuggingFace generation inefficiencies, and GPU underutilization.
- vLLM optimizes scheduling, streaming, and batching for high concurrency.

This is why vLLM was used for model deployment via Modal's Serverless Compute Platform.

Deployment

Modal

Modal is a serverless cloud platform for running AI and machine learning workloads without managing infrastructure. It handles scaling, GPU provisioning, and execution automatically, allowing developers to focus on building and deploying models.

Key Features

Serverless execution (no infrastructure setup)
Automatic scaling (CPU & GPU)
On-demand GPU access
Fast deployment with low latency
Code-first workflow

In this setup the,
L4 GPU and 2 T4 GPU (a fallback safety net) was deployed. Which allows 100 concurrent requests with an allowance of 10 containers. An idle container was not allowed so as to manage costs in infrastructure management

Estimating Cost Per Token for L4/T4 GPUs

1️⃣ Formula

Cost per token = GPU cost per hour ÷ tokens per hour throughput

GPU cost per hour: from Modal
- $0.80 for L4,
- $0.59 for T4
Tokens per hour: estimate based on model’s generation speed

2️⃣ Estimate Tokens Per Hour

Max output tokens per request: 128
The Baseline Estimates

Assume L4 throughput ≈ 1,000,000 tokens/hour
Assume T4 throughput ≈ 750,000 tokens/hour

3️⃣ Compute Cost Per Token

L4:

cost per token = 0.80 / 1,000,000 = 0.0000008 USD/token

T4:

cost per token = 0.59 / 750,000 ≈ 0.000000787 USD/token

GPU	Cost per token (USD)
L4	0.0000008
T4	0.000000787

4️⃣ Split Input vs Output Cost (example calculation for a request on L4 compute)

If your prompt is ~50 tokens, output is 128 tokens:

Total tokens per request = 50 + 128 = 178

Fractional cost:

input cost per token  = 0.0000008 * (50 / 178) = 0.000000224
output cost per token = 0.0000008 * (128 / 178) = 0.000000576

Actual Compute Cost

input_cost = input_tokens * input_cost_per_token = 
50 * 0.000000224 = 0.0000112 USD

output_cost = output_tokens * output_cost_per_token = 
128 * 0.000000576 = 0.000073728 USD

total_cost = input_cost + output_cost = 
0.0000112 + 0.000073728 = 0.000084928 USD

This means per-request cost is extremely low; Making serverless LLM deployment economically viable at scale. Moreover, this means 10,000 requests would cost less than $1, highlighting the cost-efficiency of optimised LLM deployment using modern GPU infrastructure.

LangSmith

LangSmith is a development and monitoring platform for building, debugging, and evaluating LLM-powered applications. It provides deep visibility into how models and agents behave, making it easier to trace, test, and improve performance.

Key Features

End-to-end tracing of LLM calls and workflows
Debugging and inspection of prompts, responses, and chains

In this project, it was used to trace the LLM responses and ensuring PIIs were being redacted on the developer's side to ensure privacy of users.

Redaction of users' information was achieved using a hybrid approach; regular expressions and Microsoft's Presidio anonymizer & Presidio analyzer; which recognises entities allowing for redaction where regular expressions fails.

NB: Noticeable latency was introduced with the use of these redaction approaches when loading from cold starts.

For further safety concerns; the system prompt was structured in a way that it hidden as a secret on Modal and immune to prompt injections. I cannot share the full prompt but here is an example way to do this;

"""
### ROLE
You are LoRAfrica, an expert-level repository of African Historical Knowledge........

### DOMAIN LAYER


### SECURITY & ANTI-INJECTION LAYERS
- **The Obsidian Firewall:**

In addition to this, the code is structured in a way that rejects systems prompts provided by users; thereby ensuring safety of the platform.

Streamlit

Streamlit is an open-source Python framework for building and sharing interactive web applications for data science and machine learning projects. It allows developers to create UI components directly in Python without needing frontend development skills.

Key Features

Simple Python-based UI development
Real-time interaction and updates
Easy integration with data science & ML libraries
Fast deployment and sharing
Built-in support for visualizations

Streamlit was used to quickly build and deploy an interactive frontend interface for LoRAfrica. Moreover, Streamlit secrets was also used to hide the endpoint from users as a security concern

NB: Before the use of Streamlit, server side was tested using POSTMAN to send requests and generate outputs from the model.

Conclusion

LoRAfrica demonstrates an end-to-end pipeline for deploying a domain-specific Large Language Model in LLMOps, from fine-tuning and benchmarking to scalable production deployment. By combining LoRA fine-tuning with efficient inference strategies (naive and vLLM), the project evaluates key performance metrics such as TTFT, ITL, TPOT, E2E, and TPS to optimise real-world performance.

The system leverages Modal for serverless GPU-based deployment, enabling automatic scaling and simplified infrastructure management, while Streamlit provides an accessible web interface for user interaction. Additional integrations such as LangSmith ensured observability, logging, and PII redaction for safer and more reliable usage.

Overall, the project highlights how modern tools can be orchestrated to deliver a scalable, efficient, and production-ready AI application tailored to African history knowledge.

Key Contributions

Built a domain-specific LLM using LoRA fine-tuning for African history
Benchmarked inference performance using both naive and vLLM approaches
Measured and analysed key latency and throughput metrics (TTFT, ITL, TPOT, E2E, TPS)
Deployed the model using serverless GPU infrastructure via Modal
Developed an interactive frontend using Streamlit for real-time user interaction
Integrated LangSmith for monitoring, logging, and PII redaction

Issues Faced

Setting up LiteLLM as a proxy to monitor billing and token usage which would allow alerting messages on SLACK
Trying to rent the needed GPU on Runpod to condut experimentation of naive and vLLM approaches
Pattern recognition to redact PII's in LangSmith

Future Scope

Use user's inputs logged by LangSmith to have more questions and research for ground truth thereby generating more training data further to fine-tune & improve LoRAfrica
Further use a proxy interface which hides the endpoint from users in addition to the use of Streamlit secrets
Implement more measures to redact users' PIIs from being logged by LangSmith