
This project represents the final phase of LoRAfrica deployment, making the system accessible to users on the web through scalable serverless infrastructure using Modal and an interactive frontend built with Streamlit.
Aim
Deploy LoRAfrica via Modal and Streamlit for seamless web-based user interaction.
Objectives

The first phase of the project with making LoRAfrica available on user consumer hardware before deploying unto Modal and Streamlit. The base model microsoft/Phi-4-mini-instruct was merged with the LoRA adapters of DannyAI/phi4_african_history_lora_ds2_axolotl which was then pushed to HuggingFace. This allowed a unified model for which developers can further fine-tune id desired.
The Base model and Lora Model were quantised using Llama.cpp in the q4_k_m.gguf format, which reduced the model size to about 2GB (base model) and adapters to 6MB. Make it fully available and runable on consumer hardware via platforms such as Ollama and Llama.cpp interactive UI apps.
Below are the relevant links for LoRAfrica for CPU
LoRAfrica's Serverless Compute Deployment involved the following;
Below are the brief overview of the metrics measured;
Measures the latency of the first response.
Unit: ms
The time elapsed between two consecutive tokens.
Unit: ms
The average generation time per token, excluding the prefill (TTFT) phase.
Unit: ms
The total time from request to the final token.
Unit: ms
The overall throughput of the inference engine.
Unit: tokens/sec
| Metric | Focus | LaTeX Logic |
|---|---|---|
| TTFT | Responsiveness | |
| ITL | Smoothness | |
| TPOT | Speed | |
| E2E | Total Time | |
| TPS | Throughput |
Benchmarks were conducted using a custom dataset as shown below;
SYSTEM_PROMPT = "You are a helpful AI assistant specialised in African history which gives concise answers to questions asked." TEST_PROMPTS = [ "Who was the richest king in history from the Mali Empire?", "Explain the significance of the Great Zimbabwe ruins.", "Describe the impact of the Ashanti Empire on West African trade.", "What was the role of the Kingdom of Aksum in early Christianity?", "Detail the leadership of Queen Amina of Zazzau." ]
For the vLLM approach, the dataset above was converted to a .jsonlformat, as vLLM expects such as shown below;
Template used;
full_prompt = f"<|system|>\n{SYSTEM_PROMPT}\n<|user|>\n{user_q}\n<|assistant|>\n"
Example format
{"prompt": "<|system|>\nYou are a helpful AI assistant specialised in African history which gives concise answers to questions asked.\n<|user|>\nWho was the richest king in history from the Mali Empire?\n<|assistant|>\n", "output_len": 128}
The results of the naive vs vLLM approach are shown and explained briefly below;
| Batch Size | Naive TPS | vLLM TPS (output_throughput) |
|---|---|---|
| 4 | 17.18 | 212.42 |
| 16 | 17.14 | 707.30 |
| 32 | 17.15 | 1172.94 |
| 64 | 16.50 | 1804.45 |
Interpretation:
| Batch | Naive Mean | vLLM Mean | Naive Median | vLLM Median | Naive P95 | vLLM P95 | Naive P99 | vLLM P99 |
|---|---|---|---|---|---|---|---|---|
| 4 | 1306.01 | 62.19 | 420.95 | 62.77 | 8533.89 | 67.57 | 8744.09 | 72.30 |
| 16 | 5273.29 | 71.41 | 2646.71 | 65.64 | 17867.76 | 105.55 | 18763.96 | 107.25 |
| 32 | 7396.85 | 92.39 | 5847.75 | 92.25 | 23157.50 | 121.94 | 24690.97 | 125.02 |
| 64 | 22182.49 | 157.05 | 23376.35 | 179.74 | 39222.21 | 210.91 | 59003.12 | 211.92 |
Interpretation:
| Batch | Naive Mean | vLLM Mean | Naive Median | vLLM Median | Naive P95 | vLLM P95 | Naive P99 | vLLM P99 |
|---|---|---|---|---|---|---|---|---|
| 4 | 44.43 | 17.76 | 41.87 | 17.78 | 62.21 | 17.95 | 64.18 | 17.98 |
| 16 | 44.19 | 19.03 | 42.10 | 19.14 | 53.61 | 19.37 | 64.61 | 19.45 |
| 32 | 44.77 | 20.25 | 42.31 | 20.39 | 59.11 | 20.83 | 64.68 | 20.98 |
| 64 | 44.87 | 22.50 | 42.40 | 22.65 | 64.81 | 23.39 | 64.99 | 23.60 |
Interpretation:
| Batch | Naive Mean | vLLM Mean | Naive Median | vLLM Median | Naive P95 | vLLM P95 | Naive P99 | vLLM P99 |
|---|---|---|---|---|---|---|---|---|
| 4 | 57.90 | 17.77 | 63.02 | 17.49 | 67.26 | 21.83 | 68.25 | 30.03 |
| 16 | 57.39 | 19.04 | 63.37 | 18.33 | 67.44 | 27.55 | 68.69 | 30.81 |
| 32 | 56.40 | 20.23 | 62.87 | 19.63 | 67.73 | 27.67 | 68.74 | 32.58 |
| 64 | 59.42 | 22.43 | 64.13 | 22.21 | 68.10 | 28.87 | 69.01 | 38.08 |
Interpretation:
| Batch | Naive Mean | vLLM Mean | Naive Median | vLLM Median | Naive P95 | vLLM P95 | Naive P99 | vLLM P99 |
|---|---|---|---|---|---|---|---|---|
| 4 | 1931.48 | 1536.84 | 551.98 | 1538.78 | 8674.16 | 1557.75 | 9341.98 | 1565.19 |
| 16 | 5817.38 | 1650.05 | 3018.40 | 1653.50 | 18114.21 | 1712.05 | 19265.36 | 1720.17 |
| 32 | 7870.63 | 1806.79 | 6218.77 | 1818.53 | 23466.47 | 1884.95 | 24817.74 | 1901.24 |
| 64 | 22902.15 | 2030.43 | 23502.29 | 2065.39 | 43940.01 | 2157.93 | 59194.26 | 2177.12 |
Interpretation:
This is why vLLM was used for modal deployment via Modal's Serverless Compute.
Modal is a serverless cloud platform for running AI and machine learning workloads without managing infrastructure. It handles scaling, GPU provisioning, and execution automatically, allowing developers to focus on building and deploying models.
Key Features
In this setup the,
L4 GPU and 2 T4 GPU (a fallback safety net) was deployed. Which allows 100 concurrent requests with an allowance of 10 containers. An idle container was not allowed so as to manage costs in infrastructure management
1️⃣ Formula
Cost per token = GPU cost per hour ÷ tokens per hour throughput
2️⃣ Estimate Tokens Per Hour
Max output tokens per request: 128
The Baseline Estimates
3️⃣ Compute Cost Per Token
L4:
cost per token = 0.80 / 1,000,000 = 0.0000008 USD/token
T4:
cost per token = 0.59 / 750,000 ≈ 0.000000787 USD/token
| GPU | Cost per token (USD) |
|---|---|
| L4 | 0.0000008 |
| T4 | 0.000000787 |
If your prompt is ~50 tokens, output is 128 tokens:
Fractional cost:
input cost per token = 0.0000008 * (50 / 178) = 0.000000224
output cost per token = 0.0000008 * (128 / 178) = 0.000000576
Actual Compute Cost
input_cost = input_tokens * input_cost_per_token =
50 * 0.000000224 = 0.0000112 USD
output_cost = output_tokens * output_cost_per_token =
128 * 0.000000576 = 0.000073728 USD
total_cost = input_cost + output_cost =
0.0000112 + 0.000073728 = 0.000084928 USD
This means per-request cost is extremely low; Making serverless LLM deployment economically viable at scale. Moreover, this means 10,000 requests would cost less than $1, highlighting the cost-efficiency of optimised LLM deployment using modern GPU infrastructure.
LangSmith is a development and monitoring platform for building, debugging, and evaluating LLM-powered applications. It provides deep visibility into how models and agents behave, making it easier to trace, test, and improve performance.
Key Features
In this project, it was used to trace the LLM responses and ensuring PIIs were being redacted on the developer's side to ensure privacy of users.
Redaction of users' information was achieved using a hybrid approach; regular expressions and Microsoft's Presidio anonymizer & Presidio analyzer; which recognises entities allowing for redaction where regular expressions fails.
For further safety concerns; the system prompt was structured in a way that it hidden as a secret on Modal and immune to prompt injections. I cannot share the full prompt but here is an example way to do this;
""" ### ROLE You are LoRAfrica, an expert-level repository of African Historical Knowledge........ ### DOMAIN LAYER ### SECURITY & ANTI-INJECTION LAYERS - **The Obsidian Firewall:**
In addition to this, the code is structured in a way that rejects systems prompts provided by users; thereby ensuring safety of the platform.
Streamlit is an open-source Python framework for building and sharing interactive web applications for data science and machine learning projects. It allows developers to create UI components directly in Python without needing frontend development skills.
Key Features
Streamlit was used to quickly build and deploy an interactive frontend interface for LoRAfrica.
With the frontend being managed by Streamlit, there were some noticeable latency; a drawback well known in Streamlit but good for testing purposes.
NB: Before the use of Streamlit, server side was tested using POSTMAN to send requests and generate outputs from the model.
LoRAfrica demonstrates an end-to-end pipeline for deploying a domain-specific Large Language Model in LLMOps, from fine-tuning and benchmarking to scalable production deployment. By combining LoRA fine-tuning with efficient inference strategies (naive and vLLM), the project evaluates key performance metrics such as TTFT, ITL, TPOT, E2E, and TPS to optimise real-world performance.
The system leverages Modal for serverless GPU-based deployment, enabling automatic scaling and simplified infrastructure management, while Streamlit provides an accessible web interface for user interaction. Additional integrations such as LangSmith ensured observability, logging, and PII redaction for safer and more reliable usage.
Overall, the project highlights how modern tools can be orchestrated to deliver a scalable, efficient, and production-ready AI application tailored to African history knowledge.
Malicious prompt-1

Malicious prompt-2

POSTMAN Base Model

POSTMAN LoRAfrica Model

Streamlit Interface

LangSmith tracing and Redaction-1

LangSmith tracing and Redaction-2
