Deploy and Monitor PEFT-BART for Dialogue Summarization

Objective

The objective of this project is to deploy, evaluate, and monitor a fine-tuned BART dialogue summarization model trained on the HighlightSum dataset, providing a practical reference for ML engineers, MLOps practitioners, and data scientists interested in production LLM systems.

The model training process is documented separately in the publication LLM Engineering and Deployment Certification: PEFT of BART for Dialogue Summarization.

This project focuses exclusively on the deployment and operational phase of the model lifecycle. It demonstrates how a fine-tuned model can be deployed as a managed cloud-based inference service using the Hugging Face Inference API, how it processes real dialogue inputs to generate summaries, and how key performance metrics are monitored post-deployment.

A controlled evaluation using 10 dialogue samples was conducted to assess operational behavior, including:
inference latency, cost projections and endpoint reliability. In addition, a monitoring setup was implemented to track usage metrics, errors, and operational performance indicators.

Overall, this project demonstrates how to:

Deploy and serve a fine-tuned transformer model via a managed cloud inference endpoint
Estimate infrastructure cost and cost per request using transparent modeling
Measure and report key operational metrics (latency, throughput assumptions, reliability)
Integrate monitoring and observability practices for production readiness

Use Case Definition

Problem Statement

This project deploys a fine-tuned BART-based Large Language Model (LLM) for dialogue summarization. The system takes multi-speaker conversational transcripts as input and generates concise, structured summaries.

The objective is to enable automated summarization of customer support conversations, meeting transcripts, and internal communication logs in order to:

Reduce manual review time
Improve knowledge management
Enhance customer support analytics
Enable integration into downstream business intelligence systems

Target Users

The system is designed for:

Customer Support Teams – Summarize support tickets and live chat sessions
Product Teams – Extract key insights from user feedback conversations
ML Engineers & Developers – Integrate summarization via API
Knowledge Management Teams – Store structured summaries of long transcripts
Researchers & Students – Study real-world LLM deployment patterns

Input / Output Examples (real model outputs) / Example Inference Outputs (From Deployed Endpoint)

The following examples were generated using the live Hugging Face Inference Endpoint.
All outputs shown are actual responses returned by the deployed model and are also stored in logs/demo_summaries.csv, available in the project's GiHhub repository

Example 1

Input Dialogue

Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great

Model Output (Deployed Endpoint)

Oliver is voting for Liberals this election.

This output demonstrates the model’s ability to extract and concisely summarize the main decision expressed within a short conversational exchange.

Example 2

Input Dialogue

Yannick: I heard you are going to sing the anthem at the game 
Nicki: Yes I am. I am nervous!
Yannick: Dont be. It's a huge privilege to sing the anthem in front of thousands of people! 
Nicki: I hoep I won't forget the lyrics 
Yannick: Youre a great singer. You will be fine 
Nicki: I have to rehearse 
Yannick: Your parents coming with you? 
Nicki: Yes 
Yannick: They're probably proud of you good luck! 
Nicki: Thank you Yannick

Model Output (Deployed Endpoint)

Nicki is going to sing the anthem at the game. Yannick is going with Nicki to the game with her parents. Nicki is nervous.

This output demonstrates the model’s capacity to synthesize key contextual details and emotional cues from a longer multi-speaker dialogue into a coherent summary. While the model performs reliably, occasional abstraction drift may occur in longer dialogues, which is a known characteristic of abstractive summarization systems.

Example 3

Input Dialogue

A: Hows the product launch going? 
B: Marketing is finalizing the assets. 
A: Great, keep me posted.

Model Output (Deployed Endpoint)

B tells A how marketing is finalizing the assets for the product launch.

This output demonstrates the model’s effectiveness in generating a structured summary of a brief status-oriented conversation.

Overall the above summaries reflect real-time responses from the deployed inference endpoint. They demonstrate the model’s ability to generate concise abstractive summaries of multi-speaker dialogue inputs in a production setting.

Success Criteria

Metric	Target
Average Latency	≤ 3 seconds
Reliability	≥ 99% successful responses
Cost per 1,000 Requests	< $1
ROUGE-L Score	≥ Fine-tuned baseline
Error Rate	< 1%

Deployment is considered successful if performance, reliability, and cost remain within defined thresholds under expected load.

Traffic Expectations

Estimated Requests per Day: 5,000
Average Requests per Hour: 200
Peak Load: 5–10 Requests per Second (RPS)
Concurrent Users: ~50 during peak
Projected Growth: 10–15% monthly increase

These assumptions guide infrastructure sizing and scaling strategy.

Model Selection & Configuration

Model Overview

The deployed model is a fine-tuned version of BART-large, originally developed by Meta AI.

Aspect	Configuration
Model	BART-large (Fine-tuned for Dialogue Summarization)
Model Source	Hugging Face Model Hub
Parameter Count	406 Million
Quantization	None (FP16 Inference)
Context Length	1024 Tokens
Max Output Tokens	128 Tokens
Generation Strategy	Beam Search (4 beams)

Rationale for Model Selection

Why Fine-Tuned BART Model?

Encoder-decoder architecture well-suited for summarization
Strong performance on abstractive summarization benchmarks
Compatible with PEFT fine-tuning approaches
Lower deployment complexity compared to very large decoder-only LLMs

Trade-Off Analysis

Factor	Consideration
Size vs Quality	406M parameters provide strong summarization without extreme GPU requirements
Cost vs Performance	Smaller than 7B+ models → lower inference cost
Encoder-Decoder vs Decoder-Only	Better structured summarization output
Quantization	Not used to preserve output quality

Quantization Considerations

Quantization (e.g., INT8 or INT4) could:

Reduce memory footprint
Lower GPU costs
Improve throughput

However, it may introduce minor degradation in summarization quality. Future optimization may include INT8 quantization for cost reduction.

Deployment Strategy

Platform Selection

The deployment uses:

Hugging Face Inference API
Hugging Face Spaces (Gradio UI for demo)

Why This Platform?

Managed infrastructure (no manual DevOps setup)
Built-in autoscaling
Simplified model hosting
Public demo interface via Spaces
Seamless integration with Hugging Face Hub

Alternatives Considered

Platform	Reason Rejected
AWS SageMaker	Higher setup complexity
Modal	Less direct integration with model hub
Self-hosted EC2	Requires manual scaling & DevOps
vLLM on Cloud VM	Optimized for large decoder-only models

Infrastructure Configuration

Component	Configuration
vCPU Type	Intel Sapphire Rapids
Memory	16GB VRAM
Endpoint Type	Real-time Inference
Scaling Strategy	Fixed capacity with autoscaling
Geographic Region	US-East
Deployment Artifact	Merged fine-tuned model

Alignment with Traffic Expectations

The vCPU is:

More predictable scaling for lightweight summarization workloads
Suitable performance for low-to-moderate traffic scenarios

This configuration balances cost efficiency and performance reliability.

Deployment and Monitoring Workflow

The architecture diagram below illustrates the end-to-end deployment flow.

User Application ((Web Browser)
        ↓
Gradio Web App (Hugging Face Spaces UI Layer)
        ↓
Client Script (Authenticated API Call)
        ↓
Hugging Face Inference Endpoint
        ↓
Managed vCPU Instance
        ↓
Fine-Tuned BART Model (PEFT)
        ↓
Generated Summary Response
        ↓
Metric Logging (Latency, Errors, Token Usage)
        ↓
Weights & Biases Dashboard
        ↓
Monitoring and Alerting

The workflow emphasizes:

Fine-tuned BART deployment as a managed inference artifact (PEFT-based training, inference-ready model)
Authenticated API-based access to the Hugging Face Inference Endpoint
Explicit generation configuration aligned with Hugging Face best practices
Managed vCPU-backed inference
Post-deployment monitoring of latency, token usage, and error rates
Observability via Weights & Biases with alerting for operational visibility
Feedback loop enabling cost and performance optimization decisions

Note : The earlier stages of the model lifecycle are documented in the original training publication LLM Engineering and Deployment Certification: PEFT of BART for Dialogue Summarization

Cost Analysis

A detailed cost model supporting the estimates below is provided in the accompanying spreadsheet cost_estimate.xlsx of the Github of this project.

Monthly Cost Breakdown

Cost Component	Monthly Estimate
Compute (GPU)	$194
Storage (Model+ Logs)	$10
Network Transfer	$15
Monitoring (W&B)	50
Total Estimated	~270

The baseline estimate assumes a single Intel Sapphire Rapids vCPU running continuously (720 hours/month) at $0.27 per hour. Monitoring costs vary depending on Weights & Biases plan selection.

The full calculation logic (including formulas and adjustable parameters) is available in cost_estimate.xlsx for transparency and reproducibility.

Cost Per 1,000 Requests

Assumptions:

CPU cost: $0.27/hour
Conservative throughput: 200 requests/hour
Sequential inference under moderate utilization
10-sample evaluation used to validate endpoint stability

Cost per request:

0.27 ÷ 200 = $0.00135

Cost per 1,000 requests:

0.00135×1000=1.35

Estimated Cost ≈ $1.35 per 1,000 requests

The spreadsheet includes editable cells for:

CPU hourly cost
Throughput assumptions
Monthly runtime hours
Optimization factors

This allows dynamic adjustment of cost projections under different deployment scenarios.

Cost Modeling Disclaimer

Cost projections are based on assumed steady-state throughput (200 requests/hour) and measured stability from a 10-dialogue evaluation. Large-scale concurrency and stress testing were not conducted; therefore, actual production costs may vary depending on traffic distribution, batching efficiency, autoscaling configuration, and sustained CPU utilization.

Cost Optimization Strategies

Strategy	Description	Impact on Cost	Trade-Off
Quantization	Reduce model precision to INT8 or INT4 to lower memory footprint and GPU requirements.	Lower compute cost and improved throughput.	Potential minor degradation in summary quality.
Batching	Process multiple inference requests simultaneously to maximize GPU utilization.	Reduced cost per request.	Slight increase in per-request latency.
Caching	Store summaries of frequently repeated dialogue inputs.	Reduces repeated inference calls.	Requires cache invalidation strategy.
Auto-Scaling	Dynamically scale replicas based on traffic volume.	Prevents overpaying during low-traffic periods.	Cold-start latency during scale-up events.
Spot Instances	Use discounted/preemptible GPU instances where supported.	Significant infrastructure cost reduction.	Risk of instance interruption.

A combination of quantization, batching, and auto-scaling provides the most balanced approach to minimizing cost while maintaining acceptable latency and summary quality.

Monitoring & Observability Plan

Metrics to Track

Metric	Why It Matters	Alert Threshold
p50 Latency	UX consistency	> 2.5s
p99 Latency	Tail performance	> 4s
Error Rate	Reliability	> 2%
Throughput	Capacity	< 2 RPS
Token Usage	Cost control	> 800 tokens/request
GPU Utilization	Efficiency	< 20% or > 95%

Tools Selection

Tool	Purpose
Weights & Biases	Logging & visualization
Hugging Face Dashboard	Endpoint monitoring
Custom Logging	Latency + token tracking

Alerting Strategy

Alert Triggers

Latency > 4 seconds sustained for 5 minutes
Error rate > 2%
GPU utilization > 95%

Notification Channels

Email notifications
Slack alerts (integration)

Runbook Example (High Latency)

Check GPU utilization
Verify request spikes
Reduce beam size
Restart endpoint if needed

Security Considerations

API Authentication

Secure access via Hugging Face API tokens
Token rotation policy implemented
Environment variables used for storage

Rate Limiting

60 requests/min per API key
Burst limit protection

Input Validation

Maximum token limit enforced
Input sanitization against prompt injection
Length constraints applied

PII Handling

No long-term storage of raw dialogues
Logs anonymized
Secure HTTPS communication
Encryption at rest (provider-managed)

Access Control

Private model repository
Restricted monitoring dashboard access
Role-based access control (RBAC)

Results

The table below summarizes the key performance results obtained from the reproducible test setup (10 inference requests).

Metric	Target/References	Actual Measured Value
Average Latency	≤ 3s	~3.6s
Reliability	≥ 99%	100%
Cost per 1K Tokens	< $0.50	~$0.0015
Total Cost (10 requests)	Negligible	~$0.0027

Performance is stable with slightly elevated latency that can be optimized (with fewer beams, smaller max length, or faster hardware).

Example Usage — Gradio App Hosted on Hugging Face Spaces

The deployment includes an interactive Gradio-based web application hosted on Hugging Face Spaces, titled “LLM Dialogue Summarizer”.

The web interface demonstrates real-time LLM inference for dialogue summarization.

Demonstration Overview

The application allows users to:

Enter a single dialogue or batch of dialogues
Click the Summarize button
View the generated summary
Observe the reported inference latency (in milliseconds)

Application Overview

A screenshot of the Hugging Face Spaces web app titled “LLM Dialogue Summarizer”, showcasing the user interface for interactive dialogue summarization.

The interface includes:

Input field(s) for dialogue text
A Summarize action button
An output section displaying the generated summary
A latency indicator reporting inference time

This demonstrates the deployed model functioning as a user-facing inference service.

Screenshot_18-2-2026_164234_huggingface.co.jpeg

Example in Use

A screenshot displaying a completed example run:
- A filled Single Dialogue input containing a conversational exchange
- A generated summary displayed below the input
- A reported inference latency of approximately 2.2 seconds

This example illustrates:

Real-time inference behavior
Response generation quality

Screenshot_usage_interface (1).jpeg

Example Monitoring UI

Below are example screenshots (the latency plot and the error plot, respectively) from the W&B monitoring dashboard.

The below latency plot shows response times across multiple test steps. Latency remains stable between 2.8–4.0 seconds, with one noticeable spike likely caused by cold starts or temporary resource contention.

Latency Plot (1).png

The below error plot shows a constant value of zero across all test steps, indicating that no inference errors occurred. This confirms the stability and reliability of the deployed endpoint during evaluation.

Error Plot (1).png

Limitations

While the deployment successfully demonstrates managed model serving, cost modeling, and monitoring integration, several limitations should be noted:

Small test size (10 requests) : Performance assessment was conducted on a small set of dialogue samples. This provides insight into single-request behavior but does not represent sustained or high-concurrency production workloads.
No large-scale stress testing : Load testing under concurrent request scenarios was not performed, limiting validation of throughput ceilings and autoscaling behavior.
Manual quality monitoring : Output quality was assessed qualitatively rather than through continuous automated evaluation metrics (e.g., ROUGE tracking over time).

Future Improvements

Potential future improvements include:

Automated evaluation tracking: Integrate continuous ROUGE or other summarization metrics for ongoing quality monitoring.
Implement load testing: Simulate concurrent traffic to validate throughput, autoscaling thresholds, and cost efficiency under production-like conditions.
Enable model quantization: Apply INT8 or INT4 quantization to reduce GPU memory usage and improve cost efficiency.
Introduce A/B testing framework : Compare generation configurations or model variants to measure performance and quality trade-offs.
Multi-region deployment: Deploy endpoints across regions to reduce latency and improve availability.

Feel free to suggest more ideas by opening an issue or starting a discussion! For bug reports or feature requests, open an issue. For general questions or share your thoughts, start a comment.

Conclusion

This project demonstrates a complete end-to-end deployment of a fine-tuned BART dialogue summarization model, including:

Defined production use case
Model selection justification
Deployment infrastructure planning
Cost analysis and optimization strategy
Monitoring & observability design
Security planning

The deployment achieves high reliability and low cost while maintaining acceptable latency, providing a practical reference architecture for LLM production systems.

References

LLM Engineering and Deployment Certification: PEFT of BART for Dialogue Summarization
Hugging Face
Hugging Face Inference API
Hugging Face Spaces (Gradio UI for demo)
PEFT (LoRA/QLoRA)
Weights & Biases
BART-large
Publication Evaluation Guidelines
Python, Colab/Jupyter

Contributing

We welcome contributions to improve the project:

1 Fork the GitHub repository
2 Create a feature branch:

git checkout -b your-feature-name

3 Commit and push your changes
4 Submit a Pull Request and describe your contribution.

Please follow our code style and guidelines. For questions or suggestions, open an issue

License

Licensed under the MIT License

Contact

For questions or feedback, please contact the authors:

Acknowledgments

This project is part of LLM Engineering and Deployment Certification program by the Ready Tensor. We appreciate the contributions of the Ready Tensor developer community for their guidance and contributions.