This work presents an evaluation framework for assessing the quality and relevance of doctor-patient dialogues in a medical chatbot environment using BERTScore. By leveraging contextual embeddings from a domain-specific BERT model (emilyalsentzer/Bio_ClinicalBERT), the framework compares user-generated diagnostic questions against expert-authored ground truth questions. The evaluation provides precision, recall, and F1 metrics, offering a nuanced measure of dialogue quality beyond traditional surface-level metrics. This approach demonstrates improved alignment with clinical relevance and can inform the development and refinement of medical conversational agents.
Medical chatbots are increasingly used to simulate patient interactions, assist in diagnosis, and support medical training. However, evaluating the quality of these dialogues remains challenging due to the complexity of medical language and the need for contextually relevant responses. Traditional metrics such as BLEU and ROUGE, while useful, often fail to capture semantic similarity and clinical appropriateness. BERTScore, which utilizes transformer-based contextual embeddings, offers a more robust alternative for evaluating the relevance and accuracy of medical conversations. This study introduces a BERTScore-based evaluation pipeline tailored for medical chatbot interactions, aiming to bridge the gap between human expert judgment and automated assessment.
Previous efforts in medical chatbot evaluation have primarily relied on n-gram-based metrics (BLEU, ROUGE) or manual expert review, both of which have notable limitations in capturing semantic and contextual nuances. Recent research has explored the use of BERT and its variants for both chatbot generation and evaluation, demonstrating superior performance in understanding medical language and intent. Other projects have employed sentence transformers (SBERT, MEDBERT) for retrieval and matching tasks in medical QA systems. However, systematic frameworks for evaluating chatbot dialogue quality using BERTScore, particularly with domain-specific models, remain underexplored.
Bio_ClinicalBERT.import json
with open('user_questions.json') as f:
user_questions = json.load(f)
with open('ground_truth.json') as f:
ground_truth = json.load(f)
P, R, F1 = score(user_questions, ground_truth, lang="en", model_type="emilyalsentzer/Bio_ClinicalBERT")
emilyalsentzer/Bio_ClinicalBERT was selected for its pretraining on clinical notes, enhancing its understanding of medical terminology.{ "results": [ { "student_question": "What do you do for work?", "ground_truth_question": "what kind of work do you do", "patient_response": "I'm a landscaper.", "bertscore_precision": 0.88, "bertscore_recall": 0.91, "bertscore_f1": 0.89 }, ... ], "summary": { "average_precision": 0.87, "average_recall": 0.89, "average_f1": 0.88, "alignment_score": "8/10 matched (80%)", "total_questions": 10 } }
| Metric | Token-based | Context-aware | Synonym-aware | Suitable for Long Texts |
|---|---|---|---|---|
| BLEU | Yes | No | No | Poor |
| ROUGE | Yes | No | No | Somewhat |
| BERTScore | No | Yes | Yes | Yes |
BERTScore consistently outperformed n-gram metrics in capturing the semantic relevance of medical questions.
The use of BERTScore with a domain-specific model like Bio_ClinicalBERT allows for a more accurate and context-sensitive evaluation of medical chatbot dialogues. This approach addresses key limitations of traditional metrics by leveraging contextual embeddings, which are crucial for understanding medical nuances and intent. The framework also provides granular feedback (per-question scores) and summary statistics, which can guide iterative improvement of chatbot systems.
Limitations:
This study demonstrates the effectiveness of BERTScore for evaluating medical chatbot dialogues, particularly when combined with a clinically pretrained BERT model. The framework offers a robust, context-aware alternative to traditional metrics, enabling more meaningful assessment of chatbot performance in medical training and patient interaction scenarios. Future work may explore real-world deployment and integration with continuous learning systems.
We thank the contributors of the emilyalsentzer/Bio_ClinicalBERT model and the open-source community for providing the resources and tools that made this research possible.