This work presents a modular pipeline for grammar proficiency assessment from audio samples. Using robust transcription systems like Whisper and fine-tuned transformer models (BERT, FLAN-T5), I developed a proof-of-concept grammar scoring engine for spoken English. The system processes raw audio, performs transcription, and scores grammatical correctness based on a rubric. Despite several practical challenges including noisy long-duration samples, imperfect transcription artifacts, and a lack of direct grammar evaluation metrics my pipeline achieves a promising RMSE score of 0.439 on held-out samples. Here I outline key architectural decisions, evaluate multiple approaches, and discuss limitations with a roadmap for future work.
In the age of conversational AI and remote language assessments, evaluating grammar from spoken audio remains a complex but important task. Unlike textual grammar correction, spoken grammar assessment introduces two additional bottlenecks speech-to-text fidelity and noise in informal speech. My work aims to bridge this gap by proposing a lightweight yet extensible pipeline for grammar scoring from audio samples. Here I explore modular design principles, experiment with transcription and language models, and deliver an interpretable system prototype. The project was entirely run on local infrastructure and is available on Github.
I began by analyzing over 370 training samples of spoken English, most ranging between 30 and 180 seconds. A notable insight in the dataset that was available to me was the correlation between audio length and noise clips exceeding 70 seconds frequently contained distortion, inconsistent pauses, or overlapping speech. These samples degraded transcription quality and skewed downstream grammar scoring. As a result, I adopted a hard cutoff at 70 seconds, flagging and excluding such samples.
Initially, I experimented with Wenet Gigaspeech (n-gram decoding), which offered a reasonable level of accent handling but failed to show consistency in semantics. Thus it was insufficient, since transcription fidelity in this task directly affects the grammar quality
I then transitioned to OpenAI's Whisper, which significantly improved results but suffered from latency due to its 30-second chunk processing window. A custom audio chunking and stitching pipeline was introduced to handle long audios, but inference remained slow.
To overcome this, I adopted Faster-Whisper, a lighter variant that offers similar accuracy along with reduced runtime and memory usage. This change improved throughput and made the pipeline suitable for real-time cases like AI Interviews
Now another main challenge for the pipeline is to evaluate grammar proficiency from transcriptions. I approached this using:
Fine-tuned BERT and FLAN-T5: These models were trained on rubric-annotated transcriptions with regression heads predicting grammar scores. BERT achieved an RMSE of ~0.439 on the test subset, a strong result considering transcription imperfections.
LanguageTool-Python: I used this library to extract grammar error counts. While conceptually it seems quite useful. In this case, due to the noisy transcriptions it often triggered false positives, reducing reliability in final scores. Thus, it was only used during early ensemble experiments.
MPNet for Semantic Matching: I attempted to calculate similarity between transcriptions and ideal rubric descriptions via embedding cosine distance. However, this approach lacked alignment with grammar quality and was discarded.
I acknowledge several areas where the project could be significantly improved:
Lack of Explicit Grammar Metrics: Due to time constraints and my end-semester examinations around this time, I couldn't incorporate a definitive grammar error metric (like GLEU or error type distribution) in evaluating language model outputs. This is a major limitation and I would like to address this in future iterations.
Error Propagation from Transcription: All scoring is dependent on transcription fidelity. No error-aware or uncertainty-aware scoring was implemented, causing the system to misfire in edge cases.
Simplistic Postprocessing: Chunked transcriptions were naively stitched without attention to context loss or sentence-level alignment. Improvements here could greatly benefit fluency and scoring accuracy.
Audio Quality Filtering: No denoising was applied to flagged audios, resulting in data loss instead of recovery.
Despite these issues, I believe the modularity and adaptability of my pipeline set a strong foundation for future work.
The best-performing BERT model achieved the following metrics on the held-out validation subset:
Metric | Values |
---|---|
Training RMSE Loss | 0.0133 |
Evaluation RMSE Loss | 0.439 |
Epochs | 50 |
Learning Rate | 1.5e-8 |
Eval Speed | 131.9 samples/sec |
Although this evaluation lacks direct grammar metric validation, qualitative inspection suggests meaningful grammar alignment with rubric expectations.
Post end-semester examinations, the following improvements are planned:
Metric Integration: Introducing GLEU, grammatical error rate (GER), or error category distributions for evaluating model predictions.
Grammar-specific Architectures: Investigating transformers trained for grammatical error detection and correction (e.g., GECToR, T5+EditTagger).
Self-supervised Grammar Correction: Using pseudo-labeled data and semi-supervised learning to expand the available labeled dataset.
Better Transcription Handling: Context-aware stitching and filtering of transcription artifacts, potentially via self-attention postprocessors.
This project demonstrates a first step towards evaluating grammar from spoken English using a modular, reproducible pipeline. While limitations remain particularly in evaluation and robustness, the approach showcases the viability of using Faster-Whisper and pre-trained transformers for low-resource grammar assessment. I hope this work encourages further experimentation and community contribution in this domain.