In this research work, we present a comparative analysis of automated video-to-text transcription using state-of-the-art deep learning models, Wav2Vec2 and Whisper, with integration into an intuitive user interface using a Streamlit-based application. Video-to-text transcription plays a key role in multiple domains including accessibility, content indexing, and real-time translation. The proposed system extracts the audio out of video files and uses these high-performance models to transcribe the speech with accuracy and efficiency.
Wav2Vec2 is compared against the model Whisper, which has also been designed for noising and multilingual transcription task but is a strong performer for noisy environments. In order to bring out differences between the transcription results at qualitative and quantitative analysis under various audio conditions.
Our application not only offers transcription but also allows for the results to be seen and compared directly by the researcher or practitioner. It is thus shown in this paper that Whisper is the better performer in noisy, multilingual settings while Wav2Vec2 dominates in clean, monolingual ones. Bridging state-of-the-art models of transcription with practical realization, this work contributes to advancements in ASR and also provides a robust tool to be applied in a very diverse variety of settings.
There are no datasets linked
There are no datasets linked