Multi-modal Cross Platform Application for Text and Speech Recognition
Abstract
The Multi-modal App is an innovative cross-platform solution designed to assist specially-abled individuals by leveraging text-to-speech (TTS), speech-to-text (STT), and image processing technologies. Built with Flutter, it operates seamlessly across various devices and platforms, ensuring wide accessibility. By utilizing Google's ML Kit, the app continually enhances its recognition capabilities, providing reliable offline, on-device operations without the need for additional hardware. This mobile solution offers significant benefits for individuals facing communication challenges, enabling them to express themselves effectively and independently.
Introduction
Effective communication is essential for human interaction, yet individuals with speech impairments or irregular speech patterns often encounter significant barriers. The Multi-modal App addresses these challenges by integrating advanced speech and text processing capabilities. By detecting text from images, translating speech to text, and vocalizing constructed sentences, the app empowers users to convey their thoughts and needs efficiently. Its cross-platform nature, facilitated by Flutter, ensures compatibility with various ecosystems, enhancing its accessibility and usability.
Methodology
The app combines multiple modalities of interaction:
Text-to-Speech (TTS): Converts text into natural-sounding speech, aiding users in vocalizing their thoughts.
Speech-to-Text (STT): Translates spoken words into text, making irregular speech patterns comprehensible.
Image Text Recognition: Detects and extracts text from images, enabling users to input text by capturing visuals.
On-Device Processing: Ensures all functionalities work offline for enhanced reliability and user privacy.
Cross-Platform Support: Developed using Flutter, the app operates seamlessly across various devices and platforms, offering wide support for different ecosystems.
Google's ML Kit Integration: Utilizes Google's ML Kit, which evolves continually, enhancing recognition accuracy over time.
The app is designed to work seamlessly on smartphones, eliminating the need for external devices and making it accessible to a broad audience.
App Features
Cross-Platform Compatibility: Developed using Flutter, the app works seamlessly on Android, iOS, and other platforms, ensuring broad ecosystem support.
Offline Functionality: All ML models run completely offline, ensuring user privacy and reliability without requiring an internet connection.
No Account Needed: No login or account setup is required, making it quick and easy for users to start using the app.
Camera and Gallery Input: Users can take photos directly through the app or select images from their device's gallery for text recognition.
Easy-to-Use UI: The intuitive and accessible user interface ensures a smooth experience for all users.
Text Copying: Extracted text can be easily copied to the clipboard for use in other applications.
Experiments
To validate the app's capabilities:
Speech and Text Accuracy: Evaluated TTS and STT modules with a dataset of common phrases and irregular speech patterns.
Image Text Detection: Tested the accuracy and speed of optical character recognition (OCR) on various text-rich images.
User Interaction: Simulated real-life scenarios to assess the app’s usability in constructing sentences and conveying messages.
Offline Performance: Measured response times and reliability under different on-device processing conditions.
Cross-Platform Functionality: Assessed performance consistency across multiple devices and operating systems.
Results
High accuracy rates for text recognition from images, with performance improvements observed over time due to ML Kit's evolving capabilities.
Reliable conversion of irregular speech patterns to coherent text.
Smooth and natural text-to-speech conversion, aiding clear communication.
Seamless offline functionality without latency issues, ensuring a user-friendly experience.
Consistent performance across various devices and platforms, demonstrating effective cross-platform support.
Conclusion
The Multi-modal App is a groundbreaking solution for specially-abled individuals, offering a comprehensive set of features to enhance their communication abilities. Its offline functionality, cross-platform compatibility, and reliance on existing mobile devices make it a highly accessible and effective tool. By empowering users to express themselves without barriers, this app has the potential to significantly improve their quality of life.
Future Goals
Chatbot Integration: Use extracted text to provide translations, educate users, and help them understand the content of images better.
Multi-Language Support: Expand the app's capabilities to include multiple languages for a global audience.
Phrase and Sentence Builder: Enable users to create common phrases or construct sentences for easier communication.
Enhanced AI Models: Improve irregular speech recognition using advanced machine learning algorithms.
Customizable Sentence Templates: Facilitate faster and more personalized communication.
Medical Device Integration: Explore tailored solutions for specific healthcare needs.
Collaboration with Experts: Work with speech therapists, healthcare professionals, and educators to refine and expand app functionality.
The complete source code and details for the Multi-modal App are available on GitHub: Multi-modal App Repository
This app is more than a tool; it’s a step toward inclusivity, ensuring no one is left unheard.