A Multi-Modal Document Intelligence System that seamlessly processes video, audio, and text documents through integrated computer vision and speech recognition technologies. The system employs Whisper ASR for speech-to-text conversion, advanced OCR for document analysis, and LLM-powered summarization and Q&A capabilities.
Through a Streamlit interface, users can extract, summarize, and query information from diverse media formats including PDFs, DOCX files, text, audio, video and YouTube content, demonstrating practical applications of multi-modal AI in document understanding and information retrieval.
Youtube links:
There are no datasets linked
There are no datasets linked