TubeQuery - Ask anything from a YouTube video
Table of contents
TubeQuery: An LLM-Powered Tool for Querying YouTube Video Content
Abstract:
TubeQuery is a Python-based tool that leverages Large Language Models (LLMs) and advanced Natural Language Processing (NLP) techniques to enable users to extract information from YouTube videos. By simply providing a YouTube video URL, users can obtain a transcript, a concise summary, and answers to specific questions about the video's content. This project demonstrates the practical application of speech-to-text conversion, text summarization, and question-answering models in a real-world scenario.
1. Introduction
In the age of information overload, efficiently extracting key insights from video content is crucial. TubeQuery addresses this challenge by providing a streamlined way to interact with YouTube videos, moving beyond passive viewing to active information retrieval. The tool is particularly useful for:
- Educational Content: Quickly grasping the core concepts of lectures, tutorials, and online courses.
- Information Gathering: Extracting specific details from interviews, documentaries, and news reports.
- Content Summarization: Generating concise overviews of lengthy videos.
This document details the architecture, implementation, and usage of TubeQuery, along with a discussion of its limitations and potential future enhancements.
2. System Architecture and Workflow
TubeQuery's operation can be broken down into the following key stages:
-
Video Acquisition and Audio Extraction:
- The user provides a YouTube video URL.
- The
yt_dlp
library is used to download the audio stream from the video.yt_dlp
is preferred overyoutube_dl
for its continued maintenance and improved handling of YouTube's evolving platform. - The audio is saved in WAV format using FFmpeg, ensuring high quality for subsequent processing.
import yt_dlp import os from pathlib import Path def download_audio(yt_url): output_dir = "/kaggle/working/files/audio" Path(output_dir).mkdir(parents=True, exist_ok=True) ydl_config = { "format": "bestaudio/best", "postprocessors": [ { "key": "FFmpegExtractAudio", "preferredcodec": "wav", "preferredquality": "192", } ], "outtmpl": os.path.join(output_dir, "%(title)s.%(ext)s"), "verbose": True } print(f"Downloading audio from {yt_url}") try: with yt_dlp.YoutubeDL(ydl_config) as ydl: ydl.download([yt_url]) print("Downloading successful!") except Exception as e: print(f"Error downloading audio: {e}") # Example usage (replace with user input) # yt_url = input("Input YouTube URL") yt_url = "https://youtu.be/ad79nYk2keg" # Example URL download_audio(yt_url)
-
Speech-to-Text Transcription:
- OpenAI's Whisper model is employed for accurate speech-to-text conversion. The
small
model is used here, offering a balance between speed and accuracy. Larger Whisper models (medium
,large
) could be used for improved transcription quality, at the cost of increased processing time. - The transcribed text is saved to a text file for later use.
import whisper import glob import warnings # Suppress unnecessary warnings warnings.filterwarnings("ignore", category=UserWarning, module="whisper") warnings.filterwarnings("ignore", category=FutureWarning, module="torch") def get_audiofile_path(output_dir="/kaggle/working/files/audio"): audio_file = glob.glob(os.path.join(output_dir, "*.wav")) return audio_file[-1] # Select the most recently downloaded file def transcribe_with_whisper(audio_path): model = whisper.load_model("small") result = model.transcribe(audio_path) return result['text'] def save_text(text, output_dir): with open(output_dir, "w") as file: file.write(text) print(f"Text successfully saved to {output_dir}") # Example usage (assuming audio has been downloaded) audio_filepath = get_audiofile_path() transcribed_text = transcribe_with_whisper(audio_filepath) output_dir = "/kaggle/working/files/output.txt" save_text(transcribed_text, output_dir) # Preview the transcribed text (first 10%): print(transcribed_text[:len(transcribed_text)//10])
- OpenAI's Whisper model is employed for accurate speech-to-text conversion. The
-
Text Summarization:
- The
transformers
library from Hugging Face is utilized for text summarization. Specifically, thet5-small
model is used within a summarization pipeline. - The transcribed text is split into chunks to accommodate the model's input token limit (512 tokens for
t5-small
). Thetiktoken
library, developed by OpenAI, accurately determines token counts for different models. - Each chunk is summarized, and the individual summaries are concatenated to produce a final summary.
from transformers import pipeline, T5Tokenizer from langchain.document_loaders import TextLoader from math import ceil # Load the summarization pipeline and tokenizer summarizer = pipeline("summarization", model="t5-small") tokenizer = T5Tokenizer.from_pretrained("t5-small") def split_text_into_chunks(text, max_tokens=512): tokens = tokenizer.encode(text) num_chunks = ceil(len(tokens) / max_tokens) chunks = [] for i in range(num_chunks): chunk = tokens[i * max_tokens: (i + 1) * max_tokens] if len(chunk) > max_tokens: chunk = chunk[:max_tokens] # Truncate if necessary decoded_chunk = tokenizer.decode(chunk, skip_special_tokens=True) chunks.append(decoded_chunk) return chunks def generate_summary(file_path): loader = TextLoader(file_path) documents = loader.load() text = documents[0].page_content chunks = split_text_into_chunks(text, max_tokens=505) #slightly less than max_tokens summaries = [] for chunk in chunks: if len(chunk.strip()) > 0: summary = summarizer(chunk) summaries.append(summary[0]['summary_text']) final_summary = " ".join(summaries) print("Summary generated succesfully!") return final_summary # Example usage (assuming transcript file exists) file_path = output_dir # Use the saved transcript file summary = generate_summary(file_path) print(summary)
- The
-
Question Answering:
- A question-answering pipeline, also from Hugging Face's
transformers
, is used to answer user queries. Thedistilbert-base-uncased-distilled-squad
model, fine-tuned on the SQuAD dataset, is chosen for its efficiency and effectiveness in extractive question answering. - The user's question and the transcribed text (as context) are provided to the model.
- The model identifies the span of text within the transcript that best answers the question.
from transformers import pipeline import warnings warnings.filterwarnings("ignore", category=FutureWarning, module="transformers") def qa(file_path, question): qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad") def read_text_from_file(file_path): with open(file_path, 'r', encoding='utf-8') as file: return file.read() def answer_question_from_text(text, question): answer = qa_pipeline(question=question, context=text) return answer['answer'] text = read_text_from_file(file_path) answer = answer_question_from_text(text, question) return "Answer:" + answer # Example usage # question = input("What is your question: ") question = "What is the topic of the video?" # Example answer = qa(file_path, question) # Use the transcript file print(answer) question2 = "What are the pros of AI?" # Example answer2 = qa(file_path, question2) print(answer2) question3 = "What are the cons of AI?" # Example, likely to fail answer3 = qa(file_path, question3) print(answer3)
- A question-answering pipeline, also from Hugging Face's
3. Implementation Details and Tech Stack
- Programming Language: Python 3.10
- Key Libraries:
yt_dlp
: For downloading YouTube video audio.openai-whisper
: For speech-to-text transcription.transformers
: For NLP tasks (summarization and question answering).langchain
: For loading and managing text documents (used withTextLoader
).ffmpeg
: For audio processing (handled viayt_dlp
).langchain-community
- Dependencies Installation:
pip install langchain docarray==0.38.0 yt_dlp openai-whisper transformers==4.44.0 pip install -U langchain-community
4. Results and Evaluation
The system successfully processes YouTube video links, generates transcripts, summaries, and answers questions based on the video content. The accuracy of the results depends on:
- Audio Quality: Clear audio leads to better transcriptions.
- Whisper Model Choice: Larger models improve accuracy but increase processing time.
- Question Complexity: The question-answering model performs best on fact-based questions directly answerable from the transcript. It may struggle with complex reasoning or questions requiring information not explicitly stated in the video.
The example provided in the notebook demonstrates the system's capabilities. The summary captures the main points of the video, and the question-answering component correctly identifies the topic. However, as expected, it fails to answer a question about "cons of AI" because that information was not present in the video.
5. Limitations and Future Work
- Out-of-Video Questions: The current implementation is limited to answering questions that can be directly answered from the transcribed text. It cannot handle questions requiring external knowledge or inference beyond the video content.
- Transcript Accuracy: While Whisper is generally accurate, errors in transcription can occur, especially with noisy audio or speakers with strong accents. These errors propagate to the summarization and question-answering stages.
- Context Window Limits: The summarization and question-answering models have limited context windows. Very long videos may require more sophisticated chunking and aggregation strategies.
- Single Video Source: The system currently only supports single YouTube videos.
- No Error Handling for Invalid URLs: The code does not include robust error handling for cases where the user provides an invalid YouTube URL or a URL for a video that cannot be downloaded.
The following improvements are planned for future versions:
-
Enhanced Accuracy:
- Experiment with larger Whisper models (
medium
,large-v2
,large-v3
) and evaluate the trade-off between accuracy and processing time. - Explore fine-tuning the question-answering model on a dataset of video transcripts and related questions to improve performance on this specific task.
- Implement a retrieval augmented generation (RAG) approach to allow the system to answer questions that are not directly addressed in the video. This involves retrieving relevant information from external sources (e.g., Wikipedia, web search) and using it to augment the model's context.
- Experiment with larger Whisper models (
-
Real-Time Processing:
- Investigate techniques for streaming audio and performing transcription and analysis in real-time. This would be challenging due to the computational requirements of the models.
-
Support for Multiple Video Sources:
- Extend support to include playlists, other video platforms, and local file uploads.
-
Improved Interface:
- Develop a user-friendly web interface using a framework like Streamlit or Flask. This would make the tool more accessible to a wider audience.
-
Advanced Analytics:
- Incorporate features such as sentiment analysis, keyword extraction, and topic modeling to provide more comprehensive insights into the video content.
-
Integration with External Tools:
- Enable integration with note-taking applications and learning management systems.
-
Robust Error Handling: Add try-except blocks to gracefully handle invalid URLs, network errors, and other potential issues.
-
Long-form video handling: Consider implementing a hierarchical summarization approach, where the video is first divided into large sections, each section is summarized, and then those summaries are combined to create an overall summary.
6. Conclusion
TubeQuery provides a valuable tool for interacting with YouTube video content in a more efficient and informative way. By combining speech-to-text, text summarization, and question-answering capabilities, it demonstrates the power of LLMs in transforming how we consume and learn from online video. The planned future enhancements will further improve its accuracy, usability, and versatility.
7. Acknowledgements
- OpenAI for the Whisper model.
- Hugging Face for the Transformers library and pre-trained models.
- The developers of
yt-dlp
.
Code Availability
The complete code for TubeQuery is available in the following Kaggle Notebook:
https://www.kaggle.com/code/sitama/tubequery
Models
There are no models linked
Datasets
There are no datasets linked