Abstract:
TubeQuery is a Python-based tool that leverages Large Language Models (LLMs) and advanced Natural Language Processing (NLP) techniques to enable users to extract information from YouTube videos. By simply providing a YouTube video URL, users can obtain a transcript, a concise summary, and answers to specific questions about the video's content. This project demonstrates the practical application of speech-to-text conversion, text summarization, and question-answering models in a real-world scenario.
In the age of information overload, efficiently extracting key insights from video content is crucial. TubeQuery addresses this challenge by providing a streamlined way to interact with YouTube videos, moving beyond passive viewing to active information retrieval. The tool is particularly useful for:
This document details the architecture, implementation, and usage of TubeQuery, along with a discussion of its limitations and potential future enhancements.
TubeQuery's operation can be broken down into the following key stages:
Video Acquisition and Audio Extraction:
yt_dlp
library is used to download the audio stream from the video. yt_dlp
is preferred over youtube_dl
for its continued maintenance and improved handling of YouTube's evolving platform.import yt_dlp import os from pathlib import Path def download_audio(yt_url): output_dir = "/kaggle/working/files/audio" Path(output_dir).mkdir(parents=True, exist_ok=True) ydl_config = { "format": "bestaudio/best", "postprocessors": [ { "key": "FFmpegExtractAudio", "preferredcodec": "wav", "preferredquality": "192", } ], "outtmpl": os.path.join(output_dir, "%(title)s.%(ext)s"), "verbose": True } print(f"Downloading audio from {yt_url}") try: with yt_dlp.YoutubeDL(ydl_config) as ydl: ydl.download([yt_url]) print("Downloading successful!") except Exception as e: print(f"Error downloading audio: {e}") # Example usage (replace with user input) # yt_url = input("Input YouTube URL") yt_url = "https://youtu.be/ad79nYk2keg" # Example URL download_audio(yt_url)
Speech-to-Text Transcription:
small
model is used here, offering a balance between speed and accuracy. Larger Whisper models (medium
, large
) could be used for improved transcription quality, at the cost of increased processing time.import whisper import glob import warnings # Suppress unnecessary warnings warnings.filterwarnings("ignore", category=UserWarning, module="whisper") warnings.filterwarnings("ignore", category=FutureWarning, module="torch") def get_audiofile_path(output_dir="/kaggle/working/files/audio"): audio_file = glob.glob(os.path.join(output_dir, "*.wav")) return audio_file[-1] # Select the most recently downloaded file def transcribe_with_whisper(audio_path): model = whisper.load_model("small") result = model.transcribe(audio_path) return result['text'] def save_text(text, output_dir): with open(output_dir, "w") as file: file.write(text) print(f"Text successfully saved to {output_dir}") # Example usage (assuming audio has been downloaded) audio_filepath = get_audiofile_path() transcribed_text = transcribe_with_whisper(audio_filepath) output_dir = "/kaggle/working/files/output.txt" save_text(transcribed_text, output_dir) # Preview the transcribed text (first 10%): print(transcribed_text[:len(transcribed_text)//10])
Text Summarization:
transformers
library from Hugging Face is utilized for text summarization. Specifically, the t5-small
model is used within a summarization pipeline.t5-small
). The tiktoken
library, developed by OpenAI, accurately determines token counts for different models.from transformers import pipeline, T5Tokenizer from langchain.document_loaders import TextLoader from math import ceil # Load the summarization pipeline and tokenizer summarizer = pipeline("summarization", model="t5-small") tokenizer = T5Tokenizer.from_pretrained("t5-small") def split_text_into_chunks(text, max_tokens=512): tokens = tokenizer.encode(text) num_chunks = ceil(len(tokens) / max_tokens) chunks = [] for i in range(num_chunks): chunk = tokens[i * max_tokens: (i + 1) * max_tokens] if len(chunk) > max_tokens: chunk = chunk[:max_tokens] # Truncate if necessary decoded_chunk = tokenizer.decode(chunk, skip_special_tokens=True) chunks.append(decoded_chunk) return chunks def generate_summary(file_path): loader = TextLoader(file_path) documents = loader.load() text = documents[0].page_content chunks = split_text_into_chunks(text, max_tokens=505) #slightly less than max_tokens summaries = [] for chunk in chunks: if len(chunk.strip()) > 0: summary = summarizer(chunk) summaries.append(summary[0]['summary_text']) final_summary = " ".join(summaries) print("Summary generated succesfully!") return final_summary # Example usage (assuming transcript file exists) file_path = output_dir # Use the saved transcript file summary = generate_summary(file_path) print(summary)
Question Answering:
transformers
, is used to answer user queries. The distilbert-base-uncased-distilled-squad
model, fine-tuned on the SQuAD dataset, is chosen for its efficiency and effectiveness in extractive question answering.from transformers import pipeline import warnings warnings.filterwarnings("ignore", category=FutureWarning, module="transformers") def qa(file_path, question): qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad") def read_text_from_file(file_path): with open(file_path, 'r', encoding='utf-8') as file: return file.read() def answer_question_from_text(text, question): answer = qa_pipeline(question=question, context=text) return answer['answer'] text = read_text_from_file(file_path) answer = answer_question_from_text(text, question) return "Answer:" + answer # Example usage # question = input("What is your question: ") question = "What is the topic of the video?" # Example answer = qa(file_path, question) # Use the transcript file print(answer) question2 = "What are the pros of AI?" # Example answer2 = qa(file_path, question2) print(answer2) question3 = "What are the cons of AI?" # Example, likely to fail answer3 = qa(file_path, question3) print(answer3)
yt_dlp
: For downloading YouTube video audio.openai-whisper
: For speech-to-text transcription.transformers
: For NLP tasks (summarization and question answering).langchain
: For loading and managing text documents (used with TextLoader
).ffmpeg
: For audio processing (handled via yt_dlp
).langchain-community
pip install langchain docarray==0.38.0 yt_dlp openai-whisper transformers==4.44.0 pip install -U langchain-community
The system successfully processes YouTube video links, generates transcripts, summaries, and answers questions based on the video content. The accuracy of the results depends on:
The example provided in the notebook demonstrates the system's capabilities. The summary captures the main points of the video, and the question-answering component correctly identifies the topic. However, as expected, it fails to answer a question about "cons of AI" because that information was not present in the video.
The following improvements are planned for future versions:
Enhanced Accuracy:
medium
, large-v2
, large-v3
) and evaluate the trade-off between accuracy and processing time.Real-Time Processing:
Support for Multiple Video Sources:
Improved Interface:
Advanced Analytics:
Integration with External Tools:
Robust Error Handling: Add try-except blocks to gracefully handle invalid URLs, network errors, and other potential issues.
Long-form video handling: Consider implementing a hierarchical summarization approach, where the video is first divided into large sections, each section is summarized, and then those summaries are combined to create an overall summary.
TubeQuery provides a valuable tool for interacting with YouTube video content in a more efficient and informative way. By combining speech-to-text, text summarization, and question-answering capabilities, it demonstrates the power of LLMs in transforming how we consume and learn from online video. The planned future enhancements will further improve its accuracy, usability, and versatility.
yt-dlp
.The complete code for TubeQuery is available in the following Kaggle Notebook:
https://www.kaggle.com/code/sitama/tubequery