This paper presents a multi-agent AI system designed to automatically analyze and interpret video and audio content. The system decomposes complex processing tasks into specialized agents responsible for transcription, semantic understanding, contextual web search, summarization, and explanation of complex concepts. In addition, a dedicated questioning agent generates reflective and comprehension-driven questions to support active learning. By distributing responsibilities across specialized agents, the system improves clarity, accuracy, and user engagement compared to monolithic AI approaches. This architecture enables more interactive and educational content analysis, making the system suitable for students, researchers, and content creators.
Video and audio content contain valuable information but are difficult to analyze and understand efficiently, especially when concepts are complex or require additional context. Existing AI tools often rely on single models, which limits accuracy and interaction.
This paper proposes a multi-agent AI system that analyzes multimedia content by assigning specialized agents to transcription, contextual web search, summarization, explanation, and question generation. This modular design improves clarity, understanding, and user engagement compared to monolithic approaches.

The proposed system follows a modular multi-agent architecture for structured content understanding. Input content from diverse sources is first transformed into a unified textual representation. A set of specialized agents then process the content under the coordination of a central orchestrator. Each agent performs a distinct role, and their outputs are combined to produce a suitable response.

The system consists of four specialized agents:
Planner Agent – Determines the processing strategy and task sequence. It has access to a Download tool, which retrieves audio or video content locally, and a Transcribe tool, which converts the downloaded content into text for further analysis.
Summarizer Agent – Generates concise representations of the content, highlighting key ideas and structure to facilitate comprehension.
Explainer Agent – Clarifies complex concepts in an accessible manner. It can use the Search tool to retrieve additional context from the web when needed, ensuring explanations are accurate and grounded.
Question Generation Agent – Produces reflective and comprehension-based questions to encourage active engagement with the content and support learning outcomes.
def createWorkflow(): workflow = StateGraph(State) #define nodes workflow.add_node(PLANNER_NODE, plannerNode) workflow.add_node(EXPLAINER_NODE, explainerNode) workflow.add_node(QUIZ_GENERATOR_NODE, quizGeneratorNode) workflow.add_node(SUMMARIZER_NODE, summarizerNode) workflow.add_node(PLANNER_TOOLS_NODE, plannerToolsNode) workflow.add_node(EXPLAINER_TOOLS_NODE, explainerToolsNode) #Define entry point workflow.set_entry_point(PLANNER_NODE) #Define edges workflow.add_edge(PLANNER_TOOLS_NODE, PLANNER_NODE) workflow.add_edge(EXPLAINER_TOOLS_NODE, EXPLAINER_NODE) workflow.add_edge(QUIZ_GENERATOR_NODE, END) workflow.add_edge(SUMMARIZER_NODE, END) workflow.add_conditional_edges( PLANNER_NODE, plannerEdge, { "explainer": EXPLAINER_NODE, "quizGenerator": QUIZ_GENERATOR_NODE, "summarizer": SUMMARIZER_NODE, "tools": PLANNER_TOOLS_NODE } ) workflow.add_conditional_edges( EXPLAINER_NODE, explainerEdge, { "explainerTools": EXPLAINER_TOOLS_NODE, "end": END } ) return workflow.compile()
git clone https://github.com/Achala-Elijah/Mastering_AI_Agents.git
cd Mastering_AI_Agents
python -m venv venv
venv\Scripts\activate
source venv/bin/activate
pip install -r requirements.txt
python main.py
What is the video ./workflow/downloads/video.mp4 about?
The file ./workflow/downloads/video.mp4 appears to be a video file containing a Christian rap song. The lyrics express the artist's faith and trust in Jesus Christ, and how their life has changed for the better since they started following God. The song also touches on themes of sin, redemption, and the importance of staying on the right path. The artist repeats the phrase "I keep my eyes on Jesus Christ" throughout the song, emphasizing their focus on their faith. The overall tone of the song is one of devotion, gratitude, and encouragement to others to follow God's path.
To evaluate the system’s capability in analyzing content and generating informative responses, we present an example query processed by the multi-agent system. The system integrates the Planner, Summarizer, Explainer, and Question Generation agents to produce structured, contextually rich outputs.
Query:
What is the video ./workflow/downloads/video.mp4 about?
System Response:
The file ./workflow/downloads/video.mp4 appears to be a video file containing a Christian rap song. The lyrics express the artist's faith and trust in Jesus Christ, and how their life has changed for the better since they started following God. The song also touches on themes of sin, redemption, and the importance of staying on the right path. The artist repeats the phrase "I keep my eyes on Jesus Christ" throughout the song, emphasizing their focus on their faith. The overall tone of the song is one of devotion, gratitude, and encouragement to others to follow God's path.
Planner Agent: Transcribed the video content.
Summarizer Agent: (Optional — could generate summary of the song.)
Explainer Agent: Provided thematic clarification of faith, redemption, and devotion.
Question Generation Agent: (Optional — could generate reflective questions on the song’s message.)
This example demonstrates the system’s ability to combine multiple specialized agents to provide a clear, accurate, and context-aware understanding of input content. Even with a single query, the system effectively identified it was suppose to transcribe and answer user question, highlighting the advantage of modular, multi-agent processing over monolithic approaches.
This paper presented a multi-agent AI system for analyzing and interpreting content from diverse sources. By decomposing complex tasks into specialized agents, Planner, Summarizer, Explainer, and Question Generation, the system achieves more accurate, structured, and interactive responses compared to monolithic approaches. The single-query example demonstrates its ability to transcribe content, extract key ideas, clarify complex concepts, and provide learning-oriented outputs.