Abstract

This paper presents a multi-agent AI system designed to automatically analyze and interpret video and audio content. The system decomposes complex processing tasks into specialized agents responsible for transcription, semantic understanding, contextual web search, summarization, and explanation of complex concepts. In addition, a dedicated questioning agent generates reflective and comprehension-driven questions to support active learning. By distributing responsibilities across specialized agents, the system improves clarity, accuracy, and user engagement compared to monolithic AI approaches. This architecture enables more interactive and educational content analysis, making the system suitable for students, researchers, and content creators.

Introduction

Video and audio content contain valuable information but are difficult to analyze and understand efficiently, especially when concepts are complex or require additional context. Existing AI tools often rely on single models, which limits accuracy and interaction.

This paper proposes a multi-agent AI system that analyzes multimedia content by assigning specialized agents to transcription, contextual web search, summarization, explanation, and question generation. This modular design improves clarity, understanding, and user engagement compared to monolithic approaches.

Methodology

System Architecture

Untitled Diagram.drawio.png
The proposed system follows a modular multi-agent architecture for structured content understanding. Input content from diverse sources is first transformed into a unified textual representation. A set of specialized agents then process the content under the coordination of a central orchestrator. Each agent performs a distinct role, and their outputs are combined to produce a suitable response.

Agent Design

The system consists of four specialized agents:

Planner Agent – Determines the processing strategy and task sequence. It has access to a Download tool, which retrieves audio or video content locally, and a Transcribe tool, which converts the downloaded content into text for further analysis.

Summarizer Agent – Generates concise representations of the content, highlighting key ideas and structure to facilitate comprehension.

Explainer Agent – Clarifies complex concepts in an accessible manner. It can use the Search tool to retrieve additional context from the web when needed, ensuring explanations are accurate and grounded.

Question Generation Agent – Produces reflective and comprehension-based questions to encourage active engagement with the content and support learning outcomes.

def createWorkflow():
    workflow = StateGraph(State)
    
    #define nodes
    workflow.add_node(PLANNER_NODE, plannerNode)
    workflow.add_node(EXPLAINER_NODE, explainerNode)
    workflow.add_node(QUIZ_GENERATOR_NODE, quizGeneratorNode)
    workflow.add_node(SUMMARIZER_NODE, summarizerNode)
    workflow.add_node(PLANNER_TOOLS_NODE, plannerToolsNode)
    workflow.add_node(EXPLAINER_TOOLS_NODE, explainerToolsNode)
    
    #Define entry point
    workflow.set_entry_point(PLANNER_NODE)
    
    #Define edges
    workflow.add_edge(PLANNER_TOOLS_NODE, PLANNER_NODE)
    workflow.add_edge(EXPLAINER_TOOLS_NODE, EXPLAINER_NODE)
    workflow.add_edge(QUIZ_GENERATOR_NODE, END)
    workflow.add_edge(SUMMARIZER_NODE, END)
    
    workflow.add_conditional_edges(
        PLANNER_NODE,
        plannerEdge,
        {
            "explainer": EXPLAINER_NODE,
            "quizGenerator": QUIZ_GENERATOR_NODE,
            "summarizer": SUMMARIZER_NODE,
            "tools": PLANNER_TOOLS_NODE
        }
    )
    
    workflow.add_conditional_edges(
        EXPLAINER_NODE,
        explainerEdge,
        {
            "explainerTools": EXPLAINER_TOOLS_NODE,
            "end": END
        }
    )
    
    return workflow.compile()

Installation and Usage

Requirements

Python 3.9 or higher
Git
pip (Python package manager)
Internet connection (required for downloads and web search)

Installation

Clone the repository:

git clone https://github.com/Achala-Elijah/Mastering_AI_Agents.git
cd Mastering_AI_Agents

Create a virtual environment (recommended):

python -m venv venv

Activate the virtual environment:

Windows:

venv\Scripts\activate

macOS / Linux:

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Run the application:

python main.py

Analyze a video or audio file:

Place the media file inside the workflow/downloads/ directory
Provide the file path as input when prompted
The system will transcribe the content, summarize key ideas, explain complex concepts, and generate reflective questions (optional)

Example of usage

Input:

What is the video ./workflow/downloads/video.mp4 about?

Output:

The file ./workflow/downloads/video.mp4 appears to be a video file containing a Christian rap song. The lyrics express the artist's faith and trust in Jesus Christ, and how their life has changed for the better since they started following God. The song also touches on themes of sin, redemption, and the importance of staying on the right path. The artist repeats the phrase "I keep my eyes on Jesus Christ" throughout the song, emphasizing their focus on their faith. The overall tone of the song is one of devotion, gratitude, and encouragement to others to follow God's path.

Notes

The system understands text, audio and video.
Supported formats include .mp4, .mp3, and .wav
Internet access is required for contextual web search
Each agent can be extended or replaced independently

Results

To evaluate the system’s capability in analyzing content and generating informative responses, we present an example query processed by the multi-agent system. The system integrates the Planner, Summarizer, Explainer, and Question Generation agents to produce structured, contextually rich outputs.

Query:
What is the video ./workflow/downloads/video.mp4 about?

System Response:

Agent Contributions

Planner Agent: Transcribed the video content.

Summarizer Agent: (Optional — could generate summary of the song.)

Explainer Agent: Provided thematic clarification of faith, redemption, and devotion.

Question Generation Agent: (Optional — could generate reflective questions on the song’s message.)

Observations

This example demonstrates the system’s ability to combine multiple specialized agents to provide a clear, accurate, and context-aware understanding of input content. Even with a single query, the system effectively identified it was suppose to transcribe and answer user question, highlighting the advantage of modular, multi-agent processing over monolithic approaches.

Conclusion

This paper presented a multi-agent AI system for analyzing and interpreting content from diverse sources. By decomposing complex tasks into specialized agents, Planner, Summarizer, Explainer, and Question Generation, the system achieves more accurate, structured, and interactive responses compared to monolithic approaches. The single-query example demonstrates its ability to transcribe content, extract key ideas, clarify complex concepts, and provide learning-oriented outputs.

Multimedia Comprehension Agent