
Transforming ideas into compelling video content through intelligent automation
This publication presents a comprehensive multi-agent system designed for automated video generation from textual descriptions. The system employs four specialized AI agents with integrated database logging and human-in-the-loop feedback mechanisms to produce high-quality video content with synchronized audio narration.
In today's digital landscape, video content has become the dominant medium for communication, education, and entertainment. However, traditional video content creation remains a complex and resource-intensive process that demands significant manual effort, specialized technical expertise, and substantial time investment. Content creators consistently encounter numerous obstacles throughout their production workflows.
The journey begins with storyboard development and scene planning, where creators must transform abstract ideas into concrete visual narratives. This process requires not only creative vision but also technical understanding of how individual scenes will connect to form a cohesive story. Following the planning phase, creators face the intricate challenge of audio-visual synchronization, ensuring that narration, background music, and visual elements work harmoniously together to deliver their intended message.
Quality assurance presents another significant hurdle, as creators must iterate through multiple versions of their content, identifying areas for improvement and refining their work until it meets professional standards. Throughout this entire process, progress tracking and workflow management become increasingly complex, especially when working with teams or managing multiple projects simultaneously.
Recognizing these persistent challenges in video content creation, we have developed a sophisticated multi-agent system that fundamentally transforms how videos are produced from textual descriptions. Our approach centers on the implementation of specialized AI agents, each designed to excel at distinct aspects of video creation while working collaboratively toward a unified goal.
The system incorporates comprehensive database logging capabilities that create detailed audit trails, enabling creators to track their progress, understand decision-making processes, and maintain complete visibility into their production workflows. Rather than replacing human creativity, our solution integrates strategic human feedback loops that enhance quality while preserving creative control, allowing creators to guide and refine the automated processes according to their vision.
This innovative framework enables truly automated yet controllable video generation workflows, where creators can input their ideas and receive professional-quality video content while maintaining the ability to intervene, adjust, and refine the output at crucial decision points throughout the production process.

Watch the AI Video Generator in action:
At the heart of our solution lies a sophisticated multi-agent architecture that orchestrates five specialized AI agents, each contributing unique capabilities to the video creation process. The Scene Generator Agent serves as the creative foundation, intelligently transforming user input into structured video scenes that form the narrative backbone of the final product. Working alongside this creative engine, the Database Logger Agent ensures comprehensive persistent storage and meticulous progress tracking throughout the entire production pipeline.
Quality assurance comes through the Scene Critic Agent, which provides automated assessment and improvement suggestions, acting as an intelligent reviewer that identifies opportunities for enhancement before content moves to production. The Audio Agent brings the visual narrative to life by generating synchronized narration and audio content that perfectly complements the visual elements. Finally, the Video Agent serves as the master conductor, synthesizing all visual and audio elements into the final video output that represents the culmination of the collaborative agent workflow.
Our system's robustness stems from its sophisticated database integration layer, which includes custom tools specifically designed for comprehensive workflow logging and optimization. The save_scene_progress function ensures that both raw and improved scenes are persistently stored in the database, creating a complete historical record of the creative evolution process. Meanwhile, the log_progress_event tool captures detailed progress events for each workflow step, providing granular visibility into the system's operation and enabling precise monitoring of production timelines.
The system's intelligence is further enhanced through the get_video_context function, which retrieves contextual information that enables more sophisticated and relevant scene generation. Additionally, the search_similar_videos capability identifies related content for creative inspiration, helping the system learn from existing successful patterns while maintaining originality in new productions.
This comprehensive approach ensures that all progress is automatically tracked in the database, providing complete audit trails that not only enable recovery from failures but also facilitate continuous improvement of the system's capabilities through analysis of successful production patterns.
To make the platform production-ready, we hardened configuration, improved API reliability, and engineered resilience across the stack. Configuration is centralized in backend/config.py and initialized through an application factory that promotes clean startup in both development and test environments. All critical directories—output_dir, temp_dir, and data_dir—resolve to absolute, sandboxed paths under the backend root to prevent path drift across hosts and containers. The database connection is intentionally decoupled from file storage locations, defaulting to SQLite paths suitable for containers or local runs, while CORS policies are constrained by environment and configuration to only allow trusted origins. For development and CI, a mock_mode switch disables external API calls and substitutes deterministic behaviors, enabling fast, repeatable test runs.
On the API surface, request payloads are validated using Pydantic models that enforce strict bounds on titles, descriptions, and user input lengths. File downloads are protected by canonical path checks to ensure that only files within the configured output_dir can be served; the same endpoint supports inline playback for a better user experience. A health endpoint verifies database connectivity and filesystem writability, while a lightweight metrics endpoint exposes high-level counts by video status, providing a quick operational snapshot without external dependencies.
Resilience is addressed through explicit retry strategies with exponential backoff for external LLM calls, graceful fallbacks in mock mode for audio and video processing, and deterministic filenames based on time.time_ns() to avoid collisions. Duration reporting prioritizes reading the actual rendered video file; when unavailable, the system estimates from scene timings, keeping responses informative even when media tools are absent. The orchestration layer, built with LangGraph, incorporates a critique loop and approval gate so that only improved scenes progress downstream. Throughout processing, progress is persisted to the database and streamed via SSE, enabling real-time monitoring and robust recovery.
Safety is implemented in layers. At the prompt level, each agent is guided by safety system messages and explicit output constraints defined in backend/config/prompt_config.yaml. These instructions resist prompt injection and constrain outputs to policy-compliant formats. At request time, an optional moderation step calls _llm_moderation_flagged() which integrates the omni-moderation-latest model and applies a configurable threshold. This check can be disabled in tests or mock runs for determinism; even when disabled, downstream agents still sanitize outputs by removing URLs, emails, code fences, and by normalizing whitespace.
On the platform side, SQLAlchemy parameterization guards against injection, CORS policies restrict cross-origin access, and file serving enforces strict containment checks to mitigate path traversal risks. Secrets and environment-dependent values are provided via .env so that sensitive configuration never needs to be embedded directly in the codebase.
Testing emphasizes breadth, determinism, and operational realism. Unit tests verify agent utilities, validation logic, and database helpers. Integration tests exercise the API surface, including boundary validations, secure download behavior, moderation pathways, database operations, and the Server-Sent Events stream for progress updates. End-to-end tests execute the full multi-agent workflow in mock mode, ensuring that orchestration, persistence, and status reporting function coherently without external dependencies.
External services such as OpenAI and MoviePy are mocked where appropriate to remove flakiness and ensure reproducible results. For moderation, tests directly stub the _llm_moderation_flagged function to drive both allow and block outcomes deterministically. The suite runs under pytest with pytest-cov, and coverage is scoped to application modules by configuration. Recent runs report approximately 75% line coverage across backend modules, reflecting a comprehensive safety net for critical code paths.
The interface is designed to feel fast, informative, and polished. The library presents generated videos as interactive cards that display muted, looping previews with a prominent, centered play affordance. Clicking a card opens a modal player for focused viewing, while the creation flow includes an inline player in the final step that streams from the backend’s inline download endpoint. Throughout the process, a progress indicator subscribes to SSE updates to reflect the system’s current stage, offering responsive feedback without manual refreshes.
Screenshots:

Library view with clickable cards, live previews, and quick access to recent outputs.

Guided creation flow showing step-by-step progress and inline playback of the generated video.

Modal player with focused viewing experience and clean controls optimized for review.
Operational reliability is anchored by layered failure handling and simple, actionable observability. Transient failures against external services are retried with exponential backoff, while mock fallbacks ensure the system remains usable during outages or in constrained environments. Because progress events and workflow artifacts are persisted, operators can reconstruct state, diagnose issues, and resume work without guesswork.
Health checks validate database connectivity and directory writability to catch misconfigurations early, and the metrics endpoint summarizes video statuses for rapid triage and simple dashboards. Logs include structured progress messages and moderation outcomes, creating a clear operational narrative suitable for analysis and alerting. Absolute path resolution eliminates surprises between host and container filesystems, and startup ensures required directories are present and writable. CORS and environment-based configuration round out a secure default posture with minimal setup overhead.
The project ships with Docker Compose definitions for a straightforward deployment of both backend (Flask) and frontend (Next.js). Image builds leverage caching for faster iterations, and volumes persist outputs and instance data across restarts. Environment variables—OPENAI_API_KEY, DATABASE_URL, and FLASK_ENV—are managed via a backend .env, keeping secrets out of source control and making environment promotion predictable. The API is stateless and suitable for horizontal scaling behind a load balancer; persistent state is isolated in the database and object storage (outputs/), which can be mapped to cloud services or shared volumes in production.
Our system recognizes that the most compelling video content emerges from the synergy between artificial intelligence capabilities and human creativity. Rather than replacing human input, we have designed multiple strategic interaction points that amplify creative vision while leveraging AI efficiency.
The creative journey begins at the Initial Input Stage, where users provide natural language descriptions of their desired video content, transforming abstract ideas into concrete project specifications. Users can enhance their input by specifying optional technical parameters such as duration, style, and tone, giving them precise control over the final output characteristics. The system also accommodates reference materials, allowing users to upload supporting documents or media that provide additional context and inspiration for the AI agents.
During the Intermediate Review Points, the collaborative process deepens through scene review capabilities that enable users to examine and modify generated scenes before audio creation begins. This critical checkpoint prevents downstream issues and ensures content alignment with the creator's vision. The integrated rating system captures quality feedback and relevance assessments, while iterative refinement cycles allow for multiple revision rounds based on user feedback, ensuring that each element meets the creator's standards.
The Final Approval Stage provides comprehensive control over the finished product through preview generation, allowing users to experience the complete video before final rendering commits significant computational resources. Users can submit specific modification requests for individual scenes, and the flexible export system accommodates user-defined output formats and quality settings to match their distribution requirements.
Our feedback mechanisms create a transparent and responsive environment where users maintain visibility and control throughout the production process. Real-time progress monitoring through an intuitive web interface displays the current processing stage, while detailed progress indicators provide granular insight into each agent's work progress. The system proactively communicates through immediate error notifications, ensuring users stay informed about any processing issues that require attention.
The quality assurance framework operates through sophisticated loops that combine automated intelligence with human oversight. The Scene Critic Agent continuously provides improvement suggestions based on content analysis, while human override capabilities ensure users can accept, reject, or modify agent recommendations according to their creative judgment. This collaborative approach enables continuous learning, as the system analyzes user preferences and feedback patterns to improve future recommendations and outputs.
Collaborative enhancement features support complex production workflows through comprehensive version control that preserves multiple iterations alongside user annotations. The integrated comment system enables timestamped feedback on specific scenes or elements, facilitating detailed communication between team members. For organizations requiring formal review processes, multi-stage approval workflows accommodate complex project requirements while maintaining the system's efficiency and user-friendly operation.
# Clone the repository git clone https://github.com/ezedinff/AAIDC.git cd AAIDC/project-2 # Create required directories mkdir -p data outputs temp backend/instance
# Create environment file cp backend/env.example backend/.env # Configure API credentials echo "OPENAI_API_KEY=your_openai_api_key_here" >> backend/.env echo "FLASK_ENV=production" >> backend/.env echo "DATABASE_URL=sqlite:///instance/video_generator.db" >> backend/.env
# Build and start all services docker-compose up --build -d # Verify services are running docker-compose ps
# Frontend application http://localhost:3000 # Backend API documentation http://localhost:5000/api/docs
Create New Video Project:
curl -X POST http://localhost:5000/api/videos \ -H "Content-Type: application/json" \ -d '{ "title": "Educational Content", "description": "Tutorial video about machine learning", "user_input": "Create a 3-minute video explaining neural networks", "style": "educational", "duration": 180 }'
Monitor Progress:
# Get real-time progress updates curl http://localhost:5000/api/videos/{video_id}/progress # Subscribe to Server-Sent Events curl -N http://localhost:5000/api/videos/{video_id}/events
Provide Feedback:
curl -X POST http://localhost:5000/api/videos/{video_id}/feedback \ -H "Content-Type": "application/json" \ -d '{ "scene_id": "scene_1", "rating": 4, "comments": "Excellent content, minor timing adjustment needed", "suggestions": ["Slow down narration", "Add visual examples"] }'
# config/config.yaml agents: scene_generator: model: "gpt-4" temperature: 0.7 max_scenes: 10 scene_critic: model: "gpt-4" evaluation_criteria: ["relevance", "clarity", "engagement"] audio_agent: voice_model: "tts-1" voice: "alloy" speed: 1.0
# config/config.yaml database: logging_level: "detailed" retention_days: 30 backup_interval: "daily" audit_trail: true
Understanding that complex AI systems can present various challenges during deployment and operation, we have developed comprehensive troubleshooting guidance based on common user experiences and system behavior patterns.
Installation challenges typically manifest in three primary areas that users should address systematically. Docker build failures often indicate version compatibility issues, requiring verification that your Docker installation meets the minimum version requirements specified in our documentation. Port conflicts represent another frequent obstacle, particularly when ports 3000 and 5000 are already occupied by other services on your system. Additionally, permission issues can prevent proper system initialization, making it essential to ensure that all directories have appropriate read and write permissions for the Docker containers.
Runtime complications emerge primarily from resource constraints and external service interactions. API rate limits from OpenAI services can interrupt video generation workflows, making it crucial to implement exponential backoff strategies for API calls to maintain system stability. Memory usage monitoring becomes essential for sustained operation, as Docker containers must have adequate resources allocated to handle the computational demands of multi-agent video generation. Database locks can also occur during concurrent access scenarios, requiring careful monitoring of database connection patterns and potential conflicts.
Quality-related issues in generated content often trace back to input specifications and system configuration. Poor scene generation typically results from unclear or insufficiently specific input descriptions, emphasizing the importance of providing detailed, well-structured content briefs. Audio synchronization problems usually stem from parameter misalignment in the audio generation configuration, while rendering failures commonly indicate output directory permission restrictions or insufficient disk space for the final video files.
Our support infrastructure provides multiple pathways for assistance, ensuring that users can access appropriate help regardless of their technical expertise level or project requirements.
Comprehensive documentation forms the foundation of our support ecosystem, with detailed API reference materials available directly through the /api/docs endpoint for immediate access during development work. Configuration guidance is thoroughly documented in the config/README.md file, providing step-by-step instructions for system customization. Development teams can access complete setup instructions through the CONTRIBUTING.md documentation, which covers everything from initial environment preparation to advanced customization techniques.
Community-driven support channels foster collaborative problem-solving and knowledge sharing among users. The GitHub Issues platform serves as the primary venue for submitting bug reports and feature requests, where community members and maintainers collaborate to resolve challenges and enhance system capabilities. Our GitHub Discussions forum provides a welcoming space for questions, best practice sharing, and community-driven troubleshooting sessions. The comprehensive wiki documentation offers detailed explanations, tutorials, and advanced configuration examples contributed by both maintainers and experienced community members.
Professional support services address enterprise requirements and specialized implementation needs. Commercial licensing options provide enhanced support and service level agreements for enterprise deployments requiring dedicated assistance. Custom development services enable specialized agent development and system modifications tailored to specific organizational requirements. Training services offer implementation and optimization consulting, helping teams maximize their investment in the video generation platform while developing internal expertise for ongoing operations and maintenance.
This project is licensed under the MIT License. See the LICENSE file for the complete license text.
Our system builds upon a carefully selected foundation of established open-source technologies and commercial services, each governed by their respective licensing terms. The OpenAI API integration operates under OpenAI's Terms of Service, ensuring compliance with their usage policies and rate limiting requirements. Docker containerization technology operates under the Apache License 2.0, providing robust deployment capabilities with clear licensing obligations. The Next.js frontend framework and Flask backend framework both operate under permissive licenses (MIT and BSD-3-Clause respectively), enabling flexible usage and modification while maintaining compliance with their attribution requirements.
Data privacy and security represent fundamental priorities in our system design, with comprehensive policies governing how user information and generated content are handled throughout the video creation process. All user inputs and generated content remain the exclusive property of the user, ensuring that creators maintain complete ownership and control over their intellectual property. API interactions with external services are governed by the respective service provider terms, with clear documentation of what data is transmitted and how it is processed.
Core video generation processing occurs locally within your deployment environment, minimizing external data exposure and providing enhanced security for sensitive content creation workflows. The system implements configurable retention policies for generated content, allowing organizations to establish data lifecycle management practices that align with their compliance requirements and storage preferences.
The open-source nature of our platform enables both non-commercial and commercial usage without licensing fees, making it accessible to individual creators, educational institutions, and commercial organizations alike. Attribution requirements for derivative works ensure proper recognition of the original contributors while enabling customization and extension for specific use cases.
Enterprise support services are available under separate commercial agreements for organizations requiring dedicated assistance, service level agreements, or custom development work. Standard MIT license limitations apply to the core platform, providing clarity on liability considerations while maintaining the flexibility that makes open-source adoption attractive for commercial deployments.
Our multi-agent video generation system represents a paradigm shift in how we approach automated content creation, with implications that extend far beyond traditional video production workflows. This work demonstrates that complex creative processes can be effectively decomposed into specialized, collaborative AI agents while maintaining quality standards and creative control that meet professional requirements.
The significance of this approach lies in its potential to democratize high-quality video content creation, making sophisticated production capabilities accessible to educators, small businesses, content creators, and researchers who previously lacked the resources or technical expertise for professional video production. By reducing the barrier to entry for video content creation, this system enables new forms of educational content, training materials, and creative expression that were previously constrained by resource limitations.
From a research perspective, this work contributes valuable insights to the growing field of multi-agent AI systems, particularly in creative and content generation domains. The integration of specialized agents with distinct responsibilities—scene generation, quality assessment, audio synthesis, and video compilation—demonstrates effective task decomposition strategies that could inform future multi-agent architectures across various creative and technical domains.
The human-in-the-loop integration patterns we've developed provide a blueprint for maintaining human agency and creative control in AI-driven workflows, addressing critical concerns about automation in creative industries. This balanced approach between automation efficiency and human oversight offers a model for implementing AI assistance in other creative fields such as writing, graphic design, and interactive media development.
The automated video generation capabilities present transformative opportunities for educational technology and knowledge transfer initiatives. Academic institutions can leverage this system to rapidly produce course materials, enabling faculty to focus on curriculum development rather than technical video production. The system's ability to transform textual content into engaging visual narratives has particular relevance for online education, where video content significantly improves learning outcomes and student engagement.
Corporate training and knowledge management applications represent another significant opportunity, where organizations can efficiently convert documentation, procedures, and training materials into accessible video formats. This capability becomes increasingly valuable as remote work patterns create demand for scalable, engaging training content that can be produced efficiently and updated regularly.
This work opens several promising avenues for future research and development. Adaptive Learning Integration represents one compelling direction, where the system could learn from user feedback patterns to automatically improve content generation quality and better align with individual creator preferences and audience requirements.
Cross-Modal Content Generation presents opportunities to extend the multi-agent approach beyond video, incorporating additional media types such as interactive presentations, augmented reality experiences, and adaptive content that responds to viewer engagement patterns. The modular architecture we've developed provides a foundation for integrating additional content generation capabilities.
Industry-Specific Customization offers practical research directions, where specialized agent configurations could be developed for specific domains such as medical education, technical documentation, marketing content, or scientific communication. Each domain brings unique requirements for accuracy, style, and presentation that could benefit from targeted agent specialization.
Collaborative Content Creation systems could build upon our human-in-the-loop patterns to enable multiple users to collaboratively guide and refine automated content generation, supporting team-based creative workflows and distributed content development processes.
The broader deployment of automated video generation technology carries significant implications for creative industries and content creation labor markets. While our system is designed to augment rather than replace human creativity, the efficiency gains it provides could reshape content production economics and workflow patterns across multiple industries.
Understanding and addressing these implications requires ongoing dialogue between technologists, content creators, and industry stakeholders to ensure that automation benefits enhance rather than disrupt existing creative ecosystems. The open-source nature of our platform enables community-driven development and adaptation, fostering innovation while maintaining accessibility for diverse users and use cases.
Our work contributes to the broader conversation about responsible AI development in creative domains, demonstrating approaches that preserve human agency while leveraging AI capabilities to enhance creative productivity and accessibility.
This multi-agent video generation system demonstrates the effective integration of specialized AI agents with human-in-the-loop feedback mechanisms. The comprehensive database logging, real-time progress monitoring, and quality assurance loops provide a robust foundation for automated yet controllable video content creation.
The system's modular architecture enables easy customization and extension, while the detailed installation and usage instructions ensure accessibility for both researchers and practitioners in the field of automated content generation. More importantly, this work establishes a framework for responsible AI integration in creative workflows that preserves human agency while democratizing access to professional-quality content creation capabilities.
Citation: If you use this system in academic research, please cite as:
AAIDC Project Contributors. (2024). AI Video Generator: A Multi-Agent System for Automated Video Content Creation. GitHub. https://github.com/ezedinff/AAIDC/tree/main/project-2