Multimodal CodeAct Agent Implementation with Custom Tools

🧠 Building a Multimodal Code Assistant with Groq and LlamaIndex

In this work, I present an experimental yet practical implementation of a multimodal code agent — an interactive assistant capable of understanding natural language, executing code, transcribing voice, and processing uploaded files.

The project is built using Streamlit for the interface, Groq LLMs for reasoning and generation, and LlamaIndex's CodeActAgent for orchestrating tool use and code execution. It is designed as a testbed for developing and evaluating multimodal reasoning agents, with a focus on:

Tool integration and code execution
Context management across modalities
Streaming inference with human-in-the-loop UX

You can explore all the code for this project in my GitHub repo.

🧭 Motivation

AI assistants are increasingly expected to do more than just answer questions — they need to reason, use tools, and handle inputs beyond plain text (e.g. files, audio, images). This project explores how to:

Leverage LLMs to run executable code as part of a conversation.
Maintain memory and context across multiple modalities (text, files, voice).
Provide an interface that feels interactive, explainable, and adaptable.

Rather than building an agent with rigid pipelines, the system uses dynamic prompting and a lightweight architecture that prioritizes flexibility and extensibility.

🛠️ Core Components

🧠 Reasoning Agent

At the core is CodeActAgent, a workflow-enabled LLM agent from LlamaIndex. It:

Parses natural language inputs
Decides whether to use tools or write code
Executes Python code safely via a custom sandbox
Streams partial outputs back to the frontend

🗂️ Tool-Augmented Context

The assistant can be extended with custom tools — from file parsing to API lookups or image analysis. Tool functions are automatically imported and registered based on a naming convention (e.g. tools/custom_tools_<agent_name>.py).

🎙️ Voice + File Inputs

Voice: A built-in microphone recorder lets users dictate prompts. Audio is transcribed with Groq’s Whisper-like STT model.
Files: Uploaded files are passed as context instructions to the agent, enabling use cases like CSV data exploration or PDF Q&A.

💬 Streaming UI

The interface is implemented in Streamlit, allowing:

Real-time message rendering
Expandable sections for code and tool outputs
Sidebar file upload and mode toggles (e.g. reasoning, web search)

🔍 Use Cases

This assistant can be useful in multiple settings:

Data Analysis
Upload a CSV and ask the assistant to generate plots, compute summaries, or write cleaning code.
Coding Assistant
Describe logic in plain English — the assistant writes and runs Python snippets, showing you the results.
Conversational Q&A over Files
Drop in structured or semi-structured files, and ask questions without needing to manually inspect them.
Voice-Driven Interaction
Use hands-free prompts for accessibility or rapid prototyping.

📌 Design Considerations

Memory Persistence: Session-based memory lets the assistant remember files and context throughout the interaction.
Tool Use Transparency: All tool calls and code execution are surfaced to the user with labeled, collapsible sections.
Extensibility: New tools can be added simply by writing functions in Python and dropping them into a module.

🌱 Future Directions

RAG Integration for deeper file/document QA
Vision and Video Modalities via Groq VLMs
Web Search Tools for real-time knowledge grounding
Improved Reasoning Control (e.g. chain-of-thought toggles, tool triggers)

🎯 Takeaway

This project explores how we can bring together LLMs, tool use, voice input, and code execution into a single coherent assistant. It's lightweight, easy to extend, and designed to serve as a base for more complex multimodal agents.

If you're building interactive AI systems — especially with a need for explainability, code reasoning, or multimodal input — this approach can serve as a practical template.