Notes on CrewAI multimodal agents

Lennex Zinyando

21 Jan 2025 • 1 min read

Multimodal AI agents are systems that can process, understand, and generate multiple types of data or "modalities"—such as text, images, audio, and video. They are a step up from the text-only AI agents people are used to. Below are some use cases of multimodal agents:

Healthcare -Doctors can now get AI assistance that simultaneously analyzes medical images and patient records, combining visual diagnoses with detailed medical histories. This integration of multiple data sources helps deliver more accurate diagnoses and better patient care.
Customer Service - Virtual assistants that truly understand your problem by both seeing and reading about it. When customers share photos alongside their descriptions, these agents can quickly grasp the issue and provide more precise solutions.
Education - Learning becomes more dynamic with systems that blend visuals, text, and audio seamlessly. These agents can create personalized learning experiences by adapting to how students interact across different formats, making education more engaging and effective.

Multimodal Agents In CrewAI

CrewAI recently released a feature that enables multimodality in agents. For now, they support both text and image processing in agents. To create a multimodal agent you need to set the multimodal parameter to True , which automatically configures the tools necessary for handling non-text contents, including AddImageTool.

from crewai import Agent

agent = Agent(
    role="Image Analyst",
    goal="Analyze and extract insights from images",
    backstory="An expert in visual content interpretation with years of experience in image analysis",
    multimodal=True
)

AddImageTool tool allows the agents to process images. To get the most out of the agents you need to provide more context in the tasks you create so that the agents are more focused. The tool takes to parameters:

image_url - a link to the image you want processed
action - additional context or questions about the image

Remember, being specific with your requirements helps these agents deliver more focused and valuable results.

Want to explore further? The CrewAI documentation offers comprehensive guidance on multimodal agents, including an essential Best Practices section to help you maximize their potential

I'd love to hear about your experiences with multimodal agents! Connect with me on X (formerly Twitter) or LinkedIn to share your thoughts and questions.

AI should drive results, not complexity. AgentemAI helps businesses build scalable, efficient, and secure AI solutions. See how we can help.