Multimodal AI - the next frontier (and its convergence with agents)

Multimodal AI - the next frontier (and its convergence with agents)

AI systems now combine text, images, audio, and video simultaneously. This isn’t just an extension of language models - it’s a fundamental shift in how AI works. Two technologies are converging: multimodal perception and autonomous agents. AI systems can now see, listen, interpret, and act without human intervention.

The technical reality of multimodality

Building systems that handle multiple inputs is a complex task. Each input type creates specific challenges:

  • Text: Language processing works well, but models struggle to connect text with visual or audio context.
  • Vision: Image recognition is solid, but visual reasoning fails. Advanced models can read text in images but miss logical connections between visual elements, like flowcharts or diagrams.
  • Audio: Speech recognition works, but real-time speech systems need precise integration across transcription, language generation, emotional control, and speech synthesis.
  • Video: Video processing adds complexity. Models must track changes over time, identify patterns, and update understanding across moving frames and lighting changes.

Building a single model that handles all these inputs requires shared internal representations. Most providers use hybrid pipelines rather than accurate unified models.

When perception meets action

Multimodal models and autonomous agents work together for real tasks:

  • Sales agents combine CRM data, pipeline screenshots, and meeting notes to create reports
  • Virtual machine agents process screen captures, recognise interface elements, and take actions like clicking or typing
  • Voice assistants detect emotional cues and adjust responses accordingly

You’re seeing “embodied software agents” - AI that works like humans in digital environments. These systems don’t just understand language; they navigate interfaces and take autonomous actions.

OpenAI and Anthropic’s Computer Use API enables models to operate software by observing screens and simulating keyboard and mouse actions. Gemini processes real-time video feeds of active workspaces.
Open-source models, such as Qwen-VL, demonstrate screen control on local machines. You can run fully private automation agents without cloud providers.

Two Paths Forward

Closed Source (OpenAI, Google, Anthropic)

Corporate labs deliver the most sophisticated models today. GPT-4o, Gemini 2.5, and Claude 4 offer:

  • Integrated speech, vision, and reasoning pipelines
  • Sub-300ms speech-to-speech interaction
  • Real-time video and screen perception
  • Multi-step reasoning across combined contexts

But you face limitations:

  • Usage-based API costs that scale expensively
  • Data privacy restrictions for sensitive information
  • Limited customisation and fine-tuning options

Open Source (Qwen, OpenGVLab, Maitrix, Kyutai)

Open-source models are catching up fast:

  • Qwen VL offers vision-language reasoning with local screen control
  • Qwen Omni and Maitrix Voila integrate speech for local voice agents
  • OpenGVLab InternVL-3 outperforms its parent models through community fine-tuning
  • Kyutai Moshi combines vision and speech for richer conversations

You get:

  • Private, on-premise deployment
  • Custom agents for specialised business needs
  • Full control over security and compliance
  • Rapid iteration on domain-specific tasks

While open-source models still lag behind in some benchmarks, community-driven improvements suggest they’ll soon rival closed platforms for enterprise use.

What this means for you

AI that combines vision, speech, and action is happening now. Models can see, listen, reason, and act in single, coherent systems. This creates new application categories from private digital workers to autonomous software controllers. Both closed and open-source approaches contribute. Corporate labs push real-time performance boundaries while open-source innovators democratize access and customisation. The organisations experimenting with these capabilities today will shape how practical AI develops next.

#ai-llm

Tomás Correia Marques

More posts

Share This Post

The Non-Technical Founders survival guide

How to spot, avoid & recover from 7 start-up scuppering traps.

The Non-Technical Founders survival guide: How to spot, avoid & recover from 7 start-up scuppering traps.

Download for free