Multimodal Live API with Gemini 2.0

Dec 11, 2024

Google's Gemini 2.0 Flash, released December 11th, 2024, introduces significant advancements in AI capabilities, particularly with its new Multimodal Live API. This article explores the API's features and how it enhances the Gemini experience.

Gemini 2.0 and the Multimodal Live API: A Deep Dive

Gemini 2.0 Flash: A Multimodal Powerhouse

Gemini stage presentation at Made by Google 24

As stated by TechCrunch, Gemini 2.0 Flash is designed for speed and efficiency, natively generating images and audio alongside text. It also integrates with third-party apps and services, including Google Search and code execution. An experimental version is available via the Gemini API, AI Studio, and Vertex AI, with full audio and image generation capabilities rolling out to early access partners before a wider January release. Future integration into Android Studio, Chrome DevTools, Firebase, and Gemini Code Assist is planned.

The Google Developers blog highlights three key developer-focused updates: the experimental Gemini 2.0 Flash via the Gemini API, the Multimodal Live API for real-time interactions, and the Jules AI code assistant (currently in testing). An experimental data science agent using Gemini 2.0 in Colab is also available to trusted testers. Furthermore, developers can enhance their coding with Gemini 2.0 Flash in Gemini Code Assist.

Abstract image of the Gemini AI 'sparkle' icon with a blue, black and white background

Multimodal Live API: Real-time Interaction

Both TechCrunch and the Google Cloud documentation emphasize the Multimodal Live API's role in enabling low-latency bidirectional voice and video interactions. This API allows developers to build real-time applications with natural, human-like voice conversations, including the ability to interrupt the model's responses using voice commands. The API supports text, audio, and video input, providing text and audio output. It's available in the Gemini API as the BidiGenerateContent method, utilizing WebSockets.

The Google Cloud documentation provides a code example demonstrating a simple text-to-text interaction using the Multimodal Live API:

from google import genai

client = genai.Client()
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}

async with client.aio.live.connect(model=model_id, config=config) as session:
    message = "Hello? Gemini, are you there?"
    print("> ", message, "\n")
    await session.send(message, end_of_turn=True)

    async for response in session.receive():
        print(response.text)

The API supports voice-to-voice, voice-and-video-to-voice interactions, with a maximum session duration of 2 minutes and currently only supporting English. A key limitation is that audio input/output negatively impacts function calling. Further details on capabilities and limitations are available in the Multimodal Live API reference guide (link provided in Google Cloud documentation).

Gemini Models: A Comparison

The Google AI for Developers documentation provides a detailed comparison of available Gemini models:

Model variant	Input(s)	Output	Optimized for
Gemini 2.0 Flash (experimental)	Audio, images, videos, and text	Text, images (coming soon), and audio (coming soon)	Next-generation features, speed, multimodal generation for diverse tasks
Gemini 1.5 Flash	Audio, images, videos, and text	Text	Fast and versatile performance across diverse tasks
Gemini 1.5 Flash-8B	Audio, images, videos, and text	Text	High volume, lower intelligence tasks
Gemini 1.5 Pro	Audio, images, videos, and text	Text	Complex reasoning tasks requiring more intelligence
Gemini 1.0 Pro (Deprecated)	Text	Text	Natural language tasks, multi-turn text and code chat, code generation
Text Embedding (text-embedding-004)	Text	Text embeddings	Measuring the relatedness of text strings
AQA	Text	Text	Providing source-grounded answers to questions

This table highlights the different input and output modalities supported by each model, along with their respective strengths. Note that Gemini 2.0 Flash is experimental and features are still under development.

Conclusion

The Multimodal Live API significantly enhances Gemini 2.0's capabilities, enabling real-time, interactive applications. Combined with Gemini's other multimodal features and tool integrations, it opens up exciting possibilities for developers to create innovative and engaging AI-powered experiences. However, it's crucial to be aware of the current limitations, especially regarding audio's impact on function calling and the experimental nature of some features.

Multimodal Live API with Gemini 2.0

Gemini 2.0 and the Multimodal Live API: A Deep Dive

Gemini 2.0 Flash: A Multimodal Powerhouse

Multimodal Live API: Real-time Interaction

Gemini Models: A Comparison

Conclusion

Exploring the Landscape of AI Web Browsing Frameworks

OpenAI Operator: A New Era of AI Agentic Task Automation

React OpenGraph Image Generation: Techniques and Best Practices