Inicio Servicios Proceso Proyectos Open Source Blog en Reservar llamada
Gemini Live API Voice Agent macOS Automation Hackathon #GeminiLiveAgentChallenge

Building a Voice-Controlled
macOS Agent with Gemini

What if you could control your entire Mac just by talking? We built Gemini Mac Pilot for the Gemini Live Agent Challenge — a voice agent that sees your screen, understands your apps, and takes action.

March 2026 8 min read #GeminiLiveAgentChallenge
Gemini Mac Pilot — Voice Control Hub

Every AI assistant today lives in a text box. You type a question, get an answer, maybe copy-paste something into another app. But the promise of AI has always been bigger than that — an assistant that actually does things on your computer, not just talks about them.

That idea is what drove us to build Gemini Mac Pilot for the Gemini Live Agent Challenge. It is a voice-controlled macOS agent that can open your apps, navigate your browser, read your screen, type messages, run commands, and complete multi-step workflows — all from natural speech. No keyboard required.

Say "Open WhatsApp and message Daniel that I'll be late" and Mac Pilot opens WhatsApp, finds Daniel's conversation, types the message, and sends it. Say "Play Rosalia on YouTube" and it opens Chrome, searches YouTube, and plays the video. The interaction feels like having a skilled assistant sitting next to you, operating your Mac while you talk.

The Problem

AI assistants that can't actually assist.

Current AI assistants are fundamentally disconnected from where you actually work. They live in their own window, isolated from your desktop, your apps, your browser tabs. When you ask an AI to "check your email," it tells you how to check your email. When you ask it to "schedule a meeting," it gives you instructions. The gap between AI capability and AI usefulness is the last mile problem: getting the AI to actually interact with your real environment.

We wanted to bridge that gap completely. Not a chatbot that gives instructions, but an agent that executes. Not text-only, but voice-first — because if you are going to hand control to an AI, you need to be able to talk to it naturally, interrupt it, correct it, and guide it in real time. And it has to work across everything on your Mac: native apps, web apps, system utilities, the terminal.

The Gemini Live Agent Challenge gave us the perfect excuse to build this. Gemini's Live API provides something no other foundation model offers: true bidirectional native audio streaming. Not speech-to-text-to-LLM-to-text-to-speech. Actual audio in, audio out, with function calling in between. That changes the architecture fundamentally.

Architecture

Two brains, one agent.

The key architectural insight is separating voice from reasoning. Trying to do both in a single model creates a bottleneck — voice requires low-latency streaming, while tool-calling workflows need deliberate multi-step planning. So we split the agent into two layers.

Gemini Mac Pilot — Architecture layers diagram showing Voice, Brain, and Tools layers
Voice Layer

Gemini Live API

The voice layer uses the Gemini Live API with native audio for bidirectional speech. The user speaks naturally, and the model streams audio responses back in real time. When the user requests an action — "open my email" — the voice layer calls an execute_task function, handing the request to the brain layer. The voice session stays open, providing a continuous conversational experience while the brain works in the background.

Brain Layer

Gemini 3 Flash Preview

The brain layer uses Gemini 3 Flash Preview with native function calling and parallel function call support. It receives a task description, reads the current macOS accessibility tree to understand what is on screen, plans a sequence of actions, and executes them through tool calls — multiple tools can fire simultaneously when independent. It runs in a loop — call tools, observe the results, decide the next step — until the task is complete. This is where the actual reasoning happens: figuring out which buttons to click, what text to type, which Workspace APIs to call, and how to navigate complex multi-step workflows across 24 tools.

24 tools for full Mac + Workspace control

open_app find_app click set_value focus type_text press_keys shell browse read_page get_links click_text browser_click browser_type chrome_js screenshot gmail_read gmail_send calendar_list calendar_create drive_list drive_read docs_read docs_edit

8 native macOS tools (click, focus, set_value, shell), 8 browser tools via Chrome CDP (browse, click_text, chrome_js), and 8 Google Workspace tools (Gmail, Calendar, Drive, Docs) — the brain can reach anything on the Mac and in the cloud.

Deep Dive

The interesting parts.

Reading any app's UI with the Accessibility API

The macOS Accessibility API (AX API) is the backbone of the native app control. Every macOS application exposes its UI as an accessibility tree — a hierarchy of elements with roles (button, text field, menu item), labels, values, and positions. We traverse this tree recursively, assigning each element a numeric ID, and present it to Gemini as a structured text representation. The brain sees something like [42] Button "Send" (230, 450) and can call click(42) to interact with it.

This approach is powerful because it works with any native macOS app without any app-specific integration. WhatsApp, Notes, Finder, System Settings — if it has an accessibility tree, Mac Pilot can read and control it. The challenge is keeping the tree representation compact enough for the model's context while retaining enough detail for accurate interaction. We filter out hidden elements, collapse redundant containers, and prioritize interactive elements.

How the brain decides what to do

The brain loop is deceptively simple. It receives a task, gets the current UI state, and enters a generate-execute cycle. Each turn, Gemini 3 Flash Preview receives the task description, the current accessibility tree or browser page state, and the history of actions taken so far. It decides the next tool call — or declares the task complete. There is no hardcoded workflow logic. The model reasons about what it sees and picks the right tool.

This means Mac Pilot handles unexpected situations gracefully. If a dialog box pops up, the brain sees it in the next UI state read and deals with it. If a button is not where expected, it searches for alternatives. The native function calling in Gemini 3 Flash Preview is what makes this practical — the model reliably outputs well-structured tool calls without prompt engineering gymnastics.

Browser automation with Chrome DevTools Protocol

For web interactions, the Accessibility API is not enough — web content inside Chrome is opaque to AX. So we connect directly to the user's real Chrome browser via the Chrome DevTools Protocol (CDP), available in Chrome 146+. The brain can navigate to URLs, read page text, list all interactive elements, click by text or CSS selector, type into inputs, and execute arbitrary JavaScript — all inside the user's actual browsing session with their cookies, extensions, and logged-in accounts. This gives Mac Pilot full web capabilities: searching Google, reading emails in Gmail, playing YouTube videos, filling out forms. The CDP browser tools and native macOS tools can be used together in the same workflow.

Google Workspace integration

Gemini Mac Pilot — Workspace colored streams showing Gmail, Calendar, Drive, and Docs integration

Beyond the desktop and browser, Mac Pilot now integrates directly with Google Workspace through CLI tools. The brain can read and send Gmail messages, list and create Google Calendar events, browse and read files from Google Drive, and read or edit Google Docs — all through voice commands. Say "check my email" and the agent reads your inbox. Say "create a meeting with Sarah tomorrow at 2pm" and it creates a Calendar event. This adds 8 workspace-specific tools, bringing the total to 24 tools across native macOS, browser, and cloud productivity.

Challenges

What made us sweat.

Chrome CDP on macOS. Our original plan was to use Chrome DevTools Protocol (CDP) to connect directly to the user's existing Chrome instance. Early on, this was unreliable on macOS — connecting to an already-running Chrome via the debugging port was flaky. With Chrome 146+, Google added stable CDP support that we now use in production. The agent connects to the user's real Chrome with their cookies, extensions, and logged-in sessions — a much better experience than a separate browser instance.

Voice session time limits. The Gemini Live API has a 15-minute session limit. For a desktop agent that should run all day, that is a problem. We implemented automatic session reconnection — when the session approaches the time limit or disconnects, the voice layer cleanly reconnects and resumes listening. The user does not notice the transition. Getting the reconnection logic right without dropping audio or creating duplicate sessions took several iterations.

Keeping the UI responsive. Brain tasks can take 10-30 seconds as the model reasons through multi-step workflows. During that time, the floating overlay UI needs to show progress — which step is being executed, what tool is being called, whether something failed. We built an event bus that streams status updates from the brain and voice layers to the PyWebView overlay via WebSocket. The UI shows live transcription, step-by-step action progress, and result summaries.

Accessibility permissions. macOS requires explicit user permission for accessibility control, and each terminal app needs separate authorization. The setup experience requires users to grant access in System Settings, which is not something most people do daily. We added clear setup instructions and runtime error messages that guide users through the permission flow, but it remains the biggest onboarding friction point.

Tech Stack

What powers it.

AI & Voice

Gemini Live API Gemini 3 Flash Preview Vertex AI Native Audio Function Calling

Desktop, Browser & Workspace

macOS Accessibility API Chrome CDP (Chrome 146+) Google Workspace CLI AppleScript PyWebView WebSockets

Runtime

Python 3.11+ asyncio PyAudio Google GenAI SDK macOS 13+
What's Next

Where we're taking it.

Mac Pilot has grown from a hackathon prototype to a 24-tool agent covering native macOS, browser, and Google Workspace. The immediate priorities are more robust error recovery — when a tool call fails, the brain should try alternative approaches rather than giving up — and expanding further with clipboard operations, notification handling, and deeper Workspace integrations.

We are also exploring a Cloud Run deployment model where the brain logic runs in the cloud, reducing the local footprint to just the voice layer, the UI overlay, and the tool execution bridge. This would make updates seamless and allow more powerful reasoning models without worrying about local compute.

The long-term vision is a desktop agent that learns your patterns. It should know that when you say "start my morning," you mean open Slack, check email, and open your project management tool. Personal automation powered by observation and voice.

Try it yourself.

Gemini Mac Pilot is open source. If you have a Mac, a Google Cloud project, and a microphone, you can be voice-controlling your desktop in minutes. We built this in a hackathon sprint, and we think it demonstrates something important: the combination of Gemini's native audio capabilities with its function calling makes truly useful voice agents possible — not as a future promise, but right now.

Need a voice-powered
AI agent?

We build AI agents that go beyond chat — systems that interact with real applications and automate real workflows. Let's talk about your use case.