Voice AI Gemini macOS Automation Browser Automation

Gemini Mac Pilot

A voice-controlled macOS agent with 24 tools that lets you control your entire Mac just by talking. Speak naturally and it executes complex, multi-step workflows — opening apps, clicking buttons, typing text, browsing the web, managing your Google Workspace, and running shell commands — all hands-free.

View on GitHub

Agent tools

Real-time

Voice I/O

Integration layers

System-wide

Automation scope

Overview

Voice-controlled macOS agent powered by Gemini.

Gemini Mac Pilot is a voice-controlled macOS agent that lets you control your entire Mac just by talking. Speak naturally and it executes complex, multi-step workflows — opening apps, clicking buttons, typing text, browsing the web, managing your Google Workspace (Gmail, Calendar, Drive, Docs), and running shell commands — all hands-free.

The architecture is split into three layers: a Voice layer powered by the Gemini Live API handles bidirectional speech in real time; a Brain layer using Gemini 3 Flash Preview reads the macOS accessibility tree and autonomously decides which tools to call (with parallel function call support); and a Tools layer executes actions across native apps, the browser, Google Workspace, and the system shell.

This project was built for the Google Gemini Live Agent Challenge hackathon, showcasing how Gemini's native voice and function-calling capabilities can power a fully autonomous desktop agent that bridges voice interaction with real system control.

Architecture

Three layers, one voice command, zero manual steps.

The system runs on a three-layer architecture where each layer owns a specific domain. Voice handles speech, Brain handles decisions, and Tools handle execution — coordinated seamlessly through Gemini's native function-calling capabilities.

Layer 01

Voice Layer

Gemini Live API handles bidirectional speech in real time with native audio I/O. When the user speaks a natural language command, the voice layer captures it, understands the intent, and calls the brain's execute_task function to trigger the workflow. Live transcription and status updates are displayed in the floating overlay UI throughout the process.

Layer 02

Brain Layer

Gemini 3 Flash Preview with native function calling receives the task along with the current macOS accessibility tree — a structured snapshot of every UI element on screen. It autonomously decides which tools to call, in what order, and with what parameters. It loops until the task is complete, re-reading the UI state after each action. Supports parallel function calls for efficiency.

Layer 03

Tools Layer

24 tools across three domains: 8 native macOS tools (Accessibility API, keyboard, AppleScript, shell), 8 browser tools via Chrome DevTools Protocol connecting to the user's real Chrome sessions (Chrome 146+), and 8 Google Workspace tools (Gmail, Calendar, Drive, Docs) via CLI integration. Each tool call is dispatched to the appropriate handler and results flow back to the brain for the next decision.

Gemini Mac Pilot architecture layers diagram

Browser Automation

Real Chrome sessions, not headless browsers.

The browser tools connect to the user's actual Chrome session via the Chrome DevTools Protocol — no sandboxed environments, no simulations.

Navigate & Read

Browse to URLs and read page text content from the active Chrome tab. The agent understands what's on the page and can extract specific information.

Click & Type

Click elements by visible text or CSS selector, type into input fields, and interact with web pages exactly as a human would — in the user's real browser with all their cookies and sessions.

JavaScript Execution

Execute arbitrary JavaScript in the Chrome page context for advanced interactions, data extraction, or page manipulation that goes beyond standard click-and-type operations.

Screenshots & Links

Capture screenshots of the current browser page and list all interactive elements and links — giving the brain full visibility into the page state for decision-making.

Gemini Mac Pilot workspace and tools visualization

Google Workspace

Gmail, Calendar, Drive, and Docs — by voice.

Eight dedicated tools integrate with Google Workspace via CLI, turning voice commands into real productivity actions across your entire Google ecosystem.

Gmail

Read and search emails from your inbox, compose and send new messages — all through natural voice commands. "Read my latest emails" or "Send a reply to John about the meeting."

Calendar

List upcoming events and create new ones on Google Calendar. "What's on my schedule today?" or "Create a meeting with the team tomorrow at 3pm."

Drive

List, search, and read files in Google Drive. Find documents by name, browse folders, and read file contents — all without touching the keyboard.

Docs

Read and edit Google Docs documents. The agent can read a document's content, make edits, and update text — turning voice into written productivity.

How It Works

From voice command to completed action.

Step 01 — Voice Input

Natural language capture

The user speaks a natural language command. The Gemini Live API captures audio bidirectionally, transcribes speech in real time, and understands the user's intent. When a task is detected, the voice layer calls the brain's execute_task function to initiate the workflow.

Step 02 — Brain Decides

Autonomous decision-making

Gemini 3 Flash Preview receives the task along with the current macOS accessibility tree — a structured snapshot of every UI element on screen. It autonomously decides which tools to call, in what order, and with what parameters. It loops until the task is complete, re-reading the UI state after each action.

Step 03 — Tools Execute

Multi-domain tool dispatch

Each tool call is dispatched to the appropriate handler across 24 tools: clicking native UI elements via the Accessibility API, typing text via keyboard simulation, navigating and interacting with Chrome via CDP, managing Gmail, Calendar, Drive, and Docs through Google Workspace CLI, opening apps, or running shell commands. Parallel function calls allow multiple tools to execute simultaneously.

Step 04 — Voice Response

Spoken result & UI feedback

Once the brain completes the workflow, it returns a summary to the voice layer. Gemini Live speaks the result back to the user through the floating overlay UI, which also displays live transcription, action steps, and status updates throughout the entire process.

Tech Stack

The full system.

AI & Voice

Gemini Live API Gemini 3 Flash Preview Vertex AI WebSocket

macOS & Browser

macOS Accessibility API Chrome DevTools Protocol AppleScript PyWebView

Platform & Integrations

Python Google Workspace CLI Gmail Google Calendar Google Drive Google Docs

Need a voice-controlled
AI agent?

We build autonomous AI agents that bridge voice interaction with real system control. Let's talk about your use case.

Book a discovery call See all projects →