Gemini Mac Pilot
A voice-controlled macOS agent with 24 tools that lets you control your entire Mac just by talking. Speak naturally and it executes complex, multi-step workflows — opening apps, clicking buttons, typing text, browsing the web, managing your Google Workspace, and running shell commands — all hands-free.
View on GitHub
Voice-controlled macOS agent powered by Gemini.
Gemini Mac Pilot is a voice-controlled macOS agent that lets you control your entire Mac just by talking. Speak naturally and it executes complex, multi-step workflows — opening apps, clicking buttons, typing text, browsing the web, managing your Google Workspace (Gmail, Calendar, Drive, Docs), and running shell commands — all hands-free.
The architecture is split into three layers: a Voice layer powered by the Gemini Live API handles bidirectional speech in real time; a Brain layer using Gemini 3 Flash Preview reads the macOS accessibility tree and autonomously decides which tools to call (with parallel function call support); and a Tools layer executes actions across native apps, the browser, Google Workspace, and the system shell.
This project was built for the Google Gemini Live Agent Challenge hackathon, showcasing how Gemini's native voice and function-calling capabilities can power a fully autonomous desktop agent that bridges voice interaction with real system control.
Three layers, one voice command, zero manual steps.
The system runs on a three-layer architecture where each layer owns a specific domain. Voice handles speech, Brain handles decisions, and Tools handle execution — coordinated seamlessly through Gemini's native function-calling capabilities.
Voice Layer
Gemini Live API handles bidirectional speech in real time with native audio I/O. When the user speaks a natural language command, the voice layer captures it, understands the intent, and calls the brain's execute_task function to trigger the workflow. Live transcription and status updates are displayed in the floating overlay UI throughout the process.
Brain Layer
Gemini 3 Flash Preview with native function calling receives the task along with the current macOS accessibility tree — a structured snapshot of every UI element on screen. It autonomously decides which tools to call, in what order, and with what parameters. It loops until the task is complete, re-reading the UI state after each action. Supports parallel function calls for efficiency.
Tools Layer
24 tools across three domains: 8 native macOS tools (Accessibility API, keyboard, AppleScript, shell), 8 browser tools via Chrome DevTools Protocol connecting to the user's real Chrome sessions (Chrome 146+), and 8 Google Workspace tools (Gmail, Calendar, Drive, Docs) via CLI integration. Each tool call is dispatched to the appropriate handler and results flow back to the brain for the next decision.
Real Chrome sessions, not headless browsers.
The browser tools connect to the user's actual Chrome session via the Chrome DevTools Protocol — no sandboxed environments, no simulations.
Browse to URLs and read page text content from the active Chrome tab. The agent understands what's on the page and can extract specific information.
Click elements by visible text or CSS selector, type into input fields, and interact with web pages exactly as a human would — in the user's real browser with all their cookies and sessions.
Execute arbitrary JavaScript in the Chrome page context for advanced interactions, data extraction, or page manipulation that goes beyond standard click-and-type operations.
Capture screenshots of the current browser page and list all interactive elements and links — giving the brain full visibility into the page state for decision-making.
Gmail, Calendar, Drive, and Docs — by voice.
Eight dedicated tools integrate with Google Workspace via CLI, turning voice commands into real productivity actions across your entire Google ecosystem.
Read and search emails from your inbox, compose and send new messages — all through natural voice commands. "Read my latest emails" or "Send a reply to John about the meeting."
List upcoming events and create new ones on Google Calendar. "What's on my schedule today?" or "Create a meeting with the team tomorrow at 3pm."
List, search, and read files in Google Drive. Find documents by name, browse folders, and read file contents — all without touching the keyboard.
Read and edit Google Docs documents. The agent can read a document's content, make edits, and update text — turning voice into written productivity.
From voice command to completed action.
Natural language capture
The user speaks a natural language command. The Gemini Live API captures audio bidirectionally, transcribes speech in real time, and understands the user's intent. When a task is detected, the voice layer calls the brain's execute_task function to initiate the workflow.
Autonomous decision-making
Gemini 3 Flash Preview receives the task along with the current macOS accessibility tree — a structured snapshot of every UI element on screen. It autonomously decides which tools to call, in what order, and with what parameters. It loops until the task is complete, re-reading the UI state after each action.
Multi-domain tool dispatch
Each tool call is dispatched to the appropriate handler across 24 tools: clicking native UI elements via the Accessibility API, typing text via keyboard simulation, navigating and interacting with Chrome via CDP, managing Gmail, Calendar, Drive, and Docs through Google Workspace CLI, opening apps, or running shell commands. Parallel function calls allow multiple tools to execute simultaneously.
Spoken result & UI feedback
Once the brain completes the workflow, it returns a summary to the voice layer. Gemini Live speaks the result back to the user through the floating overlay UI, which also displays live transcription, action steps, and status updates throughout the entire process.
The full system.
AI & Voice
macOS & Browser
Platform & Integrations
Need a voice-controlled
AI agent?
We build autonomous AI agents that bridge voice interaction with real system control. Let's talk about your use case.