Editorial illustration for Google adds screen-control to Gemini 3.5 Flash for cross‑platform agents
Google adds screen-control to Gemini 3.5 Flash for...
Google adds screen-control to Gemini 3.5 Flash for cross‑platform agents
Google has folded “Computer Use” straight into Gemini 3.5 Flash, letting the model actually see and manipulate a screen. No longer a separate Gemini 2.5 add‑on, the capability works across browsers, mobile devices and desktop environments, so developers can build agents that handle software testing, office automation and similar tasks without stitching together multiple tools. While the model already supports function calls, Search and Maps, this new visual‑control layer expands what a single LLM can do on its own.
On the OSWorld benchmark Gemini 3.5 Flash scores 78.4, edging out Gemini 3 Flash (65.1) and GPT‑5.4 mini (72.1); GPT‑5.5 nudges ahead at 78.7, Anthropic’s Opus leads at 83.4, and Sonnet 4.6 matches the 78.4 mark. Google backs the feature with adversarial training and two optional enterprise safeguards—one that asks for user confirmation on sensitive actions, another that aborts tasks when indirect prompt injections are detected. The company also urges sandboxing, human oversight and strict access controls. The functionality is reachable via the Gemini API and the Gemini Enterprise Agent Platform, with a Browserbase demo and a GitHub reference implementation already posted.
Combined with existing tools like function calls, Search, and Maps, developers can now build agents that work across browser, mobile, and desktop environments for tasks like software testing or office automation. On the OSWorld benchmark, Gemini 3.5 Flash scores 78.4, beating Gemini 3 Flash (65.1) and GPT-5.4 mini (72.1). GPT-5.5 sits just ahead at 78.7, while Anthropic's Opus 4.8 leads at 83.4.
Sonnet 4.6 also hits 78.4, and Gemini 3.1 Pro lands at 76.2. To guard against prompt injection attacks, Google uses adversarial training and two optional enterprise safeguards. One requires user confirmation for sensitive or irreversible actions, while the other automatically stops tasks when it detects indirect prompt injections.
Google also recommends sandboxing, human oversight, and strict access controls, with more details in its best practices documentation. The feature is available through the Gemini API and the Gemini Enterprise Agent Platform.
Why this matters
We now see Google folding “Computer Use” into Gemini 3.5 Flash, letting the model watch and manipulate a screen without a separate service. That integration could simplify building cross‑platform agents, but it also raises questions about control and security. Can we trust a model to click without oversight?
Developers can combine the new capability with existing function calls, Search and Maps, to craft tools that run on browsers, phones and desktops for software testing or office automation. On the OSWorld benchmark Gemini 3.5 Flash scores 78.4, a noticeable jump from Gemini 3 Flash’s 65.1, suggesting the added vision‑action loop improves performance. Yet the score alone tells us little about real‑world reliability; the gap between benchmark success and production robustness often proves wide.
Founders may be tempted to embed these agents into products quickly, but we should verify that the model’s screen‑level actions respect user intent and privacy. Researchers will likely probe how well the model generalizes beyond the test set, and whether the integrated approach scales without new failure modes. Unclear whether the convenience outweighs the need for tighter oversight.
Further Reading
- Introducing computer use in Gemini 3.5 Flash - Google AI Studio
- Google DeepMind announces Gemini 3.5 Flash computer use model - Reddit (Bard Community)
- Google launches Gemini 3.5 Flash. How to try it for free. - Mashable
- Gemini 3.5: frontier intelligence with action - Google Blog
- What's new in Gemini 3.5 Flash - Interactions API - Google AI Documentation