AI-supported agents reach a new level. With UI-TARS-1.5, ByteDance has developed a multimodal agent that can understand and operate graphical user interfaces – with significantly better results than previous models such as Claude and GPT-4.
Based on a vision-language model, the agent is able to perform complex tasks in different GUI environments. With an impressive accuracy of 61.6% in the ScreenSpotPro benchmark, UI-TARS-1.5 clearly outperforms the competition: Claude-3 only achieves 27.7%, while GPT-4o achieves 41.2%.
Technical innovations as drivers of success
The technical architecture of UI-TARS-1.5 is based on the Qwen2.5-VL-7B model, but has been optimized with 1.5 billion GUI-specific training data. The visual encoder processes screenshots in 1120×1120 resolution and enables precise localization of UI elements with less than 5 pixel coordinate error.
Particularly noteworthy is the “think-before-act” approach, which enables complex, multi-step actions. In Minecraft navigation tasks, this mechanism reduces erroneous block placements by 38% compared to direct action predictions. The performance is also evident in gaming environments: In 14 Poki.com mini-games, UI-TARS-1.5 achieves perfect scores 2.4 times faster than human players.
Ads
Summary
- UI-TARS-1.5 is an open-source AI agent from ByteDance that achieves new best scores in seven GUI benchmarks
- The model uses reinforcement learning with 450,000 human-annotated interaction paths
- In enterprise applications, UI-TARS-1.5 reduces RPA scripting time by 68%
- Computational costs are 43% lower than GPT-4V ($0.12/1,000 actions vs. $0.21/1,000 actions)
- ByteDance’s roadmap includes multimodal memory modules and ROS-based robotics control
Source: Seed Tars