UI-TARS-1.5: ByteDance's AI agent outperforms GPT-4 and Claude in GUI tests

AI-supported agents reach a new level. With UI-TARS-1.5, ByteDance has developed a multimodal agent that can understand and operate graphical user interfaces – with significantly better results than previous models such as Claude and GPT-4.

Based on a vision-language model, the agent is able to perform complex tasks in different GUI environments. With an impressive accuracy of 61.6% in the ScreenSpotPro benchmark, UI-TARS-1.5 clearly outperforms the competition: Claude-3 only achieves 27.7%, while GPT-4o achieves 41.2%.

Table of Contents

Technical innovations as drivers of success

The technical architecture of UI-TARS-1.5 is based on the Qwen2.5-VL-7B model, but has been optimized with 1.5 billion GUI-specific training data. The visual encoder processes screenshots in 1120×1120 resolution and enables precise localization of UI elements with less than 5 pixel coordinate error.

Particularly noteworthy is the “think-before-act” approach, which enables complex, multi-step actions. In Minecraft navigation tasks, this mechanism reduces erroneous block placements by 38% compared to direct action predictions. The performance is also evident in gaming environments: In 14 Poki.com mini-games, UI-TARS-1.5 achieves perfect scores 2.4 times faster than human players.

Summary

UI-TARS-1.5 is an open-source AI agent from ByteDance that achieves new best scores in seven GUI benchmarks
The model uses reinforcement learning with 450,000 human-annotated interaction paths
In enterprise applications, UI-TARS-1.5 reduces RPA scripting time by 68%
Computational costs are 43% lower than GPT-4V ($0.12/1,000 actions vs. $0.21/1,000 actions)
ByteDance’s roadmap includes multimodal memory modules and ROS-based robotics control

Source: Seed Tars

UI-TARS-1.5: ByteDance’s AI agent outperforms GPT-4 and Claude in GUI tests

Technical innovations as drivers of success

Summary

Related Posts: