UI-TARS-1.5: ByteDance’s AI agent outperforms GPT-4 and Claude in GUI tests

AI-supported agents reach a new level. With UI-TARS-1.5, ByteDance has developed a multimodal agent that can understand and operate graphical user interfaces – with significantly better results than previous models such as Claude and GPT-4.

Based on a vision-language model, the agent is able to perform complex tasks in different GUI environments. With an impressive accuracy of 61.6% in the ScreenSpotPro benchmark, UI-TARS-1.5 clearly outperforms the competition: Claude-3 only achieves 27.7%, while GPT-4o achieves 41.2%.

Technical innovations as drivers of success

The technical architecture of UI-TARS-1.5 is based on the Qwen2.5-VL-7B model, but has been optimized with 1.5 billion GUI-specific training data. The visual encoder processes screenshots in 1120×1120 resolution and enables precise localization of UI elements with less than 5 pixel coordinate error.

Particularly noteworthy is the “think-before-act” approach, which enables complex, multi-step actions. In Minecraft navigation tasks, this mechanism reduces erroneous block placements by 38% compared to direct action predictions. The performance is also evident in gaming environments: In 14 Poki.com mini-games, UI-TARS-1.5 achieves perfect scores 2.4 times faster than human players.

Ads

Legal Notice: This website ai-rockstars.com participates in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.

Summary

  • UI-TARS-1.5 is an open-source AI agent from ByteDance that achieves new best scores in seven GUI benchmarks
  • The model uses reinforcement learning with 450,000 human-annotated interaction paths
  • In enterprise applications, UI-TARS-1.5 reduces RPA scripting time by 68%
  • Computational costs are 43% lower than GPT-4V ($0.12/1,000 actions vs. $0.21/1,000 actions)
  • ByteDance’s roadmap includes multimodal memory modules and ROS-based robotics control

Source: Seed Tars

Advertisement

Ebook - ChatGPT for Work and Life - The Beginner's Guide to Getting More Done

For Beginners: Learn ChatGPT for Your Job & Life

Our latest e-book provides a simple and structured guide on how to use ChatGPT in your job or personal life.

  • Includes many examples and prompts to try out
  • 8 use cases included: e.g., as a translator, learning assistant, mortgage calculator, and more
  • 40 pages: clearly explained and focused on the essentials

View E-Book