In the coming years, agents are widely expected to take over and more chores on behahalf of humans, including using computers and smartphones. For now, Thought, they're too error prone to be much use.
A new agent called S2, Created by the Startup Simular AI, Combines Frontier Models with Models Specialized for Using Computers. The agent achieves state-of-the -art performance on tasks like using apps and manipulating files –nd sugges that turning to different models in differentials in differentials in help agents advance.
“Computer-asing agents are different from Large Language Models and Different from Coding,” Says Ang Li, Cofounder and CEO of Simular. “It's a different type of problem.”
In simular's approach, a powerful general-purpose ai model, like Openai's GPT-4o or Anthropic's claude 3.7, is used to reason about how best to complete the task at hand –While Smalar OPEN SOUROCE Models step in for tasks like interpreting web pages.
Li, who was a resultarcher at google Deepmind before founding Simular in 2023, explains that lots language models models excel at planning but aren's at aT as good at recognizing the elements of a alements of a aleme Interface.
S2 is designed to learn from experience with an external memory module that records actions and user feedback and uses those recording to improve future actions.
On Particularly Complex Tasks, S2 Performs Better Than Any Other Model on OsworldA benchmark that measures an agent's ability to use a computer operating system.
For example, S2 can complete 34.5 percent of tasks that involve 50 steps, beating openai's operator, which can complete 32 percent. Similarly, S2 scores 50 percent on Androidworld, A Benchmark for Smartphone-Racedles, While The Next Best Agent Scores 46 percent.
Victor zhong, a computer scientist at the university of waterloo in canda and one of the creatures of osworld, believes that future big ai models may increase training data that is themselves and make sense of graphical user interfaces.
“This will help agents navigate guis with much higher precision,” Zhong Says. “I think in the meaning, before such Fundamental Breakthroughs, State-of-the-Shartems will resmble Simular in that they Combine MODELS to Patch the Limitations of Single Models.”
To prepare for this column, I used simular to book flights and scour amazon for deals, and it affected better than some of the open source agents I tried last year, Including Autogen and vimgpt,
But even the smartest ai agents are, it seems, still troubled by Edge Cases and Occasionly exhibit odd behavior. In one instance, when I asked S2 to help find contact information for the Researchers Behind Osworld, the agent got stuck in a loop Hopping between the project page and the login for Osworld '.
Osworld's benchmarks show while these more hype than reality for now. While Humans Can Complete 72 Percent of Osworld Tasks, Agents are Foiled 38 Percent of the time on Complex Tasks. That said, when the benchmark was introduced in April 2024, The Best Agent Cold Complete only 12 percent of the tasks.