A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
Abstract
Automated benchmark generation method creates challenging tasks with broader tool-use coverage by evolving tool sequences through adaptive contrastive n-gram modeling and iterative difficulty refinement.
As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.
Community
TASTE is a new way to automatically create diverse, harder, and verified benchmarks for tool-using AI agents.
Instead of writing tasks first, we start from the tool sequences agents need to execute, then synthesize realistic tasks around them.
The result: models that look strong on existing benchmarks face a much tougher and broader test.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents (2026)
- GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling (2026)
- From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills (2026)
- Pioneer Agent: Continual Improvement of Small Language Models in Production (2026)
- CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation (2026)
- LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents (2026)
- Synthetic Sandbox for Training Machine Learning Engineering Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper