AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Agentic AI
Published: arXiv: 2510.21652v1
Authors

Jonathan Bragg Mike D'Arcy Nishant Balepur Dan Bareket Bhavana Dalvi Sergey Feldman Dany Haddad Jena D. Hwang Peter Jansen Varsha Kishore Bodhisattwa Prasad Majumder Aakanksha Naik Sigal Rahamimov Kyle Richardson Amanpreet Singh Harshit Surana Aryeh Tiktinsky Rosni Vasu Guy Wiener Chloe Anastasiades Stefan Candra Jason Dunkelberger Dan Emery Rob Evans Malachi Hamada Regan Huff Rodney Kinney Matt Latzke Jaron Lochner Ruben Lozano-Aguilera Cecile Nguyen Smita Rao Amber Tanaka Brooke Vlahos Peter Clark Doug Downey Yoav Goldberg Ashish Sabharwal Daniel S. Weld

Abstract

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

Paper Summary

Problem
The main problem addressed by this research paper is the lack of rigorous benchmarking for AI agents in scientific research. Current benchmark suites fall short in providing holistic, product-informed measures of real-world use cases, lack reproducible agent tools, fail to account for confounding variables, and lack standardized interfaces for quick agent prototyping and evaluation.
Key Innovation
The key innovation of this work is the development of AstaBench, a holistic benchmark suite for scientific research that addresses the limitations of current approaches. AstaBench provides a comprehensive suite of 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, including many problems inspired by actual user requests to deployed Asta agents. The suite comes with a scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders.
Practical Impact
The practical impact of this research is significant. AstaBench provides a standardized platform for evaluating AI agents in scientific research, enabling researchers and developers to compare and contrast their performance in a fair and controlled manner. This can lead to the development of more effective AI agents that can assist scientists in their research, ultimately accelerating scientific discovery. Additionally, AstaBench can be used to identify areas where AI agents are struggling and inform the development of new techniques and tools to address these challenges.
Analogy / Intuitive Explanation
Imagine you're trying to find the best chef in a city. You can't just taste one dish from each restaurant and declare the winner. You need to consider the variety of dishes, the cooking techniques, and the ingredients used. Similarly, evaluating AI agents in scientific research requires a comprehensive and holistic approach, considering various aspects of their performance, such as their ability to retrieve papers, analyze data, and propose new research directions. AstaBench provides this comprehensive evaluation framework, enabling researchers to compare and contrast AI agents in a fair and controlled manner.
Paper Information
Categories:
cs.AI cs.CL
Published Date:

arXiv ID:

2510.21652v1

Quick Actions