AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems

Generative AI & LLMs
Published: arXiv: 2601.16964v1
Authors

Mohamed Amine Ferrag Abderrahmane Lakas Merouane Debbah

Abstract

The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive

Paper Summary

Problem
The rapid advancement of Large Language Models (LLMs) has sparked interest in integrating them into autonomous systems, enabling reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks.
Key Innovation
This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluation of autonomous agents under diverse conditions. The dataset is built around a factorized scenario space across seven orthogonal axes and employs an LLM-driven prompt-to-JSON pipeline to produce semantically rich, simulation-ready specifications.
Practical Impact
AgentDrive has the potential to advance the development of autonomous driving systems by providing a comprehensive benchmark for evaluating and training agentic AI models. The dataset and its accompanying evaluation framework can be used to improve the reasoning capabilities of LLM-based agents, enabling them to make safer and more informed decisions in complex and dynamic environments.
Analogy / Intuitive Explanation
Imagine you're teaching a child to drive a car. You want to expose them to various scenarios, such as driving on different roads, encountering different weather conditions, and interacting with other drivers. AgentDrive is like a vast library of driving scenarios, each carefully designed to test the reasoning capabilities of LLM-based agents. By exposing these agents to a wide range of scenarios, we can better understand their strengths and weaknesses, and ultimately develop more reliable and safe autonomous driving systems.
Paper Information
Categories:
cs.AI
Published Date:

arXiv ID:

2601.16964v1

Quick Actions