OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Generative AI & LLMs
Published: arXiv: 2508.13141v1
Authors

Pranjal Aggarwal Seungone Kim Jack Lanchantin Sean Welleck Jason Weston Ilia Kulikov Swarnadeep Saha

Abstract

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. In this work, we introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks. Using novel thinking-adjusted accuracy metrics, we perform extensive evaluation of 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

Paper Summary

Problem
Large Language Models (LLMs) have struggled with two main issues: underthinking and overthinking. Underthinking refers to LLMs' inability to tackle challenging reasoning problems that require step-by-step thinking, while overthinking occurs when they spend too much time on simple queries without improving performance. This has led to the development of separate thinking and non-thinking variants of LLMs, leaving users to decide which model to use for each query.
Key Innovation
The OptimalThinkingBench is a new benchmark that simultaneously tracks the progress of optimally-thinking LLMs in terms of both performance and efficiency. It consists of two sub-benchmarks: OverthinkingBench and UnderthinkingBench, which test an LLM's ability to balance its thinking approach depending on the complexity of the query.
Practical Impact
The OptimalThinkingBench has the potential to significantly improve the user experience by providing a single model that can efficiently answer simple queries while spending more time on complex ones. This would eliminate the need for users to choose between different LLM variants, making it easier to get accurate and efficient results.
Analogy / Intuitive Explanation
Imagine trying to solve a math problem. You need to think step-by-step to arrive at the correct solution. But if you overthink it, you might spend too much time on simple calculations without getting any closer to the answer. The OptimalThinkingBench is like a training program that helps LLMs learn when to "step back" and think more deeply, and when to "speed up" and provide quick answers for simpler queries.
Paper Information
Categories:
cs.CL cs.LG
Published Date:

arXiv ID:

2508.13141v1

Quick Actions