Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

Generative AI & LLMs
Published: arXiv: 2508.13144v1
Authors

David Heineman Valentin Hofmann Ian Magnusson Yuling Gu Noah A. Smith Hannaneh Hajishirzi Kyle Lo Jesse Dodge

Abstract

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.

Paper Summary

Problem
Developing large language models requires making decisions with small-scale experiments, which can be unreliable. This problem arises because current benchmarks are not designed to handle uncertainty and noise in model evaluation.
Key Innovation
This paper introduces a framework for reducing uncertainty in language model evaluation by analyzing the signal (ability to separate better models from worse ones) and noise (sensitivity to random variability between training steps) of benchmarks. The authors propose three interventions to improve signal or noise, such as switching to a metric with better signal and noise or filtering noisy subtasks.
Practical Impact
The practical impact of this research is that it provides a framework for creating more reliable benchmarks, which can lead to more accurate predictions about large model behavior. This is particularly important for developing more general-purpose language models that need to be evaluated on diverse benchmarks.
Analogy / Intuitive Explanation
Imagine trying to predict the weather by looking at a small sample of temperature readings from different locations. If these readings are noisy (i.e., affected by random variability), you won't be able to make an accurate prediction about the overall weather pattern. Similarly, current language model evaluation benchmarks are often noisy and unreliable, which can lead to inaccurate predictions about large model behavior. By improving the signal-to-noise ratio of benchmarks, we can create more reliable and accurate evaluations that will ultimately lead to better language models.
Paper Information
Categories:
cs.CL cs.LG
Published Date:

arXiv ID:

2508.13144v1

Quick Actions