Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Explainable & Ethical AI

Published: arXiv: 2512.24503v1

Authors

Jiachen T. Wang Tong Wu Kaifeng Lyu James Zou Dawn Song Ruoxi Jia Prateek Mittal

Abstract

Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment: the use of identical small-scale model training configurations across all data recipes in the name of "fair" comparison. We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyperparameters, as the optimal training configuration is inherently data-dependent. Moreover, this fixed-configuration protocol diverges from full-scale model development pipelines, where hyperparameter optimization is a standard step. Consequently, we posit that the objective of data recipe assessment should be to identify the recipe that yields the best performance under data-specific tuning. To mitigate the high cost of hyperparameter tuning, we introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training. We show that this approach yields relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs. Theoretically, we prove that for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable loss. Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation, demonstrating dramatic improvements in the reliability of small-scale experiments.

Paper Summary

Problem

The main problem addressed by this research paper is the reliability of small-scale proxy model experiments in guiding data curation decisions for large-scale model training. The paper highlights that the standard experimental protocol for data recipe assessment, which involves using identical small-scale model training configurations across all data recipes, can lead to unreliable conclusions about data quality.

Key Innovation

The key innovation of this paper is the introduction of a simple patch to the evaluation protocol for proxy model training, which involves using reduced learning rates for proxy model training. This approach is shown to yield relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs, and theoretically preserves the ordering of datasets according to their optimal achievable loss.

Practical Impact

The practical impact of this research is significant, as it provides a reliable and efficient method for data recipe assessment. By using reduced learning rates for proxy model training, data teams can make more informed decisions about data curation, which can lead to improved model performance and reduced costs. This approach can be applied in various industries that rely on AI development, such as natural language processing, computer vision, and healthcare.

Analogy / Intuitive Explanation

Think of data curation as cooking a meal. Just as a recipe is needed to prepare a dish, a data recipe is needed to train a model. However, just as a small sample of a dish may not accurately represent the entire meal, a small-scale proxy model experiment may not accurately represent the full-scale model training process. The paper's innovation is like adding a pinch of salt to the recipe, which allows the small-scale experiment to more accurately reflect the full-scale model training process. This makes it possible to make more informed decisions about data curation and improve model performance.

Paper Information

Categories:

cs.LG cs.AI

Published Date:

arXiv ID:

2512.24503v1

Quick Actions

Back to Home