A Discrepancy-Based Perspective on Dataset Condensation

Generative AI & LLMs

Published: arXiv: 2509.10367v1

Authors

Tong Chen Raghavendra Selvan

Abstract

Given a dataset of finitely many elements $\mathcal{T} = \{\mathbf{x}_i\}_{i = 1}^N$, the goal of dataset condensation (DC) is to construct a synthetic dataset $\mathcal{S} = \{\tilde{\mathbf{x}}_j\}_{j = 1}^M$ which is significantly smaller ($M \ll N$) such that a model trained from scratch on $\mathcal{S}$ achieves comparable or even superior generalization performance to a model trained on $\mathcal{T}$. Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by $\mathcal{T}$ with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.

Paper Summary

Problem

The main problem this paper addresses is the challenge of dataset condensation (DC). DC involves reducing the size of a large dataset while preserving the performance of a model trained on it. This is crucial because large datasets require significant computational resources, contribute to the carbon footprint, and can be difficult to interpret.

Key Innovation

The key innovation of this paper is a unified framework that encompasses existing DC methods and extends the task-specific notion of DC to a more general and formal definition using notions of discrepancy. Discrepancy measures the distance between probability distributions in different regimes, allowing for a more comprehensive understanding of DC.

Practical Impact

This research has significant practical implications. By providing a principled foundation for DC, this paper enables the development of more efficient, robust, and private synthetic datasets and learning algorithms. This can lead to reduced computational costs, lower carbon emissions, and improved model interpretability. Additionally, the framework's focus on multi-objective problems can help designers balance competing objectives like accuracy, efficiency, and robustness.

Analogy / Intuitive Explanation

Imagine trying to summarize a long book into a concise summary. The goal of dataset condensation is similar: to distill the essence of a large dataset into a smaller, more manageable version that still captures the key information. The unified framework presented in this paper provides a systematic approach to achieving this goal, allowing for the creation of synthetic datasets that are more efficient, robust, and private.

Paper Information

Categories:

cs.LG

Published Date:

arXiv ID:

2509.10367v1

Quick Actions

Back to Home