DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction

Agentic AI
Published: arXiv: 2512.21433v1
Authors

Khondoker Mirazul Mumenin Robert Underwood Dong Dai Jinzhen Wang Sheng Di Zarija Lukić Franck Cappello

Abstract

Error-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10\% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis.

Paper Summary

Problem
The main problem addressed in this research paper is the challenge of assessing data quality after lossy compression in scientific data management and analytics. As the volume of data generated by scientific simulations and instruments continues to grow, it becomes increasingly difficult and computationally expensive to evaluate the quality of compressed data.
Key Innovation
The researchers introduce a general-purpose deep-surrogate framework called DeepCQ, which efficiently predicts multiple compression quality metrics for input data and error bounds across various lossy compressors. This framework is unique because it decouples the computationally expensive feature-extraction stage from the lightweight metrics prediction, enabling efficient training and modular inference.
Practical Impact
The DeepCQ framework has significant practical implications for scientific data analysis. By predicting data quality accurately and efficiently, researchers can make informed decisions about data compression, reducing I/O and computational overhead in scientific data analysis. This can lead to faster and more efficient data processing, storage, and transmission, ultimately enabling scientists to focus on more critical tasks.
Analogy / Intuitive Explanation
Imagine you're trying to compress a large folder of files to send to a colleague. You want to know how much data will be lost during the compression process, but the compression algorithm takes a long time to run. The DeepCQ framework is like a fast and accurate "compression advisor" that can predict how much data will be lost before you even start the compression process. This way, you can make an informed decision about whether to proceed with compression or look for alternative solutions.
Paper Information
Categories:
cs.LG cs.DC cs.PF
Published Date:

arXiv ID:

2512.21433v1

Quick Actions