Effective Training Data Synthesis for Improving MLLM Chart Understanding

Generative AI & LLMs
Published: arXiv: 2508.06492v1
Authors

Yuwei Yang Zeyu Zhang Yunzhong Hou Zhuowan Li Gaowen Liu Ali Payani Yuan-Sen Ting Liang Zheng

Abstract

Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.

Paper Summary

Key Innovation
The innovation lies in designing a five-step pipeline for generating synthetic training data that improves chart understanding capabilities. The pipeline is modularized and diversified, allowing for the creation of high-quality datasets with complex visualizations. Specifically, it involves separating data and function creation for single plot generation, conditioning later subplots on earlier ones for multi-subplot figures, visually diversifying generated figures, filtering out low-quality data, and generating question-answer pairs with GPT-4o.
Practical Impact
The effective chart dataset (ECD) generated by this pipeline has the potential to improve the performance of various MLLMs on a range of real-world and synthetic test sets. By providing high-quality training data, ECD can help bridge the gap between synthetic training data and authentic scientific visualizations, enabling AI agents to better analyze and learn from complex data.
Analogy / Intuitive Explanation
Imagine trying to teach a child to recognize different types of cars by showing them only simple drawings or 2D images. The child might struggle to understand the nuances of each car model without seeing real-world examples with varying angles, lighting conditions, and backgrounds. Similarly, AI models trained on synthetic data may not be able to generalize well to complex, real-world scenarios unless they are provided with high-quality training data that mimics these complexities.
Paper Information
Categories:
cs.CV cs.CL
Published Date:

arXiv ID:

2508.06492v1

Quick Actions