Training-Time Action Conditioning for Efficient Real-Time Chunking

Generative AI & LLMs
Published: arXiv: 2512.05964v1
Authors

Kevin Black Allen Z. Ren Michael Equi Sergey Levine

Abstract

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

Paper Summary

Problem
Real-time robot control is a challenging task, especially when using large vision-language-action models (VLAs) that require fast and reactive decision-making. One major issue is that the inference latency of these models can be in the tens to hundreds of milliseconds, making it difficult to produce smooth and reactive trajectories.
Key Innovation
The researchers propose a new approach called "training-time action conditioning" that simulates inference delay at training time and eliminates any inference-time computational overhead. This method allows the model to condition on action prefixes directly, eliminating the need for inference-time inpainting, which introduces additional computational overhead.
Practical Impact
This research has significant practical implications for real-time robot control. By eliminating the need for inference-time inpainting, training-time action conditioning can improve the efficiency and speed of real-time robot control systems. This can enable robots to perform complex tasks, such as box building and espresso making, with higher performance and speed parity, while being computationally cheaper.
Analogy / Intuitive Explanation
Think of training-time action conditioning like a musician practicing a song with a metronome. The metronome simulates the rhythm of the music, allowing the musician to practice in sync with the beat. Similarly, training-time action conditioning simulates the inference delay at training time, allowing the model to practice and learn how to generate smooth and reactive trajectories, even when faced with high inference delays. This enables the model to perform better in real-time robot control scenarios.
Paper Information
Categories:
cs.RO cs.AI
Published Date:

arXiv ID:

2512.05964v1

Quick Actions