VISTA: A Test-Time Self-Improving Video Generation Agent

Computer Vision & MultiModal AI
Published: arXiv: 2510.15831v1
Authors

Do Xuan Long Xingchen Wan Hootan Nakhost Chen-Yu Lee Tomas Pfister Sercan Ö. Arık

Abstract

Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

Paper Summary

Problem
Text-to-video (T2V) generation has made significant progress, but it still faces several challenges. These include: * Models struggling to precisely align with user goals * Difficulty in adhering to physical laws and common sense * High sensitivity to the exact phrasing of input prompts * Limited deployment due to these challenges
Key Innovation
VISTA is a novel multi-agent framework that emulates human-like prompt refinement to improve T2V generation. It is the first to jointly improve the visual, audio, and context dimensions of videos. VISTA consists of four key components: * Structured Video Prompt Planning * Pairwise Tournament Selection * Multi-Dimensional Multi-Agent Critiques * Deep Thinking Prompting Agent These components work together to refine the prompt and generate an optimized video.
Practical Impact
VISTA has several practical applications, including: * Creative storytelling: VISTA can generate high-quality videos that align with user goals and preferences. * Education: VISTA can create engaging and informative videos that cater to different learning styles. * Content creation: VISTA can assist in generating videos for various purposes, such as advertising, entertainment, and more. VISTA's ability to jointly optimize visual, audio, and contextual elements makes it a powerful tool for various industries and applications.
Analogy / Intuitive Explanation
Imagine you're trying to create a video based on a user's prompt. You start by planning the video's structure and content, then you generate a few options. Next, you critique each option and refine the prompt based on the feedback. Finally, you generate a new video that meets the user's expectations. VISTA does this process automatically, using a combination of algorithms and human-like intuition to create high-quality videos that align with user goals and preferences.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2510.15831v1

Quick Actions