LuxDiT: Lighting Estimation with Video Diffusion Transformer

Generative AI & LLMs
Published: arXiv: 2509.03680v1
Authors

Ruofan Liang Kai He Zan Gojcic Igor Gilitschenski Sanja Fidler Nandita Vijaykumar Zian Wang

Abstract

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.

Paper Summary

Problem
Estimating scene lighting from a single image or video remains a long-standing challenge in computer vision and graphics. This problem is difficult to solve because there are limited ground-truth high-dynamic-range (HDR) environment maps available, which are expensive to capture and lack diversity.
Key Innovation
LuxDiT is a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. The model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes.
Practical Impact
This research could be applied in the real world by enabling more realistic virtual object insertion, augmented reality, and synthetic data generation. LuxDiT produces accurate lighting predictions while preserving scene semantics, making it a valuable tool for various industries such as gaming, film, and architecture.
Analogy / Intuitive Explanation
Imagine trying to recreate a sunset from memory without having seen the original. That's what LuxDiT does, but instead of relying on human intuition, it uses a combination of artificial intelligence and large-scale synthetic data to generate high-quality HDR environment maps that accurately capture the lighting conditions of a scene.
Paper Information
Categories:
cs.GR cs.AI cs.CV
Published Date:

arXiv ID:

2509.03680v1

Quick Actions