SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Explainable & Ethical AI
Published: arXiv: 2511.16666v1
Authors

Zhenyuan Qin Xincheng Shuai Henghui Ding

Abstract

Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

Paper Summary

Problem
The main challenge addressed in this research paper is the difficulty in controlling the 9D poses (location, size, and orientation) of multiple objects in a scene. Existing methods often suffer from limited controllability and degraded quality, making it hard to achieve comprehensive multi-object 9D pose control.
Key Innovation
The key innovation in this work is the introduction of SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training.
Practical Impact
The SceneDesigner method has significant practical implications for various applications, such as: * Image editing and manipulation: Users can control the pose of multiple objects in a scene, enabling creative and realistic image editing. * Virtual reality and augmented reality: SceneDesigner can generate realistic and interactive scenes with controlled object poses, enhancing the user experience. * Computer vision and robotics: The method can be used to improve object recognition and tracking in complex scenes, and to enable robots to manipulate objects in a controlled manner.
Analogy / Intuitive Explanation
Imagine you're a interior designer trying to arrange multiple pieces of furniture in a room. You want to control the size, orientation, and position of each piece, while ensuring that they fit together harmoniously. SceneDesigner is like a powerful tool that allows you to manipulate the 9D poses of these objects, creating a realistic and aesthetically pleasing scene.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2511.16666v1

Quick Actions