MINT-RVAE: Multi-Cues Intention Prediction of Human-Robot Interaction using Human Pose and Emotion Information from RGB-only Camera Data

Computer Vision & MultiModal AI
Published: arXiv: 2509.22573v1
Authors

Farida Mohsen Ali Safa

Abstract

Efficiently detecting human intent to interact with ubiquitous robots is crucial for effective human-robot interaction (HRI) and collaboration. Over the past decade, deep learning has gained traction in this field, with most existing approaches relying on multimodal inputs, such as RGB combined with depth (RGB-D), to classify time-sequence windows of sensory data as interactive or non-interactive. In contrast, we propose a novel RGB-only pipeline for predicting human interaction intent with frame-level precision, enabling faster robot responses and improved service quality. A key challenge in intent prediction is the class imbalance inherent in real-world HRI datasets, which can hinder the model's training and generalization. To address this, we introduce MINT-RVAE, a synthetic sequence generation method, along with new loss functions and training strategies that enhance generalization on out-of-sample data. Our approach achieves state-of-the-art performance (AUROC: 0.95) outperforming prior works (AUROC: 0.90-0.912), while requiring only RGB input and supporting precise frame onset prediction. Finally, to support future research, we openly release our new dataset with frame-level labeling of human interaction intent.

Paper Summary

Problem
Detecting human intent to interact with robots is crucial for effective human-robot interaction (HRI) and collaboration. However, most existing approaches rely on multimodal inputs, such as RGB combined with depth (RGB-D), which can limit system scalability and increase costs. The main problem is to predict human interaction intent with frame-level precision using only RGB input, enabling faster robot responses and improved service quality.
Key Innovation
The researchers propose a novel RGB-only pipeline for predicting human interaction intent with frame-level precision, using a synthetic sequence generation method called MINT-RVAE (Multi-Cues Intention Prediction using Human Pose and Emotion Information). MINT-RVAE is a multimodal recurrent variational autoencoder (VAE) that addresses the class imbalance inherent in real-world HRI datasets, which can hinder the model's training and generalization.
Practical Impact
This research has significant practical implications for the development of service robots that operate in public spaces. By accurately detecting human interaction intent with frame-level precision, robots can respond in a timely and socially appropriate manner, improving fluency, safety, and user trust. This can lead to seamless user experiences in domains such as hotels, shopping centers, and healthcare facilities.
Analogy / Intuitive Explanation
Imagine you're walking towards a robot receptionist in a hotel lobby. The robot needs to detect your intention to interact with it before you explicitly verbalize or gesture. This is like a game of "reading the mind" between humans and robots. The researchers have developed a way for the robot to "read" your intention more accurately, using only a regular camera, without needing special hardware like depth cameras. This allows the robot to respond more quickly and appropriately, making the interaction more efficient and enjoyable.
Paper Information
Categories:
cs.RO cs.CV
Published Date:

arXiv ID:

2509.22573v1

Quick Actions