Neuro-Symbolic Spatial Reasoning in Segmentation

Computer Vision & MultiModal AI
Published: arXiv: 2510.15841v1
Authors

Jiayi Lin Jiabo Huang Shaogang Gong

Abstract

Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

Paper Summary

Problem
The main problem this research paper addresses is the challenge of Open-Vocabulary Semantic Segmentation (OVSS), which involves segmenting an image into regions and assigning them labels from an open set of categories. Current state-of-the-art methods rely on vision-language models (VLMs) to associate image regions with diverse textual concepts, but struggle with contextual reasoning and structured understanding.
Key Innovation
The key innovation of this work is the introduction of neuro-symbolic (NeSy) spatial reasoning in OVSS, which combines the strengths of neural perception and symbolic reasoning. The proposed Relational Segmentor (RelateSeg) model represents spatial relations among objects in an image as first-order logic formulas and incorporates them into network optimization.
Practical Impact
This research has the potential to improve the accuracy of image segmentation tasks, particularly in scenarios where objects are spatially related. The RelateSeg model can be applied in various real-world applications, such as autonomous vehicles, robotics, and medical imaging, where accurate understanding of spatial relationships is crucial. The model's ability to learn spatial relations from data can also facilitate the development of more efficient and effective image segmentation algorithms.
Analogy / Intuitive Explanation
Imagine you're trying to segment an image of a room, where there's a cat sitting on a chair next to a person. Traditional segmentation models might struggle to distinguish between the cat and the chair, or between the person and the background. The RelateSeg model, on the other hand, can learn to recognize the spatial relationships between objects, such as "the cat is to the right of the person," and use this knowledge to improve the segmentation accuracy. This is similar to how humans use contextual information to understand complex scenes and relationships between objects.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2510.15841v1

Quick Actions