AI Research Roundup: December 21, 2025
Discover the latest breakthroughs in artificial intelligence with our curated selection of top cutting-edge research papers of this week.
AI in healthcare
Cutting-edge research in artificial intelligence
Fully Automated Segmentation of Fiber Bundles in Anatomic Tracing Data
Problem
The main problem addressed in this paper is the labor-intensive process of manually annotating fiber bundles on histological slides for anatomic tracing data. This bottleneck has limited the availability of annotated data and restricted large-scale validation studies of diffusion MRI (dMRI) tractography.
Analogy
Imagine trying to identify individual fibers in a complex network of yarns. Manually annotating fiber bundles is like searching for specific strands of yarn in a tangled mess. The automated framework presented in this paper is like developing a specialized tool that can efficiently and accurately identify the fibers, even in complex situations. This tool will enable researchers to analyze large amounts of data quickly and accurately, leading to new insights into brain function and connectivity.
Key Innovation
This research presents a fully automated framework for fiber bundle segmentation in macaque tracer data, using a U-Net architecture with large patch sizes, foreground aware sampling, and semi-supervised pre-training. This approach eliminates common errors, improves detection of sparse bundles by over 20%, and reduces the False Discovery Rate (FDR) by 40% compared to the state-of-the-art.
Practical Impact
This research has significant practical implications. The automated framework will facilitate large-scale analysis of anatomic tracing data, generating more ground-truth data that can be used to validate and optimize dMRI tractography methods. This will improve our understanding of brain connectivity patterns and enable more accurate reconstruction of white matter pathways.
Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation
Problem
Radiology reports are crucial for clinical decision making, but current report-generation systems struggle to produce accurate and interpretable reports from chest X-rays. The main challenge is integrating different modalities of data, including visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals.
Analogy
Imagine trying to understand a medical image without knowing what the radiologist is looking at. By incorporating eye-tracking data, this framework helps machines "look" at the same things as humans, improving their ability to classify diseases and generate accurate reports. This is like having a virtual guide that shows you where to focus your attention, making it easier to understand complex medical images.
Overall, this research has significant implications for the development of AI-powered radiology report generation systems that can provide accurate and interpretable results.
Key Innovation
The proposed framework addresses this challenge by introducing a two-stage multimodal approach that leverages the MIMIC-Eye dataset. The first stage uses gaze-guided contrastive learning to improve disease classification, while the second stage generates region-aware radiology reports. The key innovation is the use of a novel multi-term gaze-attention loss that unifies different facets of fixation data, such as pixel-wise fidelity and pattern-aware similarity.
Practical Impact
This research has several practical implications. First, it demonstrates the effectiveness of incorporating eye-tracking data into multimodal learning for disease classification and report generation. Second, it shows how to generate region-aligned sentences via structured prompts, improving report quality. Third, it highlights the potential benefits of using gaze-informed attention supervision in radiology applications.
Computer Vision & MultiModal AI
Advances in image recognition, video analysis, and multimodal learning
Multi-Phase Automated Segmentation of Dental Structures in CBCT Using a Lightweight Auto3DSeg and SegResNet Implementation
Problem
The main problem addressed by this research is the need for efficient and accurate automated segmentation of dental structures in cone-beam computed tomography (CBCT) images. This is particularly important in radiation oncology, where accurate diagnosis and treatment planning are crucial for patients with head and neck cancer.
Analogy
Imagine trying to find a specific toy in a messy playroom. You need to look at the room as a whole, then zoom in on smaller areas until you find what you're looking for. That's similar to what this algorithm does - it looks at the entire CBCT image, then focuses on specific dental structures like teeth and nerves to segment them accurately. The analogy also highlights the importance of preprocessing (cleaning up the playroom) before trying to find the toy (segmenting the dental structures).
Key Innovation
The key innovation of this work is the development of a lightweight deep learning pipeline using the MONAI Auto3DSeg framework and a 3D SegResNet architecture. The pipeline is designed to be computationally efficient while achieving high accuracy in segmenting dental structures. The approach also involves preprocessing steps, such as image resampling and intensity clipping, to improve model performance.
Practical Impact
The practical impact of this research is the potential for automating dental segmentation in CBCT images, which can streamline clinical workflows and improve patient care. Specifically, the algorithm can be used to identify high-dose teeth and quantify patient-specific risk factors for osteoradionecrosis (ORN), a severe complication that can occur after radiation therapy. The goal is to integrate this technology into the clinical workflow for head and neck oncology, enabling automatic dental reports that flag high-dose teeth and inform personalized supportive care.
Denoising diffusion models for inverse design of inflatable structures with programmable deformations
Problem
The paper addresses the challenge of designing inflatable structures that can deform into specific shapes under pressure-driven actuation. This is a crucial problem in various fields, such as soft robotics, deployable aerospace systems, biomedical devices, and adaptive architecture.
Analogy
Imagine trying to draw a specific shape with playdough. You need to start with the right initial shape and then gradually mold it into the desired form. The DDPM framework works similarly, but instead of using your hands, it uses mathematical equations to generate images that represent the undeformed structure. These images are then used as inputs to predict how the structure will deform when inflated under specific conditions.
Note: As a summary for a general audience, I've tried to focus on the main ideas and avoid technical jargon whenever possible. If you'd like me to add or clarify anything, please let me know!
Key Innovation
The researchers present a generative design framework based on denoising diffusion probabilistic models (DDPMs) to tackle this inverse design problem. The framework generates structural designs that deform into prescribed geometries when inflated under fixed boundary conditions. Unlike traditional methods, this approach uses simple images as inputs and outputs, making it more efficient and flexible.
Practical Impact
This research has significant practical implications for the development of inflatable structures with programmable deformations. The proposed framework can be used to quickly generate diverse undeformed configurations that achieve the desired deformations when inflated, enabling parallel exploration of viable design candidates while accommodating complex constraints. This can lead to breakthroughs in various applications, such as soft robotics, deployable aerospace systems, and biomedical devices.
Motion2Motion: Cross-topology Motion Transfer with Sparse Correspondence
Problem
The paper addresses the long-standing problem of transferring a motion from one character (with a specific topology) to another character with a different topology in computer animation. This is a challenging task, especially when dealing with complex characters like those with skirts or hair.
Analogy
Imagine trying to retarget a dance move from a human to a robot. You wouldn't just copy the exact same movements, but rather try to capture the essence and spirit of the original dance. Motion2Motion does something similar by identifying key points (joints) on both characters' skeletons and aligning them in a way that preserves the core kinematic characteristics of the motion.
In other words, it's not just about matching specific bone movements, but also understanding the underlying dynamics and intent behind the original motion. This allows for more flexible and robust motion transfer across different topologies, making it a powerful tool for animators and motion designers.
Key Innovation
The key innovation is the introduction of Motion2Motion, a novel, training-free framework that enables cross-topology motion transfer with sparse correspondence. The framework assumes only minimal data availability (a few-shot setting) and a sparse joint correspondence between source and target skeletons. This allows for meaningful transfer while avoiding the need for large-scale annotation.
Practical Impact
The practical impact of this research is significant, as it can be applied in various real-world scenarios where motion transfer is crucial, such as animation creation pipelines. The framework's ability to work with minimal data availability makes it a valuable tool for industries where data is scarce or expensive to collect.
IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion
Problem
Reconstructing complete and interactive 3D scenes from partially observed environments is a fundamental challenge in computer vision and robotics. Current approaches often rely on multi-stage pipelines or require dense scanning, which can be error-prone and not easily scalable.
Analogy
Imagine taking multiple photos of the same room from different angles. Each photo captures some parts of the scene, but not everything. IGFuse is like combining those photos to create a single, detailed picture of the entire scene, while also correcting for any gaps or misalignments between them. This allows you to see the whole scene in high quality and even manipulate individual objects within it.
Note: The analogy is not perfect, as IGFuse works with 3D Gaussian fields rather than 2D photos, but it conveys the idea of combining multiple partial views to create a complete and accurate representation of the scene.
Key Innovation
IGFuse is a novel framework that reconstructs interactive Gaussian scenes by fusing observations from multiple scans. This approach leverages natural object rearrangements between captures to reveal previously occluded regions and refine geometry.
Practical Impact
IGFuse enables high-fidelity rendering and object-level scene manipulation without dense observations or complex pipelines. Its effectiveness for real-world 3D reconstruction and real-to-simulation transfer makes it a valuable tool for various applications, such as robotics, gaming, and architecture.
4DNeX: Feed-Forward 4D Generative Modeling Made Easy
Problem
The main challenge addressed in this paper is generating 4D (dynamic 3D) scene representations from a single image. Current methods require video input or rely on computationally intensive optimization procedures, making it difficult to create a scalable solution for image-to-4D modeling.
Analogy
Think of this research like trying to create a movie from a single still image. You would need to infer how the scene changes over time and what the 3D objects look like from different angles. The proposed framework uses machine learning techniques to make educated guesses about these missing pieces, allowing it to generate dynamic 3D scenes from a single image.
Note: I've written my summary based on the extracted sections from the paper, using clear Markdown section headings for each part and simple, engaging language.
Key Innovation
The key innovation of this work is the development of 4DNeX, a feed-forward framework that fine-tunes a pretrained video diffusion model to enable efficient image-to-4D generation. This approach addresses the scarcity of 4D data by introducing a large-scale dataset with high-quality pseudo-4D annotations and proposes a set of simple yet effective adaptation strategies to repurpose video diffusion models for 4D modeling.
Practical Impact
The practical impact of this research is the potential to create a scalable solution for image-to-4D modeling, enabling applications such as novel-view video synthesis, augmented reality (AR), and digital content creation. The proposed framework can also be used to simulate dynamic scene evolution, laying the foundation for generative 4D world models.
Explainable & Ethical AI
Transparency, fairness, and responsible AI development
A Perfectly Truthful Calibration Measure
Problem
The problem this paper addresses is the need for a truthful calibration measure in machine learning. Calibration measures quantify how well a model's predictions align with the true probabilities of different outcomes. However, existing calibration measures incentivize models to "lie" and appear more calibrated than they actually are. This can lead to poor performance in real-world applications.
Analogy
Imagine you're trying to predict the weather tomorrow based on today's conditions. A truthful calibration measure is like a thermometer that tells you how accurate your prediction is. If the thermometer says it will rain 80% of the time, but it actually rains only 60% of the time, then the thermometer is "lying" and not providing an accurate representation of the true probability. ATB is like a new type of thermometer that gives you an honest reading of how well your prediction matches the true weather patterns.
In summary, this paper presents a novel approach to measuring calibration in machine learning models. The proposed Averaged Two-Bin Calibration Error (ATB) measure is both truthful and computationally efficient, making it a valuable tool for developers working on high-stakes applications.
Key Innovation
The key innovation in this paper is the design of a perfectly truthful calibration measure called Averaged Two-Bin Calibration Error (ATB). ATB is a new type of calibration error that is both truthful and computationally efficient. Unlike existing measures, ATB does not incentivize models to "lie" and can be used to evaluate the calibration of any predictor.
Practical Impact
The practical impact of this research is significant. A truthful calibration measure like ATB can be used in a wide range of applications where accurate predictions are critical, such as medical diagnosis, finance, and self-driving cars. By using ATB, developers can ensure that their models are producing reliable and trustworthy predictions.
Generative AI & LLMs
Breakthroughs in language models, text generation, and creative AI systems
Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry
Problem
The European process industry is facing increasing pressures from economic competition and regulatory demands, particularly concerning energy efficiency and greenhouse gas emission reduction targets. To maintain global competitiveness, staying on top of industrial and scientific advancements is a necessity. The growth of retrofitted sensors across various sectors has led to an explosion in data volume, offering opportunities to leverage complex information for enhanced operational efficiency and decision-making.
Analogy
Imagine trying to understand a complex system by looking at individual components in isolation. This is like trying to model industrial processes using channel-independent models. However, these models lack the ability to capture specific cross-variable dynamics that are crucial for predicting real-world outcomes. The CGPT architecture is like a "systemic thinking" approach, where you break down the complex system into smaller pairs of variables and then use those pairs to understand how they interact with each other. This allows the model to capture both channel-dependent interactions and channel-independent generalization, making it a powerful tool for predicting industrial outcomes.
Key Innovation
The Causally-Guided Pairwise Transformer (CGPT) is a novel architecture that integrates a known causal graph as an inductive bias. This approach tackles the CD/CI conflict by decomposing multidimensional data into pairs, using channel-agnostic learnable layers where all parameter dimensions are independent of the number of variables.
Practical Impact
The CGPT architecture ensures scalability and any-variate adaptability, making it a significant step towards a versatile, "one-for-all" predictive model for the process industry. By handling arbitrary sensor configurations without architectural changes, CGPT excels at long-term forecasting by leveraging causal drivers, outperforming both channel-independent and channel-dependent baselines.
MuDRiC: Multi-Dialect Reasoning for Arabic Commonsense Validation
Problem
The main problem addressed by this research is the lack of a common sense dataset for Arabic dialects, despite their prevalence in spoken contexts and formal settings. Most existing datasets focus on Modern Standard Arabic (MSA), neglecting the rich diversity of Arabic dialects. This gap limits the applicability of models trained on MSA to real-world dialectal content.
Analogy
Imagine trying to understand a conversation between two people speaking different dialects of Arabic. You might struggle to pick up on the nuances of each dialect, even if you're familiar with one of them. That's what's happening when AI systems are trained only on MSA and then applied to real-world dialectal content. The MuDRiC dataset is like a Rosetta Stone for Arabic dialects, providing a common language understanding framework that can help bridge the gap between different dialects.
Key Innovation
The key innovation is the introduction of MuDRiC, a multi-dialect common sense benchmark that incorporates four major Arabic dialects: Egyptian, Gulf, Levantine, and Moroccan. Additionally, the research presents a novel approach adapting Graph Convolutional Networks (GCNs) to Arabic commonsense reasoning, which enhances semantic relationship modeling for improved commonsense validation.
Practical Impact
The practical impact of this research is the provision of a foundational dataset and a novel methodology for handling the complex variations in Arabic dialects. This will enhance Arabic natural language understanding by enabling AI systems to interpret and generate text in ways that align with human intuition. The dataset and framework can be applied in various real-world scenarios, such as chatbots, voice assistants, or social media platforms.
Improving Detection of Watermarked Language Models
Problem
The problem addressed in this research paper is improving the detection of AI-generated language models (LLMs) in text content. This is a crucial issue as LLMs are becoming increasingly popular and widely used, making it essential to identify whether text was generated by an AI model or not.
Analogy
Imagine trying to identify a specific song by listening to snippets of it. If you only listen to the song's melody, it might be difficult to distinguish it from other similar songs. However, if you also consider the song's lyrics and rhythm, your chances of correctly identifying the song increase significantly. Similarly, in this research, combining watermark-based detection (which looks at the "melody" of AI-generated text) with non-watermark-based detection (which looks at the "lyrics" and "rhythm" of human-written text) improves the accuracy of detecting AI-generated content.
Key Innovation
The key innovation in this work is combining watermark-based detection with non-watermark-based detection approaches to improve the accuracy of first-party detection (i.e., detecting a specific AI model's output). The researchers explore various hybrid schemes and find that these combinations outperform either approach alone under a wide range of experimental conditions.
Practical Impact
The practical impact of this research is significant. Improved detection methods can help institutions, organizations, and individuals identify whether text was generated by an AI model or not. This has important implications for education, content creation, and intellectual property protection. For example, academic institutions may want to detect whether students are using AI-generated content in their assignments, while LLM providers may need to understand how their models are being used.
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Problem
Large Language Models (LLMs) have struggled with two main issues: underthinking and overthinking. Underthinking refers to LLMs' inability to tackle challenging reasoning problems that require step-by-step thinking, while overthinking occurs when they spend too much time on simple queries without improving performance. This has led to the development of separate thinking and non-thinking variants of LLMs, leaving users to decide which model to use for each query.
Analogy
Imagine trying to solve a math problem. You need to think step-by-step to arrive at the correct solution. But if you overthink it, you might spend too much time on simple calculations without getting any closer to the answer. The OptimalThinkingBench is like a training program that helps LLMs learn when to "step back" and think more deeply, and when to "speed up" and provide quick answers for simpler queries.
Key Innovation
The OptimalThinkingBench is a new benchmark that simultaneously tracks the progress of optimally-thinking LLMs in terms of both performance and efficiency. It consists of two sub-benchmarks: OverthinkingBench and UnderthinkingBench, which test an LLM's ability to balance its thinking approach depending on the complexity of the query.
Practical Impact
The OptimalThinkingBench has the potential to significantly improve the user experience by providing a single model that can efficiently answer simple queries while spending more time on complex ones. This would eliminate the need for users to choose between different LLM variants, making it easier to get accurate and efficient results.
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
Problem
The main problem addressed by this research paper is the lack of spatial intelligence in advanced artificial intelligence (AI) models, particularly in multi-modal large language models (MLLMs). Despite impressive advancements in MLLMs, they often struggle with basic spatial tasks that are trivially easy for humans.
Analogy
Imagine trying to navigate a new city without a map or compass. You might know how to read signs and follow streets, but you'd struggle to understand the layout of the city and find your way around. This is similar to what happens when AI models lack spatial intelligence – they can process text and data, but they struggle to understand and reason about the physical world.
In this study, the researchers evaluated GPT-5, a highly advanced AI model, on various spatial tasks. While GPT-5 demonstrated remarkable strength in some areas, it still fell short of human performance across many tasks. The study also identified more challenging spatial intelligence problems for multi-modal models and found that proprietary models did not exhibit a decisive advantage when facing the most difficult problems.
Key Innovation
What's new and unique about this work is the comprehensive evaluation of state-of-the-art proprietary and open-source models on eight key benchmarks designed to assess spatial intelligence. The study also proposes a unified taxonomy of spatial tasks and discusses challenges in ensuring fair evaluation.
Practical Impact
This research has significant practical implications for the development of artificial general intelligence (AGI). By understanding where AI models stand on the path toward spatial intelligence, researchers can focus on improving these capabilities, which are essential for AGI. Additionally, this study highlights the need for more diverse and challenging benchmarks to evaluate spatial intelligence.
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Problem
Developing large language models requires making decisions with small-scale experiments, which can be unreliable. This problem arises because current benchmarks are not designed to handle uncertainty and noise in model evaluation.
Analogy
Imagine trying to predict the weather by looking at a small sample of temperature readings from different locations. If these readings are noisy (i.e., affected by random variability), you won't be able to make an accurate prediction about the overall weather pattern. Similarly, current language model evaluation benchmarks are often noisy and unreliable, which can lead to inaccurate predictions about large model behavior. By improving the signal-to-noise ratio of benchmarks, we can create more reliable and accurate evaluations that will ultimately lead to better language models.
Key Innovation
This paper introduces a framework for reducing uncertainty in language model evaluation by analyzing the signal (ability to separate better models from worse ones) and noise (sensitivity to random variability between training steps) of benchmarks. The authors propose three interventions to improve signal or noise, such as switching to a metric with better signal and noise or filtering noisy subtasks.
Practical Impact
The practical impact of this research is that it provides a framework for creating more reliable benchmarks, which can lead to more accurate predictions about large model behavior. This is particularly important for developing more general-purpose language models that need to be evaluated on diverse benchmarks.
MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
Problem
The main problem addressed in this paper is the "training-inference divide" in Masked Diffusion Language Models (MDLMs). During training, tokens are randomly masked, but during inference, the model progressively reveals the structure of the generated sequence by producing fewer and fewer masked tokens. This discrepancy between training and inference can lead to suboptimal performance.
Analogy
Think of MDLMs as a puzzle-solving process. During training, you randomly mask some of the puzzle pieces to help the model learn to fill in the blanks. But during inference, you progressively reveal the correct answers by removing the masks. MDPO is like optimizing the puzzle-solving strategy by learning from intermediate rewards and adjusting the denoising trajectory accordingly.
Note: The analogy is not perfect, but it gives a rough idea of how MDLMs work and what MDPO does to improve their performance.
Key Innovation
The key innovation is the proposal of a novel Masked Diffusion Policy Optimization (MDPO) framework that uses reinforcement learning to optimize denoising trajectories with intermediate rewards. MDPO explicitly trains the model under the same progressive refining schedule used at inference, addressing the training-inference divide overlooked by previous works.
Practical Impact
The practical impact is significant. MDPO matches the performance of the state-of-the-art method with 60× fewer gradient updates and achieves average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. The Running Confidence Remasking (RCR) strategy, which is a plug-in inference replacement, also consistently improves performance.
Agentic AI
Autonomous agents, multi-agent systems, and intelligent decision-making
Contrastive Representations for Temporal Reasoning
Problem
The paper addresses a crucial challenge in artificial intelligence: how can we learn representations that enable efficient planning and temporal reasoning in complex domains? Currently, perception relies on learning state-based representations, while planning is typically achieved through search algorithms like A* or Best First Search (BestFS). This approach can be computationally expensive and may not always lead to optimal solutions.
Analogy
Imagine trying to solve a Rubik's Cube without using an external search algorithm. Traditional approaches would require you to examine each piece individually, searching for the correct move to make. CRTR is like learning a new way of looking at the cube, where the pieces are already arranged in a way that allows you to directly visualize the solution. This "representation" can be used to solve the puzzle without needing to search through all possible combinations.
In essence, CRTR enables us to learn patterns and structures within complex domains, allowing us to make decisions and solve problems more efficiently.
Key Innovation
The authors introduce Contrastive Representations for Temporal Reasoning (CRTR), a novel method that uses a negative sampling scheme to remove spurious features and facilitate temporal reasoning. Unlike standard temporal contrastive learning, CRTR is designed to capture both perceptual and temporal structure, enabling efficient planning and problem-solving.
Practical Impact
The CRTR approach has the potential to revolutionize the way we solve complex problems in areas like robotics, logistics, and planning. By learning representations that capture temporal structure, CRTR can reduce or eliminate the need for search algorithms, leading to faster and more efficient solutions. This technology could be applied to real-world domains such as robotic assembly, chemical retrosynthesis, and puzzle-solving.
Bayesian Optimization-based Search for Agent Control in Automated Game Testing
Problem
Automated game testing is a challenging task, especially when it comes to detecting potential bugs within a game level. Traditional methods can be slow, inefficient, or even miss important issues.
Analogy
Imagine you're trying to find a specific location within a large maze. Traditional methods would be like searching the entire maze one step at a time, which can be slow and inefficient. The proposed system is like having a map of the maze that helps you navigate and prioritize your search, allowing you to find the location more quickly and effectively.
By combining BO with a game testing-specific model, this research has developed a powerful tool for automating game testing, making it easier to detect bugs and improve overall gaming experience.
Key Innovation
This research introduces a novel approach that combines Bayesian Optimization (BO) with a game testing-specific model built on top of a grid map. This allows for efficient search and exploration of the game level, while also providing scalability and uncertainty estimation required by BO.
Practical Impact
The proposed system has significant potential to improve automated game testing by increasing map coverage capabilities in both time efficiency and exploration distribution. This could lead to faster bug detection, reduced development costs, and improved overall gaming experience.
Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors
Problem
The main problem this paper addresses is how to enable robots to effectively navigate dynamic environments where obstacles can move or return to their original positions. This "manipulate-to-navigate" challenge requires the robot to interact with its environment by moving objects out of the way before it can safely move forward.
Analogy
Think of a robot trying to navigate through a crowded room. To get to its destination, it needs to move obstacles (like people) out of the way first. This paper proposes a way for the robot to learn how to effectively "manipulate" these obstacles using visual cues and prior knowledge about what actions are likely to be successful. By doing so, the robot can reduce the complexity of the task and focus on finding the best path forward.
Let me know if you'd like me to make any changes!
Key Innovation
What's new and unique about this work is a reinforcement learning-based approach that integrates manipulability priors (which help focus the robot on high-manipulability body positions) and visual affordance maps (which select high-quality manipulation actions). This combination reduces unnecessary exploration and allows the robot to learn manipulation strategies more effectively.
Practical Impact
This research has significant practical implications. By enabling robots to successfully navigate dynamic environments, this work has applications in areas such as human assistance, manufacturing, and agriculture, where mobile manipulators can perform complex tasks that involve both navigation and manipulation. The proposed approach could also be used in search and rescue scenarios or other situations where the environment is unpredictable.