MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection

Agentic AI
Published: arXiv: 2604.16172v1
Authors

Yeganeh Abdollahinejad Ahmad Mousavi Naeemul Hassan Kai Shu Nathalie Japkowicz Shahriar Khosravi Amir Karami

Abstract

The widespread dissemination of multimodal content on social media has made misinformation detection increasingly challenging, as misleading narratives often arise not only from textual or visual content alone, but also from semantic inconsistencies between modalities and their evolution over time. Existing multimodal misinformation detection methods typically model cross-modal interactions statically and often show limited robustness across heterogeneous datasets, domains, and narrative settings. To address these challenges, we propose MOMENTA, a unified framework for multimodal misinformation detection that captures modality heterogeneity, cross-modal inconsistency, temporal dynamics, and cross-domain generalization within a single architecture. MOMENTA employs modality-specific mixture-of-experts modules to model diverse misinformation patterns, bidirectional co-attention to align textual and visual representations in a shared semantic space, and a discrepancy-aware branch to explicitly capture semantic disagreement between modalities. To model narrative evolution, we introduce an attention-based temporal aggregation mechanism with drift and momentum encoding over overlapping time windows, enabling the framework to capture both short-term fluctuations and longer-term trends in misinformation propagation. In addition, domain-adversarial learning and a prototype memory bank improve domain invariance and stabilize representation learning across datasets. The model is trained using a multi-objective optimization strategy that jointly enforces classification performance, cross-modal alignment, contrastive learning, temporal consistency, and domain robustness. Experiments on Fakeddit, MMCoVaR, Weibo, and XFacta show that MOMENTA achieves strong, consistent results across accuracy, F1-score, AUC, and MCC, highlighting its effectiveness for multimodal misinformation detection.

Paper Summary

Problem
The widespread dissemination of misinformation on social media has become a significant challenge. Misleading content can arise from textual or visual content alone, but also from semantic inconsistencies between modalities and their evolution over time. Existing methods for multimodal misinformation detection have limitations, such as modeling cross-modal interactions statically and showing limited robustness across heterogeneous datasets, domains, and narrative settings.
Key Innovation
MOMENTA is a unified framework for multimodal misinformation detection that captures modality heterogeneity, cross-modal inconsistency, temporal dynamics, and cross-domain generalization within a single architecture. It employs modality-specific mixture-of-experts modules, bidirectional co-attention, and an explicit discrepancy-aware branch to capture semantic alignment and inconsistency between textual and visual modalities. MOMENTA also introduces an attention-based temporal aggregation mechanism with drift and momentum encoding over overlapping time windows to model narrative evolution.
Practical Impact
MOMENTA has the potential to improve misinformation detection on social media by capturing the complex interactions between different modalities and temporal dynamics. This can help to reduce the spread of false or misleading content, which can have significant social, political, and public health consequences. The framework can be applied to various social media platforms and languages, making it a practical solution for addressing the misinformation problem.
Analogy / Intuitive Explanation
Imagine trying to detect a forgery in a painting. A good forgery might look similar to the original at first glance, but upon closer inspection, it might reveal inconsistencies in the brushstrokes, colors, or composition. Similarly, MOMENTA's framework detects misinformation by analyzing the semantic inconsistencies between textual and visual content, as well as the temporal dynamics of the content's evolution. By combining these insights, MOMENTA can identify potential misinformation and help to prevent its spread.
Paper Information
Categories:
cs.MM
Published Date:

arXiv ID:

2604.16172v1

Quick Actions