Sample-efficient Integration of New Modalities into Large Language Models

Generative AI & LLMs
Published: arXiv: 2509.04606v1
Authors

Osman Batur İnce André F. T. Martins Oisin Mac Aodha Edoardo M. Ponti

Abstract

Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector -- placed between modality-specific encoders and an LLM -- to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$\times$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.

Paper Summary

Problem
The paper addresses the challenge of integrating new modalities into large language models (LLMs) with minimal training data and paired samples. This is a crucial problem because LLMs are being applied to increasingly diverse domains, and it's not feasible to train a model from scratch for each new modality.
Key Innovation
The key innovation is the development of sample-efficient modality integration (SEMI), which uses a hypernetwork to adapt a shared projector to any modality given only a few samples. This allows for the integration of new modalities into LLMs with minimal training data and paired samples, making it possible to extend the coverage of multimodal AI models to low-resource modalities.
Practical Impact
The practical impact of this research is that it enables the integration of new modalities into LLMs with minimal training data and paired samples. This has significant implications for applications such as geo-location, astronomy, navigation, and biology/medicine, where multimodal AI models can be applied to solve complex problems.
Analogy / Intuitive Explanation
Imagine trying to learn a new language by looking at only a few sentences in that language, without any context or prior knowledge. It would be difficult, right? That's what integrating new modalities into LLMs is like - it requires a way to adapt the model to understand the new modality with minimal data and context. SEMI provides this adaptation mechanism, allowing LLMs to learn from just a few samples of the new modality and then apply that knowledge to generate text about that modality.
Paper Information
Categories:
cs.CL cs.AI cs.CV
Published Date:

arXiv ID:

2509.04606v1

Quick Actions