Tabular foundation models for in-context prediction of molecular properties

AI in healthcare
Published: arXiv: 2604.16123v1
Authors

Karim K. Ben Hicham Jan G. Rittig Martin Grohe Alexander Mitsos

Abstract

Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning transferable molecular representations; however, they typically involve task-specific fine-tuning, require machine learning expertise, and often fail to outperform classical baselines. Tabular foundation models (TFMs) offer a fundamentally different paradigm: they perform predictions through in-context learning, enabling inference without task-specific training. Here, we evaluate TFMs in the low- to medium-data regime across both standardized pharmaceutical benchmarks and chemical engineering datasets. We evaluate both frozen molecular foundation model representations, as well as classical descriptors and fingerprints. Across the benchmarks, the approach shows excellent predictive performance while reducing computational cost, compared to fine-tuning, with these advantages also transferring to practical engineering data settings. In particular, combining TFMs with CheMeleon embeddings yields up to 100\% win rates on 30 MoleculeACE tasks, while compact RDKit2d and Mordred descriptors provide strong descriptor-based alternatives. Molecular representation emerges as a key determinant in TFM performance, with molecular foundation model embeddings and 2D descriptor sets both providing substantial gains over classic molecular fingerprints on many tasks. These results suggest that in-context learning with TFMs provides a highly accurate and cost-efficient alternative for property prediction in practical applications.

Paper Summary

Problem
Predicting molecular properties is crucial for various industries such as drug discovery, catalysis, and process design. However, real-world applications are often limited by small datasets, making it challenging to achieve accurate predictions.
Key Innovation
The research introduces tabular foundation models (TFMs), a new paradigm for predicting molecular properties. Unlike traditional methods that require task-specific fine-tuning and machine learning expertise, TFMs perform predictions through in-context learning, enabling inference without task-specific training.
Practical Impact
The proposed TFMs can be applied in various real-world scenarios, such as: * Predicting the properties of molecules in the pharmaceutical industry, reducing the need for costly experiments and speeding up the development of new drugs. * Optimizing the design of chemical processes and catalysts, leading to more efficient and environmentally friendly production methods. * Enabling data-driven decision-making in industries where reliable estimates of molecular properties are essential.
Analogy / Intuitive Explanation
Imagine having a large library of books, each representing a molecule with its unique properties. Traditional methods are like searching for a specific book by asking a librarian (machine learning expert) to fine-tune the search criteria. In contrast, TFMs are like having a super-smart librarian who can quickly find the book you need without any additional information, just by looking at the book titles and descriptions. This analogy illustrates how TFMs can efficiently and accurately predict molecular properties without requiring extensive machine learning expertise or task-specific training.
Paper Information
Categories:
cs.LG physics.chem-ph
Published Date:

arXiv ID:

2604.16123v1

Quick Actions