Overview

Large atomic models (LAM), also known as machine learning interatomic potentials (MLIPs), are considered foundation models that predict atomic interactions across diverse systems using data-driven approaches. LAMBench is a benchmark designed to evaluate the performance of such models. It provides a comprehensive suite of tests and metrics to help developers and researchers understand the accuracy and generalizability of their machine learning models.

Our mission includes

  • Provide a comprehensive benchmark: Covering diverse atomic systems across multiple domains, moving beyond domain-specific benchmarks.
  • Align with real-world applications: Bridging the gap between model performance on benchmarks and their impact on scientific discovery.
  • Enable clear model differentiation: Offering high discriminative power to distinguish between models with varying performance.
  • Facilitate continuous improvement: Creating dynamically evolving benchmarks that grow with the community, integrating new tasks and models.

Features

  • Easy to Use: Simple setup and configuration to get started quickly.
  • Extensible: Easily add new benchmarks and metrics.
  • Detailed Reports: Generates detailed performance reports and visualizations.

LAMBench Leaderboard

The LAMBench Leaderboard. MˉFFm\bar M^m_{\mathrm{FF}} refers to the generalizability error on force field prediction tasks, while MˉPCm\bar M^m_{\mathrm{PC}} denotes the generalizability error on domain-specific tasks. MEmM_{\mathrm{E}}^m stands for the efficiency metric, and MISmM^m_{\mathrm{IS}} refers to the instability metric. Arrows alongside the metrics denote whether a higher or lower value corresponds to better performance.

ModelGeneralizabilityApplicability

MˉFFm\bar{M}^m_{\mathrm{FF}}

MˉPCm\bar{M}^m_{\mathrm{PC}}

MEmM_{{E}}^m

MISmM^m_{{IS}}

DPA-3.0-7M
0.245
0.161
0.151
0.291
DPA-2.4-7M
0.265
0.208
0.614
0.039
Orb-v3
0.280
0.240
0.400
0.000
DPA-3.0-3M
0.338
0.257
0.296
0.480
GRACE-2L-OAM
0.340
0.262
0.678
0.309
SevenNet-l3i5
0.355
0.240
0.279
0.036
MACE-MPA-0
0.356
0.291
0.291
0.000
Orb-v2
0.356
0.560
1.343
2.649
SevenNet-MF-ompa
0.358
0.300
0.088
0.000
SevenNet-0
0.369
0.246
0.760
0.556
MatterSim-v1-5M
0.389
0.280
0.388
0.000
MACE-MP-0
0.405
0.341
0.291
0.089

Generalizability - Force Field Prediction

LAMBench Metrics Calculations

Generalizability

Force Field Prediction

We categorize all force-field prediction tasks into 5 domains:

  • Inorganic Materials: Torres2019Analysis, Batzner2022equivariant, SubAlex_9k, Sours2023Applications, Lopanitsyna2023Modeling_A, Lopanitsyna2023Modeling_B, Dai2024Deep, WBM_25k
  • Small Molecules: ANI-1x
  • Catalysis: Vandermause2022Active, Zhang2019Bridging, Zhang2024Active, Villanueva2024Water
  • Reactions: Gasteiger2020Fast, Guan2022Benchmark
  • Biomolecules/Supramolecules: MD22, AIMD-Chig

To assess model performance across these domains, we use zero-shot inference with energy-bias term adjustments based on test dataset statistics. Performance metrics are aggregated as follows:

  1. The error metric is normalized against the error metric of a baseline model (dummy model) as follows: M^k,p,im=Mk,p,imMk,p,idummy\hat{M}^m_{k,p,i} = \frac{M^m_{k,p,i}}{M^{\mathrm{dummy}}_{k,p,i}}

where Mk,p,imM^m_{k,p,i} is the original error metric, mm indicates the model, kk denotes the domain index, pp signifies the prediction index, and ii represents the test set index. For instance, in force field tasks, the domains include Small Molecules, Inorganic Materials, Biomolecules, Reactions, and Catalysis, such that k{Small Molecules, Inorganic Materials, Biomolecules, Reactions, Catalysis}k \in \{\text{Small Molecules, Inorganic Materials, Biomolecules, Reactions, Catalysis}\}. The prediction types are categorized as energy (EE), force (FF), or virial (VV), with p{E,F,V}p \in \{E, F, V\}. For the specific domain of Reactions, the test sets are indexed as i{Guan2022Benchmark, Gasteiger2020Fast}i \in \{\text{Guan2022Benchmark, Gasteiger2020Fast}\}. This baseline model predicts energy based solely on the chemical formula, disregarding any structural details, thereby providing a reference point for evaluating the improvement offered by more sophisticated models.

  1. For each domain, we compute the log-average of normalized metrics across all datasets within this domain by

    Mˉk,pm=exp(1nk,pi=1nk,plogM^k,p,im)\bar{M}^m_{k,p} = \exp\left(\frac{1}{n_{k,p}}\sum_{i=1}^{n_{k,p}}\log \hat{M}^m_{k,p,i}\right)

where nk,pn_{k,p} denotes the number of test sets for domain kk and prediction type pp.

  1. Subsequently, we calculate a weighted dimensionless domain error metric to encapsulate the overall error across various prediction types:

    Mˉkm=pwpMˉk,pm/pwp\bar{M}^m_{k} = \sum_p w_{p} \bar{M}^m_{k,p} \Bigg/ \sum_p w_{p}

where wpw_{p} denotes the weights assigned to each prediction type pp.

  1. Finally the generalizability error metric of a model across all the domains is defined by the average of the domain-wise error metric,

Mˉm=1nDk=1nDMˉkm{\bar M^m}= \frac{1}{n_D}\sum_{k=1}^{n_D}{\bar M^m_{k}}

where nDn_D denotes the number of domains under consideration.

The generalizability error metric Mˉm\bar M^m allows for the comparison of generalizability across different models. It reflects the overall generalization capability across all domains, prediction types, and test sets, with a lower error indicating superior performance. The only tunable parameter is the weights assigned to prediction types, thereby minimizing arbitrariness in the comparison system.

For the force field generalizability tasks, we adopt RMSE as error metric. The prediction types include energy and force, with weights assigned as wE=wF=0.5w_E = w_F = 0.5. When periodic boundary conditions are assumed and virial labels are available, virial predictions are also considered. In this scenario, the prediction weights are adjusted to wE=wF=0.45w_E = w_F = 0.45 and wV=0.1w_V = 0.1. The resulting error is referred to as MˉFFm\bar M^{m}_{FF}.

The error metric is designed such that a dummy model, which predicts system energy solely based on chemical formulae, results in MˉFFm=1\bar{M}^m_{\mathrm{FF}}=1. In contrast, an ideal model that perfectly matches Density Functional Theory (DFT) labels achieves a value of MˉFFm=0\bar{M}^m_{\mathrm{FF}}=0.

Domain Specific Property Calculation

For the domain-specific property tasks, we adopt the MAE as the error metric. In the Inorganic Materials domain, the MDR phonon benchmark predicts maximum phonon frequency, entropy, free energy, and heat capacity at constant volume, with each prediction type assigned a weight of 0.25. In the Small Molecules domain, the TorsionNet500 benchmark predicts the torsion profile energy, torsion barrier height, and the number of molecules for which the model's prediction of the torsional barrier height has an error exceeding 1 kcal/mol. Each prediction type in this domain is assigned a weight of 13\frac{1}{3}. The resulting score is denoted as MˉPCm\bar M^{m}_{PC}.

Applicability

Efficiency

To assess the efficiency of the model, we randomly selected 2000 frames from the domain of Inorganic Materials and Catalysis using the aforementioned out-of-distribution datasets. Each frame was expanded to include 800 to 1000 atoms through the replication of the unit cell, ensuring that measurements of inference efficiency occurred within the regime of convergence. The initial 20% of the test samples were considered a warm-up phase and thus were excluded from the efficiency timing. We have reported the average efficiency across the remaining 1600 frames.

We define an efficiency score, MEmM_E^m, by normalizing the average inference time (with unit μs/atom\mathrm{\mu s/atom}), ηˉm\bar \eta^m, of a given LAM measured over 1600 configurations with respect to an artificial reference value, thereby rescaling it to a range between zero and positive infinity. A larger value indicates higher efficiency.

MEm=η0ηˉm,η0=100 μs/atom,ηˉm=11600i1600ηimM_E^m = \frac{\eta^0 }{\bar \eta^m },\quad \eta^0= 100\ \mathrm{\mu s/atom}, \quad \bar \eta^m = \frac{1}{1600}\sum_{i}^{1600} \eta_{i}^{m}

where ηim\eta_{i}^{m} is the inference time of configuration ii for model mm.

Stability

Stability is quantified by measuring the total energy drift in NVE simulations across nine structures. For each simulation trajectory, an instability metric is defined based on the magnitude of the slope obtained via linear regression of total energy per atom versus simulation time. A tolerance value, 5×104 eV/atom/ps5\times10^{-4} \ \mathrm{eV/atom/ps}, is determined as three times the statistical uncertainty in calculating the slope from a 10 ps NVE-MD trajectory using the MACE-MPA-0 model. If the measured slope is smaller than the tolerance value, the energy drift is considered negligible. We define the dimensionless measure of instability for structure ii as follows:

If the computation is successful:

MIS,im=max(0,log10(ΦiΦtol))M^m_{\mathrm{IS},i} = \max\left(0, \log_{10}\left(\frac{\Phi_{i}}{\Phi_{\mathrm{tol}}}\right)\right)

Otherwise:

MIS,im=5M^m_{\mathrm{IS},i} = 5

with Φtol=5×104 eV/atom/ps\Phi_{\mathrm{tol}} = 5 \times 10^{-4} \ \mathrm{eV/atom/ps} where Φi\Phi_i represents the total energy drift , and Φtol\Phi_{\mathrm{tol}} denotes the tolerance. This metric indicates the relative order of magnitude of the slope compared to the tolerance. In cases where a MD simulation fails, a penalty of 5 is assigned, representing a drift five orders of magnitude larger than the typical statistical uncertainty in measuring the slope. The final instability metric is computed as the average over all nine structures.

MISm=19i=19MIS,imM^m_{\mathrm{IS}} = \frac{1}{9}\sum_{i=1}^{9} M^m_{\mathrm{IS},i}

This result is bounded within the range [0, ++\infty], where a lower value signifies greater stability.