Large atomistic models (LAM), also known as machine learning interatomic potentials (MLIPs), are considered foundation models that predict atomic interactions across diverse systems using data-driven approaches. LAMBench is a benchmark designed to evaluate the performance of such models. It provides a comprehensive suite of tests and metrics to help developers and researchers understand the accuracy and generalizability of their machine learning models.
refers to the generalizability error on force field prediction tasks, while denotes the generalizability error on domain-specific tasks. stands for the efficiency metric, and refers to the instability metric. Arrows alongside the metrics denote whether a higher or lower value corresponds to better performance.
| Model | Generalizability | Applicability | ||
|---|---|---|---|---|
|
|
|
|
|
|
|
DPA-3.1-3M
|
0.175
|
0.322
|
0.261
|
0.572
|
|
Orb-v3
|
0.215
|
0.414
|
0.396
|
0.000
|
|
DPA-2.4-7M
|
0.241
|
0.342
|
0.617
|
0.039
|
|
GRACE-2L-OAM
|
0.251
|
0.404
|
0.639
|
0.309
|
|
Orb-v2
|
0.253
|
0.601
|
1.341
|
2.649
|
|
SevenNet-MF-ompa
|
0.255
|
0.455
|
0.084
|
0.000
|
|
MatterSim-v1-5M
|
0.283
|
0.467
|
0.393
|
0.000
|
|
MACE-MPA-0
|
0.308
|
0.425
|
0.293
|
0.000
|
|
SevenNet-l3i5
|
0.326
|
0.397
|
0.272
|
0.036
|
|
MACE-MP-0
|
0.351
|
0.472
|
0.296
|
0.089
|
Figure 1: Generalizability on force field prediction tasks, 1 - .
Figure 2: Accuracy-Efficiency Trade-off, vs .
We categorize all force-field prediction tasks into 3 domains:
Torres2019Analysis,
Batzner2022equivariant,
Sours2023Applications,
Lopanitsyna2023Modeling,
Mazitov2024Surface,
Gao2025Spontaneous
ANI-1x,
MD22, AIMD-Chig
Vandermause2022Active,
Zhang2019Bridging,
Villanueva2024Water
To assess model performance across these domains, we use zero-shot inference with energy-bias term adjustments based on test dataset statistics. Performance metrics are aggregated as follows:
The error metric is normalized against the error metric of a baseline model (dummy model) as follows:
where is the original error metric, indicates the model, denotes the domain index, signifies the prediction index, and represents the test set index. For a model with worse accuracy than a dummy model, the error metric is set to 1. For instance, in force field tasks, the domains include Molecules, Inorganic Materials, and Catalysis, such that . The prediction types are categorized as energy (), force (), or virial (), with . For the specific domain of Molecules, the test sets are indexed as . This baseline model predicts energy based solely on the chemical formula, disregarding any structural details, thereby providing a reference point for evaluating the improvement offered by more sophisticated models.
For each domain, we compute the log-average of normalized metrics across all datasets within this domain by
where denotes the number of test sets for domain and prediction type .
Subsequently, we calculate a weighted dimensionless domain error metric to encapsulate the overall error across various prediction types:
where denotes the weights assigned to each prediction type .
Finally the generalizability error metric of a model across all the domains is defined by the average of the domain-wise error metric,
where denotes the number of domains under consideration.
The generalizability error metric allows for the comparison of generalizability across different models. It reflects the overall generalization capability across all domains, prediction types, and test sets, with a lower error indicating superior performance. The only tunable parameter is the weights assigned to prediction types, thereby minimizing arbitrariness in the comparison system.
For the force field generalizability tasks, we adopt RMSE as error metric. The prediction types include energy and force, with weights assigned as . When periodic boundary conditions are assumed and virial labels are available, virial predictions are also considered. In this scenario, the prediction weights are adjusted to and . The resulting error is referred to as .
The error metric is designed such that a dummy model, which predicts system energy solely based on chemical formulae, results in . In contrast, an ideal model that perfectly matches Density Functional Theory (DFT) labels achieves a value of .
For the domain-specific property calculation tasks, we adopt the MAE as the primary error metric.
In the Inorganic Materials domain, the MDR phonon benchmark predicts the maximum phonon frequency, entropy, free energy, and heat capacity at constant volume, while the elasticity benchmark evaluates the shear and bulk moduli. Each prediction type is assigned an equal weight of .
In the Molecules domain, the TorsionNet500 benchmark evaluates the torsion profile energy, torsional barrier height, and the number of molecules for which the predicted torsional barrier height error exceeds 1 kcal/mol. The Wiggle150 benchmark assesses the relative conformer energy profile. Each prediction type in this domain is assigned a weight of 0.25.
In the Catalysis domain, the OC20NEB-OOD benchmark evaluates the energy barrier, reaction energy change (delta energy), and the percentage of reactions with predicted energy barrier errors exceeding 0.1 eV for three reaction types: transfer, dissociation, and desorption. Each prediction type in this domain is assigned a weight of 0.2.
The resulting error metric after averaging over all domains is denoted as .
To assess the efficiency of the model, we randomly selected 1000 frames from the domain of Inorganic Materials and Catalysis using the aforementioned out-of-distribution datasets. Each frame was expanded to contain between 800 and 1000 atoms — dynamically determined using a binary search algorithm to fully utilize GPU capacity — by replicating the unit cell. This ensured that measurements of inference efficiency were conducted within the regime of convergence. The initial 10% of the test samples were considered a warm-up phase and thus were excluded from the efficiency timing. We have reported the average efficiency across the remaining 900 frames.
We define an efficiency score, , by normalizing the average inference time (with unit ), , of a given LAM measured over 900 configurations with respect to an artificial reference value, thereby rescaling it to a range between zero and positive infinity. A larger value indicates higher efficiency.
where is the inference time of configuration for model .
Stability is quantified by measuring the total energy drift in NVE simulations across nine structures. For each simulation trajectory, an instability metric is defined based on the magnitude of the slope obtained via linear regression of total energy per atom versus simulation time. A tolerance value, , is determined as three times the statistical uncertainty in calculating the slope from a 10 ps NVE-MD trajectory using the MACE-MPA-0 model. If the measured slope is smaller than the tolerance value, the energy drift is considered negligible. We define the dimensionless measure of instability for structure as follows:
If the computation is successful:
Otherwise:
with where represents the total energy drift , and denotes the tolerance. This metric indicates the relative order of magnitude of the slope compared to the tolerance. In cases where a MD simulation fails, a penalty of 5 is assigned, representing a drift five orders of magnitude larger than the typical statistical uncertainty in measuring the slope. The final instability metric is computed as the average over all nine structures.
This result is bounded within the range , where a lower value signifies greater stability.