Large atomic models (LAM), also known as machine learning interatomic potentials (MLIPs), are considered foundation models that predict atomic interactions across diverse systems using data-driven approaches. LAMBench is a benchmark designed to evaluate the performance of such models. It provides a comprehensive suite of tests and metrics to help developers and researchers understand the accuracy and generalizability of their machine learning models.
The LAMBench Leaderboard. refers to the generalizability error on force field prediction tasks, while denotes the generalizability error on domain-specific tasks. stands for the efficiency metric, and refers to the instability metric. Arrows alongside the metrics denote whether a higher or lower value corresponds to better performance.
Model | Generalizability | Applicability | ||
---|---|---|---|---|
DPA-3.0-7M | 0.245 | 0.161 | 0.151 | 0.291 |
DPA-2.4-7M | 0.265 | 0.208 | 0.614 | 0.039 |
Orb-v3 | 0.280 | 0.240 | 0.400 | 0.000 |
DPA-3.0-3M | 0.338 | 0.257 | 0.296 | 0.480 |
GRACE-2L-OAM | 0.340 | 0.262 | 0.678 | 0.309 |
SevenNet-l3i5 | 0.355 | 0.240 | 0.279 | 0.036 |
MACE-MPA-0 | 0.356 | 0.291 | 0.291 | 0.000 |
Orb-v2 | 0.356 | 0.560 | 1.343 | 2.649 |
SevenNet-MF-ompa | 0.358 | 0.300 | 0.088 | 0.000 |
SevenNet-0 | 0.369 | 0.246 | 0.760 | 0.556 |
MatterSim-v1-5M | 0.389 | 0.280 | 0.388 | 0.000 |
MACE-MP-0 | 0.405 | 0.341 | 0.291 | 0.089 |
We categorize all force-field prediction tasks into 5 domains:
Torres2019Analysis
, Batzner2022equivariant
, SubAlex_9k
, Sours2023Applications
, Lopanitsyna2023Modeling_A
, Lopanitsyna2023Modeling_B
, Dai2024Deep
, WBM_25k
ANI-1x
Vandermause2022Active
, Zhang2019Bridging
, Zhang2024Active
, Villanueva2024Water
Gasteiger2020Fast
, Guan2022Benchmark
MD22
, AIMD-Chig
To assess model performance across these domains, we use zero-shot inference with energy-bias term adjustments based on test dataset statistics. Performance metrics are aggregated as follows:
where is the original error metric, indicates the model, denotes the domain index, signifies the prediction index, and represents the test set index. For instance, in force field tasks, the domains include Small Molecules, Inorganic Materials, Biomolecules, Reactions, and Catalysis, such that . The prediction types are categorized as energy (), force (), or virial (), with . For the specific domain of Reactions, the test sets are indexed as . This baseline model predicts energy based solely on the chemical formula, disregarding any structural details, thereby providing a reference point for evaluating the improvement offered by more sophisticated models.
For each domain, we compute the log-average of normalized metrics across all datasets within this domain by
where denotes the number of test sets for domain and prediction type .
Subsequently, we calculate a weighted dimensionless domain error metric to encapsulate the overall error across various prediction types:
where denotes the weights assigned to each prediction type .
where denotes the number of domains under consideration.
The generalizability error metric allows for the comparison of generalizability across different models. It reflects the overall generalization capability across all domains, prediction types, and test sets, with a lower error indicating superior performance. The only tunable parameter is the weights assigned to prediction types, thereby minimizing arbitrariness in the comparison system.
For the force field generalizability tasks, we adopt RMSE as error metric. The prediction types include energy and force, with weights assigned as . When periodic boundary conditions are assumed and virial labels are available, virial predictions are also considered. In this scenario, the prediction weights are adjusted to and . The resulting error is referred to as .
The error metric is designed such that a dummy model, which predicts system energy solely based on chemical formulae, results in . In contrast, an ideal model that perfectly matches Density Functional Theory (DFT) labels achieves a value of .
For the domain-specific property tasks, we adopt the MAE as the error metric. In the Inorganic Materials domain, the MDR phonon benchmark predicts maximum phonon frequency, entropy, free energy, and heat capacity at constant volume, with each prediction type assigned a weight of 0.25. In the Small Molecules domain, the TorsionNet500 benchmark predicts the torsion profile energy, torsion barrier height, and the number of molecules for which the model's prediction of the torsional barrier height has an error exceeding 1 kcal/mol. Each prediction type in this domain is assigned a weight of . The resulting score is denoted as .
To assess the efficiency of the model, we randomly selected 2000 frames from the domain of Inorganic Materials and Catalysis using the aforementioned out-of-distribution datasets. Each frame was expanded to include 800 to 1000 atoms through the replication of the unit cell, ensuring that measurements of inference efficiency occurred within the regime of convergence. The initial 20% of the test samples were considered a warm-up phase and thus were excluded from the efficiency timing. We have reported the average efficiency across the remaining 1600 frames.
We define an efficiency score, , by normalizing the average inference time (with unit ), , of a given LAM measured over 1600 configurations with respect to an artificial reference value, thereby rescaling it to a range between zero and positive infinity. A larger value indicates higher efficiency.
where is the inference time of configuration for model .
Stability is quantified by measuring the total energy drift in NVE simulations across nine structures. For each simulation trajectory, an instability metric is defined based on the magnitude of the slope obtained via linear regression of total energy per atom versus simulation time. A tolerance value, , is determined as three times the statistical uncertainty in calculating the slope from a 10 ps NVE-MD trajectory using the MACE-MPA-0 model. If the measured slope is smaller than the tolerance value, the energy drift is considered negligible. We define the dimensionless measure of instability for structure as follows:
If the computation is successful:
Otherwise:
with where represents the total energy drift , and denotes the tolerance. This metric indicates the relative order of magnitude of the slope compared to the tolerance. In cases where a MD simulation fails, a penalty of 5 is assigned, representing a drift five orders of magnitude larger than the typical statistical uncertainty in measuring the slope. The final instability metric is computed as the average over all nine structures.
This result is bounded within the range [0, ], where a lower value signifies greater stability.