PREreview of Evaluation of Foundational Machine Learned Interatomic Potentials for Migration Barrier Predictions
- Published
- DOI
- 10.5281/zenodo.20935522
- License
- CC0 1.0
Summary
The authors benchmark five foundation MLIPs (MACE-MP-0, SevenNet-MF-ompa, Orb-v3, CHGNet, M3GNet) run through ASE-NEB against DFT-NEB migration barriers for 574 battery-relevant paths (Dataset-2), plus a 60-path set with full intermediate-image geometries (Dataset-1). They report barrier-MAE rankings under three outlier schemes, good/bad-conductor classification at a 500 meV threshold, a new geometry-similarity metric θ comparing MLIP-NEB images to DFT-NEB versus linear interpolation, and the finding that barrier accuracy and geometry accuracy are anti-correlated across E_m. As far as I can tell, scoring foundation MLIPs on the barrier itself inside the NEB workflow (rather than on static forces) has not been done at this scale, and it answers a question practitioners actually have.
Strengths
The question is the right one. Static energy/force RMSE does not tell you whether a model gets a saddle, and this paper measures the property people screen on.
Real scope and full openness: 574 paths across multivalent (Mg/Ca), Na, and a wide structural zoo, with data and scripts on GitHub and Zenodo. I was able to reproduce the headline numbers from the archive directly (see point 1 below), which is a credit to how the data was released.
The methodological choices are sensible and stated: the conservative Orb-v3 variant, IDPP-vs-LI and full-spring EB pre-tested, identical NEB settings across the set.
The geometry/barrier anti-correlation (§3.5) is the most interesting result in the paper. Point 7 is the one caveat that would keep a referee from picking at it.
Major comments
1. Add confidence intervals. The good news is that your full-set MACE result holds up; the only thing to tighten is the abstract’s “co-best” billing of Orb-v3. Because the barriers are archived, it was straightforward to run a paired bootstrap on top of them. I reproduce your MAEs to within 8 meV (MACE 0.305 vs 0.310, Orb-v3 0.337 vs 0.336, CHGNet 0.335 vs 0.343, SevenNet 0.345, M3GNet 0.349), and the bootstrap (B=2000, 494 complete cases) adds the error bars the paper is missing. They support you: MACE has the lowest MAE in 97% of resamples and is statistically clear of Orb-v3 (Δ = +0.037 eV, 95% CI [+0.009, +0.068]), SevenNet (+0.040 [+0.012, +0.075]) and M3GNet (+0.042 [+0.016, +0.070]); only CHGNet is a genuine tie (+0.027 [−0.000, +0.057]). So the full-set ranking is real and it favours MACE. My one concern is that the abstract reads “MACE-MP-0 and Orb-v3 exhibit the lowest MAEs across the entire dataset and over data points that are not outliers, respectively,” presenting Orb-v3 as co-best, whereas Orb’s headline 0.198 eV comes from the own-outlier-removed scheme and on the full set Orb sits significantly behind MACE. I would lead with MACE as the robust full-set winner, present Orb’s 0.198 explicitly as a best-case-on-systems-it-handles-well number, and add the CIs (about an hour on the public data; happy to send my script if it saves you the trouble). One more thing to flag: MACE is scored on 494 systems versus 572 for the others, since its barriers sit in neb_results.csv for fewer paths, so that column is not quite apples-to-apples.
2. Make the common-outlier scheme the primary comparison. The own-outlier-removed numbers in §3.1 drop a different set of systems for each model, so they are a per-model best case rather than a ranking (which is what the bootstrap in point 1 shows). The fixed common-outlier scheme (17 shared paths removed: MACE 0.239, Orb 0.245, SevenNet 0.251, and so on) is the fairer headline, and the count of catastrophic outliers per model is worth reporting as a number in its own right.
3. Three images without a climbing image will under-resolve the saddle, and it may be doing some of the work you attribute to the models. Dataset-2, which carries the headline numbers, uses three non-climbing images, against seven for Dataset-1. For a transition state displaced even 15% from the path midpoint that can miss the barrier by roughly 40 to 160 meV, larger for the double-humped NaSICON profiles. That is the size of the entire inter-model spread in point 1, and it could account for part of the under-estimation you ascribe to CHGNet and M3GNet. Re-running 30 to 50 Dataset-2 paths with seven images (about a CPU-day, no new DFT) would settle how much is protocol and how much is model.
4. The reference set spans four functionals; stratifying the MAE by functional would separate model error from reference spread. I checked the public Dataset-2.json: the XC field is GGA for 505 paths (88%), GGA+U for 40 (7%), SCAN for 19 (3.3%) and LDA for 10 (1.7%), and 417 of 574 systems (73%) contain a transition metal. The foundation models are trained mostly on GGA-level MPtrj/OMat data, so a single MAE against this reference folds together genuine model error and a GGA-vs-(GGA+U/SCAN) reference mismatch, with the obvious circularity that a model reproducing the GGA number looks “wrong” on the SCAN and +U entries through no fault of its physics. The fix is nearly free because the data is already labelled: report the MAE split by XC, or restrict the headline ranking to the homogeneous GGA subset. That is probably where the inter-model differences either live or disappear.
5. Orb-v3’s 27% endpoint non-convergence deserves to be in the Results, not buried in the Discussion. 153 of 574 Orb-v3 endpoint relaxations did not converge in 1000 steps (§4). The authors kept those results as-is rather than dropping them, and note that better relaxation might improve Orb, so this is a reliability point, not evidence that Orb is secretly worse: its barriers for that 27% simply carry uncertainty in an uncontrolled direction. Given the screening claim rests partly on Orb, a per-model convergence-rate column belongs in the main table, and re-relaxing a few dozen of the failed cases with a longer optimizer run would show whether the ranking moves.
6. Add a balance-aware classification metric; it will make Orb and SevenNet look better, not worse. Accuracies are Orb 84.84%, SevenNet 82.93%, MACE 79.44%, CHGNet 73.87%, M3GNet 73.52% at the 500 meV cut. From the public data the actual split is 245 good (E_m < 0.5 eV) to 329 bad, so a trivial “always bad” classifier scores 57.3%. The problem is moderately balanced, so the accuracies are not badly inflated and every model clears the 57.3% floor (CHGNet and M3GNet do carry real skill, not zero). The subtle point is that at a 43/57 split raw accuracy slightly rewards the majority-leaning under-estimators, so the gap between Orb (84.8%) and CHGNet (73.9%) may understate the true difference. Matthews correlation or balanced accuracy alongside the 57.3% baseline would show the real separation.
7. The geometry/barrier anti-correlation is real but partly built into θ. Since θ is defined relative to LI, the bar for “better than LI” is low at high E_m (where LI is far from the path) and high at low E_m, so some of the anti-correlation is the metric’s asymmetric baseline rather than PES physics. With about 12 points per Dataset-1 bin the fractions are also thin. The physical reading (flat PES at low E_m, high-E_m images far from equilibrium training data) is reasonable and consistent with what I would expect. A sentence noting the relative-to-LI definition, plus the per-bin N, closes the gap.
Minor
Verb calibration in the Conclusion: given points 1 and 5, “useful coarse pre-screen” is probably better calibrated than “highly suitable” and “we recommend.”
The θ “good” threshold of δ < 0.01 Å is tighter than DFT’s own coordinate accuracy for TM–O bonds (~0.02–0.05 Å); a δ < 0.03 Å check (half a day) would show whether the good fractions are threshold-sensitive.
One paragraph positioning this against concurrent alkali-only benchmarks (e.g. arXiv:2601.10938, Orb E_m MAE ~75–111 meV vs ~198–336 here) would pre-empt the obvious question. Priority is not at stake; the scope here is genuinely broader (multivalent, Na, zero-shot), and the gap is most likely the three-image protocol of point 3 plus the harder chemistries.
Verdict
Significance: valuable. The first NEB-integrated foundation-MLIP barrier benchmark, and a useful map of which model to reach for. Strength of evidence: solid. The central claims survive scrutiny: I bootstrapped the public data and MACE’s full-set lead is statistically robust (point 1), and the secondary results (Orb/SevenNet classification well above the 57% floor, the geometry/barrier independence, MLIP-NEB beating LI in ~66% of cases) are supported. What remains is presentation and robustness rather than correctness: the abstract’s Orb framing (1, 2), the mixed-functional reference (4), the three-image saddle (3) and Orb’s convergence rate (5), and a balance-aware classification metric (6). Each is reachable with data already in hand and a day or two of analysis, and several with the authors’ own public files. With those, this becomes a reference the community will cite.
Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they used generative AI to come up with new ideas for their review.