Escrever uma avaliação PREreview

Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction

de Rajesh Sathyamoorthy e Munish Puri

Publicado: 1 de abril de 2026
Servidor: bioRxiv
DOI: 10.64898/2026.03.31.715487

Accurate prediction of Enzyme Commission (EC) numbers is foundational to genome annotation, metabolic reconstruction, and enzyme engineering. Protein language models (PLMs) have transformed protein function prediction, yet their systematic evaluation for EC number prediction across architectures, EC hierarchy levels, and sequence identity thresholds is lacking. Here we present a comprehensive benchmark of three PLMs (ESM2-650M, ESM2-3B, ProtT5-XL) combined with nine downstream neural architectures, evaluated across four EC hierarchy levels and four sequence identity thresholds with 1,296 trained models in total. Our results establish that simple MLP classifiers achieve 98.0% accuracy at EC1, 96.9% at EC2, 96.6% at EC3, and 97.0% at EC4, matching or marginally exceeding a train-set-matched BLASTp baseline (±0.7 pp) for in-distribution proteins. Crucially, PLM-based methods dramatically outperform BLAST for evolutionarily distant eukaryotes: gains reach +31.8 pp over a fair 90K-sequence BLAST baseline ( Giardia lamblia ) and +26.4 pp over a full 520K SwissProt database ( Trichomonas vaginalis ). For held-out prokaryotic proteomes, PLMs outperform BLAST by a mean of +16.9 pp at EC4. Our benchmark reveals that (i) MLP architectures are sufficient and consistently superior to CNN/ResNet/Transformer variants, (ii) ESM2-650M is statistically distinguishable from but practically equivalent to the 5× larger ESM2-3B, and (iii) Transformer re-encoding of PLM embeddings fails at a shared learning rate due to convergence instability. All code, models, and benchmark results are available at [ https://github.com/r-mbio/plm_benchmark.git ].

Você pode escrever uma avaliação PREreview de Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction. Uma avaliação PREreview é uma avaliação de um preprint e pode variar de algumas frases a um parecer extenso, semelhante a um parecer de revisão por pares realizado por periódicos.

Antes de começar

Vamos pedir que você faça login com seu ORCID iD. Se você não tiver um iD, pode criar um.