Escribe una PREreview

Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction

por Rajesh Sathyamoorthy y Munish Puri

Publicada: 1 de abril de 2026
Servidor: bioRxiv
DOI: 10.64898/2026.03.31.715487

Accurate prediction of Enzyme Commission (EC) numbers is foundational to genome annotation, metabolic reconstruction, and enzyme engineering. Protein language models (PLMs) have transformed protein function prediction, yet their systematic evaluation for EC number prediction across architectures, EC hierarchy levels, and sequence identity thresholds is lacking. Here we present a comprehensive benchmark of three PLMs (ESM2-650M, ESM2-3B, ProtT5-XL) combined with nine downstream neural architectures, evaluated across four EC hierarchy levels and four sequence identity thresholds with 1,296 trained models in total. Our results establish that simple MLP classifiers achieve 98.0% accuracy at EC1, 96.9% at EC2, 96.6% at EC3, and 97.0% at EC4, matching or marginally exceeding a train-set-matched BLASTp baseline (±0.7 pp) for in-distribution proteins. Crucially, PLM-based methods dramatically outperform BLAST for evolutionarily distant eukaryotes: gains reach +31.8 pp over a fair 90K-sequence BLAST baseline ( Giardia lamblia ) and +26.4 pp over a full 520K SwissProt database ( Trichomonas vaginalis ). For held-out prokaryotic proteomes, PLMs outperform BLAST by a mean of +16.9 pp at EC4. Our benchmark reveals that (i) MLP architectures are sufficient and consistently superior to CNN/ResNet/Transformer variants, (ii) ESM2-650M is statistically distinguishable from but practically equivalent to the 5× larger ESM2-3B, and (iii) Transformer re-encoding of PLM embeddings fails at a shared learning rate due to convergence instability. All code, models, and benchmark results are available at [ https://github.com/r-mbio/plm_benchmark.git ].

Puedes escribir una PREreview de Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.