PREreview of How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?

by harshraj bhoite

Published: February 27, 2025
DOI: 10.5281/zenodo.14934537
License: CC BY 4.0

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?"

Short Summary of the Research’s Main Findings

This paper explores how different sampling strategies and sample sizes influence the predictive accuracy of AI models for peptide physicochemical properties. The study evaluates four sampling methods—Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS)—across sample sizes ranging from 100 to 20,000. A key takeaway is that a sample size of 12,000 (7.5% of the dataset) is a threshold for stable and reliable predictions. The findings provide useful guidance for designing AI models that require efficient and representative sampling, particularly in peptide-based drug discovery and bioengineering.

Major Issues

The choice of sampling methods is not fully justified. The paper does not explain why these four specific methods were chosen. Would alternative approaches, like stratified sampling or active learning, offer better results? A brief comparison with other techniques would strengthen the argument.

While UDS ensures an even distribution of sequences, it does not necessarily create a balanced dataset in terms of physicochemical properties. The study could explore ways to adjust for property distribution bias, such as weighted sampling or a hybrid approach.

The study suggests that increasing the sample size improves model performance, but it does not quantify computational trade-offs. Would the accuracy gains from increasing the dataset from 8,000 to 12,000 samples justify the extra computational cost? A cost-benefit analysis would help clarify this.

The AI model predicts key peptide properties such as aggregation propensity, hydrophilicity, and isoelectric point, but none of these predictions are validated experimentally. Even verifying a small set of AI predictions through lab experiments would add credibility to the results.

The study focuses on tetrapeptides but does not discuss whether the findings apply to longer peptide sequences. Would a pentapeptide or decapeptide dataset require exponentially larger samples for similar accuracy? A discussion on this would be valuable.

Minor Issues

The UDS method maintains sequence diversity but is inconsistent in capturing the full range of property distributions. Would combining UDS with PPS help correct this issue? A short discussion on potential improvements would be useful.

Some figures, particularly those showing property distribution errors, could be better labeled. Highlighting key trends directly in the captions would improve readability.

The paper includes an effect size analysis, but the real-world impact of these values isn’t fully explained. How do the reported effect sizes translate into meaningful improvements for peptide property prediction?

A few sentences could be clearer. For example, the sentence "No significant differences in AI prediction accuracy are observed among all four sampling methods" could be rewritten as "The study found no major differences in prediction accuracy across the four sampling methods." Small refinements in wording would improve overall readability.

Final Recommendation

Accept with minor revisions. The study provides practical insights into sampling strategies for AI-driven peptide property prediction. With some refinements—particularly around sampling method justification, property bias, computational trade-offs, and validation—this could be a very strong contribution to the field.

PREreview of How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?"

Competing interests

Comments