Skip to main content

Write a PREreview

How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?

Posted
Server
bioRxiv
DOI
10.1101/2025.01.29.635451

Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. A sample size of approximately 12,000 (7.5% of the total tetrapeptide dataset) marks a key threshold for stable and consistent model performance. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides’ physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.

You can write a PREreview of How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now