Skip to PREreview

PREreview of Intrahost dynamics, together with genetic and phenotypic effects predict the success of viral mutations

Published
DOI
10.5281/zenodo.14391071
License
CC BY 4.0

Intrahost dynamics, together with genetic and phenotypic effects predict the success of viral mutations

This paper provides a machine learning approach for predicting the advantageous nature of single amino acid variants (SAVs) across various time points during the SARS-CoV-2 pandemic. The authors do so by leveraging data collected throughout the SARS-CoV-2 pandemic to investigate the fitness of intra- versus interhost SAVs and determine if intrahost mutations may provide insight into the fitness of interhost mutations. The authors use the XGBoost algorithm to build an initial model that is built upon by introducing predictors for SAVs linkage and intrahost or interhost mutations. Notably, their final models elucidate linkage as a necessary component to understand viral evolution and the fitness of SAVs in SARS-CoV-2 by improving the overall performance of predictors in the machine learning model.

Major comments: 

  1. Consider training non-RBD and RBD SAVs in side-by-side models. Since non-RBD SAVs are the majority of SARS-CoV-2 mutations, removing RBD SAVs from the training model could improve prediction accuracy.

    • Additionally, as the DMS and expression data is only available for RBD, please explain why you decided to use this data in predicting the impact of all mutations. 

    • Please address the limitations or advantages of developing the model using DMS data for RBD mutations only, rather than all mutations.

  2. For lines 241-251, consider adding a supplementary figure that shows each of the models built for each of the seven timeframe datasets with observed versus predicted values plotted.

  3. Please add a schematic of how data is being filtered throughout the training of the model. This schematic should highlight the total number of genomes (15 million) and sequencing libraries (7 million) from line 81, and each step taken to reduce the number of libraries or genomes that were used in the analysis.

  4. Adding a schematic describing linkage of SAVs would be useful to better understand the dynamics of genetic linkage. Ideally, this schematic would describe linkage decay and epistasis that are mentioned in the paper.

  5. When including statistical analysis results (pearson's correlation, r-squared values, and p-values) please clearly state which two variables are being compared.

  6. As stated in lines 111-113, “There were no obvious protein-specific patterns in the number of mutations observed for the vast majority of SAVs that failed to reached  fixation (Extended Data Fig. 1).” We were unclear on the intended meaning of “no obvious protein-specific patterns”, please clarify. 

  7. Consider explaining why the majority of SAVs that stay below 10% frequency are valuable data for training the model given this may cause class imbalance in your model. From a reader's perspective this data introduces noise that may affect the models due to the unbalanced frequency of fixation (0.1%, Line 110) versus no fixation (99.9%, Lines 109-110) of SAVs. 

    • Could this data be preprocessed before model building to alleviate that imbalance or does XGboost have parameters that negate the imbalance?

  8. As it is possible for some of the variables in your model to be correlated with each other, potentially impacting the results of the XGBoost model, please provide information on how each variable is correlated with one another and whether it is appropriate to include all variables in the model.

  9. Please expand on how your approach to modeling SAVs fitness provides advantages over previous models that focus on spike protein SAVs.

  10. Please address the location of the SAVs in the virion represented by the discrete points (cyan, future freq. > 10%) in Figure 2h.

  11. Please provide more details on the seven intrahost datasets. How many patients were there per dataset? How many sequences were there per host? What was the average timeline these data were collected across?

  12. When referring to the MAE, please explain the metric in the context of this data. This indicates that your model is off by 7.5% to 11.3%. Does this still provide helpful information?

  13. The following comments are related to this portion of the discussion: “For example, the estimated fitness of a moderately successful SAV that circulated only in countries with lower surveillance would be lower than a less successful SAV that spread in countries with more extensive surveillance. However, these discrepancies are likely random and manifest as noise, rather than a systematic bias, in the fitness estimates. As such, these sampling biases are not likely to skew the patterns observed in our study.”

    • It is accurate that you are stating that based on surveillance and data collection in different countries that data could be biased, but it is not applicable to state that your data is not skewed because of this unless you can prove otherwise. If it cannot be proved otherwise it is beneficial to state this as a limitation of epidemiology studies/using these databases.

    • Linked to Figure 1b, some countries or regions likely experienced different timelines of SARS-CoV-2 variants which should be discussed in the paper. Consider adding that your data is likely skewed for understanding well surveillanced regions, but may not be applicable to other regions with less data.

  14. On line 283-284, you discuss training another model on the Alpha dataset to predict future waves. Please clarify if this was using the same variables as the XGBoost on intrahost variation. 

Minor comments:

  1. To help contextualize the paper, consider including SARS-CoV-2 in the article title (”Intrahost dynamics, together with genetic and phenotypic effects, predict the success of viral mutations in SARS-CoV-2”).

  2. Line 110 states the mutation frequencies as <10%, 10-90%, and >90%. In lines 213-225 and Figure 2a, 10-90% is reported as >10%. Please keep this naming consistent throughout the paper as having >10% and >90% together is confusing. If the percentages are being misinterpreted, please clarify.

  3. In line 154, clarify that this trend is only for mutations that did not reach fixation.

  4. In line 165, please remind us of the physiochemical variables used in the model for improved readability.

  5. Please improve the description of Figure 1b in the text that explains how the variants are defined and at what date the gray dashed lines represent (i.e. the start of the new variant). This is a vital component to understand how the predictive models are being used.

  6. In lines 165, 167, 171, 174, 177, and 178 where statistics are provided, consider including supplementary plots that show the statistics.

  7. Consider adding arrows on the x-axis of Figure 1e to provide conserved versus non-conserved directionality to the BLOSUM62 score index.

  8. Please clarify the range of SHAP values to help contextualize the median value.

  9. Please consider relabeling the x axis label titles from  “Dataset” to “Variants” or “SARS-CoV-2 Datasets” in Figure 2a and subsequent figures.

  10. Please increase the size of the inlet in Figure 2b as it is very difficult to read.

  11. Please correct in-text references to Figure 4, which are currently referenced as Figure 3.

  12. Clarify if there is any multicollinearity between variables, as this would impact the XGBoost model.

  13. Please consider increasing the spacing between panels in the figures, as some x axis labels are easily confused with the title of the panel/graph below it. 

  14. Consider including additional model metrics such as MAPE and RMSE, in addition to MAE, for model scoring/evaluation. Increasing the number of metrics for model performance provides readers with a more robust understanding of model performance.

Competing interests

The authors declare that they have no competing interests.