Skip to PREreview

PREreview of Machine Learning Analysis of Post-Acute COVID Symptoms Identifies Distinct Clusters and Severity Groups

Published
DOI
10.5281/zenodo.17886094
License
CC0 1.0

This preprint presents a novel usage of machine learning to identify and analyze symptoms present in long COVID, a post-infection range of symptoms beginning 3 months after infection of SARS-CoV-2. Authors in this preprint wanted to utilize the often overlooked data from questionnaires to identify potential symptoms patterns, severity groups, and biological endotypes seen in patients with long COVID from four cohorts at University of California, San Francisco, Icahn School of Medicine at Mount Sinai, Emory, and Cardiff. To do this, the authors analyzed symptom questionnaire data from the four independent cohorts, altogether including 1,661 individuals who reported 32–50 symptoms in binary (present/absent) form. Because each cohort used a slightly different survey, the authors modeled each dataset separately. They then utilized topic modeling, specifically Poisson Factor Analysis, to identify latent symptom “topics” that capture symptom co-occurrence patterns, and then clustered participants based on their topic proportions to define symptom clusters and broader, organ-system–based endotypes. The authors also derived a data-driven severity scale from symptom burden, analyzing how clusters map onto clinical severity and examining how these symptom profiles change over time in the longitudinal UCSF cohort. The authors explored how symptom-based clusters correlate with immune responses measured in a subset of participants.  The novel usage of machine learning could uncover patterns of long-COVID symptoms that might represent endotypes of a disease that is not well researched. Overall, the authors present an in depth and well conducted study that proposes long COVID is heterogeneous and composed of reproducible symptom clusters, with the need for further validation.

Major Issues 

Methods

  1. Differences in cohorts make cross-site comparison more difficult

  2. The authors did a good job at explaining the efforts taken to construct their analysis model and the timeline in which cohort enrollment criteria were analyzed separately before compiling together. However, because of the nature of these analyses, there is naturally participant data from various recruitment methods (walk-in, case-control, longitudinal, etc), timing discrepancies post-infection, and variation in the symptom questionnaire offered to each cohort. While the heterogeneity of the four cohorts is acceptable, I recommend that the authors address potential confounding and add sensitivity analyses for time since infection, sex-stratified models, and ensure the the questionnaires were similar enough to use in clustering. Additional factors, such as time since infection, vaccination history, and reinfections should be controlled for within the study model. Without validation of the data, it is difficult to attribute these findings to a broader population and need validation in standardized cohorts. 

  3. While multiple cohorts are included, each cluster solution is trained separately on each cohort. Does this mean that there is not one model trained for all cohorts? Were any cohort’s models trained and validated on another? As a reader, I would love to hear the authors’ opinion on generalizability and universal endotypes if each cohort model was separate. 

    • Because symptoms differ across cohorts, and some cohorts do not measure certain symptom domains found in other cohorts, I would like to see authors quantify how many symptoms overlap and test whether these missing domains affect topic structure. This is not a limitation of the study, but could strengthen the study’s validity if the authors were to discuss this distinction across cohorts. 

  4. The number of topic numbers does not appear to be validated. While the authors do a good job of stating that the topic number is chosen by coherence and likelihood, they do not provide the model selection curves or sensitivity analyses. Topic composition may shift substantially with different topic counts. I would recommend reporting everything transparently, including a plot of coherence vs T, cluster stability metrics and a brief rationale for the chosen topic T. This prevents overfitting the model and provides a more defensible and credible explanation for subsequent modeling. 

Discussion

  1. The authors do a good job at acknowledging potential limitations and challenges in the discussion. However, I would recommend that the authors ensure they are not making any causal inferences, and more prominently acknowledge the observational design of their study.

Minor Issues 

Methods

  1. While the authors provide a supplement for readers to read the exact symptoms listed across the various questionnaires, the authors could provide a sample list of symptoms to save readers time and provide more efficient understanding of the methods. 

  2. I also wonder why the modeling did not include duration since infection or duration since symptom onset in their modeling, and if there was any reason for this that could be mentioned in the methods or discussion. 

Results

  1. While the authors’ visualizations are strong, the reporting of statistical uncertainty is inconsistent. Figure 3 presents the only confidence interval in the paper; extending confidence interval reporting to other figures and analyses would enhance interpretability.

Competing interests

The authors declare that they have no competing interests.

Use of Artificial Intelligence (AI)

The authors declare that they used generative AI to come up with new ideas for their review.