Skip to main content

Write a PREreview

Machine Learning Analysis of Post-Acute COVID Symptoms Identifies Distinct Clusters and Severity Groups

Posted
Server
medRxiv
DOI
10.1101/2025.11.16.25340350

Questionnaires that capture patient-reported symptomatology provide low-cost but potentially high-value data for the de novo discovery of disease phenotype, severity, and responsiveness to intervention groupings within an umbrella condition. The availability of comprehensive electronic health records (EHRs) has nonetheless overshadowed the use of questionnaires data for symptom analysis in the context of COVID-19. We analyzed de-identified questionnaires from post-acute COVID-19 cohorts at the University of California, San Francisco (UCSF, n = 669), Icahn School of Medicine at Mount Sinai (ISMMS, n = 615), Emory University (Emory, n = 60), and the University Hospital of Wales (Cardiff, n = 317). Using topic modeling followed by unsupervised clustering, we identified distinct symptom clusters and their corresponding symptom signatures. Mapping these signatures to organ systems revealed nine to twelve endotypes per cohort, capturing the heterogeneity of post-COVID-19 symptoms. Some clusters were associated with pre-existing conditions, including a female-predominant severity cluster with neurological and hormonal symptoms. Longitudinal analysis distinguished three symptom trajectories: acute then resolving, persistent but attenuated, and progressive disease. Across all cohorts, three severity levels, namely, mild, moderate, and severe, were evident from symptoms alone. Symptom-based severity scores correlated with patient-reported health status (EQ-5D) and SARS-CoV-2-specific antibody responses in plasmablasts, validating the prediction. Cluster-level analyses further stratified patients into recovered and non-recovered subgroups, identifying endotypes associated with different recovery trajectories. Finally, meta-analysis integrating cohort-specific clusters defined ten global endotypes and a unified map of severity scores, highlighting cohort-specific patterns, sex differences, and relationships among organ systems. These findings demonstrate that machine learning-assisted screening of questionnaire data can robustly identify symptom clusters, endotypes, and severity groups, providing a framework for stratifying long COVID patients for precision medicine trial design.

You can write a PREreview of Machine Learning Analysis of Post-Acute COVID Symptoms Identifies Distinct Clusters and Severity Groups. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now