Ir para a Avaliação PREreview

Avalilação PREreview de Combining blood transcriptomic signatures improves the prediction of progression to tuberculosis among household contacts in Brazil

Publicado
DOI
10.5281/zenodo.17634756
Licença
CC BY 4.0

Summary:

This manuscript presents the first systematic head-to-head comparison of multiple TB risk signatures on the same platform and dataset. With nearly a third of the world’s population being infected with tuberculosis, and only 5-10% progressing to active disease, there is a need for better triage tests that could significantly improve resource allocation and reduce unnecessary treatment toxicity. In this study, the authors aim to address this need by deriving new blood transcriptomic signatures with greater specificity using machine learning and assessing whether combining signatures (ML-derived + published) enhances predictive performance.

The dataset consisted of 272 individuals, specifically 11 progressors to active TB and 261 non-progressors. Expression of gene signatures was quantified using the NanoString nCounter platform and scored using Gene Set Variation Analysis (GSVA) and Pathway Level Analysis of Gene Expression (PLAGE).

The authors report that ML-derived signatures (h2otg, h2ox1) outperformed published signatures with AUCs of 0.89 vs 0.83-0.84 and that their two-signature combinations achieved WHO target product profile levels ( > 75% sensitivity and specificity) for TB progression prediction multiple years before disease onset. They conclude that their research has "potential clinical utility in identifying high-risk individuals for targeted prophylaxis to reduce morbidity and mortality."

Overall, this work makes a valuable contribution to the TB biomarker field by presenting a genuinely novel and clinically relevant combination approach and constitutes an important proof-of-concept. However, the small number of progression events and the use of the same dataset for both signature derivation and combination testing create substantial risk of overfitting. The impressive performance metrics should be viewed as preliminary estimates requiring validation in independent cohorts before clinical utility can be claimed.

Strengths

·      Clinical relevance and contextualization. The authors clearly articulate the clinical problem they address and provide a strong rationale for why their work matters, explaining it could help reduce unnecessary treatment toxicity and appropriately positioning their work within the WHO End TB 2030 initiative.

·      Rigorous quality control measures.  All progressors had microbiological confirmation of active TB, preventing misclassification from clinical diagnosis alone. The researchers systematically excluded co-prevalent cases who developed TB within 90 days of baseline, preventing confounding between disease detection and progression prediction. RNA quality was maintained through PAXgene tube collection, standardized Qiagen extraction protocols, and Nanodrop quality assessment to ensure only high-quality RNA proceeded to NanoString analysis.

·      Comprehensive head-to-head comparison design. They achieve remarkable, comprehensive signature coverage:15 published signatures that represent major work in the field, including one of their own, PREDICT29, which they validate in this study. Using a single platform to do the comparison eliminates the technical confounding that plagued previous multi-platform studies, and using two scoring approaches (PLAGE & GSVA) contributes to their methodological robustness.

·      Comprehensive reporting. Multiple performance metrics (AUC, sensitivity, specificity, PPV, NPV) are reported, along with p-values. Confusion matrices and Venn diagrams that enable the reader to visualize performance are also included. Comprehensive supplementary materials are provided.

Major Comments

·      Asymmetric evaluation framework. H2O signatures were developed using 55% of this cohort and evaluated on the full dataset, while published signatures never saw any of this data during development.  This asymmetric evaluation framework creates a bias in favor of h2o signatures. Yet, the authors state without nuance that ML-derived signatures “performed significantly better than published signatures”.

I recommend acknowledging that h2o signatures have a systematic advantage over published signatures that is unrelated to biological relevance or true predictive power and that performance differences between the two types of signatures may reflect prior data exposure rather than genuine superiority of h2o signatures.

·      Circular validation. The authors derived h2o signatures using the same dataset that they then evaluated two-signature combinations on, which creates a risk of overfitting and inflated performance. Therefore, the reported performance cannot be trusted as a reliable estimate of how these combinations would perform in new populations.

I recommend that the authors explicitly acknowledge their flawed methodological approach and state that their impressive performance numbers should be viewed as upper bounds rather than reliable estimates.

·      Repetitive numerical inconsistency. 273 total participants are reported in the Methods. However, the caption for Figure 1 states “n=272” and all confusion matrices in Figures 2-4 show calculations based on 272 individuals comprising 11 progressors (TP+FN) and 261 non-progressors (TN+FP). But the text also refer to 262 non-progressors more than once: “reduced the potential prophylaxis candidates from 262 to 80”. Additionally, Figure 4B contains errors: the top panel (h2ox1+ & NANO6) shows non-progressors summing to 244 (11+35+47+151) rather than the expected 261 NP. The bottom panel (h2ox1+p29) of the same figure labels one segment “47 NP, 1 NP” which should likely be “47 NP, 1 P” given the total progressor count.

The authors should reconcile these discrepancies by saying whether one non-progressor was excluded from analysis and provide justification for that,

or correcting the Methods to state that 272 participants were enrolled in the study. They should revise the Figure 1 caption and all text referencing total sample size as needed to report consistent values throughout manuscript. The numbers in Figure 4B should be corrected to match the right number of participants in each class. I also recommend that the authors verify their confusion matrix, specificity, PPV, and NPV calculations.

·      Statistical power and minority class inadequacy With only 11 progressors (6 used in training) in the dataset and 75 gene features, there is an important risk of overfitting: it is quite likely that the ML models "memorized" these specific individuals rather than learned generalizable patterns, leading to inflated performance estimates. There may be inadequate statistical power to reliably detect true differences between signatures, combinations and models. Biomarker validation studies typically require substantially larger event numbers for reliable performance estimation. No evidence of nested cross-validation during training, regularization strategies, dimensionality reduction prior to modeling, or assessment of signature stability through repeated random splits is provided.

The authors should more directly acknowledge these limitations due to the small number of positive cases. If 11 progressors is sufficient to confidently support the conclusions in this work, please provide evidence of this. Also consider reporting bootstrap confidence intervals for sensitivity and specificity in Tables 3 and 4 in order to better reflect the true uncertainty in their estimates given the minority class size. For comparisons between signatures, report whether confidence intervals overlap and explicitly state when differences are not statistically distinguishable.

Minor Comments

·      Extensive multiple testing. In this study, the authors tested 15 published signatures, multiple h2o-derived signatures, and multiple two-signature combinations without any correction for multiple testing. They should either apply appropriate correction (e.g., False Discovery Rate) and report adjusted p-values or explicitly recognize this limitation by saying “We did not correct for multiple testing when comparing signatures and combinations, which increases the possibility that some apparent performance differences may be due to chance” or something similar.

·      PLAGE scoring portability considerations. PLAGE scoring requires principal component weights derived from the original cohort's covariation patterns, which adds an additional population-dependent parameter compared to GSVA, which only requires a threshold. The clinical significance of this theoretical difference is uncertain. The authors should consider disclosing this and noting that PLAGE-scored signature combinations may not be as portable to new clinical settings as GSVA-scored combinations.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.