Skip to main content

Write a PREreview

The Second Brain: Diffusion Models for Realistic Human Microbiome Generation

Posted
Server
bioRxiv
DOI
10.64898/2026.05.07.723523

The human microbiome is a critical determinant of health and disease, but microbiome machine learning is constrained by limited data availability, heterogeneous cohort coverage, and privacy risks from individually identifying microbial signatures. Synthetic microbiome generation could support method development and privacy-preserving sharing, provided that generated samples preserve the ecological zero-inflation of real communities. We present a diffusion-based generative model with a sparsity-preserving decoder built around two sparsity-focused mechanisms: (1) prevalence-aware bias initialization that anchors per-taxon presence probabilities to observed prevalences from epoch one; and (2) a hard sparsity loss implemented with straight-through gradient estimators. The implementation also uses hyperbolic taxonomic embeddings as an unvalidated, phylogeny-aware architectural prior in the diffusion backbone. Evaluated on the American Gut Project (4,827 samples, 500 taxa), the full 15.2M-parameter model achieves parametric-level sparsity preservation: 1.4% deviation in the main comparison and 2.6%±0.5% deviation across three AGP seeds. SparseDOSSA2 achieves the lowest sparsity deviation in this comparison (0.7%), and MIDASim also passes the operational sparsity threshold (4.9%). Among the three threshold-passing methods, MIDASim achieves the best ecological distance scores, SparseDOSSA2 is best on sparsity deviation, and our model achieves the best prevalence correlation (0.996) while narrowly improving on SparseDOSSA2 on Bray–Curtis (0.0485 vs. 0.0495) and UniFrac (0.0400 vs. 0.0435) discrepancies. PERMANOVA remains able to distinguish generated from real AGP samples ( F = 64.29), which we treat as an important limitation rather than evidence of indistinguishability. These results support a deliberately narrow conclusion: this is, to our knowledge, the first deep generative model to match parametric-level sparsity preservation for human microbiome profiles while remaining competitive on standard ecological distance metrics.

You can write a PREreview of The Second Brain: Diffusion Models for Realistic Human Microbiome Generation. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now