Comments
Write a commentNo comments have been published yet.
This preprint is a literature review focused on major computational and conceptual pathways in bioinformatics, genomics, and health applications. The review discusses reproducibility in genomics, comparative genomics, genomic language models, tokenization strategies, deep learning architectures, state space models, telomere-to-telomere assemblies, and the connection of these technologies to precision medicine. The manuscript frames genomics as a data-driven discipline where the major challenge has shifted from data generation to large-scale computational interpretation and clinical translation.
The review’s main contribution is its attempt to connect several rapidly evolving areas—genomic reproducibility, benchmarking, comparative genomics, BPE tokenization, large language models, state space models such as Caduceus and Mamba, and precision medicine—into one broad conceptual roadmap. The manuscript is especially useful for readers seeking a high-level orientation to how computational genomics is evolving beyond conventional sequence analysis toward scalable, AI-enabled interpretation of long genomic sequences.
Timely and relevant topic. The review addresses important current themes in genomics, including T2T assemblies, long-range genomic modeling, genomic language models, benchmarking, and precision medicine. These are highly relevant areas as genomics becomes increasingly computational and clinically integrated.
Good broad conceptual framing. The manuscript correctly emphasizes that modern genomics is no longer limited by sequence generation alone, but increasingly by reproducible analysis, scalable computation, and clinically meaningful interpretation.
Useful inclusion of reproducibility and benchmarking. The discussion of reproducibility, GIAB-style benchmarking, and workflow standardization is valuable because clinical genomics depends heavily on consistent and reliable pipelines.
Interesting coverage of genomic language models and tokenization. The section on Byte-Pair Encoding and repetitive elements is useful because tokenization can strongly influence how genomic foundation models learn sequence structure.
Good attention to state space models. Highlighting Caduceus, Mamba, and related SSM architectures is timely because linear-time models may help overcome the limitations of transformer-based approaches for ultra-long genomic sequences.
The manuscript covers many topics: reproducibility, comparative genomics, DCJ distance, tokenization, genomic language models, deep learning, SSMs, T2T assemblies, and precision medicine. Each topic is relevant, but the current structure sometimes reads as a collection of important themes rather than a tightly argued review.
Suggested improvement: The author should define a clearer central question, such as:
“How are computational models enabling clinical interpretation of long-range genomic information?” or
“What computational barriers must be solved before genomic AI can reliably support precision medicine?”
This would help connect the sections more strongly and make the review more focused.
The manuscript states that literature was searched across PubMed, IEEE Xplore, Google Scholar, and arXiv, with preference for studies from 2012–2025 and a final selection of 40 key references. This is helpful, but it does not provide enough detail to determine how references were selected or excluded.
Suggested improvement: The author should add:
exact search strings
search dates
number of records identified
number excluded and why
whether duplicate records were removed
whether screening was title/abstract-based or full-text-based
whether a PRISMA-style flow diagram was used
criteria for including preprints versus peer-reviewed articles
Because this is a literature review, transparent search and selection methods are important.
The manuscript includes both peer-reviewed literature and recent preprints, especially for fast-moving areas such as SSMs, Caduceus, Mamba, and genomic tokenization. Including preprints is reasonable for an emerging computational field, but readers should be able to clearly distinguish established findings from preliminary claims.
Suggested improvement: Add a table classifying key references as:
peer-reviewed article
preprint
benchmarking paper
theoretical/methodological paper
review article
This would improve transparency and help readers assess the maturity of the evidence.
The manuscript states that state space models can match or surpass transformer performance and generalize to much longer genomic sequences. This is an important point, but the review should avoid implying that SSMs have already broadly solved long-range genomic modeling.
Suggested improvement: The author should add a more balanced discussion of:
task-specific performance differences
limitations of current benchmarks
interpretability challenges
training-data bias
generalization across species and genome builds
whether improved computational scaling translates into better biological insight
whether these models are clinically validated
This would make the review more critical and less promotional.
The manuscript connects computational genomics to precision medicine, but the clinical integration section is relatively high-level. For readers in clinical genomics or molecular diagnostics, the review would be stronger with concrete examples.
Suggested improvement: Add examples such as:
variant calling and benchmarking in clinical NGS
rare disease diagnosis using WGS
oncology variant interpretation
pharmacogenomics
polygenic risk scores
noncoding variant interpretation
RNA-seq integration
clinical assay validation and regulatory requirements
This would better connect the computational discussion to real health applications.
Some computational methods discussed in the review are promising but not necessarily ready for clinical decision-making. For example, genomic language models and SSMs may improve prediction tasks, but clinical implementation requires validation, interpretability, reproducibility, bias assessment, and regulatory-grade evidence.
Suggested improvement: Add a section titled something like “Barriers to Clinical Translation” covering:
reproducibility
model interpretability
dataset bias
ancestry representation
regulatory validation
clinical utility
workflow integration
data privacy
prospective validation
This would make the health-application framing more realistic.
The DCJ distance and genome rearrangement discussion is scientifically valid, but it is not clearly connected to the later sections on genomic AI, SSMs, and precision medicine.
Suggested improvement: Either strengthen the connection by explaining how comparative genomics supports modern health applications, or reduce this section and focus more on clinical genomics, benchmarking, variant interpretation, and AI-based genome modeling.
Clarify the type of review. The manuscript should state whether it is a narrative review, scoping review, or systematic review. Based on the current structure, it appears to be a narrative review with thematic synthesis.
Add a summary figure. A conceptual diagram showing the pathway from sequencing data → reproducible pipelines → genomic representation learning → long-range models → clinical applications would improve readability.
Add a table of reviewed themes. A table listing each major topic, representative methods, key references, strengths, limitations, and clinical relevance would make the review more useful.
Strengthen transitions between sections. Some sections move quickly from reproducibility to comparative genomics to tokenization to SSMs. Brief transition paragraphs would improve flow.
Define technical terms early. Terms such as DCJ, BPE, SSM, T2T, GLRB, and LLM should be briefly defined at first use for readers outside computational genomics.
Avoid overgeneralized statements. Phrases suggesting broad “superior performance” should be tied to specific tasks, benchmarks, or references.
Discuss limitations of genomic foundation models. The review should mention issues such as repeat-driven learning, limited interpretability, training-data bias, reference-genome dependency, and poor clinical explainability.
Expand the discussion of benchmarking. Since benchmarking is central to clinical trust, the review could include examples such as GIAB, SEQC/MAQC-style studies, benchmark truth sets, and reproducible workflow containers.
Improve clinical-health framing in the conclusion. The conclusion is conceptually strong but could better specify what researchers, clinicians, and policymakers should do next.
Proofread for style and consistency. Some sentences are broad and abstract. The manuscript would benefit from tightening language and reducing repetition around “pathways,” “integration,” and “precision medicine.”
This is a timely and useful narrative review that introduces several important computational directions in modern genomics, including reproducibility, tokenization, genomic language models, state space models, and precision medicine integration. The manuscript’s main strength is its broad synthesis of emerging computational themes that are likely to shape future genomic analysis.
The most important improvements would be to narrow the central thesis, make the literature search strategy more transparent, distinguish peer-reviewed evidence from preprints, add more critical evaluation of SSMs and genomic AI models, and strengthen the clinical translation section with concrete examples. With these revisions, the review would become more focused, more rigorous, and more useful for readers working at the intersection of bioinformatics, genomics, and health applications.
The author declares that they have no competing interests.
The author declares that they used generative AI to come up with new ideas for their review.
No comments have been published yet.