Skip to PREreview

PREreview of FusOn-pLM: A Fusion Oncoprotein-Specific Language Model via Focused Probabilistic Masking

Published
DOI
10.5281/zenodo.12658078
License
CC BY 4.0

Summary:

Fusion oncoproteins occurring from genomic rearrangements are commonly observed in cancers and often drive oncogenesis. Although these fusions frequently involve kinases or transcription factors, they are a diverse group at both molecular and functional levels, and a unified description of their oncogenetic properties is lacking. Robust methods for predicting oncogenicity of unknown fusions would be immediately clinically useful, making this an important gap. At a more basic level, this points to a gap in our ability to describe a key biological phenomenon. Some recent work has tackled this problem by examining the physicochemical properties of fusion oncoproteins, notably [1], but this is essentially still an open question.

In this manuscript, the authors present a language model of fusion oncoproteins, FusOn-pLM, by fine-tuning ESM-2 with two recent databases of human fusion oncoproteins. They compare random masking vs. one using their previous fine-tuned ESM-2 model SaLT&PepPr and benchmark their results on a number of tasks, demonstrating reasonably increased specificity on specific tasks and improvement with non-random masking. The model training and benchmarking are sound and convincingly demonstrate the improvement.

Despite this, the lack of clarity about what unifies fusion oncogenes is a major challenge. Language models can be powerful ways to learn these sorts of definitions in a less biased way, and in that light this is an important step towards clarifying this basic gap. However, as written, the work uses a working definition of fusion oncogene that is based on physicochemical properties that may or may not be specific to oncogenes. Examining the benchmarking tasks the authors use makes this clearer: they are almost entirely predictions of condensate and IDR properties rather than oncogenetic ones. The one truly cancer-specific benchmark, differentiating carcinoma classes, is fairly narrow and no model performs particularly well here. As a result, we are unsure how strongly this model will perform in discrimination or generalization tasks.

Another general problem for the field is the lack of negative controls. Gene fusions are relatively common mutations, but bona fide oncogenic fusions are a small fraction of all fusions, making this a class imbalance problem. Even within tumors, the majority of fusions are thought to be passengers rather than driver mutations. Any predictor should be able to discriminate between these, but the lack of good data on non-oncogenetic fusions makes this challenging. This is evident in this work, where the model’s discrimination is not strongly tested.

In summary, we believe this is technically strong work which addresses a pressing need, and which also presents some general strategies for domain-specific language model fine-tuning, but which is unfortunately hamstrung by defects in the available data and conceptualization of the field that are outside of the authors’ control. As presented, it will be of interest to AI practitioners and oncofusion researchers, but the clinical utility is unclear.

Reviewed by Priyanka Bajaj and Christian B. Macdonald

Major points

  1. As discussed, we think the concept of an “oncofusion” is somewhat diffuse, as it describes an extremely heterogeneous set of proteins. This makes the prediction task particularly difficult. While the introduction discusses the barriers to prediction of fusion oncoproteins due to their intrinsically disordered regions and large size, we believe a bit more care with the effective definition they are using is warranted. Related to this is the choice of FOdb to train their model, which is essentially a database of condensate properties of oncofusions rather than oncogenetic ones. The implications of this choice also warrant a bit more discussion.

  1. We wonder if there is a class imbalance problem. The databases used to fine-tune their model have a small fraction of possible fusion proteins, and don’t contain large amounts of negative training information. We are thus unsure if FusOn-pLM’s significant improvements over ESM-2 are specific to driver fusion oncogenes.

  1. The method is not contextualized with respect to prior work in computational oncofusion prediction and characterization. Such methods are few ([2],[3],[4],[5],[6] among others) but important to understand FusOn-pLM’s performance.

  1. Several experimental datasets for fusion oncogenes have been published, including [7], [5], and [8]. FusON-pLM’s performance on these would be a compelling way to show its utility, as well as a more specific oncogenetic task.

Minor points:

  1. Figure 2D: Although FusON-pLM is doing a slightly better job at distinguishing carcinoma prediction into two classes (BRCA vs. STAD), the performance metrics are the worst across the board. What does this mean for the prediction problem overall? Does the fact that IDR and condensate properties are much better predicted mean that the model is actually not learning an oncogenetic task? This seems worthy of more discussion.

  1. Figure 4A: The authors present a FusOn-pLM embedding visualization of fusion oncoproteins, along with the corresponding head and tail protein sequences. It would be beneficial to clarify whether the protein sequences used for the head and tail counterparts are full-length sequences or only up to the exon breakpoint that forms the chimeric fusion protein. This information can be included in the Materials and Methods section.

  1. Figure 4A: The authors demonstrate that FusON-pLM is able to separate out fusions from their head and tail components. To demonstrate that it is learning more specific embeddings for fusion oncoproteins, a comparison of the embeddings with untuned ESM-2 would be appropriate.

  1. Figure 4B: In the main text of results section the authors write “FusOn-pLM largely clusters sequences by key properties such as the fraction of polar, charged, and disordered residues as well as the propensity to form pi-pi and pi-cation interactions and prion-like domains, via the PLAC NLLR score.” From the data shown in Figure 4B, this conclusion seems fine for polar residues and NLLR scores, but not for disordered residues and pi-pi/pi-cation interaction propensity by eye. Without quantification of the clustering, we are not sure this statement is supported.

References:

1. Tripathi S, Shirnekhi HK, Gorman SD, Chandra B, Baggett DW, Park C-G, et al. Defining the condensate landscape of fusion oncoproteins. Nat Commun. 2023;14: 6008.

2. Shugay M, Ortiz de Mendíbil I, Vizmanos JL, Novo FJ. Oncofuse: a computational framework for the prediction of the oncogenic potential of gene fusions. Bioinformatics. 2013;29: 2539–2546.

3. Abate F, Zairis S, Ficarra E, Acquaviva A, Wiggins CH, Frattini V, et al. Pegasus: a comprehensive annotation and prediction tool for detection of driver gene fusions in cancer. BMC Syst Biol. 2014;8: 97.

4. Lovino M, Montemurro M, Barrese VS, Ficarra E. Identifying the oncogenic potential of gene fusions exploiting miRNAs. J Biomed Inform. 2022;129: 104057.

5. Li J, Lu H, Ng PK-S, Pantazi A, Ip CKM, Jeong KJ, et al. A functional genomic approach to actionable gene fusions for precision oncology. Sci Adv. 2022;8: eabm2382.

6. Liu J, Tokheim C, Lee JD, Gan W, North BJ, Liu XS, et al. Genetic fusions favor tumorigenesis through degron loss in oncogenes. Nat Commun. 2021;12: 6704.

7. Frenkel M, Hujoel MLA, Morris Z, Raman S. Discovering chromatin dysregulation induced by protein-coding perturbations at scale. bioRxiv. 2023. doi:10.1101/2023.09.20.555752

8. Kobayashi Y, Oxnard GR, Cohen EF, Mahadevan NR, Alessi JV, Hung YP, et al. Genomic and biological study of fusion genes as resistance mechanisms to EGFR inhibitors. Nat Commun. 2022;13: 5614.

Competing interests

The authors declare that they have no competing interests.