PREreview of Chai-1: Decoding the molecular interactions of life

by Phyu Sin (Jessica) M. Myat, Stephanie Wankowicz, Morad C. Malek, and Alexandria Haynes

Published: December 13, 2024
DOI: 10.5281/zenodo.14450747
License: CC BY 4.0

Summary

Much progress continues to be made in deep learning structure prediction methods, such as AlphaFold 3 (AF3), RoseTTAFold All-Atom, and Boltz-1. Here, the authors provide a new open-source structure prediction model called Chai-1 that is claimed to perform better than AF3 and RoseTTAFold All-Atom in predicting structures from sequences. Additionally, it provides input channels to constrain predictions based on experimental data such as XL-MS data. For single-sequence predictions, Chai-1 has an additional input channel for residue embeddings from protein language models to ensure accurate widespread predictions. Constraint features such as pocket and docking constraints were also added to mimic experimental constraints. For multimer protein folding predictions, Chai-1 seems to excel with multiple sequence alignments (MSAs) and even without MSAs or structural templates in single-sequence mode. However, it is unclear what the paper's main thesis is if the main goal is to fill a gap in the need for an open-access structure prediction software prediction protein modeling software or a better-performing software that works without MSAs or structural templates. This should be emphasized in the introduction and conclusion.

This paper provides an important prediction model that is a more accessible alternative to AF3 while also providing additional input features. Although this paper does a good job of introducing the model, the Chai-1 algorithm itself is poorly explained, with a lack of detail presented on its novel structure. Additionally, we examined Chai-1’s web interface version and found it intuitive and direct to use.

Finally, we believe that all publicly available protein prediction models, including AF3, should be made truly open source to help make improvements within the field and foster discoveries without the burden of convoluted and restrictive use licenses. Chai-1 thus makes a step in the right direction: “We release Chai-1 model weights and inference code as a Python package for non-commercial use and via a web interface where it can be used for free, including for commercial drug discovery purposes,” under an Apache license.

Major Comments

1) Since the publishing of this preprint, Boltz-1 (doi: https://doi.org/10.1101/2024.11.19.624167) has been released. Similar to Chai-1, its stated mission and structure are very similar. The authors of Boltz-1 have compared their model to Chai-1 and claimed comparable, and sometimes better, performance. The authors of Boltz-1 have also committed to “releasing the training and inference code, model weights, datasets, and benchmarks” under an MIT license. Additionally, DeepMind has finally released its weights behind AlphaFold 3, albeit at request, (https://www.nature.com/articles/d41586-024-03708-4). Therefore, new benchmarks can now be run for a true head-to-head comparison between each model.

We believe, prior to publishing, the authors should review and comment on the performance of Boltz-1 to Chai-1 and the model structures broadly.
New benchmarks should be run between Chai-1, AF3, and Boltz-1 where applicable.

2) We believe the introduction needs an explanation of the current field of modeling, the gap that Chai-1 aims to improve, and the unique characteristics that Chai-1 possesses in predicting structures. It would be helpful to format the introduction by funneling from the current relevant protein modeling software, such as AlphaFold 3 and Rosetta, and pointing out how these are not open access, rely on MSAs, etc., to highlight the need for more accessible modeling software.

3) Abramson et al. is repeatedly mentioned to reference the model architecture and training strategy used. However, nothing further is explained. Please provide a brief in-text explanation to outline the algorithm implemented by Abramson et. al. at the end of the introduction.

4) In section 2.2 of protein-ligand prediction, the authors state that structures with greater than 2048 tokens were cropped to the 2048 closest tokens to the ligand binding site.

a) Why is this the cutoff since we presume many sequences would exceed this token cutoff?

b) When users encounter this issue, how would you suggest going about it? Have the authors tested with more tokens and if so, what are the results like?

c) Are 2048 tokens the maximum token limit for Chai-1? Would increasing this to the level of AF3 be beneficial to Chai-1?

5) In section 2.4, the authors compared predicted structures that AF2.3 struggled with to Chai-1’s predictions.

a) Are there any targets that Chai-1 fails to predict that AF2.3 does well on? It would be helpful to investigate and report this to further improve on possible shortcomings in Chai-1’s protein monomer prediction if those exist.

6) When running nucleic acid structure prediction on single-sequence mode with Chai-1, the authors report that performance compared with RoseTTAFold2NA is comparable to Chai-1. However, Chai-1 is trained without MSAs, while RoseTTAFold2NA has access to the evolutionary data (section 2.5).

a) What is Chai-1’s performance with MSAs for nucleic acid prediction and does it outperform RoseTTAFold2NA’s performance? To support this, we suggest that Figure S3 which depicts Chai-1’s outperformance compared to RoseTTAFold2NA should be a primary figure instead of supplementary figure.

7) The constraints feature section of the methods (section 5.1.2) states that the distance threshold is set between 6 to 20 angstroms.

a) Is this range derived from prior models and literature?

b) What happens when the distance is at the boundaries of 6 or 20?

c) Additionally, the formula used to calculate the minimum distance between two tokens could be better explained. Please clarify what is considered a token for this model (functional groups, bonds, or something else)?

8) In section 5.1.2, the docking feature is set by binning pairwise distances of input tokens into four different bins that were specified. Please clarify how these bins were determined and separated into different groups.

Additionally, when the pairwise distance falls on the edge of two bins, such as 4Å,

How is that binned?
Are the distances rounded to the nearest integer and then binned?

9) We believe that one of the most distinct features between Chai-1 and AF3 is the pocket-aligned ligand RMSD calculations. Chai-1 selects the chain permutation that best minimizes the RMSD between the predicted chain and the native chain by referring to the averaged global DockQ scores. AF3 uses a simulated annealing approach to find the minima. We believe that highlighting this in the protein-ligand complex prediction (section 2.2) would be helpful in explaining the better ligand RMSD success performance stated.

10) In the discussion, restating the hypothesis and emphasizing the results would help readers understand the goal of the whole paper. If the main goal is to fill a gap on the need for an open access structure prediction software or the need for a better performing software that works without MSAs or structural templates, then the discussion should detail that. Essentially, Chai-1 is an addition to the current field of using deep learning to predict structures. Emphasizing that Chai-1 is open access and seems to outperform existing models like AlphaFold 3 would clearly hone in on the paper as a whole.

11) (Figure 1) Since this figure aims to describe the whole algorithm and the additional input channels, we believe it would be better to divide the architecture schematics into subfigures. Some suggestions would be to allocate DNA, RNA, and protein sequence inputs as subfigure A, with the origin of the data and the selection criteria, clearly referenced and explained. Doing this for each step of the model: input features, MSA module, pair-bias attention, and structure prediction would be helpful. With that, each step or subfigure should be properly referenced in-text when explaining it in depth.

12) (Figure 2) In Abramson et al., for the ligand PoseBusters dataset, the authors used a benchmark of n = 428. However, in this Chai-1 paper, AF3 is said to have an n = 427. It was stated that one structure, 7D6O, is removed from the Chai-1 benchmark. It is unclear why this structure, 7D6O, was removed for both results if it was reported for AF3, given that the Ligand PoseBusters set results were taken from AlphaFold3’s publicly released PoseBusters predictions.

13) (Figure 6) We think that Figure 6 would be better integrated as a sub-figure within Figure 1 to depict it as an example result of a structure predicted accurately by Chai-1.

Minor Comments

1) On the web interface version of Chai-1, we suggest including spreadsheet import options for adding restraints since entering them manually is time-consuming and prone to human error.

2) The term MSAs is mentioned for the first time in the abstract, but the acronym is not defined until the introduction. Please correct.

3) In the second paragraph of section 2.2, it is stated that specifying the apo structure of a protein boosts Chai-1's success rate to 81%. Please specify where the data showing this is with a figure reference.

4) We believe it could be interesting to include DockQ scores for each protein–antigen interface and Predicted Aligned Error (PAE) matrix for each residue that was predicted, if possible, for figures 3 and 4.

5) In the limitations section, please include suggestions on addressing or resolving these limitations. For example, based on the first limitation, authors could suggest the inclusion of additional contact information for the best results. To address the second limitation, authors could include a future direction of working on their model so it relies less on modified residues or suggest how to structure a workflow if inputting a sequence with and without modifications is required.

6) We believe Table 1 and Table 2 are the numeric values visualized in Figure 2. We believe it would be helpful to either move these tables to the supplementary and then further divide Figure 2 into subfigures A and B, mentioning the numbers specified in these tables in the text while referencing the subfigures.

7) It would be helpful to include the significant values from these tables in the caption of Figure 2, along with the statistical test used to calculate them.

8) Section 2.3, paragraph 2 states that Chai-1 predicts significantly better than AF2.3 on targets that AF2.3 usually struggles with, but it is unclear where this data denoting statistical significance is. Please include a reference figure on where to find this data.

9) We suggest making all figures color-blind friendly.

10) We suggest addressing the following typos and grammatical errors:

Abstract:
- Add commas after “(e.g., data)” and “free”

Introduction
- Paragraph 2
  - Add comma after “Chai-1 excels on a variety of tasks”
  - Remove commas after “Chai-1 outperforms ESMFold,” and “multiple sequence alignments (MSAs),”
Results
- 2.1.2 Constraint features
  - Remove comma after “distances,”
- 2.2 Protein-ligand prediction
  - Paragraph 1

Add “, the” before “(i.e.” in “(i.e. fraction of predictions with DockQ > 0.23)”
Add “the” before “difference between means is not statistically meaningful)”
Remove commas after “without the need for MSAs,”

Paragraph 2

Add “the” before “success rate to 81%.”
“prompt following” should be “prompt-following”

Paragraph 3

Add comma after “crystallization aid for elucidating structures”
Remove comma after “importance of manually inspecting models,”

2.3 Multimeric protein prediction
- Paragraph 2

“redundancy reduced” should be “redundancy-reduced”
Remove comma after “these antibody-protein interfaces,”
Add commas after “highly variable immunological protein sequences”

Paragraph 3
- Remove comma after “full performance mode with MSAs,”
Paragraph 4
- Add commas after “Unfortunately’’, and “(4-8%)”
- “high quality” should be “high-quality”

Last sentence
- Add comma after “AF3 restricts commercial use”

2.4 Protein monomer prediction
- Paragraph 1
  - Add commas after “We find that Chai-1", and “MSA information”

Paragraph 2

Remove “to” in “we attempt to fairly compare to”
Remove comma after “full set of 70 CASP15 targets,”

2.5 Nucleic acid structure prediction
- Paragraph 1
  - Add “, the” before “(i.e. no RNA MSAs)”
2.6 Predicted confidence scores track accuracy
- “include” in “evaluated in Figure 5 include” should be “includes”
2.8 Limitations
- Paragraph 1
  - Change “encountering” to “encounter” in the first sentence.
  - Remove comma after “chains in a complex correctly,”

Paragraph 2
- Change “posses” to “possesses”

3 Discussion

“scientific understanding of cellular processes, and ultimately,” should be “scientific understanding of cellular processes and, ultimately,”

Competing interests

Comments