Skip to PREreview

PREreview of Ensemble reweighting using Cryo-EM particles

Published
DOI
10.5281/zenodo.7727867
License
CC BY 4.0

Review of “Ensemble reweighting using Cryo-EM particles” by Tang et al

Reviewed by F. Emil Thomasen and Kresten Lindorff-Larsen

The manuscript by Tang et al describes an approach to analyse cryoEM data as single molecular measurements. As such, the work—together with previous work from Cossio and colleagues—stands out as an alternative approach to get out as much information as possible from the revolution in cryoEM experiments. The authors apply their approach to a set of synthetic data and need to cut a few corners to be able to implement their overall idea in practice. This obviously limits the practical utility of the method—at least for the moment. That said, one needs to start somewhere, and this work could be the beginning of a potentially very fruitful approach for studying biomolecular structure and dynamics. From this perspective, it is understandable that we are not all the way at the end (yet). The paper is well written, the chignolin example is useful, and the paper makes it relatively clear what has and hasn’t been achieved.

Major comments;

1.

The comments above lead me to my main suggestion for the authors. They already do a good job of building up the idea from basic principles; this is appreciated, because it makes it possible to understand better what approximations are made. That said, I would have liked to see an even clearer outline of where this approach is heading. The authors make several approximations, and it would be good to understand better how they interact with one another, what the computational gain is by making them, and how one might envisage overcoming them. To be clear, it would in my opinion be unreasonable to ask for these issues to be solved now, or even to present clear ideas on how to do this. I merely ask for an outline (maybe a flow chart) so that others can see where the main problems lie and where to focus.

The authors make several approximations including:

— not fitting/integrating out parameters in the forward model (rotations, centering etc)

— discretizing conformational space by clustering

— considering only Calpha atoms

I think it would be relevant to know what the computational gains are when doing these approximations. Is the overall algorithm roughly linear in the number of structures? How long does the current forward model calculation take compared to a more realistic one? What are the expected problems one might run into when using a more realistic treatment of nuisance parameters in Eq. 17. In addition to this, I think it would be useful with an estimate of the scaling of the overall algorithm with the different parameters (I guess mostly number of images and structures).

2.

I would have liked to see a slightly better description of how the prior is implemented. If I understand correctly, the prior ends up being encoded in p(alpha) as in Eq. 15. If the model was perfect, I can see that this works well. But I would assume that if the data is not iid and/or if the noise model is wrong, then the balance between the prior and the data might end up wrong. Have the authors considered adding a parameter as in “theta” in reference 11.

2a.

I am asking also because I was surprised to see that the population of e.g. the misfolded state in Table 2 does not seem to revert to the prior (“original”) even when the SNR is low (10^-4). Is this because SNR=10^-4 still carries a lot of information, or is there something else at play? Would one not expect the model to revert to the prior if the data carries no/little information? Maybe it’s just that one would need to go below 10^-4, or maybe it’s because the simplification in the forward model (assuming known parameters) makes it possible to squeeze out some information even w very low SNR.

3.

I will leave this up to the authors, but I would have liked to see an analysis with a bad prior (e.g. for the chignolin work). As it stands now, the prior has one state too much (the intermediate) and the authors show that this can be filtered away. But the other two states are there and the populations in the prior are relatively similar to those in the simulation used to generate the synthetic data. In many cases, I suspect one will not be in such a nice scenario. Given that the authors use clusters, it would seem relatively straightforward to perturb the prior (assigning new weights to the clusters) to examine how far one can push the model. I don’t think this should be a lot of work, but I would also be OK if the authors prefer not to do this and/or delay to later work. I, however, think it would strengthen the current paper.

4.

A key aspect of the work is the MCMC algorithm used to fit parameters. It would be nice with a bit more detail about what the actual algorithm looks like. In particular, it seems a bit surprising that the authors spend a lot of time explaining the theory and so little to give the practical implementation (I realize the code is available; highly appreciated).

5.

Maybe this belongs under minor points, but the authors state in the conclusions that their approach “provides an approximation of the full Boltzmann ensemble in conformation space without requiring prior collective variables or dimensionality reduction.” This is perhaps true in the ideal case of the algorithm, and if the data is at equilibrium. But as long as they need to reduce conformational space by clustering, then this does require some manual choices about which variables to cluster on. Different choices for the variables to cluster on (incl. RMSD) will lead to highlighting different parts of conformational space. I think the authors could better separate the potential of the algorithm from what is currently achieved. This doesn’t subtract from the work in my opinion.

Minor

1.

The authors discuss the ensembles and densities as being at thermal equilibrium. While this is briefly mentioned, I think it’s still important to make it clear that there is still uncertainty about whether the EM data represents an equilibrium, and if so at which temperature. I think this should be stated a bit more clearly. This doesn’t subtract from the overall work since it would be mostly OK from a theoretical viewpoint; the link to the Boltzmann distribution would just be less direct.

2.

The authors resort to clustering to avoid having too many structures to compare to the images. If this will be the practical solution in the foreseeable future, it might be good with a brief discussion of how best to achieve the clustering. Naively, I would assume that the most robust approach would be to merge structures that give rise to similar maps/projections as these cannot be distinguished by the data. On the other hand, if two structures are distinct, but cannot easily be told apart from the data, one might also say that the prior should do the job. I am not asking about the authors trying different clustering methods (they already do that), but some discussion about what such clustering should ideally achieve, i.e. which metric to use. I expect that it should be possible to do clustering based on map similarity if that were a more relevant target.

3.

I must admit that I didn’t find the toy model (section 3.2 and 4.1) particularly useful. I understand the idea and also like the result that the algorithm is robust e.g. when splitting up clusters. But I also don’t know whether any of these results transfer to high dimensional data, and so I didn’t really get much out of it. I would be completely fine with the authors leaving it in there as is; I just wanted to state my personal opinion.

4.

Maybe I missed it, but I only noticed when looking at the SI that the chignolin work deals with Calpha atoms only. I think this should be stated in the main text too.

Competing interests

The author declares that they have no competing interests.