A rigorous and efficient method to reweight very large conformational ensembles using average experimental data and to determine their relative information content.
Flexible polypeptides such as unfolded proteins may access an astronomical number of conformations. The most advanced simulations of such states usually comprise tens of thousands of individual structures. In principle, a comparison of parameters predicted from such ensembles to experimental data provides a measure of their quality. In practice, analyses that go beyond the comparison of unbiased average data have been impossible to carry out on the entirety of such very large ensembles and have, therefore, been restricted to much smaller subensembles and/or nondeterministic algorithms. Here, we show that such very large ensembles, on the order of 104 to 105 conformations, can be analyzed in full by a maximum entropy fit to experimental average data. Maximizing the entropy of the population weights of individual conformations under experimental χ2 constraints is a convex optimization problem, which can be solved in a very efficient and robust manner to a unique global solution even for very large ensembles. Since the population weights can be determined reliably, the reweighted full ensemble presents the best model of the combined information from simulation and experiment. Furthermore, since the reduction of entropy due to the experimental constraints is well-defined, its value provides a robust measure of the information content of the experimental data relative to the simulated ensemble and an indication for the density of the sampling of conformational space. The method is applied to the reweighting of a 35 000 frame molecular dynamics trajectory of the nonapeptide EGAAWAASS by extensive NMR 3J coupling and RDC data. The analysis shows that RDCs provide significantly more information than 3J couplings and that a discontinuity in the RDC pattern at the central tryptophan is caused by a cluster of helical conformations. Reweighting factors are moderate and consistent with errors in MD force fields of less than 3kT. The required reweighting is larger for an ensemble derived from a statistical coil model, consistent with its coarser nature. We call the method COPER, for convex optimization for ensemble reweighting. Similar advantages of large-scale efficiency and robustness can be obtained for other ensemble analysis methods with convex targets and constraints, such as constrained χ2 minimization and the maximum occurrence method.