320 66 13MB
English Pages 695 Year 2003
METHODS IN ENZYMOLOGY EDITORS-IN-CHIEF
John N. Abelson
Melvin I. Simon
DIVISION OF BIOLOGY CALIFORNIA INSTITUTE OF TECHNOLOGY PASADENA, CALIFORNIA
FOUNDING EDITORS
Sidney P. Colowick and Nathan O. Kaplan
Preface
Five years ago, Academic Press published parts A and B of volumes of Methods in Enzymology devoted to Macromolecular Crystallography, which we had edited. The editors of the series, in their wisdom, requested that we assemble the present volumes. We have done so with the same logical style as before, moving smoothly from methods required to prepare and characterize high quality crystals and to measure high quality data, in the first volume, to structure solving, refinement, display, and evaluation in the second. Although we continue to look forward in these volumes, we also look resolutely back in time by having recruited three chapters of reminiscence from some of those on whose shoulders we stand in developing methods in modern times: Brian Matthews, Michael Rossmann, and Uli Arndt. A spiritually similar contribution opens the second volume: David Blow’s introduction to our Phases section has his personal reflections on the impact that Johannes Bijvoet has had on modern protein crystallography. In the earlier volumes, we foreshadowed a time when macromolecular crystallography would become as automated as the technique applied to small molecules. That time is not quite upon us, but we all feel rattling of the windows from the heavy tread of high-throughput synchrotron-based macromolecular crystallography. As for the previous volumes, we have tried to provide in this volume sufficient reference that those becoming immersed in the field might find an explanation of methods they confront, while hopefully also stimulating others to create the new and better methods that sustain intellectual vitality. The years since publication of parts A and B have seen amazing advances in all areas of the discipline. Super high brightness synchrotron sources (Advanced Photon Source in the United States, European Synchrotron Radiation Facility in Europe, and Super Photon Ring-8 in Japan) are producing numerous important results even while the older sources are increasing productivity. Proteomics and structural genomics have appeared in the lexicon of all biologists and have become vital research programs in many laboratories. In the spirit of the time, these chapters approach many of the methods that are pertinent to high-throughput structure determination. These are now robots for large-scale screening of crystal-growth conditions using sub-microliter volumes, which were accessible only in a few dedicated research laboratories a decade ago. Similarly, automation has begun to assume increasing roles in cryogenic specimen changing for data collection; many laboratories are building and beginning to use robots for this purpose. xiii
xiv
preface
The first and largest section of technical chapters dissects the cutting-edge methods for thinking about or accomplishing crystal growth, including theoretical aspects, using physical chemistry to understand and improve crystal diffraction quality, robotics, and cryocrystallography. The other large section addresses phasing. A profound shift has occurred with the growing appreciation that map interpretation and model refinement are inseparable from the phase problem itself. Various methods of integrating the two processes in automated algorithms constitute an important step toward realization of high-throughput. More importantly perhaps, they improve the resulting structures themselves. New algorithms for representing the variance parameters have come into wider practice. The database of solved macromolecular structures has grown to the point where its statistical properties now afford impressive insight and can be used to improve the quality of structures. Concurrently, simulation methods have become more accessible, reliable, and relevant. The validation process is therefore one that impacts a widening sphere of activities, including homology modeling and the presentation and analysis of conformational, packing, and surface properties. Many of these are reviewed in the concluding chapters. We take little credit, either for the quality of the volume, which goes to the chapter authors, or for comprehensive coverage of competing methods. We will happily accept blame for mistakes and omissions. Academic Press has remained supportive and helpful throughout the long and trying process of completing this job, earning our sincere appreciation. Charles W. Carter Robert M. Sweet
Contributors to Volume 374 Article numbers are in parentheses and following the names of contributors. Affiliations listed are current.
Jan Pieter Abrahams (8), Biophysical Structureal Chemistry, Leiden Institute of Chemistry, 2300 RA Leiden, The Netherlands
Axel T. Brunger (3), The Howard Hughes Medical Institute and Departments of Molecular and Cellular Physiology, Neurology, and Neurological Sciences, Stanford Radiation Laboratory, Stanford University, 1201 Welch Road, Stanford, California 94205
Paul D. Adams (3), Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, California 94720
Sergey V. Buldyrev (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Vandim Alexandrov (23), Department of Biochemistry and Biophysics, Texas A & M University, College Station, Texas, 77843
Kyle Burkhardt (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854
W. Bryan Arendall, III. (18), Department of Biochemistry, Duke University, Duke Building, Durham, North Carolina 27708 Nenad Ban (8), Institute for Molecular Biology and Biophysics, Swiss Federal Institute of Technology, CH8093 Zurich, Switzerland
Raul E. Cachau (15), Advanced Biomedical Computer Center, Frederic, Maryland 21703 Stephen Cammer (22), University of California San Diego Libraries, 9500 Gillman Drive, La Jolla, California 92093
Joel Berendzen (3), Biophysics Group, Los Alamos National Laboratory, Los Alamos, New Mexico 87545
Charles W. Carter, Jr. (7, 22), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Helen M. Berman (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854 D. M. Blow (1), 26 Riversmeet, Appledore, Bideford, Devon EX39 1RE, United Kingdom
Zbigniew Dauter (5), Synchroton Radiation Research Section, NCI Brookhaven National Laboratory Building, Upton, New York 11973
Jose M. Borreguero (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Feng Ding (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
ix
x
contributors to volume 374
Eleanor J. Dodson (3), Department of Chemistry, University of York, Heslington York YO1 5DD, United Kingdom Nikolay V. Dokholyn (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Andrzej Joachimiak (15), Structural Biology Sciences, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439 Jochen Junker (23), Max Planck Institut fur Biophysikalische Chemie, D37070 Gottingen, Germany
Zukang Feng (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854
Michel H. J. Koch (24), European Molecular Biology Laboratory, Hamburg Outstation, D-22603 Hamburg, Germany
Andra´s Fiser (20), Department of Biochemistry and Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, Bronz, New York 10461
W. G. Krebs (23), San Diego Supercomputer Center, University of California San Diego, La Jolla California 92093
Roger Fourme (4), Soleil (CNRS-CEAMEN), Batiment 209d, Universite Paris XI, 91898 Orsay, Cedex France Mark Gerstein (23), Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520 Ralf W. Gosse-Kunstleve (3), Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, California 94720 Dorit Hanein (10), The Burnham Institute, La Jolla, California 92037 Jan Hermans (19), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 Barry Honig (21), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032
Victor S. Lamzin (11), European Molecular Biology Laboratory, Hamburg Outstation, 22603 Hamburg, Germany Richard J. Morris (11), European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom Garib N. Murshudov (14), Chemistry Department, University of York, Helsington York, YO1 5DD, United Kingdom Ronaldo A. P. Nagem (5), CBME Laboratorio Nacional de Luz Sincrotron and Instituto de Fisica Gleb Weataghin, Unicamp Caixa, CEP 13084-971 Campinas SP, Brazil Tom Oldfield (13), European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom
Thomas R. Ioerger (12), Texas A & M University, College Station, Texas 77843
Miroslav Z. Papiz (14), Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, United Kingdom
Ronald Jansen (23), Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520
Anastassis Perrakis (11), Netherlands Cancer Institute, Department of Carcinogenesis, 1066 CX Amsterdam, The Netherlands
contributors to volume 374 Donald Petrey (21), The Howard Hughes Medical Institute, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032 Alberto Podjarny (15), Structural Biology Sciences, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439 Igor Polikarpov (5), Instituto de Fisica de Sao Carlos, Universidade de Sao Paulo, Av Trabalhador, Saovarlense, 13560 Sao Carlos SP, Brazil Thierry Prange´ (4), LURE (CNRS-CEAMEN), Batiment 209d, Universite Paris XI, 91898 Orsay, Cedex France David C. Richardson (18), Department of Biochemistry, Duke University, Duke Building, Durham, North Carolina 27708
xi
Eugene L. Shakhnovich (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 George M. Sheldrick (3), Lehrstuhl fur Strukturchemie, Gottingen University D37077 Gottingen, Germany H. Eugene Stanley (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 Dmitri I. Svergun (24), Institute of Crystallography Russian Academy of Sciences, 117333 Moscow, Russia Lynn F. Ten Eyck (16), National Partnership for Advanced Computational Infrastructure, San Diego Supercomputer Center, La Jolla, California 92093
Jane S. Richardson (18), Department of Biochemistry, Duke University, Duke Building, Durham, North Carolina 27708
Thomas C. Terwilliger (2, 3), Los Alamos National Laboratory, Los Alamos, New Mexico 87545
Jeffrey Roach (6), Department of Chemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
Alexander Tropsha (22), Department of Medicinal Chemistry and Natural Products, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Mark A. Rould (7), Department of Physiology, University of Vermont, School of Medicine, Burlington, Vermont 05405 James C. Sacchettini (12), Texas A & M University, College Station, Texas 77843 Andrej Sˇali (20), Mission Bay Genentech Hall, University of California at San Francisco, San Francisco, California 94143 Celia Schiffer (19), Department of Biochemistry and Molecular Pharmacology, University of Massachusetts, Medical School, Worcesster, Massachusetts 01655 Marc Schiltz (4), LURE (CNRS-CEAMEN), Batiment 209d, Universite Paris XI, 91898 Orsay, Cedex France Thomas R. Schneider (3, 15), Lehrstuhl fur Strukturchemie, Gottingen University D37077 Gottingen, Germany
J. Tsai (23), Department of Biochemistry and Biophysics, Texas A & M University, College Station, Texas, 77843 Maria G. W. Turkenburg (3), Department of Chemistry, University of York, Heslington York YO1 5DD, United Kingdom Isabel Uson (3), Lehrstuhl fur Strukturchemie, Gottingen University D37077 Gottingen, Germany Patrice Vachette (24), LURE Bat. 209d, University Paris-Sud, F-91898 Orsay, Cedex France Iosif I. Vaisman (22), School of Computational Sciences, George Mason University, Manassas, Virginia 20110 Niels Volkmann (10), The Burnham Institute, La Jolla, California 92037
xii
contributors to volume 374
Charles M. Weeks (3), Hauptman-Woodward Medical Research Institute, 73 High Street, Buffalo, New York 14203
Martyn D. Winn (14), Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, United Kingdom
John Westbrook (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854
Kam Y. J. Zhang (9), Department of Structural Biology, Plexxikon, Inc., Berkeley, California 94710
[1]
3
how bijvoet made the difference
[1] How Bijvoet Made the Difference: The Growing Power of Anomalous Scattering By D. M. Blow History
Johannes Bijvoet (1892–1980) made pioneering contributions to the determination of noncentrosymmetric structures. He was the first to exploit the isomorphous replacement method to reveal a noncentrosymmetric structure, using isomorphous sulfate and selenate salts to determine the structure of strychnine on the basis of two projections.1–3 In space group C2, the selenium atoms, one in each asymmetric unit, make a centrosymmetric array, and the structure factors of the heavy atoms (with appropriate choice of origin) are all real. The isomorphous difference then determines the real part of the strychnine structure factor, but the sign of the imaginary part of the structure factor is undefined. The best estimate of the strychnine structure factor is its real part. This leads to an electron density map in which the structure and its inverse are superimposed, with symmetry C2/m. The structure of the strychnine molecule was deduced by discarding one of each pair of atoms related by the mirror, using the same principles that Carlisle and Crowfoot4 had used in separating the two images of the cholesteryl iodide molecule generated by the ‘‘heavy atom’’ method. In both cases, the authors deriving their structure did not know which interpretation was a true representation of the molecule, and which was its inverted image. Bijvoet recognized that anomalous scattering could be used to identify the correct enantiomorph of a noncentrosymmetric structure. He wrote,5 There is in principle a general way of determining the sign [of a phase angle]. . . .We can use the abnormal scattering of an atom for a wavelength just beyond its absorption limit. . . .It also becomes possible to attribute the d or l structure to an optically active compound on actual grounds and not merely by a basic convention.
Nishikawa and Matsukawa6and Coster et al.7 had observed departure from Friedel’s law8 in diffraction from opposite polar faces of a zinc sulfide 1
C. Bokhoven, J. C. Schoone, and J. M. Bijvoet, Kon. Ned. Akad. Wet. 51, 825 (1948). C. Bokhoven, J. C. Schoone, and J. M. Bijvoet, Kon. Ned. Akad. Wet. 52, 120 (1949). 3 C. Bokhoven, J. C. Schoone, and J. M. Bijvoet, Acta. Crystallogr. 4, 275 (1951). 4 H. C. Carlisle and D. M. Crowfoot, Proc. R. Soc. A 184, 64 (1945). 5 J. M. Bijvoet, Kon. Ned. Akad. Wet. 52, 313 (1949). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
4
phases
[1]
crystal. In a beautifully clear exposition of anomalous scattering effects, Bijvoet9 drew on this example: Normal X-ray reflection does not detect any difference between one side [of the octahedral faces of a zinc blende crystal], a dull and poorly developed tetrahedron plane, and the other, a shining well-developed one. In this respect, it is less sensitive than the human eye. Coster, however, chose a radiation— L1 radiation of gold—which just excites the K electrons of zinc. . . .Now X-ray analysis not only detects a difference, but it concludes—and this is, of course completely impossible for the human eye—that it is the dull plane that has the zinc plane facing outwards.
In 1951 Bijvoet and colleagues10 observed the intensity differences between Friedel-related pairs of X-ray reflections from sodium rubidium tartrate crystals. These observed differences showed that the convention established by Emil Fischer to discuss the configuration of bonds at asymmetric carbon atoms, especially in sugars, by good chance represents the true three-dimensional enantiomorph of these molecules. This was a substantial achievement, but Bijvoet was looking much further ahead. His visionary paper in 195411 opens by mentioning . . .the great successes of X-ray analysis that determined structures as complicated as those of sterols and alkaloids and that now approach the domain of Nature’s most complicated biochemical compounds, the proteins. . . .
A flow chart (see Fig. 1) sets the agenda for structure determination for the next half-century. It shows how isomorphous substitution and anomalous scattering can determine phases for all noncentrosymmetric reflections. But there is a question mark. Bijvoet warns,11 It has not yet been thoroughly investigated whether the small effect of the anomalous scattering will be measurable for a sufficient part of the reflections involved in a complete Fourier synthesis.
How thrilled he would have been to know that tunable X-ray sources could produce such measurable effects that anomalous scattering could solve protein structures on its own! These methods began to be introduced in the last year of his life.
6
S. Nishikawa and R. Matsukama, Proc. Imp. Acad. Jpn. 4, 96 (1928). D. Coster, K. S. Knol, and J. A. Prins, Z. Phys. 63, 345 (1930). 8 G. Friedel, Comptes Rendus 157, 1533 (1913). 9 J. M. Bijvoet, Endeavour 14, 71 (1955). 10 J. M. Bijvoet, J. F. Peerdeman, and J. A. van Bommel, Nature 168, 271 (1951). 11 J. M. Bijvoet, Nature 173, 888 (1954). 7
[1]
how bijvoet made the difference
5
PHASE DETERMINATION IN THE ISOMORPHOUS SUBSTITUTION METHOD Center of Symmetry (determination of amplitude sign) Heavy atom on center Algebraic amplitude addition (1936).
Heavy atom out of center (a) Location of the heavy atom (Patterson analysis). (b) Algebraic amplitude addition (1939).
No center of symmetry (determination of phase angle) (a) Location of the heavy atom. (b) Determination of the absolute value of phase angle from amplitude addition in vector diagram (1949). FB-FA
(c) Synthesis of double Fourier and resolution (c') Determination of all phase signs by geometrical considerations. by anomalous scattering (19 ?) (d) Phase shift by anomalous scattering (1930). Determination of absolute configuration (1951). FB-FA
∆f "
∆f " Fig. 1. The Bijvoet presentation of phase determination in the isomorphous substitution method. Redrawn with permission from Nature 173, 888–891. Copyright 1954 Macmillan Magazines Limited.
6
phases
[1]
Anomalous scattering became popular. Pepinsky’s group12 devised a type of anomalous difference Patterson function, known as the Ps function, which is the sine transform of the intensities: X Ps ðuÞ ¼ ð1=VÞ jFðhÞj2 sinð2h uÞ Because the sine function is odd [sin x ¼ sin(x)], the terms of this summation are only nonzero to the extent that intensities for h and h differ. The Ps function is antisymmetric, and its positive and negative peaks represent vectors between an anomalous scatterer and a normal scatterer. Peerdeman and Bijvoet13 and Ramachandran and Raman14 both discovered a simple way to use a centrosymmetric array of anomalous scatterers in a noncentrosymmetric structure to derive the imaginary components of the structure factors. Personal Notes
During 1954–1957 I was a student working on ways to apply isomorphous replacement to phase noncentrosymmetric reflections of proteins, and especially to deal with the ambiguities and errors that appeared to dominate the results in practice.15,16 I met Johannes Bijvoet at an international crystallography meeting in Madrid in April 1956, introduced myself, and outlined my research project to him. I remember him as a strongly built man with slightly receding hair (Fig. 2), who was gentle and encouraging to the nervous student talking to him—indeed, he was clearly excited by the progress in developing methods to exploit isomorphous replacement in proteins. He spoke excellent English, and his attitude was warm and friendly. I met him again at an International Crystallography Congress in Cambridge in August 1960. By that time the subject of protein crystallography had become established by three-dimensional electron density maps for hemoglobin and myoglobin, and anomalous scattering had been used to help with the phasing of haemoglobin. Bijvoet was enthusiastic about these developments. He retired in 1962 and I did not see him again, but I had one other personal involvement. In 1972 I was elected to the Royal Society, and a few days later (before the formal admission ceremony) I learned that a vote was to be taken on the election of Bijvoet as a Foreign Member of the 12
Y. Okaya, Y. Saito, and R. Pepinsky, Phys. Rev. 98, 1857 (1955). A. F. Peerdeman and J. M. Bijvoet, Acta Crystallogr. 9, 1012 (1956). 14 G. N. Ramachandran and S. Raman, Curr. Sci. 25, 348 (1956). 15 D. M. Blow, Proc. R. Soc. A. 247, 302 (1958). 16 D. M. Blow and F. H. C. Crick, Acta Crystallogr. 12, 794 (1959). 13
[1]
how bijvoet made the difference
7
Fig. 2. Johannes Bijvoet. Photograph courtesy of Han Meijer.
Royal Society. I was told it would be in order for me to vote, as I had already been elected. It was a pleasure and an honor to travel to London to cast my vote for his election. Anomalous Scattering in Proteins
In 1956, a consistent and obvious difference was observed between the diffracted intensities of a Friedel pair for a low-order reflection of myoglobin (what is now known as a Bijvoet difference).17 Wyckoff ruled out that it was an experimental artifact, or that it was dependent on solvent concentration, and it was recognized to be an anomalous scattering effect. These 17
J. C. Kendrew, G. Bodo, H. M. Dintzis, J. Kraut, and H. W. Wyckoff, unpublished data (1956).
8
phases
[1]
Fig. 3. (a) Normal and anomalous components of the heavy atom structure factor for reflection h, and for reflection h. (b) Comparison of the heavy atom structure factor FH(h) with the complex conjugate of its Friedel mate FH(h)*.
studies were made with CuK radiation, for which the iron atom of myoglobin has a significant anomalous scattering component (about 3.4 electrons). We can now recognize that this effect could have given clear evidence about the position of the iron atom in the myoglobin crystals. It suggested that anomalous scattering might provide useful phase information, even though the effects at accessible wavelengths were much smaller than those of isomorphous replacement. Let us refer18 to the normal part of the heavy atom structure factor as 0 FH . This is calculated using the real component of the atomic scattering factor f 0 þ f 0 . (In practice, f 0 is a negative quantity.) The anomalous part 00 is calculated using the imaginary part of the atomic scattering factor, FH if 00 . As indicated in Fig. 3, FH ðhÞ ¼ FH ðhÞ* 0
0
00 00 FH ðhÞ ¼ FH ðhÞ*
In the simple case of a centrosymmetric distribution of heavy atoms (which always exists for a single heavy atom site in a space group with an even-fold symmetry axis), the normal structure factor FH of the heavy atom is real. In this case, the isomorphous replacement method estimates the cosine of the phase angle, but gives no information about its sine; measurement of the Bijvoet difference estimates the sine of the phase angle, but gives no information about its cosine (Fig. 1). In a general case (Fig. 4), the isomorphous replacement method and the anomalous scattering method give orthogonal information about the phases. 18
Notation: When discussing anomalous scattering, the subscript P refers to all the ordered atoms in the crystal whose atomic scattering factors are real. The subscript H refers to atoms that exhibit significant anomalous scattering, usually assumed to be all of the same 0 , which arises from the normal part type. The structure factor FH has two components: FH 00 f 0þ f of the scattering factor of the H atoms and FH , which arises from the anomalous part 00 is in of their scattering factor f 00 . When all the anomalous scatterers are of the same type, FH 0 . quadrature with FH
[1]
how bijvoet made the difference
9
Fig. 4. Harker constructions for (a) isomorphous replacement difference; (b) Bijvoet amplitude difference, showing how they give orthogonal phase information. In (a) the real part of the scattering by the heavy atoms is used to calculate the structure factor FH0 appropriate for isomorphous replacement. In (b), the Bijvoet difference is due to opposite effects of the imaginary part of the heavy atom scattering factor on F(h) and on [F(h)]*.
Considering the effects with CuK too small, Blow15 plated a rotating anode with chromium and made measurements on a mercury derivative of hemoglobin, using CrK radiation (f 00 for mercury then estimated as 15e ˚ ). This was a mistake, because the need for large absorption corat 2.29 A rections at this wavelength seriously prejudiced precise observation of Bijvoet differences. It was subsequently concluded that MoK radiation would have been more suitable, being relatively close to the mercury L absorption edge, but at a wavelength at which absorption errors are much smaller. The Bijvoet differences did give significant information to resolve ambiguities in the phases determined by isomorphous replacement, but the large errors made a quantitative estimate difficult. The results were simply categorized as Bijvoet difference probably positive, insufficient information, or as Bijvoet difference probably negative. Even this information was useful in estimating phases, when available isomorphous replacements left a large ambiguity in phase.15 Using only CuK radiation, similar methods were employed by Cullis et al.19 to help resolve ambiguities of phase left by the isomorphous 19
A. F. Cullis, H. Muirhead, M. F. Perutz, M. G. Rossmann, and A. C. T. North, Proc. R. Soc. A. 265, 15 (1961).
10
phases
[1]
˚ replacement technique, in determining the hemoglobin structure to 5.8-A resolution. The squares of the Bijvoet amplitude differences provide a set of Fourier coefficients of an approximate Patterson function of the anomalous scatterers (closely analogous to the difference Patterson for an isomorphous pair). Using CuK radiation, the iron atoms of hemoglobin are the only important anomalous scatterers in the molecule and Rossmann showed how the iron atom positions could be determined directly from the Bijvoet differences.19a Blow and Rossmann20 showed that a recognizable but more noisy electron density map could be obtained using only the data from the parent crystal and a single isomorphous derivative, including anomalous scattering observations, the method now known as SIRAS (single isomorphous replacement with anomalous scattering). Methods of Analysis at Fixed Wavelength
Blow and Rossmann20 did not simply use the sign of the observed Bijvoet difference to resolve the ambiguity of phase left by the single isomorphous replacement. Instead, they followed a procedure similar to that of Blow and Crick,16 in which a probability is assigned to every possible phase angle depending on how accurately it fits the observations. For a particular reflection h, the observations of jFPH(h)j and jFPH(h)j were treated as separate observations. The calculated heavy-atom structure factors FH(h) and FH(h)* were calculated using a complex atomic scattering factor f 0 þ f 0 þ if 00 (Fig. 1). The analysis was carried out as though the two members of the Friedel pair were separate isomorphous derivatives. An improvement was suggested by North,21 who pointed out that the errors in exploiting the Bijvoet difference are far smaller than those that arise in isomorphous replacement, and the Bijvoet difference can be interpreted with greater precision. The implications of the isomorphous difference are confused by departures from ideal isomorphism between ‘‘parent’’ and ‘‘derivative,’’ but there is no corresponding inaccuracy affecting the Bijvoet difference. Also, because measurements are made on the same crystal, often under similar geometric conditions, the amplitude difference is measured more accurately. North suggested a different algorithm for calculation of the phase probabilities, depending on the three observed structure amplitudes, jFP(h)j, jFPH(h)j, and jFPH(h)j, and on the estimated normal and anomalous components of the scattering by 0 (h) and F 00 (h). However, as North recognized, the the heavy atoms, FH H 19a
M. G. Rossmann, Acta Crystallogr. 14, 383 (1961). D. M. Blow and M. G. Rossmann, Acta Crystallogr. 14, 1195 (1961). 21 A. C. T. North, Acta Crystallogr. 18, 212 (1965). 20
[1]
how bijvoet made the difference
11
algorithm depended on an approximation and could be used in different ways. This was a considerable improvement, but Matthews22 found a better formulation. The essence of the method was to change the variables used in the analysis. Instead of working with the observed quantities jFPH(h)j and jFPH(h)j, Matthews worked with the mean structure amplitude 12(jFPH(h)j þ jFPH(h)j) and the Bijvoet amplitude difference jFPH(h)j jFPH(h)j. The mean structure amplitude estimates the struc0 ðh). This ture amplitude that would exist if f 00 were zero, designated FPH is used in the usual way with jFP(h)j and with the calculated normal part 0 (h), to obtain a phase probability disof the heavy atom structure factor FH tribution by isomorphous replacement. The Bijvoet amplitude difference is used in a similar way with the calculated anomalous part of the heavy atom structure factor. This second contribution to the phase probability distribution is independent of any assumption about isomorphism with the parent crystal P. The Bijvoet amplitude difference (jFPH(h)j jFPH(h)j) is used to develop a phase probability distribution derived from anomalous effects. Because there are no errors due to nonisomorphism, the intrinsic errors in interpreting the Bijvoet difference are much smaller, so the root-meansquare (RMS) lack of closure E00 is smaller, leading to a more tightly defined phase distribution. This formulation22 is valid even when different types of anomalous scatterer exist, but usually one type of anomalous scatterer is assumed. This method of phase determination was coded by L. P. Ten Eyck23 and by J. E. Ladner24 into a widely used program PHARE (now incorporating a maximum likelihood refinement procedure and called MLPHARE25). The Ten Eyck phasing algorithm seems to have been used without significant change. In that era X-ray analysis of proteins was practicable only when using characteristic radiations such as CuK. Investigators concentrated on the effects of f 00 , whose effect causes a Bijvoet difference, and tended to ignore f 0 , which modifies the magnitude of the isomorphous difference, because at the given wavelength it is a fixed quantity. Minor criticisms of the Matthews algorithm22 can be made. 0 1. The normal part of the scattering FPH (h) is the complex quantity * 0 þ FPH (h)]. But Matthews approximates jFPH (h)j as 12[jFPH(h)j þ
1 2[FPH(h) 22
B. W. Matthews, Acta Crystallogr. 20, 82 (1966). L. F. Ten Eyck, J. Mol. Biol. 100, 3 (1976). 24 J. E. Ladner, personal communication (2002). 25 Z. Otwinowski, ‘‘CCP4 Study Weekend Proceedings’’ (W. Wolf, P. R. Evans, and A. G. W. Leslie, eds.), p. 80. Daresbury Laboratory, Warrington, UK, 1991. 23
12
phases
[1]
Fig. 5. The two triangles shown include the length jFPH0 (h)j, which in each triangle can be calculated from the lengths of the other sides, and from two related (but unknown) angles. Equating two trigonometric expressions for jFPH0 (h)j leads to Eq. (1).
jFPH(h)j]. By straightforward trigonometry (Fig. 5) (see also Burling et al.26), 1 00 0 jFPH ðhÞj2 ¼ ðjFPH ðhÞj2obs þ jFPH ðhÞj2obs Þ jFPH ðhÞj2calc 2
(1)
Subscripts ‘‘obs’’ and ‘‘calc’’ emphasize that the calculated part of this expression is a small correction to the value derived from observation. In practice the error will often be on the order of 1% and it will rarely exceed 00 3–4% unless FPH is extraordinarily large. 2. The isomorphous replacement method, as conventionally used, gives a phase probability distribution for the parent crystal FP. The phase probability distribution derived from the Bijvoet difference applies to the 0 normal part of the scattering from the derivative crystal FPH . That means 0 it is the phase probability distribution for (FP þ FH ). This fact was ignored by Matthews, who estimated the phase probability as if these two distributions applied to the same quantity. The errors created by these simplified assumptions were insignificant in relation to the precision of phase estimation at the time. A slightly different approach to isomorphous replacement was introduced by Hendrickson and Lattmann.27 They devised a method to summarize the phase probability distribution by four coefficients A, B, C, and D, which essentially represent the first two Fourier components of the phase probability curve. To achieve this, they expressed the lack of closure error e() as the lack of agreement of observed and calculated intensity, eðÞ ¼ jjFP jexpðiÞ þ FH j2 jFPH j2 26 27
F. T. Burling, W. I. Weis, K. M. Flaherty, and A. T. Bru¨nger, Science 271, 72 (1996). W. A. Hendrickson and E. E. Lattman, Acta Crystallogr. B 26, 136 (1970).
[1]
how bijvoet made the difference
13
(In this expression jFPj and jFPHj are derived from the observed intensities, and FH is calculated from the heavy atom parameters.) This method has been incorporated into a number of computer programs. A criticism of it would be that the observational error in intensity tends to be proportional to the intensity, so that larger errors e are usually encountered for intense reflections. In contrast, the observational error in amplitude is fairly constant for weak and medium strength reflections. Moreover, errors due to nonisomorphism are not correlated with intensity. Therefore the rootmean-square error E ¼ he(best)2i1=2 depends on the intensity. Blow and Crick16 used the amplitude error xðÞ ¼ jjFP jexpðiÞ þ FH j jFPH j in their analysis, and this justifies using a single value for the root-meansquare lack of closure E ¼ hx(best)2i1=2 at a given resolution, independent of the observed intensity (see also Kumar and Rossmann28). A new era in the use of anomalous scattering began when Hendrickson and Teeter29 spectacularly demonstrated the possibilities of using anomalous scattering on its own, by total determination of the crambin structure using the anomalous scattering of its six sulfur atoms in CuK radiation. In terms of a Harker diagram (Fig. 4b), this method (now called SAD: single-wavelength anomalous diffraction) produces a phase ambiguity equivalent to that of a single isomorphous replacement, but still gives important information. Because the constellation of six sulfur atoms will never be centrosymmetric there is no restriction on the indicated phase angle. The resulting image is not the structure plus its inverse, but a noisy image of the structure, which can be refined using other information, especially at high resolution. As is discussed in the final section of this article, these methods have now become powerful. Synchrotron Radiation
In the late 1970s synchrotron radiation became more accessible, the first beamline facilities for macromolecular crystallography were set up, and for the first time experiments became feasible using any chosen wavelength. The possibilities of using anomalous scattering at several wavelengths were recognized early by Phillips et al.30
28
A. Kumar and M. G. Rossmann, Acta Crystallogr. D 52, 518 (1996). W. A. Hendrickson and M. Teeter, Nature 290, 107 (1981). 30 J. C. Phillips, A. Wlodawer, J. M. Goodfellow, K. D. Watenpaugh, L. C. Sieker, L. H. Jensen, and K. O. Hodgson, Acta Crystallogr. A 33, 445 (1977). 29
14
phases
[1]
It had now become possible to exploit the changes in f 0 at different wavelengths, as well as the existence of Bijvoet differences. Karle31 made a new analysis of the problem that was rigorous and general. It developed the possibility of determining phase angles using anomalous scattering by a single crystal at different wavelengths [now known as MAD: multiplewavelength anomalous diffraction (or dispersion)]. Karle’s analysis, although presented in a general way, concentrates on the usual case in practice, in which there is one type of anomalous scatterer and other types of atom whose anomalous scattering factors f 0 and f 00 can be taken as zero. To preserve the notation already adopted, this article shall identify the scattering by these two parts of the structure by subscripts H and P, respectively (Karle uses subscripts 2 and 1). The subscript PH is omitted, because it refers to the whole structure of the crystal under investigation, including normal and anomalous scatterers. Following Karle, a structure factor F(h) without subscript refers to scattering by the whole crystal. Because the crystal includes anomalous scatterers, this structure factor depends on the wavelength and is written F(h). An important feature of Karle’s analysis31 lies in the definition of the ‘‘normal’’ and ‘‘anomalous’’ parts of the structure. The normal structure factors F n are calculated as though all the electrons in the structure scatter normally. The anomalous component F a is the correction that must be made to give the actual scattering at some wavelength . To emphasize wavelength dependence, structure factors that are dependent on wavelength are given a presuperscript :
FðhÞ ¼ F n ðhÞ þ F a ðhÞ
(Although F a derives entirely from the heavy atoms that exhibit anomalous scattering, the subscript H is not needed, because there is no other anomalous scattering.) If the parameters of the anomalous scatterers are known (including the complex atomic scattering factors at wavelength ), a F may be calculated. In contrast to the earlier approaches discussed above, the effects of changes to the real parts of the atomic scattering factors f 0 are now included in the anomalous component F a. This facilitates comparison of scattering at different wavelengths. But the anomalous scattering F a is not in quadrature with the normal part F n. The relation00 (h)* ¼ F 00 (h) stated above does not apply to F a because F 00 is ship FH H H only one component of F a. The notation has many advantages, but it does not allow the Bijvoet difference and the Bijvoet amplitude difference to be interpreted so simply. The two notations are compared in Fig. 6. 31
J. Karle, Int. J. Quantum Chem. Quantum Biol. Symp. 7, 357 (1980).
[1]
how bijvoet made the difference
15
Fig. 6. Comparison of Karle notation using F n, F a with ‘‘pseudo-isomorphous’’ notation using FP, FH0 , and FH00 . FH0 and FH00 are orthogonal because all the atoms scattering anomalously are assumed to be of the same type.
Karle’s analysis31 expands the algebraic expression for each observable intensity as a linear combination of terms which depend on four variables. n j2 , and the other two depend on the sine Two variables are jFPn j2 and jFH n a and cosine of the angle ( ), which determines their phase relationship. If F a can be calculated from the parameters of the heavy atoms, this angle leads to a phase angle for the normal component of the structure factor F n. At each wavelength two independent intensity observations [F(h)]2 and [F(h)]2 provide two independent quantities that depend on these four variables. If measurements are made at two wavelengths, giving four independent observations, the system is determined in principle. If three wavelengths are used there are six equations in four unknowns, and standard linear algebra can give a definite least-squares solution (as in MADLSQ32). The precision of the result depends on the properties of the normal matrix of the equations—on how well ‘‘conditioned’’ they are. When the equations are solved for each reflection, the Fourier transform n j2 provides a Patterson function of the array of anomalous scatterers. of jFH From this, the parameters of the heavy atoms may be determined. Compared with the isomorphous replacement method, anomalous scattering analysis is relatively error-free. The experimental errors arise from inaccuracies of intensity measurement, and from inaccurate estimation of the anomalous scattering component, arising from errors in the estimated parameters of the atoms that cause it, at the particular wavelengths employed. There is also a significant but usually small error that arises from the assumption that all of the atoms in the P part of the structure are ‘‘normal’’ scatterers, with f 0 and f 00 precisely zero. For the Karle method,31 it was assumed at first that no sophisticated error analysis was 32
W. A. Hendrickson, Trans. Am. Crystallogr. Assoc. 21, 245 (1985).
16
phases
[1]
needed, and many structures were solved in this way, using three and sometimes four different wavelengths. These usually include two very close wavelengths at the maximum anomalous scattering f 00 (‘‘peak’’) and at the absorption edge (‘‘edge,’’ where the change f 0 to the normal scattering is maximized), and one or more ‘‘remote’’ wavelengths where f 0 is fairly small, although f 00 may be significant. A study of selenobiotinyl streptavidin by MAD undertook a direct analysis of experimental error.33 The formulas derived from Karle’s analysis31 were rearranged into a form parallel to Hendrickson and Lattman’s expressions for MIR (multiple isomorphous replacement).28 The lack of agreement of observed and calculated intensities e() at each wavelength could be expressed directly in terms of four quantities A, B, C, and D, which allow a phase probability distribution to be calculated, leading to a best phase and figure of merit for each structure factor. The method was reported to give a smaller phase error, and the map using the resulting Fourier coefficients was significantly enhanced in appearance and ease of interpretation, compared with results from MADLSQ. Two Ways with MAD
An alternative to the Karle approach is to apply the method of Matthews,22 originally developed to deal with anomalous dispersion at a single wavelength in conjunction with isomorphous replacement. Hendrickson and Ogata34 and Smith and Hendrickson35 have contrasted the two approaches. When the Matthews approach is applied to MAD, it is referred to as ‘‘pseudo-MIR’’ by Smith and Hendrickson. The methods based on Karle’s approach,31–33 developed specifically for multiple wavelength studies, are called the ‘‘explicit’’ approach. An important practical difference between the methods arises in identifying the positions of the anomalous scatterers. In pseudo-MIR the observed Bijvoet amplitude difference directly provides coefficients [jF(h)jjF(h)j]2 for an anomalous difference Patterson synthesis as in Rossmann.19a The coefficients from observations at several wavelengths may be combined. In the explicit approach, the Karle simultaneous equan j2 , which provide coefficients for a Patterson function tions generate jFH of the anomalous scatterers. The advantages and disadvantages of these approaches are discussed briefly by Smith and Hendrickson.35 33
A. Pa¨hler, J. L. Smith, and W. A. Hendrickson, Acta Crystallogr. A 46, 537 (1990). W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). 35 J. L. Smith and W. A. Hendrickson, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 299. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001. 34
[1]
how bijvoet made the difference
17
Once the positions of the anomalous scatterers have been established n ð¼ jF n j exp in ) can be by either of these approaches, an estimate of FH H H calculated, and the parameters of the H atoms can be refined by many available methods. When this has been done, either the explicit or the pseudo-isomorphous method may be used to obtain phases. In the explicit approach, quantities proportional to cosðnP nH ) and sinðnP nH ) have already been calculated, so that nH leads directly to the phase angle nP , which allows calculation of an electron density map representing the scattering density of the normal scatterers. Alternatively, the Karle equations are revisited to generate a best phase and figure of merit using the ABCD algorithm. An adaptation of the MIRAS (multiple isomorphous replacement with anomalous scattering) approach36,37 uses the Bijvoet difference to give phase information at each wavelength as in the Matthews method.22 The 0 at different wavelengths are used like isomorphous replacechanges in FH ment differences. This approach worked well, but it includes an approximation because it is not defined which phase is being determined. (Each Bijvoet difference indicates the phase at a different wavelength.) Terwilliger38 suggested further approximations, but Burling and colleagues carried out a more precise analysis.26 In this scheme, every pair of observed intensities either related as a Bijvoet pair, or related by a wavelength change, can be treated separately to provide a phase probability curve. These results were compared with those obtained by Hendrickson’s method32 and consistent and significant improvement in phasing accuracy was reported. Similar results are reported by other authors. Some differences remain, however, about which phase is to be determined. Burling et al.26 chose diffraction at the ‘‘remote’’ wavelength to represent the ‘‘parent,’’ but comment that this represents a difference from calculating the phase of F n as defined by Karle.31 A Better Way?
We must be clear about what phase is actually to be determined. There are two sensible choices: either the phase of FP (the phase of the structure factor corresponding to the normally scattering atoms in the crystal, but omitting the anomalous scatterers), or the phase of F n (the phase of the structure factor if all the electrons in the crystal scattered normally). The 36
V. Ramakrishnan, J. T. Finch, V. Graziano, P. L. Lee, and R. M. Sweet, Nature 362, 219 (1993). 37 V. Ramakrishnan and V. Biou, Methods Enzymol. 276, 538 (1997). 38 T. C. Terwilliger, Acta Crystallogr. D 50, 17 (1994).
18
phases
[1]
Fourier transform of FP will show the electron density of all the normal scatterers in the crystal; the Fourier transform of F n will show the density of all the electrons in the crystal. In the MAD technique, neither of these structure factors is observable. But because the structure factors FH corresponding to the scattering by the anomalous scatterers are calculable from their parameters (equivalently, the structure factor F a caused by anomalous scattering effects may be calculated), this creates no fundamental problem. A straightforward approach was suggested by Bella and Rossmann,39 who chose to estimate the phase of FP. Each experimental observation of jF(h)j or jF(h)j, together with the calculated contribution of the anomalous * (h) follows the usual relationship scatterers FH(h) or FH
FðhÞ ¼ FP ðhÞ þ FH ðhÞ
Using the Harker construction, each observation generates a circle of possible values for FP(h) (Fig. 7), but because of observational and systematic errors the circles do not all intersect perfectly. In the MAD technique there is no direct measure of jFPj, and Bella and Rossmann exploited a method of analysis in which the most probable phase is identified as that where the Harker circles intersect most closely.19 This method has several advantages. The analysis is done directly in terms of the observed quantities jF(h)j. Each observation is treated in an equivalent way. There is total clarity about which phase is being evaluated. Compared with the isomorphous replacement method, there is a complication because there is no direct measure of jFPj. The analysis does not need to be restricted to the method of selecting close intersection.19 For any chosen value of FP (specifying both amplitude and phase), a lack of closure for each observation of jF( h)j is readily calculated. As seen on the Harker diagram, it is simply the radial distance xi of this value of FP from the corresponding Harker circle (Fig. 8). This approach could be applied equally well to find a ‘‘best’’ value of the quantity F n defined by Karle.31 A more sophisticated method of analysis is embodied in the program SHARP (statistical heavy atom refinement and phasing).40 In SHARP, all possible values of an ‘‘unperturbed native structure factor’’ FP* are considered, using a maximum-likelihood formulation in which errors in all observations and parameters of the problem including the ‘‘lack of closure’’ are retained as variables. In this way the effects of correlations and feedback in estimated parameters for a particular reflection are 39 40
J. Bella and M. G. Rossmann, Acta Crystallogr. D 54, 159 (1998). E. de la Fortelle and G. Bricogne, Methods Enzymol. 276, 472 (1997).
[1]
how bijvoet made the difference
19
Fig. 7. Harker diagram for the MAD technique. To keep the diagram as simple as possible only two wavelengths are illustrated, denoted as 1 and 2. Data from reflection h are identified by a subscript plus symbol, and data from its Friedel mate h are denoted by a subscript negative symbol. A possible value for FP is indicated as a dashed line, but because jFPj is not observable, its amplitude is unknown. The open circle represents the structure factor if all atoms scattered normally.
included in the analysis, and the resulting estimate of the complex quantity FP* is said to be unbiased. In SHARP every observation is handled equivalently, and all contribute to the likelihood function for FP* (a two-dimensional function representing its amplitude as well as its phase). The absence of any observation of jFPj does not change the formulation of the problem, so the MAD technique can be treated in the same way as usual. Around the turn of the century, MAD became the most frequently used technique for direct determination of an unknown macromolecular structure (where the structure cannot be inferred from a homologous structure). The different approaches remain in competition. In either case, the analysis proceeds by two steps, the first of which obtains the positions of the anomalous scatterers by Patterson methods, either using the Karle equan j2 , or using coefficients derived directly from the Bijvoet tions for jFH differences. Phase determination (or maximum likelihood analysis) then uses all the parameters of the anomalous scatterers. Again, two methods
20
phases
[1]
Fig. 8. An enlarged view of a small part of Fig. 7. The solid square indicates a point chosen as a possible origin for the FP vector. The radial distance, measured along FP, between this point and any Harker circle represents a lack of closure x(FP) for the corresponding observation of jF( h)j. One such distance is identified.
are available (explicit, using the Karle equations, or pseudo-isomorphous), and there is no reason why the second step should use the same formulation as the first. The Future Is SAD
It seems likely, however, that the various improvements to analyze MAD data more correctly are fading into insignificance. The MAD technique is losing ground to SAD. SAD has problems similar to those of SIR, because there are only two measurements, jF(h)j and jF(h)j, so that they indicate two possible values for the phase angle as in Fig. 4b. If the distribution of anomalous scattering electrons does not have a tendency to centrosymmetry, the phases of the anomalous scattering contributions will be fairly random, and will not tend to generate false symmetry of the kind that Bokhoven et al. encountered with strychnine.3 Without other information, the ‘‘best’’ value for F(h) is the mean of the two possible structure factors indicated by the two phase angles. The use of this mean structure factor allows considerable error, introducing noise into the electron density map. In the absence of other
[1]
how bijvoet made the difference
21
errors, the electron density introduced by this noise is equal to the signal, equivalent to a mean phase error of 45 . But, in fact, we know that interpretable maps have been obtained from data with phase errors larger than 45 . Phase errors contribute to ‘‘noise’’ distributed over the whole map, whereas the correct components of the phase information contribute density to ‘‘signal’’ at the atomic positions. As the resolution improves, additional reflections will not only sharpen the image but will also increase the signal-to-noise ratio. For this reason the consequences of phase ambiguity become less severe. A further method to improve phase estimation was first used by Hendrickson and Teeter.27 It becomes more effective as the diffraction from the anomalous scatterers becomes a larger part of the total scattering. If the scattering by the anomalous scatterers FH(h) represents a significant fraction of the total scattering F(h), the phase angle of F(h) is more likely to be close to that of FH(h), and this may partly resolve the ambiguity between two possible phases. Sim41 derived the relevant probability function. In the last decade there has been great progress in the improvement of poor-quality electron density maps. In addition to improved refinement procedures for the parameters of anomalous scatterers, the solvent flattening and histogram matching algorithms have become powerful. At a resolution at which individual atoms can be resolved, direct methods allow further refinement. Many authors have reported excellent results using SAD. A significant advantage is that because only one wavelength is used, the complication of resetting the apparatus to precisely defined wavelengths is avoided. Data collection can proceed without interruption, so that radiation damage problems are greatly reduced, and more accurate measurement is possible. Brodersen et al.42 determined two structures from a direct SAD analysis in 2000. From a thorough comparison of MAD and SAD techniques to interpret eight different structures, Rice et al.43 concluded that ‘‘the combination of SAD phasing and solvent flattening will be sufficient to determine most structures.’’ Dauter et al.44 have reported on 13 structures, mostly macromolecules, which with one exception proved interpretable by SAD. An important advance is the use of heavy halide ions in the supernatant, which frequently bind to favorable sites on the protein surface, and that can be used as anomalous scatterers.45 41
G. A. Sim, Acta Crystallogr. 12, 813 (1959). D. E. Brodersen, E. de la Fortelle, C. Vonrhein, G. Bricogne, J. Nyborg, and M. Kjeldgaard, Acta Crystallogr. D 56, 431 (2000). 43 L. M. Rice, T. N. Earnest, and A. T. Brunger, Acta Crystallogr. D 56, 1413 (2000). 44 Z. Dauter, M. Dauter, and E. Dodson, Acta Crystallogr. D 58, 494 (2002). 45 Z. Dauter and M. Dauter, Structure 9, R21 (2001). 42
22
phases
[2]
Bijvoet’s work has come full circle. Despite the development of sophisticated algorithms for MAD techniques, emphasis is returning to a method that relies simply on measurement of the mean intensity and the Bijvoet difference at a single wavelength, for every reflection. Acknowledgment I thank Gerard Bricogne and Brian Matthews for constructive criticism and helpful comment.
[2] SOLVE and RESOLVE: Automated Structure Solution and Density Modification By Thomas C. Terwilliger The analysis of X-ray diffraction measurements from macromolecular crystals and their interpretation in terms of a model of the macromolecule constitute a process that consists of many steps involving significant decisions to be made. Major steps include structure solution (scaling, heavy atom location, and phasing), density modification, model building, and refinement. The complexity of this process has for many years required the involvement of a highly trained crystallographer for reasonable decisionmaking and successful completion of the process. The combination of several factors has made it possible to carry out the critical structure solution process (scaling through phasing) in a fully automated fashion (SOLVE1). In addition, automated methods for refinement and model building with high-resolution data have been developed (wARP/ARP; Perrakis et al.2,3). The separate automation of structure solution and of model building and refinement presents the promise of full automation of the entire structure determination process from scaling diffraction data to a refined model. The software packages SOLVE and RESOLVE comprise a suite for automated structure solution using multiple isomorphous replacement (MIR), multiwavelength anomalous dispersion (MAD), single-wavelength anomalous diffraction (SAD), and other approaches. Their defining feature is that they can carry out all the steps necessary for structure solution 1
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 3 A. Perrakis, R. Morris, and V. S. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
22
phases
[2]
Bijvoet’s work has come full circle. Despite the development of sophisticated algorithms for MAD techniques, emphasis is returning to a method that relies simply on measurement of the mean intensity and the Bijvoet difference at a single wavelength, for every reflection. Acknowledgment I thank Gerard Bricogne and Brian Matthews for constructive criticism and helpful comment.
[2] SOLVE and RESOLVE: Automated Structure Solution and Density Modification By Thomas C. Terwilliger The analysis of X-ray diffraction measurements from macromolecular crystals and their interpretation in terms of a model of the macromolecule constitute a process that consists of many steps involving significant decisions to be made. Major steps include structure solution (scaling, heavy atom location, and phasing), density modification, model building, and refinement. The complexity of this process has for many years required the involvement of a highly trained crystallographer for reasonable decisionmaking and successful completion of the process. The combination of several factors has made it possible to carry out the critical structure solution process (scaling through phasing) in a fully automated fashion (SOLVE1). In addition, automated methods for refinement and model building with high-resolution data have been developed (wARP/ARP; Perrakis et al.2,3). The separate automation of structure solution and of model building and refinement presents the promise of full automation of the entire structure determination process from scaling diffraction data to a refined model. The software packages SOLVE and RESOLVE comprise a suite for automated structure solution using multiple isomorphous replacement (MIR), multiwavelength anomalous dispersion (MAD), single-wavelength anomalous diffraction (SAD), and other approaches. Their defining feature is that they can carry out all the steps necessary for structure solution 1
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 3 A. Perrakis, R. Morris, and V. S. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[2]
SOLVE and RESOLVE
23
in a fully automated way. SOLVE can scale data, find heavy atom sites, and calculate phases, while RESOLVE can improve density, find patterns of electron density corresponding to helices and sheets, and build a preliminary model. The theory and operation of both SOLVE and RESOLVE have been described in detail.1,4,5 In addition, an overview of SOLVE and a comparison with other methods for finding heavy atom sites is presented elsewhere (see [3] in this volume).5a This chapter reviews how SOLVE and RESOLVE operate, emphasizing the computational approaches and philosophy that have been employed. General Computational Approaches for Automated Structure Solution in SOLVE and RESOLVE
SOLVE and RESOLVE use three basic principles to carry out automated structure solution. The first is to create a seamless set of subprograms that carry out all the operations that are needed. The second is to develop scoring algorithms that can be used to replace conventional decision-making steps, and the third is to make these part of software systems that are highly error tolerant. Creating a sequential set of subprograms that carry out all the tasks needed for structure solution is an obvious necessity for automated structure solution. Somewhat less obvious is the importance of having each subprogram provide, in a convenient fashion, all the information necessary for the next one to operate. Many software packages contain programs that can carry out all the steps of structure solution, but often the key parameters for one step (e.g., heavy atom sites) are simply part of the printout from a previous step. The key requirement for automation is that this information and all necessary data to go with it be passed from one step to another in as straightforward a fashion as possible. It is possible to automate by scanning output files for parameters, but this is far more difficult than simply passing on the key information. Decision-making is the most complex part of structure solution, and is important in density modification and model building as well. There are many ways to make decisions in a process such as structure solution, which is iterative and somewhat branched. The approach taken in SOLVE is to use a scoring system to replace the conventional decision-making process. 4
T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 55, 1863 (1999). T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 58, 2082 (2002). 5a C. M. Weeks, P. D. Adams, J. Berendsen, A. T. Brunger, E. J. Dodson, R. W. GosseKunstleve, T. R. Schneider, G. M. Sheldrick, T. C. Terwilliger, M. G. W. Turkenburg, and I. Uson, Methods Enzymol. 374, [3], 2003 (this volume). 5
24
phases
[2]
In SOLVE, what is scored is the quality of each potential heavy atom solution. The advantage of this approach is that once a scoring system is devised, the decision-making process simply becomes an optimization process for which there are many well-known algorithms. Error-tolerant programming is a final and useful element in automating complex procedures. Although it would be convenient to eliminate all errors in algorithms and in programming these algorithms, in practice it is difficult to reduce these to fewer than about 1 error per 1000 lines of software code. In a package with 150,000 lines of code, it is reasonable to expect hundreds of errors. The approach used in SOLVE and RESOLVE is to minimize the effects of these difficult-to-identify errors by using the scoring and optimization system to reject analyses that result in identifiable errors. For example, any time the refinement of heavy atom positions in SOLVE fails for any reason, from zero occupancy of all sites to a programming error leading to a failed refinement, the score for that heavy atom solution is recorded as ‘‘very poor.’’ That particular solution is then rejected and other similar solutions are considered instead. As other solutions may have different characteristics, they may refine successfully and be used, getting around either truly incorrect solutions or programming errors that prevent successful refinement. This approach has the underlying reasonable premise that most errors will lead to scores that are poor. SOLVE: Automated Structure Solution
The SOLVE software1 is capable of full automation of structure solution for MIR (isomorphous replacement) and MAD or SAD (anomalous diffraction) data. SOLVE can begin with raw measurements of intensities of crystallographic intensities, scale the data, carry out the process of finding the heavy atom sites, refine parameters, and calculate an electron density map. The automation of this key step shows the feasibility of automation of all steps in structure determination. The SOLVE software also demonstrates the usefulness of a decision-making process based on a scoring algorithm and provides a basis for decision-making in full structure determination. Key Technical Developments for Automated Structure Solution SOLVE contains a number of advances in the central steps in structure solution as well as the seamless linkage and decision-making necessary for automation. SOLVE includes developments in the treatment of MAD data, estimation of heavy atom structure factor amplitudes (FA values), heavy atom refinement, and phasing, which in turn have made automated structure solution practical.
[2]
SOLVE and RESOLVE
25
A MAD data set containing any number of wavelengths of data can be viewed with good approximations as if it were a single isomorphous replacement data set with anomalous scattering (SIRAS6). This meant that our fast and unbiased method for refinement of heavy atom parameters using an origin-removed difference Patterson function7 could be used for heavy atom refinement. A method for estimation of amplitudes of heavy atom structure factors (FA values) in an optimal way from MAD data was also developed.8 This allowed the calculation of optimal heavy atom Patterson functions that could be searched with our automated heavy atom search procedure9 to find initial partial solutions for the locations of anomalously scattering atoms in the crystals. Once the parameters describing the locations, occupancies, and thermal factors for the anomalously scattering atoms were refined, we used our Bayesian correlated MAD phasing approach to calculate optimal estimates of the crystallographic phases.10 For multiple isomorphous replacement (MIR) structure solution, SOLVE incorporates methods for phasing that include a detailed analysis of error estimates.11 It also includes methods for calculating phases in cases in which the derivatives have substantial lack of isomorphism, but in which this nonisomorphism is correlated among several derivatives.12 The use of this Bayesian correlated phasing allows the SOLVE software to handle MIR data sets that contain serious nonisomorphism in exactly the same way as it deals with data sets that are highly isomorphous. Decision-Making in Automated Structure Solution One of the most time-consuming steps in macromolecular structure solution is the testing of many possible arrangements of heavy atoms in the crystal for compatibility with the X-ray data. In the multiwavelength (MAD) technique, for example, the differences in X-ray diffraction at different X-ray wavelengths can be used to generate sets of potential arrangements of heavy atoms in the crystal, but they generally cannot be used directly to identify which one of these arrangements is correct. Consequently, the crystallographer is faced with the tedious prospect of using manual or semiautomated methods to generate potential heavy atom arrangements, using each heavy atom arrangement to calculate an electron 6
T. T. 8 T. 9 T. 10 T. 11 T. 12 T. 7
C. C. C. C. C. C. C.
Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 50, 17 (1994). Terwilliger and D. Eisenberg, Acta Crystallogr. A 39, 813 (1983). Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 50, 11 (1994). Terwilliger, S.-H. Kim, and D. Eisenberg, Acta Crystallogr. A 43, 1 (1987). Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 53, 571 (1997). Terwilliger and D. Eisenberg, Acta Crystallogr. A 43, 6 (1987). Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 52, 749 (1996).
26
phases
[2]
density map, and looking at each map on a graphics screen to subjectively evaluate whether it looks like a picture of a protein. In parallel with this, the crystallographer evaluates in a subjective way whether each heavy atom arrangement appears to be compatible with the differences in diffraction at different X-ray wavelengths used to generate it (using the Patterson function), and whether a subset of the heavy atom sites in each arrangement can be used to derive the other sites.13 A key feature of the SOLVE software is the introduction of a uniform scoring scheme for evaluating heavy atom arrangements. For each criterion that crystallographers have used to compare potential solutions, a Z-score is calculated that reflects how each trial arrangement compares with the mean and standard deviation of the set of trial solutions as a whole. The composite Z-score for each trial arrangement is then used as the criterion for choosing the arrangement leading to a good electron density map. At every stage at which a decision would ordinarily be made by a crystallographer, such as whether to include a particular heavy atom site in a heavy atom arrangement or not, the choice is made by picking the arrangement that leads to the higher score. In this way, a complicated decision-making process is converted into a well-defined optimization process. Scoring Criteria for Heavy Atom Partial Structures
We developed four criteria for evaluating the quality of a heavy atom partial structure. These are as follows: . Evaluation of the match between a heavy-atom model and the Patterson function . Cross-validation difference Fourier maps . Figure of merit (internal consistency) of the phasing . Analysis of the native Fourier maps The match between a heavy atom model and the Patterson function has always been an important criterion in the MIR and MAD methods.14 Our scoring in this case essentially consists of the average value of the Patterson function at the predicted locations of peaks (based on the model), weighted by a factor based on the number of heavy atom sites in the trial solution. Our cross-validation difference Fourier method is based on the idea of Dickerson et al.13 for MIR cross-validation, in which one derivative is omitted from phasing and the others are used to calculate phases and a heavy 13 14
R. E. Dickerson, J. C. Kendrew, and B. E. Strandberg, Acta Crystallogr. 14, 1188 (1961). T. L. Blundell and L. N. Johnson, ‘‘Protein Crystallography,’’ p. 368. Academic Press, New York, 1976.
[2]
SOLVE and RESOLVE
27
atom difference Fourier for it. We extended this to MAD data by omitting one site at a time and using all others to calculate phases and a heavy atom Fourier. Those sites that have a high peak height in the resulting map are likely to be correct. The third criterion for scoring is the figure of merit. Although the figure of merit is sensitive to errors in heavy atom occupancies, the origin-removed Patterson refinement procedure we use11 yields essentially unbiased estimates of these parameters, so the criterion is useful.1 The final criteria for judging the quality of a heavy atom arrangement is the quality of the resulting electron density map. There are several features of protein crystals that could be used for such a measure. One of these would be the connectivity of the positive density in the man.15 Another feature of protein crystals is the presence of distinct regions of protein and solvent. The electron density maps in protein regions are rough, whereas in the solvent regions they are flat. Protein molecules are relatively compact and when they pack together in a crystal the regions between them are filled with solvent. As the solvent molecules are not fixed, and as an X-ray diffraction experiment averages the scattering of X-rays both over time and over the many repetitions of the protein in the crystal, the electron density in the solvent region is generally flat. In contrast, the electron density in the protein region is rough, as it is high at the locations of protein atoms and low between them. We have found that the variation in local roughness is a powerful indicator of the quality of an electron density map.16,17 The SOLVE software uses the variation in local roughness as a measure of the quality of the electron density map. Using this measure, it can reliably differentiate between a random map and a noisy map of a protein molecule at just about the same map quality (a signal-to-noise ratio of about 1) that a crystallographer can.1,16 SOLVE: Working Automated System for Structure Solution
We have put the algorithms and decision-making process described above into a single system (SOLVE) that is capable of fully automated structure solution.1 The information that is required from a user consists of (1) the locations of data files, (2) space group and symmetry information, (3) the identity of the anomalously scattering atom, scattering factor estimates, and the number of sites (for MAD data), and (4) the number of 15
D. Baker, A. E. Krukowski, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 186 (1993). 16 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 501 (1999). 17 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 1872 (1999).
28
phases
[2]
Fig. 1. SOLVE electron density map for IF-5A. Reprinted from T. C. Terwilliger, ‘‘Maximum-likelihood density modification for X-ray crystallography’’, in Image reconstruction from incomplete data, SPIE, vol. 4123, pp. 243–247.
amino acid residues in the asymmetric unit (for proteins). The output from this system consists of information about heavy atom locations, estimates of crystallographic phases, and an electron density map. The software has been used to solve structures as large as the ribosome 30S subunit18 and with as many as 56 selenium sites (W. Smith and C. Jansen, unpublished results). Figure 1 shows an example of an electron density map produced automatically by the SOLVE software.19 The X-ray data consisted of three wavelengths of selenomethionine MAD data collected to a resolution of ˚ . The protein (initiation factor 5A; IF-5A) contains 149 amino acids 2.1 A and there is one molecule in the asymmetric unit of space group P4122. SOLVE found three of the four selenium atoms (the selenomethione at position 7 was relatively disordered). The electron density map produced by SOLVE was highly interpretable. Figure 1 shows this map overlaid with the final refined atomic model. 18
W. M. Clemons, J. L. C. May, B. T. Wimberly, J. P. McCutcheon, M. S. Capel, and V. Ramakrishnan, Nature 400, 833 (1999). 19 T. S. Peat, J. Newman, G. S. Waldo, J. Berendzen, and T. C. Terwilliger, Structure 6, 1207 (1998).
[2]
SOLVE and RESOLVE
29
RESOLVE: Statistical Density Modification
Density modification is a method for improving the quality of electron density maps by incorporating real-space information such as the flatness of the solvent region.20 If density modification techniques could be made even more powerful than they already are, then structures could be solved with fewer methionines for selenomethionine MAD, with weakly diffracting crystals, and with less diffraction data. In addition, because of the high volume of data collection in structural genomics efforts and the limited supply of the necessary synchrotron beam time, it is advantageous to collect a minimal amount of X-ray data sufficient to obtain phase information. The MAD method requires several complete data sets to be collected at different X-ray wavelengths, whereas the related SAD (single-wavelength anomalous diffraction) method21 requires only one complete data set with anomalous measurement. When the SAD method is used alone, it leads to incomplete phasing information, but with density modification it can lead to highly interpretable electron density maps.22,23 Basis for Density Modification Many density modification methods have been developed. Exceptionally powerful for phase improvement are solvent flattening and noncrystallographic symmetry averaging.20,24–27 Additional density modification methods include histogram matching and phase extension,28 entropy maximization,29 iterative skeletonization,30,31 and iterative model building and refinement.2,3 The fundamental basis of density modification is that there are many possible sets of structure factor amplitudes and phases that are all reasonable based on the limited experimental data. Those structure factors that lead to maps that are most consistent with both the 20
B. C. Wang, Methods Enzymol. 115, 90 (1985). Z. J. Liu, E. S. Vysotski, C. J. Chen, J. P. Rose, J. Lee, and B. C. Wang, Protein Science 9, 2085 (2000). 22 Z. Dauter and N. Dauter, J. Mol. Biol. 289, 93 (1999). 23 M. A. Turner, C. S. Yuan, R. T. Borchardt, M. S. Hershfield, G. D. Smith, and P. L. Howell, Nat. Struct. Biol. 5, 369 (1998). 24 J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D Biol. Crystallogr. 52, 30 (1996). 25 G. Bricogne, Acta Crystallogr. A 30, 395 (1974). 26 K. D. Cowtan and P. Main, Acta Crystallogr. D Biol. Crystallogr. 49, 148 (1993). 27 F. M. D. Vellieux and R. J. Read, Methods Enzymol. 277, 18 (1997). 28 K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 377 (1990). 29 S. B. Xiang, C. W. Carter, G. Bricogne, and C. J. Gilmore, Acta Crystallogr. A 49, 193 (1993). 30 D. Baker, C. Bystroff, R. J. Fleterick, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 429 (1993). 31 C. Wilson and D. A. Agard, Acta Crystallogr. A 49, 97 (1993). 21
30
phases
[2]
experimental data and prior knowledge about what the electron density map should look like are the most likely overall. Until more recently the statistical foundation of density modification has been poorly developed.32–34 The procedure used to carry out density modification has involved iterations of calculating an electron density map, using the best estimates of phases; modifying the map by flattening the solvent or otherwise making it conform to expectations; calculating ‘‘model’’ phases; and estimating new phases based on a weighted average of model and experimental phases.20 The problem with this approach is that the model phases are partly based on the experimental ones, so that it is not clear how to weight the model and experimental phases in an optimal fashion. Several methods have been devised to address this problem, including ‘‘solvent flipping’’35 and cross-validation,32,36 but neither of these approaches fully addresses the fundamental problem of the correlation between model and experimental phases.34 We have invented a method of density modification based on a statistical formulation that preserves the independence of model and experimental phases (see Terwilliger4,5 for details). This statistical density modification technique (previously known as maximum-likelihood density modification) is able to make much better use of knowledge about characteristics of electron density maps than the earlier methods because of its improved and completely different statistical treatment.34 In particular, statistical density modification is able to properly treat the relationship between experimental phases and phase information obtained by solvent flattening. In addition, statistical density modification is capable of taking advantage of information specifying which parts of the modified electron density map are known and which are not, whereas previous formulations could not. The statistical density modification method we have developed can be applied to a wide range of situations in which information from experimental measurements is to be combined with information derived from expectations about plausible electron density arrangements in a map. These range from solvent flattening and noncrystallographic symmetry averaging, to phasing using a partial model and molecular replacement.5 Furthermore, as the method has a sound statistical basis, it can be combined with probabilistic methods for molecular fragment detection in an electron density map37 to yield phase improvement and to serve as the basis for model building. 32
K. D. Cowtan and P. Main, Acta Crystallogr. D Biol. Crystallogr. 52, 43 (1996). K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 55, 1555 (1999). 34 K. Cowtan, in ‘‘CCP4 Newsletter’’: http: //www.dl.ac.uk/CCP/CCP4/newsletter38/ 07_gaussian.html 35 J. P. Abrahams, Acta Crystallogr. D Biol. Crystallogr. 53, 371 (1997). 36 A. L. U. Roberts and A. T. Brunger, Acta Crystallogr. D Biol. Crystallogr. 51, 990 (1995). 33
[2]
SOLVE and RESOLVE
31
Statistical Density Modification The general idea of statistical density modification is simple. It combines the experimental information that is available about the probability of a particular value of the phase for each reflection with an examination of the electron density map that results from such a set of phases. Statistical density modification is a way of finding the set of phases that is compatible with the experimental information and that produces the most plausible electron density map. There are a number of cases in which we may have a good idea about what the features in an electron density map should look like. If we have even a poor set of crystallographic phases, then the electron density maps of most macromolecules will show regions that are relatively flat (the solvent region) and others that have a large amount of variation (where the macromolecule is located). Once the solvent region is identified, we can be confident that the true electron density in that region is nearly constant. This means we have a great deal of information about the pattern of electron density in that region. Another case occurs if we have been able to identify a feature such as an helix in the map. As we know what helices look like, we may have a better idea of the electron density in that region than the map alone provides. The power of density modification (and statistical density modification in particular) is that by choosing a set of phases that are compatible with the experimental information and that lead to a plausible map, all the other features of the map become clearer as well, even those about which we have no information. This comes about because the crystallographic phases are improved by the density modification.20,25 The key requirement for statistical density modification is that we have an estimate, for each point in our electron density map, of what values of electron density are plausible. This can be in the form of a probability distribution for each point in the map, or an estimate of electron density and an uncertainty, for example. Mathematics of Statistical Density Modification The goal of statistical density modification is to come up with the set of crystallographic phases that is the most likely given all available information. To carry out statistical density modification,4,5 we express the logarithm of the probability of a set of structure factors as the sum of two basic quantities: (1) the probability that we would have measured the
37
K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 54, 750 (1998).
32
phases
[2]
observed set of structure factors if this structure factor set were correct, and (2) the log probability that the map resulting from this structure factor set is consistent with our prior knowledge about this and other macromolecular structures. In this formulation, density modification consists of maximizing the total probability. To maximize this probability it is necessary both to define a ‘‘map probability function’’ and to have a practical way of finding structure factors that maximize it. We developed a formulation of the map probability function that allows a straightforward and rapid optimization of the total probability.4,5 The log probability for an electron density map is written as the integral over the map of a local log probability of electron density. In essence, we look at the map, point by point, and evaluate how likely it is that the true electron density at this point has the value given in the map. For example, if we look in the solvent region, it is unlikely that the true electron density would have a high or low value. We have shown that as long as the first and second derivatives of the local log probability of electron density with respect to electron density can be calculated, a steepest ascent method can be used to optimize the total probability when expressed in this way.4,5 In this broad class of situations, a fast fourier transform (FFT)-based method can be used to approximate derivatives of the total map log probability function with respect to each structure factor. These derivatives in turn can then be used in a Taylor’s series expansion to approximate the total map log probability function as a function of each structure factor. This makes it practical to optimize the total probability because the other terms (a priori knowledge of phases and experimental phase information) are also normally expressed separately for each structure factor. Outline of Cycle of Solvent Flattening-Based Phase Improvement with RESOLVE The process used to improve crystallographic phases with this maximum-probability algorithm is straightforward in concept. At the start of the process, the crystallographic phases are known only approximately (an experimentally derived probability distribution for their possible values is known). Each cycle of solvent flattening using statistical density modification (RESOLVE) involves six basic steps (see Terwilliger5 for more details). They are as follows: . Calculation of an electron density map based on current best estimates of crystallographic phases . Estimation of the probability that each point in the map lies in the protein or solvent region
[2]
SOLVE and RESOLVE
33
. Calculation of expected probability distributions for values of electron density in the protein and solvent regions (based on statistics of maps generated from model structures) . Calculation of the local map log probability function (the logarithm of the probability that each value of electron density in the map is consistent with the designated protein and solvent regions) . Calculation of how the probability of the map would change if an individual phase were changed . Adjustment of each phase to maximize its contribution to the total probability (the probability of the map plus the probability based on the experimental measurements of the phase) The local map log probability function is a critical element in our statistical density modification approach. This probability function can include any type of expectations about the electron density value at a particular point in the map. In particular, we have shown that expectations about electron density values at points both in the solvent region and in the protein region of a protein crystal can be included in statistical density modification and that this approach can be powerful for improving crystallographic phases.4,5 We have also shown that the same approach can be used to incorporate detailed information about patterns of electron density in a map, such as those corresponding to secondary structural elements in a protein structure (see below and Terwilliger5). Example of Statistical Density Modification Using Solvent Flattening Figure 2 illustrates the power of statistical density modification based on solvent flattening.5 Figure 2A shows a section through a model (perfect) electron density map based on the refined model of EF-5A (from Fig. 1). We then created a poor ‘‘experimental’’ electron density map by using just one of the three selenium atoms used in the EF-5A selenomethionine MAD structure solution example shown in Fig. 1 to calculate phases. This electron density map is shown in Fig. 2B. The correlation coefficient of this initial map to one calculated using the final model is only 0.37. Crystals of EF-5A contain 60% solvent, and the solvent region can be identified from this initial electron density map (notice that the contours of electron density are less pronounced on the right side of Fig. 2B, where the solvent is located). We tested both our statistical density modification technique and existing methods (dm, Cowtan and Main32; SOLOMON, Abrahams35) for the improvement in map quality obtained with solvent flattening. Figure 2C shows the RESOLVE-modified map. It has a correlation coefficient to the model map of 0.79, and the strand running vertically next to the
34
phases
[2]
Fig. 2. Section through electron density maps. (A) Model map; (B) map created by SOLVE (using just one selenium in phasing); (C) map produced from (B) by RESOLVE (maximum-likelihood density modification); and (D) map produced from (B) by RESOLVE (maximum-likelihood density modification); and (D) map produced from (B) by dm (conventional density modification). See text for details. Reprinted from T. C. Terwilliger, ‘‘Maximum-likelihood density modification for X-ray crystallography’’, in Image reconstruction from incomplete data, SPIE, vol. 4123, pp. 243–247.
[2]
SOLVE and RESOLVE
35
solvent region is clearly visible. Figure 2D shows the dm-modified map. It has a correlation coefficient to the model map of 0.65, and the density is much less clear. A similar result (correlation coefficient of 0.63) was obtained with SOLOMON. Example of Pattern Matching with Statistical Density Modification Figure 3 illustrates how statistical density modification can be combined with a search for fragments of secondary structure in an electron density map. We tested a pattern-matching approach to density modification, using the armadillo repeat region of -catenin, which is largely helical and con˚, tains 50% solvent.38 This structure was solved at a resolution of 2.7 A using MAD phasing on 15 selenium atoms incorporated into methionine residues in the protein. To make the test suitably difficult, we used only 3 of the 15 selenium atoms in calculating initial phases. As expected, this led to a noisy map; the correlation coefficient of this map with a map calculated on the basis of phases from the refined model was only 0.33 (Fig. 3A). The statistical density modification approach (without any pattern recognition) resulted in a great improvement in the map, with a correlation coefficient of 0.62 (Fig. 3C). Next we identified the location of helices in the map, using an FFTbased method37 in which a template of a helix (in all orientations) was placed at all locations in the unit cell and the correlation of the template with the local electron density was calculated. Those locations where there was a high correlation were considered to be locations of helices (Fig. 3B). The expected electron density in the region was then estimated from the template, and these expectations about the map, combined with the expectation of a flat solvent region, were used in statistical density modification. The statistical density modification with pattern recognition of helices improved the map even more substantially, with an overall correlation coefficient of 0.67 (Fig. 3D). This density-modified map is of sufficiently high quality that a model could be built into most of it, yet it is derived using phases based on just 3 selenium atoms in 700 amino acid residues and an initial map that is completely uninterpretable. This example shows that advances in density modification techniques will allow even further extensions of the range of targets that are accessible to X-ray crystallographic structure analysis. This example of pattern recognition combined with density modification also suggests the possibility of iterative pattern recognition and map improvement as a way of building up an atomic model of a macromolecule. 38
A. H. Huber, W. J. Nelson, and W. I. Weis, Cell 90, 871 (1997).
36
phases
[2]
Fig. 3. Pattern matching and statistical density modification. (A) ‘‘Experimental’’ electron density map obtained with 3 selenium atoms in 700 amino acids used in phasing. The model is the refined model of -catenin (Huber et al.38). (B) Electron density based on helical segments recognized by pattern matching. Note that some of the helices were recognized in this poor map, but not all. (C) Electron density after statistical density modification (without pattern matching) of the map shown in (A). (D) Electron density after statistical density modification with pattern matching. CC, Correlation coefficient.
The identification of helices in Fig. 3A and their use in making the map shown in Fig. 3D correspond to building part of the atomic model (the main-chain atoms for some of the helical segments) for -catenin. We expect that this process could be extended and repeated to build up a large part of the structure. Conclusions and Summary
SOLVE and RESOLVE have shown that it is possible to automate a significant part of the macromolecular X-ray structure determination process. The key elements of seamless and compatible subprograms, scoring algorithms, and error-tolerant software systems have been important in implementing these programs. The principles used in SOLVE and
[3]
automatic solution of heavy-atom substructures
37
RESOLVE can be applied to other aspects of structure determination as well, suggesting that full automation of the entire structure determination process from scaling diffraction data to a refined model will be possible in the near future.
[3] Automatic Solution of Heavy-Atom Substructures By Charles M. Weeks, Paul D. Adams, Joel Berendzen, Axel T. Brunger, Eleanor J. Dodson, Ralf W. Grosse-Kunstleve, Thomas R. Schneider, George M. Sheldrick, Thomas C. Terwilliger, Maria G. W. Turkenburg, and Isabel Uso´n Introduction
With the exception of small proteins that can be solved by ab initio direct methods1 or proteins for which an effective molecular replacement model exists, protein structure determination is a two-step process. If two or more measurements are available for each reflection with differences arising only from some property of a small substructure, then the positions of the substructure atoms can be found first and used as a bootstrap to initiate the phasing of the complete structure. Historically, substructures were first created by isomorphous replacement in which heavy atoms (usually metals) are soaked into crystals without displacing the protein structure, and measurements were made from both the unsubstituted (native) and substituted (derivative) crystals. When possible, measurements were made also of the anomalous diffraction generated by the metals at appropriate wavelengths. Now, it is common to incorporate anomalous scatterers such as selenium into proteins before crystallization and to make measurements of the anomalous dispersion at multiple wavelengths. The computational procedures that can be used to solve heavy-atom substructures include both Patterson-based and direct methods. In either case, the positions of the substructure atoms are determined from difference coefficients based on the measurements available from the diffraction experiments as summarized in Table I. The isomorphous difference magnitude, jFj iso (¼kFPHjjFPk), approximates the structure amplitude, jFH cos()j, and the anomalous-dispersion difference magnitude, jFj ano 1
G. M. Sheldrick, H. A. Hauptman, C. M. Weeks, R. Miller, and I. Uso´n, In ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 333. Kluwer Academic, Dordrecht, The Netherlands, 2001.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[3]
automatic solution of heavy-atom substructures
37
RESOLVE can be applied to other aspects of structure determination as well, suggesting that full automation of the entire structure determination process from scaling diffraction data to a refined model will be possible in the near future.
[3] Automatic Solution of Heavy-Atom Substructures By Charles M. Weeks, Paul D. Adams, Joel Berendzen, Axel T. Brunger, Eleanor J. Dodson, Ralf W. Grosse-Kunstleve, Thomas R. Schneider, George M. Sheldrick, Thomas C. Terwilliger, Maria G. W. Turkenburg, and Isabel Uso´n Introduction
With the exception of small proteins that can be solved by ab initio direct methods1 or proteins for which an effective molecular replacement model exists, protein structure determination is a two-step process. If two or more measurements are available for each reflection with differences arising only from some property of a small substructure, then the positions of the substructure atoms can be found first and used as a bootstrap to initiate the phasing of the complete structure. Historically, substructures were first created by isomorphous replacement in which heavy atoms (usually metals) are soaked into crystals without displacing the protein structure, and measurements were made from both the unsubstituted (native) and substituted (derivative) crystals. When possible, measurements were made also of the anomalous diffraction generated by the metals at appropriate wavelengths. Now, it is common to incorporate anomalous scatterers such as selenium into proteins before crystallization and to make measurements of the anomalous dispersion at multiple wavelengths. The computational procedures that can be used to solve heavy-atom substructures include both Patterson-based and direct methods. In either case, the positions of the substructure atoms are determined from difference coefficients based on the measurements available from the diffraction experiments as summarized in Table I. The isomorphous difference magnitude, jFj iso (¼kFPHjjFPk), approximates the structure amplitude, jFH cos()j, and the anomalous-dispersion difference magnitude, jFj ano 1
G. M. Sheldrick, H. A. Hauptman, C. M. Weeks, R. Miller, and I. Uso´n, In ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 333. Kluwer Academic, Dordrecht, The Netherlands, 2001.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
38
[3]
phases TABLE I Measurements Used for Substructure Determinationa
Acronym SIR SIRAS MIR MIRAS SAD or SAS MAD a
Type of experiment Single isomorphous replacement Single isomorphous replacement with anomalous scattering Multiple isomorphous replacement Multiple isomorphous replacement with anomalous scattering Single anomalous dispersion or single anomalous scattering Multiple anomalous dispersion
Measurements FP, FPH FP, FPHþ, FPH FP, FPH1, FPH2, . . . FP, FPH1þ, FPH1, FPH2þ, FPH2, . . . FPHþ, FPH at one wavelength FPHþ, FPH at several wavelengths
The notation used for the structure factors is FP (native protein), FPH (derivative), FH or FA (substructure), Fþ and F (for Fhkl and Fhkl , respectively, in the presence of anomalous dispersion).
00 sin()j. (The angle is the difference (¼k FþjjFk), approximates 2jFH between the phase of the whole protein and that of the substructure.) When SIRAS or MAD data are available, the differences can be combined to give an estimate of the complete FA structure factor.2,3 Both Patterson and direct methods require extremely accurate data for the successful determination of substructures. Care should be taken to eliminate outliers and observations with small signal-to-noise ratios, especially in the case of single anomalous differences. Fortunately, it is usually possible to be stringent in the application of appropriate cutoffs because the problem is overdetermined in the sense that the number of available observations is much larger than the number of heavy-atom positional parameters. In particular, it is important that the largest isomorphous and anomalous differences be reliable. The coefficients that are used consider small differences between two or more much larger measurements, so errors in the measurements can easily disguise the true signal. If there are even a few outliers in a data set, or some of the large coefficients are serious overestimates, substructure determination is likely to fail. Patterson and direct-methods procedures have been implemented in a number of computer programs that permit even large substructures to be determined with little, if any, user intervention. (The current record is 160 selenium sites.) The methodology, capabilities, and use of several such 2 3
J. Karle, Acta Crystallogr. A 45, 303 (1989). W. Hendrickson, Science 254, 51 (1991).
[3]
automatic solution of heavy-atom substructures
39
popular programs and program packages are described in this chapter. The SOLVE4 program, which uses direct-space Patterson search methods to locate the heavy-atom sites, provides a fully automated pathway for phasing protein structures, using the information obtained from MIR or MAD experiments. The two major software packages currently in use in macromolecular crystallography [i.e., the Crystallography and NMR System (CNS5) and the Collaborative Computational Project Number 4 (CCP46)] provide internally consistent formats that make it easy to proceed from heavy-atom sites to density map, but user intervention is required. CNS employs both direct-space and reciprocal-space Patterson searches. The CCP4 suite includes programs for computing Pattersons as well as the direct-method programs RANTAN7 and ACORN.8 The dualspace direct-method programs SnB9,10 and SHELXD11,11a provide only the heavy-atom sites, but they are efficient and capable of solving large substructures currently beyond the capabilities of programs that use only Patterson-based methods. SnB uses a random number generator to assign initial positions to the starting atoms in its trial structures, but SHELXD strives to obtain better-than-random initial coordinates by deriving information from the Patterson superposition minimum function. In some cases, this has significantly decreased the computing time needed to find a heavyatom solution. Other direct-method programs (e.g., SIR200012), not described in this chapter, also can be used to solve substructures. Pertinent aspects of data preparation are described in detail in the following sections devoted to the individual programs. Automated or semiautomated procedures for locating heavy-atom sites operate by generating many trial structures. Thus, a key step in any such procedure is the scoring or ranking of trial structures by some measure of quality in such a way that 4
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 55, 849 (1999). A. T. Brunger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gross, R. W. GrosseKunstleve, J.-S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J. Read, L. M. Rice, T. Simonson, and G. L. Warren, Acta Crystallogr. D. Biol. Crystallogr. 54, 905 (1998). 6 Collaborative Computational Project Number 4, Acta Crystallogr. D. Biol. Crystallogr. 50, 760 (1994). 7 J.-X. Yao, Acta Crystallogr. A 39, 35 (1983). 8 J. Foadi, M. M. Woolfson, E. J. Dodson, K. S. Wilson, J.-X. Yao, and C.-D. Zheng, Acta Crystallogr. D. Biol. Crystallogr. 56, 1137 (2000). 9 R. Miller, S. M. Gallo, H. G. Khalak, and C. M. Weeks, J. Appl. Crystallogr. 27, 613 (1994). 10 C. M. Weeks and R. Miller, Acta Crystallogr. D. Biol. Crystallogr. 55, 492 (1999). 11 G. M. Sheldrick, in ‘‘Direct Methods for Solving Macromolecular Structures’’ (S. Fortier, ed.), p. 401. Kluwer Academic, Dordrecht, The Netherlands, 1998. 11a T. R. Schneider and G. M. Sheldrick, Acta Crystallogr. D. Biol. Crystallogr. 58, 1772 (2002). 12 M. C. Burla, M. Camalli, B. Carrozzini, G. L. Cascarano, C. Giacovazzo, G. Polidori, and R. Spagna, Acta Crystallogr. A 56, 451 (2000). 5
40
phases
[3]
any probable solution can be identified. Therefore, the methods used to accomplish this are described for each program, along with methods for validating the correctness of individual sites. Where applicable, methods used to determine the correct hand (enantiomorph) and refine the substructure also are described. Finally, interesting applications to large selenomethionine derivatives, substructures phased by weak anomalous signals, and substructures created by short halide cryosoaks are discussed. SOLVE
In favorable cases, the determination of heavy-atom substructures using MAD or MIR data is a straightforward, although often lengthy, process. SOLVE4 is designed to automate fully the analysis of such data. The overall approach is to link together into one seamless procedure all the steps that a crystallographer would normally do manually and, in the process, to convert each decision-making step into an optimization problem. A somewhat more generalized description of SOLVE, together with a description of RESOLVE, a maximum-likelihood solvent-flattening routine, appear in the chapter by T. Terwilliger (see [2] in this volume12a). The MAD and MIR approaches to structure solution are conceptually similar and share several important steps. In each method, trial partial structures for the heavy or anomalously scattering atoms often are obtained by inspection of difference-Patterson functions or by semiautomated analysis.13–15 These initial structures are refined against the observed data and used to generate initial phases. Then, additional sites and sites in other derivatives can be found from weighted difference or gradient maps using these phases. The analysis of the quality of potential heavyatom solutions is also similar for the two methods. In both cases, a partial structure is used to calculate native phases for the entire structure, and the electron density that results is then examined to see whether the expected features of the macromolecule can be found. In addition, the figure of merit of phasing and the agreement of the heavy atom model with the difference Patterson function are commonly used to evaluate the quality of a solution. In many cases, an analysis of heavy-atom sites by sequential deletion of individual sites or derivatives is also an important criterion of quality.16 12a
T. C. Terwilliger, Methods Enzymol. 374, [2], 2003 (this volume). T. C. Terwilliger, S.-H. Kim, and D. Eisenberg, Acta Crystallogr. A 43, 1 (1987). 14 G. Chang and M. Lewis, Acta Crystallogr. D. Biol. Crystallogr. 50, 667 (1994). 15 A. Vagin and A. Teplyakov, Acta Crystallogr. D. Biol. Crystallogr. 54, 400 (1998). 16 R. E. Dickerson, J. C. Kendrew, and B. E. Strandberg, Acta Crystallogr. 14, 1188 (1961). 13
[3]
automatic solution of heavy-atom substructures
41
Data Preparation SOLVE prepares data for heavy-atom substructure solution in two steps. First, the data are scaled using the local scaling procedure of Matthews and Czerwinski.17 Second, MAD data are converted to a pseudo-SIRAS form that permits more rapid analysis.18 Systematic errors are minimized by scaling all types of data (e.g., Fþ and F, native and derivative, and the different wavelengths of MAD data) in similar ways and by keeping different data sets separate until the end of scaling. The scaling procedure is optimized for cases in which the data are collected in a systematic fashion. For both MIR and MAD data, the overall procedure is to construct a reference data set that is as complete as possible and that contains information from either a native data set (for MIR) or for all wavelengths (for MAD data). This reference data set is constructed for just the asymmetric unit of data and is essentially the average of all measurements obtained for each reflection. The reference data set is then expanded to the entire reciprocal lattice and used as the basis for local scaling of each individual data set (see Terwilliger and Berendzen4 for additional details). For MAD data, Bayesian calculations of phase probabilities are slow.19,20 Consequently, SOLVE uses an alternative procedure for all MAD phase calculations except those done at the final stage. This alternative is to convert the multiwavelength MAD data set into a form that is similar to that used for SIRAS data. The information in a MAD experiment is largely contained in just three quantities: a structure factor Fo corresponding to the scattering from nonanomalously scattering atoms, a dispersive or isomorphous difference at a standard wavelength o (ISO o ), 18 It and an anomalous difference (ANO ) at the same standard wavelength. o is easy to see that these three quantities could be treated just like an SIRAS data set with the ‘‘native’’ structure factor FP replaced by Fo, the derivative structure factor FPH replaced by Fo þ (ISO o ), and the anomalous difference replaced by ANO o . In this way, a single data set with isomorphous and anomalous differences is obtained that can be used in heavy-atom refinement by the origin-removed Patterson refinement method and in phasing by conventional SIRAS phasing.21 The conversion of MAD data to a pseudo-SIRAS form that has almost the same information content requires two important assumptions. The first assumption is that the structure factor 17
B. W. Matthews and E. W. Czerwinski, Acta Crystallogr. A 31, 480 (1975). T. C. Terwilliger, Acta Crystallogr. D. Biol. Crystallogr. 50, 17 (1994). 19 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 53, 571 (1997). 20 E. de la Fortelle and G. Bricogne, Methods Enzymol. 277, 472 (1997). 21 T. C. Terwilliger and D. Eisenberg, Acta Crystallogr. A 43, 6 (1987). 18
42
phases
[3]
corresponding to anomalously scattering atoms in a structure varies in magnitude, but not in phase, at various X-ray wavelengths. This assumption will hold when there is one dominant type of anomalously scattering atom. The second assumption is that the structure factor corresponding to anomalously scattering atoms is small compared with the structure factor from all other atoms. The conversion of MAD to pseudo-SIRAS data is implemented in the program segment MADMRG.18 In most cases, there is more than one pair of X-ray wavelengths corresponding to a particular reflection. The estimates from each pair of wavelengths are all averaged, using weighting factors based on the uncertainties in each estimate. Data from various pairs of X-ray wavelengths and from various Bijvoet pairs can have different weights in their contributions to the total. This can be understood by noting that pairs of wavelengths that differ considerably in dispersive contributions would yield relatively accurate estimates of ISO o . In the same way, Bijvoet differences measured at the wavelength with the largest value of f 00 will contribute by far the most to estimates of ANO o . The standard wavelength choice in this analysis is arbitrary because values at any wavelength can be converted to values at any other wavelength. The standard wavelength does not even have to be one of the wavelengths in the experiment, although it is convenient to choose one of them. Heavy-Atom Searching and Phasing The process of structure solution can be thought of largely as a decision-making process. In the early stages of solution, a crystallographer must choose which of several potential trial solutions may be worth pursuing. At a later stage, the crystallographer must choose which peaks in a heavy-atom difference Fourier are to be included in the heavy-atom model, and which hand of the solution is correct. At a final stage, the crystallographer must decide whether the solution process is complete and which of the possible heavy-atom models is the best. The most important feature of the SOLVE software is the use of a consistent scoring algorithm as the basis for making all these decisions. To make automated structure solution practical, it is necessary to evaluate trial heavy-atom solutions (typically 300–1000) rapidly. For each potential solution, the heavy-atom sites must be refined and the phases calculated. In implementing automated structure solution, it was important to recognize the need for a trade-off between the most accurate heavyatom refinement and phasing at all stages of structure solution and the time required to carry it out. The balance chosen for SOLVE was to use the most accurate available methods for final phase calculations and
[3]
automatic solution of heavy-atom substructures
43
to use approximate, but much faster, methods for all intermediate refinements and phase calculations. The refinement method chosen on this basis was origin-removed Patterson refinement,22 which treats each derivative in an MIR data set independently, and which is fast because it does not require phase calculation. The phasing approach used for MIR data throughout SOLVE is Bayesian-correlated phasing,21,23 a method that takes into account the correlation of nonisomorphism among derivatives without slowing down phase calculations substantially. Once MIR data have been scaled, or MAD data have been scaled and converted to a pseudo-SIRAS form, automated searches of difference Patterson functions are then used to find a large number (typically 30) of potential one-site and two-site solutions. In the case of MIR data, difference-Patterson functions are calculated for each derivative. For MAD data, anomalous and dispersive differences are combined to yield a Bayesian estimate of the Patterson function for the anomalously scattering atoms.24 In principle, Patterson methods could be used to solve the complete heavy-atom substructure, but the approach used in SOLVE is to find just the initial sites in this way and to find all others by difference Fourier analysis. This initial set of one-site and two-site trial solutions becomes a list of ‘‘seeds’’ for further searching. Once each of the potential seeds is scored and ranked, the top seeds (typically five) are selected as independent starting points in the search for heavy-atom solutions. For each seed, the main cycle in the automated structure-solution algorithm used by SOLVE consists of two basic steps. The first is to refine heavy-atom parameters and to rank all existing solutions generated from this seed so far, on the basis of the four criteria discussed below. The second is to take the highest-ranking partial solution that has not yet been analyzed exhaustively and use it in an attempt to generate a more complete solution. Generation of new solutions is carried out in three ways: by deletion of sites, by addition of sites from difference Fouriers, and by reversal of hand. A partial solution is considered to have been analyzed exhaustively when all single-site deletions have been considered, when no more peaks that result in improvement can be found in a difference Fourier, when inversion does not cause improvement, or when the maximum number of sites specified by the user has been reached. In each case, new solutions generated in these ways are refined, scored, and ranked, and the cycle is continued until all the top trial solutions have been analyzed fully and no new possibilities are found. Throughout this process, a tally of the 22
T. C. Terwilliger and D. Eisenberg, Acta Crystallogr. A 39, 813 (1983). T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 52, 749 (1996). 24 T. C. Terwilliger, Acta Crystallogr. D. Biol. Crystallogr. 50, 11 (1994). 23
44
phases
[3]
solutions that have already been considered is kept, and any duplicates are eliminated. In some cases, one clear solution appears early in this process. In other cases, there are several solutions that have similar scores at early (and sometimes even late) stages of the analysis. When no one possibility is much better than the others, all the seeds are analyzed exhaustively. On the other hand, if a promising partial solution emerges from one seed, then the search is narrowed to focus on that seed, deletions are not carried out until the end of the analysis, and many peaks from the difference Fourier analysis are added simultaneously so as to build up the solution as quickly as possible. Once the expected number of heavy-atom sites is found, then each site is deleted in turn to see whether the solution can be further improved. If this occurs, then the process is repeated in the same way by addition and deletion of sites and by inversion until no further improvement is obtained. At the conclusion of the SOLVE algorithm, an electron-density map and phases for the top solution are reported in a form that is compatible with the CCP46 suite. In addition, command files that can be modified to look for additional heavy-atom sites or to construct other electrondensity maps are produced. If more than one possible solution is found, the heavy-atom sites and phasing statistics for all of them are reported. Scoring, Site Validation, Enantiomorph Determination, and Substructure Refinement Scoring of potential heavy-atom solutions is an essential part of the SOLVE algorithm because it allows ranking of solutions and appropriate decision-making. Scoring, validation, and enantiomorph determination are all part of the same process, and they are carried out continuously during the solution process. For each trial solution, SOLVE first refines the heavy-atom substructure against the origin-removed Patterson function. Then, it scores the trial solutions using four criteria that are described in detail below: agreement with the Patterson function, cross-validation of heavy-atom sites, the figure of merit, and nonrandomness of the electrondensity map. The scores for each criterion are normalized to those for a group of starting solutions (most of which are incorrect) to obtain a socalled Z score. The total score for a solution is the sum of its Z scores after correction for anomalously high scores in any category. SOLVE identifies the enantiomorph, using the score for the nonrandomness criterion. All the other scores are independent of the hand of the heavy-atom substructure, but the final electron-density map will be just noise if anomalous differences are measured and the hand of the heavy atoms is incorrect.
[3]
automatic solution of heavy-atom substructures
45
Consequently, this score can be used effectively in later stages of structure solution to identify the correct enantiomorph. Patterson Agreement. The first criterion used by SOLVE for evaluating a trial heavy-atom solution is the agreement between calculated and observed Patterson functions. Comparisons of this type have always been important in the MIR and MAD methods.25 The score for Patterson function agreement is the average value of the Patterson function at predicted peak locations after multiplication by a weighting factor based on the number of heavy-atom sites in the trial solution. The weighting factor4 is adjusted such that, if two solutions have the same mean value at predicted Patterson peaks, the one with the larger number of sites receives the higher score. In some cases, predicted Patterson vectors fall on high peaks that are not related to the heavy-atom solution. To exclude these contributions, the occupancies of each heavy-atom site are refined so that the predicted peak heights approximately match the observed peak heights at the predicted interatomic positions. Then, all peaks with heights more than 1 larger than their predicted values are truncated. The average values are corrected further for instances in which more than one predicted Patterson vector falls at the same location by scaling that peak height by the fraction of predicted vectors that are unique. Cross-Validation of Sites. A cross-validation difference Fourier analysis is the basis of the second scoring criterion. One at a time, each site in a solution (and any equivalent sites in other derivatives for MIR solutions) is omitted from the heavy-atom model, and the phases are recalculated. These phases are used in a difference Fourier analysis, and the peak height at the location of the omitted site is noted. A similar analysis, in which a derivative is omitted from phasing and all other derivatives are used to phase a difference Fourier, has been used for many years.16 The score for cross-validation difference Fouriers is the average peak height after weighting by the same factor used in the difference Patterson analysis. Figure of Merit. The mean figure of merit of phasing, m,25 can be a remarkably useful measure of the quality of phasing despite its susceptibility to systematic error.4 The overall figure of merit is essentially a measure of the internal consistency of the heavy-atom solution with the data. Because heavy-atom refinement in SOLVE is carried out using origin-removed Patterson refinement,22 occupancies of heavy-atom sites are relatively unbiased. This minimizes the problem of high occupancies leading to inflated figures of merit. In addition, using a single procedure for phasing allows
25
T. L. Blundell and L. N. Johnson, ‘‘Protein Crystallography.’’ Academic Press, New York, 1976.
46
phases
[3]
comparison among solutions. The score based on figure of merit is simply the unweighted mean for all reflections included in phasing. Nonrandomness of Electron Density. The most important criterion used by a crystallographer in evaluating the quality of a heavy-atom solution is the interpretability of the resulting electron-density map. Although a full implementation of this criterion is difficult, it is quite straightforward to evaluate instead whether the electron-density map has general features that are expected for a crystal of a macromolecule. A number of features of electron-density maps could be used for this purpose, including the connectivity of electron density in the maps,26 the presence of clearly defined regions of protein and solvent,27–33 and histogram matching of electron densities.31,34 The identification of solvent and protein regions has been used as the measure of map quality in SOLVE. This requires that there be both solvent and protein regions in the electron-density map. Fortunately, for most macromolecular structures the fraction of the unit cell that is occupied by the macromolecule is in the suitable range of 30–70%. The criteria used in scoring by SOLVE are based on the solvent and protein regions each being fairly large, contiguous regions.33 The unit cell is divided into boxes having each dimension approximately twice the resolution of the map, and the root–mean–square (rms) electron density is calculated within each box without including the F000 term in the Fourier synthesis. Boxes within the protein region will typically have high values of this rms electron density (because there will be some points where atoms are located and other points that lie between atoms) whereas boxes in the solvent region will have low values because the electron density will be fairly uniform. The score, based on the connectivity of the protein and solvent regions, is simply the correlation coefficient of the density for adjacent boxes. If there is a large contiguous protein region and a large contiguous solvent region, then adjacent boxes will have highly correlated values. If the electron density is random, there will be little or no correlation. On the other hand, the correlation may be as high as 0.5 or 0.6 for a good map. 26
D. Baker, A. E. Krukowski, and D. A. Agard, Acta Crystallogr. D. Biol. Crystallogr. 49, 186 (1993). 27 B.-C. Wang, Methods Enzymol. 115, 90 (1985). 28 S. Xiang, C. W. Carter, Jr., G. Bricogne, and C. J. Gilmore, Acta Crystallogr. D. Biol. Crystallogr. 49, 193 (1993). 29 A. D. Podjarny, T. N. Bhat, and M. Zwick, Annu. Rev. Biophys. Biophys. Chem. 16, 351 (1987). 30 J. P. Abrahams, A. G. W. Leslie, R. Lutter, and J. E. Walker, Nature 370, 621 (1994). 31 K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 377 (1990). 32 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 55, 501 (1998). 33 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 55, 1872 (1999). 34 A. Goldstein and K. Y. J. Zhang, Acta Crystallogr. D. Biol. Crystallogr. 54, 1230 (1998).
[3]
automatic solution of heavy-atom substructures
47
The four-point scoring scheme described above provides the foundation for automated structure solution. To make it practical, the conversion of MAD data to a pseudo-SIRAS form and the use of rapid origin-removed, Patterson-based, heavy-atom refinement have been critical. The remainder of the SOLVE algorithm for automated structure solution is largely a standardized form of local scaling, an integrated set of routines to carry out all the calculations required for heavy-atom searching, refinement, and phasing as well as routines to keep track of the lists of current solutions being examined and past solutions that have already been tested. SOLVE is an easy program to use. Only a few input parameters are needed in most cases, and the SOLVE algorithm carries out the entire process automatically. In principle, the procedure also can be thorough: many starting solutions can be examined, and difficult heavy-atom structures can be determined. In addition, for the most difficult cases, the failure to find a solution can be useful in confirming that additional information is needed. Crystallography and NMR System
The Crystallography and NMR System (CNS)5 implements a novel Patterson-based method for the location of heavy atoms or anomalous scatterers.35 The procedure is implemented using a combination of direct-space and reciprocal-space searches, and it can be applied to both isomorphous replacement and anomalous scattering data. The goal of the algorithm is to make it practical to locate automatically a subset of the heavy atoms without manual interpretation or intervention. Once the sites have been located, CNS provides tools for heavy-atom refinement, phase estimation, density modification, and heavy-atom model completion. These tools, known as task files, are scripts written in the CNS language and are supplied with reasonable default parameters. Using these task files, the process of phasing is greatly simplified and initial electron-density maps, even for large complex structures, can be calculated in a relatively short time. CNS has been used successfully to solve problems with up to 4036 and 66 selenium sites (see Applications, below). Data Preparation Sigma Cutoffs and Outlier Elimination. The peaks in a Patterson map correspond to interatomic vectors of the crystal structure.37 However, the 35
R. W. Grosse-Kunstleve and A. T. Brunger, Acta Crystallogr. D. Biol. Crystallogr. 55, 1568 (1999). 36 M. A. Walsh, Z. Otwinowski, A. Perrakis, P. M. Anderson, and A. Joachimiak, Struct. Fold. Des. 8, 505 (2000).
48
[3]
phases
atoms are not point scatterers, and there are errors associated with experimental data, making the interpretation of the Patterson map difficult. Therefore, steps are taken to minimize the amount of error that is introduced. In practice, the suppression of outliers can be essential to the success of a heavy atom search.38 In CNS, reflections are first rejected on the basis of their signal-to-noise ratio (‘‘sigma cutoff’’). This is performed on both the observed amplitudes and the computed difference between pairs of amplitudes. For the computation of differences, the observed amplitudes are scaled relative to each other, using overall k-scaling and B-scaling in order to compensate for systematic errors caused by differences between crystals and data collection conditions. Additional reflections are rejected if their amplitudes or difference amplitudes deviate too much from the corresponding root–mean–square (rms) value for all of the data in their resolution shell (‘‘rms outlier removal’’). Empirical observation has led to the values of the rejection criteria shown in Table II. Except for the TABLE II Default Parameters for CNS Automated Heavy-Atom Search Procedure Parameter
Default valuea
Number of sites
2/3 of total expected
Minimum Bragg spacing
˚ 4.0 A
Averaging of Patterson maps Special positions
No
Sigma cutoff on F RMS outlier cutoff on F for native or on F for difference Patterson maps Expected increase in correlation coefficient for dead-end test a b
37 38
No 1 4
0.01
Commentb Typically not all sites are well ordered, and it is easy to add additional sites using gradient map methods once phasing has started with the 2/3 partial solution If there are a large number of heavy-atom sites per macromolecule, a higher resolution ˚) limit may be required (3.5 A If solutions are not found with a single map, then multiple maps can be tried Can be set to true if the heavy atoms have been soaked into the crystal Decrease to 0 for FA structure factors Increase to 10 for FA structure factors
When there are a large number of heavy-atom sites, it may be necessary to decrease this value (to 0.005)
Values present in the heavy_search.inp task file supplied with CNS. Situations in which the default parameter may require modification.
M. J. Buerger, ‘‘Vector Space.’’ John Wiley & Sons, New York, 1959. G. M. Sheldrick, Methods Enzymol. 276, 628 (1997).
[3]
automatic solution of heavy-atom substructures
49
instances noted in Table II, these values can generally be used without modification. Combining Patterson Maps. CNS provides the option to average Patterson maps based on different data sets. For example, several MAD wavelengths or a combination of isomorphous and anomalous difference maps can be combined. This is useful if the signal in any individual data set is too weak to locate the heavy atoms unambiguously. A small signal-to-noise ratio in the observed data leads to noise in the Patterson maps. The combination of data increases the signal-to-noise ratio in the resulting Patterson map by averaging out the noise and, therefore, improves the chances of locating the heavy-atom positions (Fig. 1d). Using FA Structure Factors. If MAD data are available, it is possible to define structure factors FA that are approximations to the component of the observed structure factors resulting from the anomalous scatterers.2,3,18 FA structure factors can be calculated using programs such as XPREP,39 MADSYS,3 or the MADBST module of SOLVE.4 Although CNS does not perform FA estimation, the heavy-atom search procedure can make use of this information and that has been found to increase the chances for locating the correct sites (Fig. 1e). Ideally, an algorithm for the estimation of FA structure factors includes a careful treatment of outliers similar to the sigma cutoff and rms outlier removal outlined above. If this is the case, the parameters for the sigma cutoff and rms outlier removal in CNS should be adjusted to include all data in the heavy-atom search procedure (see Table II). Heavy-Atom Searching The CNS heavy-atom search procedure (Fig. 2) consists of four stages that are described in more detail by Grosse-Kunstleve and Brunger.35 In the first stage, the observed diffraction intensities are filtered by the criteria described above, and two or more Patterson maps (calculated from MIR, MAD, or MIRAS data) can be averaged. The second stage consists of a Patterson search by either a reciprocal-space single-atom fast translation function, by a direct-space symmetry minimum function, or by a combination of both. Combination searches have been shown to be the most accurate.35 A given number (typically 100) of the highest peaks in the resulting Patterson search map are sorted and subsequently used as initial trial sites. The third stage consists of a sequence of alternating reciprocalspace or direct-space Patterson searches as well as Patterson-correlation 39
Written by G. Sheldrick. Available from Bruker Advanced X-Ray Solutions (Madison, WI).
50
phases
CC
(a)
[3]
0.6 0.4 0.2 0
CC
(b)
0.6 0.4 0.2 0
CC
(c)
0.6 0.4 0.2 0
CC
(d)
0.6 0.4 0.2 0
CC
(e)
0.6 0.4 0.2 0
Trial
Fig. 1. Results of automated CNS heavy-atom search with the MAD data from 2aminoethylphosphonate transaminase. Sixty-six selenium sites are present in the asymmetric unit. Automated searches for 44 sites (two-thirds of the expected total) were performed. In all cases, 100 trial solutions were generated and sorted by the correlation coefficient (F2F2). (a) No solutions were found using the anomalous F structure factors at the high-energy remote wavelength as indicated by no separation between the trials. (b) A few solutions were found using the anomalous F structure factors at the peak wavelength. (c) The anomalous F structure factors at the inflection-point wavelength found more solutions, indicating a larger anomalous signal than the peak wavelength. (d) Using combined anomalous F structure factors at the inflection-point wavelength and the dispersive differences between the inflection point and high-energy remote gave an even higher success rate. (e) Finally, the greatest success rate was with FA structure factors calculated from all three wavelengths, using XPREP.39
(PC) refinements40 starting with each of the initial trial sites. The highest peak is selected that has distances to its symmetrically equivalent points and all preexisting sites larger than the given cutoff distance. If two or more sites already have been placed, a dead-end elimination test is performed.
[3]
automatic solution of heavy-atom substructures
51
First patterson search => list of initial trial sites
go through list of initial trial sites
distance within specified range to all sites?
no
yes Positional and/or B-factor PC refinement of all sites
Expected number of sites placed?
yes
write sites to file
no
dead end?
yes
no
do Patterson search for next site go through top peaks of this search
yes
distance within specified range to all sites?
no
Fig. 2. CNS automated heavy-atom location protocol.
The correlation coefficient computed before placing and refining the last new site is compared with the correlation coefficient computed after the addition of the new site. If the target value does not increase by a specified amount, typically 0.01 (see Table II), then the search for that particular initial trial site is deemed to have reached a dead end, and no additional sites are placed. Otherwise, another Patterson search is carried out until the expected number of sites is found. The final stage consists of sorting the solutions ranked by the value of the target function (a correlation coefficient) 40
A. T. Brunger, Acta Crystallogr. A 47, 195 (1991).
52
[3]
phases
of the PC refinement. If the correct solution has been found, it is normally characterized by the best value of the target function and a significant separation from incorrect solutions (compare, e.g., Fig. 1a and b). Reciprocal-Space Method: Single-Atom Fast Translation Function. A single heavy-atom site is translated throughout an asymmetric unit, and 2 2 (t) (referred to the standard linear correlation coefficient of Fpatt and Fcalc as F2F2) is computed for each position t: P 2 2 iÞðF 2 2 ðFH;patt hFpatt H;calc hFcalc iÞ H ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi rP F2F2ðtÞ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1) P 2 2 2 iÞ2 2 iÞ2 ðFH;calc hFcalc ðFH;patt hFpatt H
H
The summations are computed for all Miller indices H, and hF 2i denotes the mean of F 2 over all Miller indices. Other target expressions can be used including the correlation coefficient between Fpatt and Fcalc(t), E2patt and E2calc (t), and Epatt and Epatt and Ecalc(t), where the E values are normalized structure factors (see Dual-Space Direct Methods, below). The F2F2 target function is preferred because it permits the use of a fast translation function (FTF),41 which is 300–500 times faster35 than the conventional translation function.42 Thus, the FTF makes the automated reciprocal-space heavy-atom search procedure practical even for large numbers of sites. The reciprocal-space search for an additional site is similar to the search for the initial trial sites, except that the previously placed sites are kept fixed and are included in the structure-factor (Fcalc) calculation.41 Direct-Space Method: Symmetry and Image-Seeking Minimum Functions. The symmetry minimum function (SMF)43–45 makes maximal use of the information contained in the Harker regions. The computation of an SMF requires a Patterson map as well as a table of the unique Harker vectors and their weights.43 These Harker vectors and weights are supplied automatically by CNS. The image-seeking minimum function (IMF)43,45 can be used to locate additional sites once one or more are placed. Computing an IMF map is equivalent to a deconvolution of the Patterson map using knowledge of the already placed heavy-atom sites. Because of coincidental overlap of peaks in the Patterson map, thermal motion of the sites, and noise in the data, the IMF maps typically provide only limited information for macromolecular crystal structures. 41
J. Navaza and E. Vernoslova, Acta Crystallogr. A 51, 445 (1995). M. Fujinaga and R. J. Read, J. Appl. Crystallogr. 20, 517 (1987). 43 P. G. Simpson, R. D. Dobrott, and W. N. Lipscomb, Acta Crystallogr. 18, 169 (1965). 44 F. Pavelcik, J. Appl. Crystallogr. 19, 488 (1986). 45 M. A. Estermann, Nucl. Instr. Methods Phys. Res. A 354, 126 (1995). 42
[3]
automatic solution of heavy-atom substructures
53
Peak Search and Special Position Check. The list of initial trial sites is determined by a peak search in the single-atom FTF, the SMF, or their combination. A grid point is considered to be a peak if the corresponding density in the map is at least as high as that of its six nearest neighbors. Redundancies due to space-group symmetry and allowed origin shifts are automatically removed. Similarly, additional sites are determined by a peak search in the FTF, the IMF, or their combination. The treatment of redundancies due to symmetry is fully integrated into the search procedure. Sites at or close to a special position can be accepted or rejected. In the latter case, the shortest distance to all its symmetry equivalent sites is computed for each of the trial sites. If this distance is less than a given cutoff ˚ ), the site is rejected. Because selenomethionine distance (typically 3.5 A substitution is the predominant technique for introducing anomalous scatterers into a macromolecule, the rejection of peaks on special positions is set to be the default. However, if heavy atoms have been soaked, cocrystallized, or chemically reacted with the macromolecule, a site could be located on a special position. In such cases, it is appropriate to search for heavy atoms first with special positions rejected and then with them accepted in order to determine whether further sites are found. Scoring Trial Structures The result of the CNS heavy-atom search is a number of trial solutions, each containing up to the specified maximum number of sites. There are typically as many of these trial solutions as were requested by the user before running the heavy_search.inp task file. However, when the input Patterson map has only a small number of peaks, it is possible that there will be fewer trial solutions found. The trial solutions can be ranked by the scoring function (which is typically F2F2, the correlation between the squared amplitudes), but other score functions can be used. Although the absolute value of the correlation coefficient could be used as a guide to the correctness of each trial solution, empirical observation has shown that a more informative guide is the presence of solutions with correlation coefficients that are outstanding compared with the rest (Fig. 1). Similar observations have also been made by the authors of other automatic programs for locating heavy atoms.9 The heavy_search.inp task file creates a list file (heavy_search.list) that contains an unsorted list of the score function for each trial solution. Each solution with a correlation score that is 1.5 above the mean of all the solutions is marked with a plus sign (þ). To interpret the results easily, the list of configurations can be sorted by correlation coefficient and then plotted graphically (Fig. 1). In the majority of cases encountered to date, if the
54
phases
[3]
solution with the highest correlation is also more than 1.5 above the mean, then all or most of the heavy-atom positions in that solution are correct. Substructure Refinement, Site Validation, and Enantiomorph Determination The trial solutions produced by the automated heavy-atom search are used to determine initial phases to generate an electron-density map. Several different tasks must be performed in order to refine the heavy-atom substructure, calculate phases, complete the heavy-atom model, resolve the enantiomorph, and possibly resolve phase ambiguities. A similar approach is followed for MAD, SAD, and (M/S)IR(AS) experiments. In all cases, the following methods are employed. Substructure Refinement. The heavy-atom sites located automatically with CNS are refined and phase probability distributions generated using the ir_phase.inp or mad_phase.inp task files that deal with isomorphous replacement and anomalous diffraction, respectively. A generalized phase refinement formulation is used when lack-of-closure expressions are calculated between a user-selected reference data set and all other data sets.46,47 A maximum-likelihood target function47 is employed that makes use of an error model similar to that of Terwilliger and Eisenberg.21 Coordinates, B-factors and, when appropriate, occupancies are refined using the Powell conjugate gradient minimization algorithm.48 Site Validation. The heavy-atom positions are not extensively validated during the search procedure; instead, the refinement of B-factors during each cycle decreases the contribution from incorrect sites. After phase calculation, the gradient map technique is used to validate the existing sites further, and also to detect sites missing from the current model.49 The gradient map is a Fourier synthesis calculated from the first derivative of the phasing target function, which can be interpreted as a difference map. A positive peak, clearly separated from any existing atom, corresponds to an atom missing from the heavy-atom model whereas a negative peak, located at the position of an existing atom, indicates that this atom is either incorrectly placed or has been assigned an incorrect chemical type or occupancy. Anisotropic motion of atoms in the substructure also can lead to peaks in the gradient map close to existing sites. Enantiomorph Determination. The use of the gradient map method in combination with substructure refinement allows the heavy-atom model 46
J. C. Phillips and K. O. Hodgson, Acta Crystallogr. A 36, 856 (1980). F. T. Burling, W. I. Weis, K. M. Flaherty, and A. T. Brunger, Science 271, 72 (1996). 48 M. J. D. Powell, Math. Program. 12, 241 (1977). 49 G. Bricogne, Acta Crystallogr. A 40, 410 (1984). 47
[3]
automatic solution of heavy-atom substructures
55
to be completed even though the correct hand of the heavy-atom configuration is often still unknown. In CNS, the correct hand is determined by repeating the phase determination with the alternate hand followed by inspection of the two electron-density maps (see below). In the majority of cases, obtaining the alternative hand is achieved simply by inverting the coordinates about the origin. However, in the case of enantiomorphic space groups, the space group must be changed at the same time as the coordinates are inverted (e.g., P61 is mapped to P65). In addition, in a small number of space groups, the inversion of the coordinates is not about the origin, but rather some other point in the unit cell. The CNS task file flip_sites.inp automatically takes account of both of these situations. Once phasing has been performed with the two possible choices of heavy-atom coordinates, the electron-density maps can be compared to determine which hand is correct. Making this decision from the raw experimental phases is feasible only with high-quality MIR(AS) or MAD data sets. In such cases, the solvent boundary, secondary structure elements, or atomic detail in the electron-density map can show clearly which heavy-atom configuration is correct. However, in the general case the raw experimental phases are not sufficient to reveal such features. In particular, in the case of a single anomalous diffraction (SAD) or a single isomorphous replacement (SIR) experiment, it is not possible to distinguish the two hands in this way because of the bimodal phase distributions that are produced. Therefore, it is usually better to perform phase improvement by density modification in the form of solvent flattening or solvent flipping50 to resolve the phase ambiguity present in the SAD and SIR cases. The CNS task file density_modify.inp should be used to improve the phases irrespective of the type of phasing experiment. After density modification of phases from both heavy-atom hands, the electron-density maps usually identify the correct hand unambiguously and generate maps good enough to begin model building. Dual-Space Direct Methods: SnB and SHELXD
Direct methods are techniques that use probabilistic relationships among the phases to derive values of the individual phases from the measured amplitudes. The purpose of this section is to give a concise summary of these techniques as they apply to substructure determination. The basic theory underlying direct methods,51 as well as macromolecular applications 50 51
J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D. Biol. Crystallogr. 52, 30 (1996). C. Giacovazzo, in ‘‘International Tables for Crystallography’’ (U. Shmueli, ed.), Vol. B, p. 201. Kluwer Academic, Dordrecht, The Netherlands, 1996.
56
phases
[3]
of direct methods,1 have been reviewed; the reader is referred to these sources for additional details. Historically, direct methods have targeted the determination of complete structures, especially small molecules containing fewer than 100 nonhydrogen atoms. In the early 1990s, the size range of routine direct-methods applications was extended by almost an order of magnitude through a procedure that has come to be known as Shake- and-Bake.52,53 The distinctive feature of this procedure is the repeated and unconditional alternation of reciprocal-space phase refinement (Shaking) with a complementary real-space process that seeks to improve phases by applying constraints (Baking). This algorithm has been implemented independently in two computer programs, SnB9,10 and SHELXD11,11a (alias Halfbaked or SHELXM). These programs provide default parameters and protocols for the phasing process, but they allow easy user intervention in difficult cases. It has been recognized for some time that the formalism of direct methods carries over to substructures when applied to single isomorphous54 (SIR) or single anomalous55 (SAD or SAS) difference data. MIR data can be accommodated simply by treating the data separately for each derivative, and MAD data can be handled by examining the anomalous differences for each wavelength individually or by combining them together in the form of FA structure factors.2,3 The dispersive differences between two wavelengths of MAD data also can be treated as pseudo-SIR differences. If substructure determination were the only concern, it is unclear whether it would be best to measure anomalous scattering data a few times for each of three wavelengths or many times for one wavelength. What is clear is that high redundancy leads to a highly beneficial reduction in measurement errors. SnB and SHELXD can both use either jFANOj or jFAj values, and so far both approaches have worked well. SnB is normally applied to peak-wavelength anomalous differences computed using the DREAR56 program suite, and SHELXD is normally applied to jFANOj or jFAj values that have been calculated using XPREP.39 It is reassuring to know that one wavelength is generally sufficient for substructure determination when not all wavelengths were measured or when one or more wavelengths were in error. In addition, treating the wavelengths separately allows for useful cross-correlation of sites (see below, Site Validation). 52
C. M. Weeks, G. T. DeTitta, R. Miller, and H. A. Hauptman, Acta Crystallogr. D. Biol. Crystallogr. 49, 179 (1993). 53 C. M. Weeks, G. T. DeTitta, H. A. Hauptman, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 54 K. S. Wilson, Acta Crystallogr. B 34, 1599 (1978). 55 A. K. Mukherjee, J. R. Helliwell, and P. Main, Acta Crystallogr. A 45, 715 (1989). 56 R. H. Blessing and G. D. Smith, J. Appl. Crystallogr. 32, 664 (1999).
[3]
automatic solution of heavy-atom substructures
57
The largest substructure solved so far by direct methods contained 160 independent selenium sites.57 The upper limit of size is unknown, but, by analogy to the complete structure case, it is reasonable to think that it is at least a few hundred sites. In all likelihood, the inherently noisier nature of difference data and the fact that jFANOj and jFAj values provide imperfect approximations to the substructure amplitudes mean that the maximal substructure size that can be accommodated is probably less than that of complete structures. Although, at present, full structure direct-methods ap˚ or better, the resolution plications require atomic-resolution data of 1.2 A of the data typically collected for isomorphous replacement or MAD experiments is sufficient for direct-methods determinations of substructures. Because it is rare for heavy atoms or anomalous scatterers to be closer than ˚ , data having a maximum resolution in this range are adequate. 3–4 A Data Preparation Normalization. To take advantage of the probabilistic relationships that form the foundation of direct methods, the usual structure factors, F, must be replaced by the normalized structure factors,58 E. The condition hjEj2i ¼ 1 is always imposed for every data set. Unlike hjFji which decreases as sin()/ increases, the values of hjEji are constant for concentric resolution shells. Similarly, correction factors (e) are applied that take into account the average intensities of particular classes of reflections as a result of space-group symmetry.59 The distribution of jEj values is, in principle, and often in practice, independent of the unit cell size and contents, but it does depend on whether a center of symmetry is present. Normalization is a necessary first step in data processing for direct-methods computations. It can be accomplished simply by dividing the data into resolution shells and applying the condition hjEj2i ¼ 1 to each shell. Alternatively, a leastsquares-fitted scaling function can be used to impose the normalization condition. The procedures are similar regardless of whether the starting information consists of jFj, jFj (iso or ano), or jFAj values and leads to jEj, jEj, or jEAj values. Mathematically precise definitions of the SIR and SAD difference magnitudes, jEj, that take into account the atomic scattering factors jfj j ¼ jfjo þ fj0 þ ifj00 j have been presented by Blessing and Smith56 and implemented in the program DIFFE that is distributed as part 57
F. von Delft, T. Inoue, S. A. Saldanha, H. H. Ottenhof, F. Schmitzberger, L. M. Birch, V. Dhanaraj, M. Witty, A. G. Smith, T. L. Blundell, and C. Abell, Struct. 11, 985 (2003). 58 H. A. Hauptman and J. Karle, ‘‘Solution of the Phase Problem. I. The Centrosymmetric Crystal.’’ ACA Monograph No. 3. Polycrystal Book Service, Dayton, OH, 1953. 59 U. Shmueli and A. J. C. Wilson, in ‘‘International Tables for Crystallography’’ (U. Shmueli, ed.), Vol. B, p. 190. Kluwer Academic, Dordrecht, The Netherlands, 1996.
58
phases
[3]
of the SnB package. The jFAj values that are used in SHELXD to form jEAj values are computed in XPREP,39 using algorithms similar to those employed in the MADBST component of SOLVE.4 Sigma Cutoffs and Outlier Elimination. Direct methods are notoriously sensitive to the presence of even a small number of erroneous measurements. This is especially problematical in the case of difference data, which can be quite noisy. The best antidote is to eliminate any questionable measurement before initiating the phasing process. Fortunately, it is possible to be stringent in the application of cutoffs because the number of difference reflections that must be phased is typically a small fraction of the total available observations. In small-molecule cases in which all reflections accessible to copper radiation have been measured, it is normal to phase about 10 reflections for every atom to be found, and this means that about 15% of the total data are used. In substructure cases, the unit cell for an N-site problem will be much larger than it would be for a small molecule with the same number of atoms to be positioned. Thus, the number of possible reflections will also be much larger, and many more can be rejected if ˚ need necessary. In fact, only 2–3% of the total possible reflections at 3 A be phased in order to solve substructures using direct methods, but these reflections must be chosen from those with the largest jEj values. The DIFFE56 program rejects data pairs (jE1j, jE2j) [i.e., SIR pairs (jEPj, jEPHj), SAD pairs (jEþj, jEj), and pseudo-SIR dispersive pairs (jE1j, jE2j)] or difference E magnitudes (jEj) that are not significantly different from zero or deviate markedly from the expected distribution. The following tests are applied when the default values, supplied by the SnB interface for the cutoff parameters (TMAX, XMIN, YMIN, ZMIN, and ZMAX), are shown in parentheses and are based on empirical tests with known data sets.60,61 1. Pairs of data are excluded if j(jE1jjE2j)median(jE1jjE2j)j/{1.25 median[j(jE1jjE2j)median(jE1jjE2j)j]} > TMAX (6.0). 2. Pairs of data are excluded for which either jE1j/(jE1j) or jE2j/ (jE2j) < XMIN (3.0). 3. Pairs of data are excluded if kE1jjE2k/[2(jE1j) þ 2(jE2j)]1/2 < YMIN (1.0). 4. Normalized jEj are excluded if jEj/(jEj) < ZMIN (3.0). 5. Normalized jEj are excluded if [jEjjEjMAX]/(jEj) > ZMAX (0.0). 60
G. D. Smith, B. Nagar, J. M. Rini, H. A. Hauptman, and R. H. Blessing, Acta Crystallogr. D. Biol. Crystallogr. 54, 799 (1998). 61 P. L. Howell, R. H. Blessing, G. D. Smith, and C. M. Weeks, Acta Crystallogr. D. Biol. Crystallogr. 56, 604 (2000).
[3]
automatic solution of heavy-atom substructures
59
The parameter TMAX is used to reject data with unreliably large values of kE1jjE2k in the tails of the (jE1jjE2j) distribution. This test assumes that the distribution of (jE1jjE2j)/(jE1jjE2j) should approximate a zeromean unit-variance normal distribution for which values less than TMAX or greater than þTMAX are extremely improbable.P The quantity jMAX is P 2 jE 1/ 2 a physical least upper bound such that jE j ¼ jf j/[e jfj ] for SIR MAX P P data and jEj MAX ¼ f 00 /[e (f 00 )2]1/2 for SAD data. Resolution Cutoffs. Before attempting to use MAD or SAD data to locate the anomalous scatterers, a critical decision is to choose the resolution to which the data should be truncated. If data are used to a higher resolution than is supported by significant dispersive and anomalous information, the effect will be to add noise. Because direct methods are based on normalized structure factors, which emphasize the high-resolution data, they are particularly sensitive to this. Because there is some anomalous signal at all the wavelengths in the MAD experiment, a good test is to calculate the correlation coefficient between the signed anomalous differences F at different wavelengths as a function of the resolution. A good general rule is to truncate the data where this correlation coefficient falls below 25–30%. Table III (calculated using XPREP39) illustrates three different cases. In case A, the high values involving the peak (PK) and inflectionpoint (IP) data show that it is not necessary to truncate the data because there is significant MAD information at the highest resolution collected. A poorer correlation would be expected with the low-energy remote data (LR), which has a much smaller anomalous signal. In case B, it is advisable ˚ (which indeed led to a successful soluto truncate the data to about 3.9 A tion using SHELXD). Case C is clearly hopeless and, in fact, could not be solved. For SAD data collected at a single wavelength, it is still possible to use the correlation coefficient between the anomalous differences collected from two crystals, or from one crystal in two orientations, before merging the two data sets. Such information is also available from the CCP4 programs SCALA and REVISE (see Collaborative Computational Project Number 4, below). Heavy-Atom Searching and Phasing The phase problem of X-ray crystallography may be defined as the problem of determining the phases of the normalized structure factors E when only the magnitudes jEj are given. Owing to the atomicity of crystal structures and the redundancy of the known magnitudes, the phase problem is overdetermined. This overdetermination implies the existence of relationships among the phases that are dependent on the known magnitudes alone, and the techniques of probability theory have identified the linear
60
[3]
phases TABLE III Correlation Coefficients (%) Between High-Energy Remote Data and Other Wavelengths as a Function of Resolution Range A. Apical domain,a 1 (3 SeMet in 144 residues), C2221
Inf – 8.0 – 6.0 – 5.0 – 4.0 – 3.6 – 3.4 – 3.2 – 3.0 – 2.8 – 2.6 – 2.4 – 2.2 PK IP LR
91.2 89.7 48.5
93.9 90.0 52.8
93.9 87.0 52.9
89.6 84.4 38.0
88.6 79.8 28.4
89.4 78.9 34.6
89.4 79.4 14.2
83.9 74.7 21.1
76.9 71.1 24.7
65.7 54.3 9.1
57.0 47.2 5.4
44.8 39.2 3.7
B. Ribosome recycling factor,b 1 (4 SeMet in 185 residues), P43212 Inf – 8.0 – 6.0 – 5.0 – 4.6 – 4.4 – 4.2 – 4.0 – 3.8 – 3.6 – 3.4 – 3.2 – 3.0 PK IP
69.3 59.4
73.1 58.3
62.2 41.9
56.9 43.3
49.6 40.7
45.6 50.4
48.6 34.6
29.6 24.7
20.6 17.5
24.6 16.6
20.1 8.1
14.2 3.9
C. Unknown protein, 4 (4 SeMet in 350 residues), P21 Inf – 8.0 – 6.0 – 5.0 – 4.6 – 4.4 – 4.2 – 4.0 – 3.8 – 3.6 – 3.4 – 3.2 – 3.0 PK IP
33.2 37.6
29.5 38.9
19.9 37.8
10.6 26.5
7.7 13.5
17.4 24.0
7.6 14.2
9.8 27.3
9.3 25.9
13.4 23.1
6.0 24.3
2.8 22.8
Abbreviations: PK, peak; IP, inflection point; LR, low-energy remote. a M. A. Walsh, I. Dementieva, G. Evans, R. Sanishvili, and A. Joachimiak, Acta Crystallogr. D. Biol. Crystallogr. 55, 1168 (1999). b M. Selmer, S. Al-Karadaghi, G. Hirokawa, A. Kaji, and A. Liljas, Science 286, 2349 (1999).
combinations of three phases whose Miller indices sum to zero (i.e., HK ¼ H þ K þ HK) as relationships useful for determining unknown structures. (The quantities HK are known as structure invariants because their values are independent of the choice of origin of the unit cell.) The conditional probability distribution of the three-phase or triplet invariants depends on the parameter AHK, where AHK ¼ (2/N 1/2)jEHEKEHKj and N is the number of atoms, here presumed to be identical, in the asymmetric unit of the corresponding primitive unit cell.62 Probabilistic estimates of the invariant values are most reliable when the associated normalized magnitudes (jEHj, jEKj, and jEHKj) are large and the number of atoms in the unit cell is small. Thus, it is the largest jEj or jEAj, remaining after the application of all appropriate cutoffs, that are phased in direct-methods substructure determinations. The triplet invariants involving these reflections are generated, and a sufficient number of those invariants with the highest AHK values are retained to achieve the desired invariant-to-reflection ratio (e.g., SnB uses a default ratio of 10:1). The inability to obtain a sufficient 62
W. Cochran, Acta Crystallogr. 8, 473 (1955).
[3]
automatic solution of heavy-atom substructures
61
number of accurate invariant estimates is the reason why full-structure phasing by direct methods is possible only for the smallest proteins. ‘‘Multisolution’’ Methods and Trial Structures. Once the values for some pairs of phases (K and HK) are known, the triplet structure invariants can be used to generate further phases (H) which, in turn, can be used iteratively to evaluate still more phases. The number of cycles of phase expansion or refinement that must be performed depends on the size of the structure to be determined. Older, conventional, direct-methods programs operate in reciprocal space alone, but the SnB and SHELXD programs alternate phase improvement in both reciprocal and real spaces within each cycle. To obtain starting phases, a so-called multisolution or multitrial approach63 is taken in which the reflections are each assigned many different starting values in the hope that one or more of the resultant phase combinations will lead to a solution. Solutions, if they occur, must be identified on the basis of some suitable figure of merit. Typically, a random-number generator is used to assign initial values to all phases from the outset.64 A variant of this procedure employed in SnB is to use the random-number generator to assign initial coordinates to the atoms in the trial structures and then to obtain initial phases from a structure-factor calculation. The efficiency of direct methods, however, often can be improved considerably by using better-than-random starting trial structures that are, in some way, consistent with the Patterson function. In SHELXD, this is accomplished by computing a Patterson minimum function (PMF)65 to screen for likely candidates. First, one presumes that the strongest general Patterson peaks may well correspond to a vector between two heavy atoms. For a selected number (e.g., 100) of these vectors, the pair of atoms related by the vector are subjected to a number of random translations (e.g., 99,999). For each of these potential two-atom trial structures, all the symmetryequivalent atoms are found, the Patterson-function values corresponding to the unique vectors between all of these atoms are calculated and sorted in ascending order, and then the PMF scoring criterion is computed as the mean value of the lowest (e.g., 30%) values in this list. For each two-atom vector, the random translation with the highest PMF is retained. Next, the two-atom trial structures are extended to N atoms by using a technique that involves the computation of a full-symmetry Patterson superposition minimum function (PSMF).37 A list containing all symmetry equivalents of the two starting atoms is generated. Then, each pixel of the PSMF map is 63
G. Germain and M. M. Woolfson, Acta Crystallogr. B 24, 91 (1968). R. Baggio, M. M. Woolfson, J.-P. Declercq, and G. Germain, Acta Crystallogr. A 34, 883 (1978). 65 C. E. Nordman, Trans. Am. Crystallogr. Assoc. 2, 29 (1966). 64
62
phases
[3]
assigned a value equal to the PMF for all vectors in the list and a dummy atom placed at that pixel. Finally, the N 2 highest peaks in the PSMF map are obtained by interpolation and sorting, and then they are added to the trial structure. Tests using SHELXD have shown that this combination of direct and Patterson methods produces more complete and precise solutions than just using the Patterson methods alone. To make this method applicable in space group P1, SHELXD places an extra atom at the origin and performs random translations of the two-atom fragment. Reciprocal-Space Phase Refinement or Expansion: Shaking. Once a set of initial phases has been chosen, it must be refined against the set of structure invariants whose values are presumed known. So far, two optimization methods (tangent refinement and parameter-shift reduction of the minimal function) have proved useful for extracting phase information in this way. Both of these optimization methods are available in both SnB and SHELXD, but SnB uses the minimal function by default whereas SHELXD uses the tangent formula. The tangent formula66 P jEK EHK j sin ðK þ HK Þ (2) tan ðH Þ ¼ PK jEK EHK j cos ðK þ HK Þ K
is the relationship used in conventional direct-methods programs to compute H given a sufficient number of pairs (K, HK) of known phases. It is also an option within the phase-refinement portion of the dual-space Shake-and-Bake procedure.67,68 In each cycle, SnB uses the tangent formula to redetermine all the phases, a process referred to as tangent-formula refinement. On the other hand, SHELXD performs a process of tangent expansion in which, during each cycle, the phases of (typically) the 40% highest calculated E magnitudes are held fixed while the phases of the remaining 60% are determined by the tangent formula. The tangent formula suffers from the disadvantage that, in space groups without translational symmetry, it is perfectly fulfilled by a false solution with all phases equal to zero, thereby giving rise to the so-called ‘‘uranium-atom’’ solution with one dominant peak in the corresponding Fourier synthesis. In conventional direct-methods programs, the tangent formula is often modified in various ways to include (explicitly or implicitly) information from the so-called negative quartet or four-phase structure invariants69,70 that are 66
J. Karle and H. A. Hauptman, Acta Crystallogr. 9, 635 (1956). C. M. Weeks, H. A. Hauptman, C.-S. Chang, and R. Miller, Trans. Am. Crystallogr. Assoc. 30, 153 (1994). 68 G. M. Sheldrick and R. O. Gould, Acta Crystallogr. B 51, 423 (1995). 67
[3]
automatic solution of heavy-atom substructures
63
dependent on the smallest as well as the largest E magnitudes. Such modified tangent formulas do indeed largely overcome the problem of false minima for small structures, but because of the dependence of quartet term probabilities on 1/N, they are little more effective than the normal tangent formula for large structures. Constrained minimization of an objective function like the minimal function71,72 X X AHK (3) AHK ½ cos HK I1 ðAHK Þ=I0 ðAHK Þ 2 = RðÞ ¼ H;K
H;K
provides an alternative approach to phase refinement or phase expansion. R() is a measure of the mean-square difference between the values of the triplets calculated using a particular set of phases and the expected probabilistic values of the same triplets as given by the ratio of modified Bessel functions [i.e., I1(AHK)/I0(AHK)]. The minimal function is expected to have a constrained global minimum when the phases are equal to their correct values for some choice of origin and enantiomorph. The minimal function also can be written to include contributions from quartet invariants, although their use is not as imperative as with the tangent formula because the minimal function does not have a minimum when all phases are zero. An algorithm known as parameter shift73 has proved to be quite powerful and efficient as an optimization method when used within the Shake-andBake context to reduce the value of the minimal function. For example, a typical phase-refinement stage consists of three iterations or scans through the reflection list, with each phase being shifted a maximum of two times by 90 in either the positive or negative direction during each iteration. The refined value for each phase is selected, in turn, through a process that involves evaluating the minimal function using the original phase and each of its shifted values.53 The phase value that results in the lowest minimalfunction value is chosen at each step. Refined phases are used immediately in the subsequent refinement of other phases. Real-Space Constraints: Baking. Peak picking is a simple but powerful way of imposing an atomicity constraint. Karle74 found that even a relatively small, chemically sensible, fragment extracted by manual interpretation of a small-molecule electron-density map could be expanded 69
H. Schenk, Acta Crystallogr. A 30, 477 (1974). H. Hauptman, Acta Crystallogr. A 30, 822 (1974). 71 T. Debaerdemaeker and M. M. Woolfson, Acta Crystallogr. A 39, 193 (1983). 72 G. T. DeTitta, C. M. Weeks, P. Thuman, R. Miller, and H. A. Hauptman, Acta Crystallogr. A 50, 203 (1994). 73 A. K. Bhuiya and E. Stanley, Acta Crystallogr. 16, 981 (1963). 74 J. Karle, Acta Crystallogr. B 24, 182 (1968). 70
64
phases
[3]
into a complete solution by transformation back to reciprocal space and then performing additional iterations of phase refinement with the tangent formula. Automatic real-space electron-density map interpretation in the Shake-and-Bake procedure consists of selecting an appropriate number of the largest peaks in each cycle to be used as an updated trial structure without regard to chemical constraints other than a minimum allowed distance ˚ for full structures and 3–3.5 A ˚ for substructures). between atoms (e.g., 1.0 A If markedly unequal atoms are present, appropriate numbers of peaks (atoms) can be weighted by the proper atomic numbers during transformation back to reciprocal space in a subsequent structure-factor calculation. Thus, a priori knowledge concerning the chemical composition of the crystal is used, but no knowledge of constitution is required or used during peak selection. It is useful to think of peak picking in this context as simply an extreme form of density modification appropriate when the resolution of the data is small compared with the distance separating the atoms. In theory, under appropriate conditions it should be possible to substitute alternative density-modification procedures such as low-density elimination75,76 or solvent flattening,27 but no practical applications of such procedures have yet been made. The imposition of physical constraints counteracts the tendency of phase refinement to propagate errors or produce overly consistent phase sets. For example, the ability to eliminate chemically impossible peaks at special positions using a symmetry-equivalent cutoff distance (similar to the procedure described in the Crystallography and NMR System section) prevents the occurrence of most cases of false minima.10 In its simplest form as implemented in the SnB program, peak picking consists of simply selecting the top N E-map peaks, where N is the number of unique nonhydrogen atoms in the asymmetric unit. This is adequate for small-molecule structures. It has also been shown to work well for heavyatom or anomalously scattering substructures where N is taken to be the number of expected substructure sites.60,77 For larger structures or substructures (e.g., N > 100), the number of peaks selected is reduced to 0.8N peaks, thereby taking into account the probable presence of some atoms that, owing to high thermal motion or disorder, will not be visible. An alternative approach to peak picking used in SHELXD is to begin by selecting approximately N top peaks, but then to eliminate some of them (typically one-third) at random. By analogy to the common practice in macromolecular crystallography of omitting part of a structure from a 75
M. Shiono and M. M. Woolfson, Acta Crystallogr. A 48, 451 (1992). L. S. Refaat and M. M. Woolfson, Acta Crystallogr. D. Biol. Crystallogr. 49, 367 (1993). 77 M. A. Turner, C.-S. Yuan, R. T. Borchardt, M. S. Hershfield, G. D. Smith, and P. L. Howell, Nat. Struct. Biol. 5, 369 (1998). 76
[3]
65
automatic solution of heavy-atom substructures
Fourier calculation in hopes of finding an improved position for the deleted fragment, this version of peak picking is described as making a random omit map. It has the potential for being a more efficient search algorithm. Scoring Trial Structures SnB and SHELXD compute figures of merit that allow the user to judge the quality of a trial structure and decide whether or not it is a solution. It is worth repeating the caution given above (see Crystallography and NMR System). Although it is sometimes possible to give absolute values that strongly indicate a solution, it is safer to consider relative values. A true solution should have one or more figure-of-merit values that are outstanding relative to the nonsolutions, which generally are in the majority. Minimal Function. The minimal function itself, R() [Eq. (3)], is a highly reliable figure of merit, provided that it has been calculated directly from the constrained phases corresponding to the final peak positions.53 This figure of merit is computed by both programs, and solutions typically have the smallest values. The SnB graphical user interface provides an option for checking the status of a running job by displaying a histogram of the minimal-function values for all trials that have been processed so far, as illustrated in Fig. 3 for the peak-anomalous difference data for a 30-site selenomethionyl (SeMet) substructure.77 A clear bimodal distribution of figure-of-merit values is a strong indication that a solution has, in fact, been found. Confirmation that this is true for trial 913 in the example in Fig. 3 can be obtained by inspecting a trace of the minimal-function value as a function of refinement cycle (Fig. 4). Solutions usually show an abrupt decrease in value over a few cycles, followed by stability at the lower value. P Crystallographic R. SnB and SHELXD compute RCRYST ¼ ( kEOj P jECk)/ jEOj. This figure of merit, which is also highly reliable, has small values for solutions. PATFOM. The Patterson figure of merit, PATFOM, is the mean Patterson minimum function value for a specified number of atoms. It is computed by SHELXD. Although the absolute value depends on the structure in question, solutions almost always have the largest PATFOM values. Correlation Coefficient. The correlation coefficient42 computed in SHELXD is defined by hX i X X X w wEo wEc CC ¼ wEo Ec X
wE2o
X
w
X
wEo
2 X
wE2c
X
w
X
wEc
2 1=2 (4)
66
phases
[3]
Fig. 3. This bimodal histogram of minimal function (RMIN) values for 1000 trials suggests that there are 39 solutions. RTRUE and RRANDOM are theoretical values for true and random phase sets, respectively.53
Fig. 4. Plots of the minimal-function value over 60 cycles (a) for a solution (trial 913) and (b) for a nonsolution (trial 914).
with default weights w ¼ 1/[0.1 þ 2 (E)]. Solutions typically have the largest values for this figure of merit. Values of 0.7 or greater when based on all, or almost all, of the jEj data for full structures strongly indicate that a solution has been found. Also, when computed in SHELXD for substructures using jEAj data, values greater than 0.4 typically indicate a solution. SnB also computes a correlation coefficient, but this criterion has not been found to be reliable for substructures when based on the limited number of jEj difference data normally used.
[3]
automatic solution of heavy-atom substructures
67
Site Validation Direct-methods programs provide as output a file of peak positions, for one or more of the best trials, sorted in descending order according to the electron density at those positions on the Fourier map. For an N-site substructure, SnB provides 1.5N peaks for each trial. The user must then decide which, and how many, of these peaks correspond to actual atoms. The first N peaks have the highest probability of being correct, and in many cases this simple guideline is adequate. Sometimes, there will be a significant break in the density values between true and false peaks, and, when this occurs in the expected place, it is additional confirmation. In other cases, a conservative approach is to accept the 0.8N to 0.9N top peaks, compute a difference Fourier map, and compare the peaks on this map to the original direct-methods map. Crossword Tables. The Patterson superposition function is the basis of the crossword table,78,79 introduced in SHELXS-8680 and available also in SHELXD, that provides another way to assess which of the heavy-atom sites are correct and, in some cases, to recognize the presence of noncrystallographic symmetry. Each entry in the table links the potential atom forming the row with the potential atom forming the column. For each pair of atoms, the top number is the minimum distance between them, taking the space-group symmetry into account. The bottom number is the Patterson minimum function (PMF) value calculated from all vectors between the two atoms, also taking symmetry into account. The first vertical column is based on the self-vectors (i.e., the vectors between one atom and its symmetry equivalents). In general, wrong sites can be recognized by the presence in the table of several zero PMF values (negative values are replaced by zero). Table IV shows the crossword table for the CuK anomalous F data for a HiPIP with two Fe4S4 clusters in the asymmetric unit.81 It is easy to find the two clusters (atoms 1–4 and 5–8) by looking for Fe Fe dis˚ , and the PMF values for the eight correct tances of approximately 2.8 A atoms are, in general, higher than those involving spurious atoms despite the weakness of the anomalous signal. Comparison of Trials. When trying to decide which peaks are correct, it is also helpful to compare the peak positions from two or more solutions. 78
G. M. Sheldrick, Z. Dauter, K. S. Wilson, and L. C. Sieker, Acta Crystallogr. D. Biol. Crystallogr. 49, 18 (1993). 79 G. M. Sheldrick, in ‘‘Direct Methods for Solving Macromolecular Structures’’ (S. Fortier, ed.), p. 131. Kluwer Academic, Dordrecht, The Netherlands, 1998. 80 G. M. Sheldrick, J. Mol. Struct. 130, 9 (1985). 81 I. Rayment, G. Wesenberg, T. E. Meyer, M. A. Cusanovich, and H. M. Holden, J. Mol. Biol. 228, 672 (1992).
68
[3]
phases TABLE IV Crossword Table for Location of Eight Iron Atoms
Peak
x
y
z
Self
Cross-vectors
99.9
0.9201
0.0784
0.1133
88.4
0.9719
0.1047
0.1356
85.5
0.9043
0.1258
0.0884
82.7
0.9546
0.0950
0.0503
81.1
0.3542
0.5285
0.2615
80.5
0.4316
0.5144
0.2451
80.4
0.3942
0.5575
0.1995
73.9
0.3920
0.5023
0.1694
27.7 26.6 27.4 39.7 27.7 27.3 26.7 15.2 31.2 20.9 30.0 25.5 29.6 0.0 29.1 26.1
2.4 25.1 2.6 23.3 2.3 28.4 14.6 41.4 16.5 24.6 14.4 31.4 14.3 22.3
3.0 5.5 2.5 43.5 16.6 14.8 18.7 20.0 16.4 7.7 16.6 16.0
2.7 26.4 14.4 9.5 16.4 21.2 13.9 22.6 14.5 24.5
14.6 21.5 16.8 8.9 14.6 33.8 14.8 18.3
3.0 0.0 2.7 26.6 3.2 10.9
2.9 19.4 2.6 0.0
3.0 17.5
63.8
0.4025
0.4641
0.2218
58.9
0.9655
0.0517
0.0945
29.9 18.4 26.9 45.9
16.1 17.0 2.2 7.3
18.4 13.1 3.0 15.8
16.4 0.0 4.5 7.8
16.5 4.5 2.6 5.3
4.0 0.0 15.2 0.0
2.9 5.4 17.3 0.0
5.0 0.0 15.4 6.1
Peaks recurring in several solutions are more likely to be real. However, in order to do this comparison, one must take into account the fact that different solutions may have different origins and/or enantiomorphs. A standalone program for doing this is available,82 and the capability of making such comparisons automatically for all space groups will be available in future versions of SnB and SHELXD. The usefulness of peak correlation is illustrated by an example for a 30-site SeMet substructure.61,77 Table V presents the relative rankings of peaks, from nine other trials, that correspond to peaks 29–45 of trial 149, which had the lowest minimal-function value for the peak-wavelength difference data for crystal 1. The top 29 peaks for trial 149 were correct selenium positions, but peak 30 (the Nth peak) was spurious. Peak 33 of trial 149 was found to have a match on every other map, and indeed, it did correspond to the final selenium site. It appears that, in general, the same noise is not reproduced on different maps, especially maps originating from different data sets. Thus, peak correlation can be used to identify correct peaks ranking below the Nth peak. 82
G. D. Smith, J. Appl. Crystallogr. 35, 368 (2002).
[3]
69
automatic solution of heavy-atom substructures TABLE V Trial Comparison for 30-Site Substructure
Crystal: Wavelengtha: Trial no.:
1 PK 149
1 PK 31
1 PK 158
1 PK 165
1 PK 176
Peak rank:
29 31 33 34 37 39 40 45
22
29
29
42
30 33
29 34 30
a
35 42
1 IP 104
24
1 HR 23
2 IP 476
2 PK 93
2 HR 86
21
38
29
28
22
34
30
30
43 40 42
38 42
40
The wavelengths are peak (PK), inflection point (IP), and high-energy remote (HR).
Enantiomorph Determination Because all publicly distributed direct-methods programs, including SnB and SHELXD, work with only jEj, jEj, or jEAj values, they have no way to determine the proper hand. Both enantiomorphs are found with equal frequency among the solutions. If a structure crystallizes in an enantiomorphic space group, either of the space groups may be used during the directmethods step, but chances are 50% that, at a later stage, the coordinates will have to be inverted and the space group changed to its enantiomorph in order to produce an interpretable protein map. A direct-methods formalism has been proposed83 that uses both jEþj and jE–j and, in theory, should make it possible to produce only solutions with the proper hand. However, this theory has never been successfully applied to actual experimental data. Similarly, it should be noted that solutions occur at all permitted origin positions with equal frequency. This means that, in the MIR case, cross-phasing is necessary to ensure that all derivatives are referred to the same origin. A direct-methods formalism84 exists that should automatically do this, but it has never been implemented in a distributed program. Substructure Refinement Fourier refinement, often called E-Fourier recycling, has been used for many years in direct-methods programs to improve the quality and completeness of solutions.85 Additional refinement cycles are performed in real 83 84
H. Hauptman, Acta Crystallogr. A 38, 632 (1982). S. Fortier, C. M. Weeks, and H. Hauptman, Acta Crystallogr. A 40, 646 (1984).
70
phases
[3]
space alone, using many more reflections than is possible in the directmethods steps that are dependent on the accuracy of triplet-invariant relationships. In SHELXD, the final model can be improved further by occupancy or isotropic displacement parameter (Biso) refinement for the individual atoms,86 followed by calculation of the Sim87- or sigma-A88weighted map. The development of a common interface89 for SnB and the PHASES package90 permits coordinates determined by direct methods to be passed easily for conventional substructure phase refinement and protein phasing, and for SHELXD this facility is provided by a program SHELXE.90a Collaborative Computational Project Number 4
Unlike many other packages, the Collaborative Computational Project Number 4 (CCP4) suite is a set of separate programs that communicate via standard data files rather than having all operations integrated into one huge program. This has some disadvantages in that it is less easy for programs to make decisions about what operation to do next even though communication is now being coordinated through a graphical user interface (CCP4i). The advantage of loose organization is that it is easy to add new programs or to modify existing ones without upsetting other parts of the suite. Data Preparation The CCP4 suite provides a number of programs (i.e., SCALA,91 TRUNCATE,92 and SCALEIT) that are useful in preparing data for experimental phasing. SCALA treats scaling and merging as different operations, thereby allowing an analysis of data quality before merging. For isomorphous replacement studies, the native data can be used as the reference set, and all of the derivatives scaled to it. This provides 85
G. M. Sheldrick, in ‘‘Crystallographic Computing’’ (D. Sayre, ed.), p. 506. Clarendon Press, Oxford, 1982. 86 I. Uso´n, G. M. Sheldrick, E. de la Fortelle, G. Bricogne, S. di Marco, J. P. Priestle, M. G. Gru¨tter, and P. R. E. Mittl, Struct. Fold. Des. 7, 55 (1999). 87 G. A. Sim, Acta Crystallogr. 12, 813 (1959). 88 R. J. Read, Acta Crystallogr. A 42, 140 (1986). 89 C. M. Weeks, R. H. Blessing, R. Miller, R. Mungee, S. A. Potter, J. Rappleye, G. D. Smith, H. Xu, and W. Furey, Z. Kristallogr. 217, 686 (2002). 90 W. Furey and S. Swaminathan, Methods Enzymol. 277, 590. 90a G. M. Sheldrick, Z. Kristallogr. 217, 644 (2002). 91 P. R. Evans, in ‘‘Recent Advances in Phasing.’’ Proceedings of CCP4 Study Weekend (1997). 92 G. S. French and K. S. Wilson, Acta Crystallogr. A 34, 517 (1978).
[3]
automatic solution of heavy-atom substructures
71
well-parameterized ‘‘local’’ scales. For MAD data, all sets are scaled in one pass, gross outliers are rejected (e.g., any measurement four to five times greater than the mean), and then each data set is merged separately to give a weighted mean for each reflection. A detailed analysis of the data is provided in a graphical form. Useful information is given on the scale factors themselves (which can often pinpoint rogue images), on the Rmerge values, and on the correlation coefficients between wavelengths for MAD data (coefficients fpð1 pÞ=4g1=2 Combining Eqs. (9), (10), and (13), it can be concluded that solvent flattening reduces the error in an experimentally phased set of structure factors according to < j ð1 pÞFt ðaÞ Fu ðaÞj > ¼ ð1=nÞ1=2 < jFe ðxÞj > fpð1 pÞ=4g1=2 (14) where < jFðaÞj > fpð1 pÞ=4g1=2 < jFu ðaÞj > < jFðaÞj > ð1 pÞ: Comparing this result with Eq. (15), which is a rearranged form of Eq. (7), indicates the effects of unbiased solvent flattening (solvent flipping) on the average errors: < j Ft ðaÞ FðaÞj > ¼ < jFe ðxÞj >
(15)
In conclusion, unbiased solvent flattening (solvent flipping) has the following effects. 1. The average error of each structure factor that remains after solvent flipping has a random phase; 2. The expected mean squared amplitude of the error is reduced by a factor determined by p, the protein content of the crystal, and by the inverse square root of n, the number of relevant structure factors in G(x). 3. The resolution and the absolute volume of the protein mask determine n [see Eq. (9a)]. These theoretical considerations are borne out in practice. For example, the structure of F1-ATPase (54% solvent; unit cell volume, approximately ˚ 3) could be solved using the isomorphous differences extending 5.5 106 A ˚ of just 2.5 Hg atoms.42 In the case of even larger asymmetric structo 3.2 A tures like ribosomal subunits, the phases could be extended from even
41
J. Drenth, In ‘‘Principles of Protein X-Ray Crystallography,’’ p. 122. Springer-Verlag, New York, 1994. 42 J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D Biol. Crystallogr. 52, 30 (1996).
188
[9]
phases
˚ ) to yield interpretable electron density.39 Next to lower resolution (6–8 A very careful data processing, these results could be obtained because the large size of the unit cell increases the power of density modification techniques in improving phases. Conclusions
We have witnessed a dramatic increase in the success of crystallography in solving the structures of large asymmetric subunits. The major contributors to this success are the improved quality and flux of synchrotron beam lines, allowing data to be measured at the highest possible accuracy. More accurate data allow even weak phasing signals to become useful. These weak phasing signals are often essential for the success of more powerful computational methods for determining and refining such phases that have been developed. Finally, a hitherto unrealized benefit of large unit cells has assisted structure determinations of large asymmetric macromolecular complexes: phase refinement through solvent flipping is more powerful for large unit cells. Acknowledgments Some of the strategies and procedures described here were developed through inspiring discussions between N.B. and researchers at Yale University in the course of the large ribosomal subunit structure determination: T. A. Steitz, P. B. Moore, and P. Nissen. This work was supported by a Burroughs Welcome Fund Career Award to N.B. J.P.A. warmly thanks N. Pannu, J. Plaisier, and R. A. G. de Graaff for vital feedback on the mathematical derivations.
[9] Multidimensional Histograms for Density Modification By Kam Y. J. Zhang Introduction
Density modification improves the quality of an approximate electron density map by imposing physical constraints based on some conserved features of the correct electron density map. These conserved features are independent of the unknown fine detail of the structural conformation. They are often expressed as constraints on the electron density in various forms, either in real or reciprocal space. Because the structure factor amplitudes
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
188
[9]
phases
˚ ) to yield interpretable electron density.39 Next to lower resolution (6–8 A very careful data processing, these results could be obtained because the large size of the unit cell increases the power of density modification techniques in improving phases. Conclusions
We have witnessed a dramatic increase in the success of crystallography in solving the structures of large asymmetric subunits. The major contributors to this success are the improved quality and flux of synchrotron beam lines, allowing data to be measured at the highest possible accuracy. More accurate data allow even weak phasing signals to become useful. These weak phasing signals are often essential for the success of more powerful computational methods for determining and refining such phases that have been developed. Finally, a hitherto unrealized benefit of large unit cells has assisted structure determinations of large asymmetric macromolecular complexes: phase refinement through solvent flipping is more powerful for large unit cells. Acknowledgments Some of the strategies and procedures described here were developed through inspiring discussions between N.B. and researchers at Yale University in the course of the large ribosomal subunit structure determination: T. A. Steitz, P. B. Moore, and P. Nissen. This work was supported by a Burroughs Welcome Fund Career Award to N.B. J.P.A. warmly thanks N. Pannu, J. Plaisier, and R. A. G. de Graaff for vital feedback on the mathematical derivations.
[9] Multidimensional Histograms for Density Modification By Kam Y. J. Zhang Introduction
Density modification improves the quality of an approximate electron density map by imposing physical constraints based on some conserved features of the correct electron density map. These conserved features are independent of the unknown fine detail of the structural conformation. They are often expressed as constraints on the electron density in various forms, either in real or reciprocal space. Because the structure factor amplitudes
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[9]
multidimensional histograms
189
are known, these constraints restrict the value of phases and therefore can be used for phase improvement. Density modification methods generally require an initial map with substantial phase information. In most cases, these phases are obtained from multiple isomorphous replacement (MIR)1 or multiwavelength anomalous dispersion (MAD),2 and they are of pivotal importance when the experimental source of phase information is bimodal, as it is in single-wavelength anomalous scattering (SAD) and single isomorphous replacement methods (SIR). It is also possible to improve maps from other sources, such as molecular replacement. The amount of information in the initial map is dependent on phase accuracy, data resolution, and completeness. As more powerful constraints are incorporated, the density modification can be initiated from lower resolution maps with less accurate phases. Density modification methods are usually implemented as an iterative procedure that alternates between density modification in real space and phase combination in reciprocal space. This paradigm was first proposed by Hoppe and Gassmann3 in their ‘‘phase correction’’ method. This approach takes advantage of the particular properties of the constraints and uses them in a way that is most convenient to implement. A broad range of techniques has been developed to modify electron density maps by imposing chemical or physical information. The most commonly used density modification method is solvent flattening,4 which exploits the observation that the solvent region of the electron density map is featureless at medium resolution because of the high thermal motion and disorder of the solvent molecules. Flattening of the solvent region suppresses noise in the electron density map and thereby improves phases. A complementary method to solvent flattening is histogram matching,5 which modifies the protein region of the map by systematically adjusting the electron density values so that the electron density distribution conforms to an ideal distribution. Sayre’s equation is used to restrain the local shape of the electron density.6 Molecular averaging forces the electron density at equivalent positions to be equal when there are multiple copies of the same molecule in the asymmetric unit.7 Electron density skeletonization
1
M. F. Perutz, Acta Crystallogr. 9, 867 (1956). W. A. Hendrickson, J. R. Horton, and D. M. LeMaster, EMBO J. 9, 1665 (1990). 3 W. Hoppe and J. Gassmann, Acta Crystallogr. B 24, 97 (1968). 4 B. C. Wang, in ‘‘Diffraction Methods for Biological Macromolecules’’ (H. W. Wyckoff, C. H. W. Hirs, and S. N. Timasheff, eds.), Vol. 115, p. 90. Academic Press, Orlando, FL, 1985. 5 K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 41 (1990). 6 D. Sayre, Acta Crystallogr. 5, 60 (1952). 7 G. Bricogne, Acta Crystallogr. A 32, 832 (1976). 2
190
phases
[9]
imposes main-chain connectivity in the electron density, which is characteristic of protein molecules.8–10 Comprehensive descriptions of various density modification techniques can be found in Cowtan and Zhang11 and Zhang et al.12 Density Histogram and Histogram Matching
Histogram matching seeks to bring the distribution of electron density values of a map to that of an ideal map. It has proved to be a powerful method for phase improvement.5,13–15 The electron density histogram of a map is the probability distribution of the electron density values. The density histogram specifies not only the permitted values of the electron density but also their frequencies of occurrence. This distribution contains structural information about the underlying protein structure, such as the types of atoms and their packing. Proteins consist of mostly C, N, O, and a few S atoms, and these atoms are certain characteristic distances apart. The atoms are packed together in protein structures and the packing density is relatively independent of the detailed structure conformation.16,17 The distribution of atomic types, and the distances and angles between different atomic types, are all similar among different structures. Differences in structural conformation arise mainly from the dihedral angles of each residue. The density histogram discards this spatial information and therefore is independent of the factors that make each structure unique. Rather, it captures the commonality between different structures: the 8
C. Wilson and D. A. Agard, Acta Crystallogr. A 49, 97 (1993). D. Baker, C. Bystroff, R. J. Fletterick, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 429 (1993). 10 C. Bystroff, D. Baker, R. J. Fletterick, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 440 (1993). 11 K. D. Cowtan and K. Y. J. Zhang, in ‘‘Progress in Biophysics and Molecular Biology’’ (T. Blundell, ed.), Vol. 72, p. 245. Elsevier Science, Amsterdam, 1999. 12 K. Y. J. Zhang, K. D. Cowtan, and P. Main, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 311. Kluwer Academic, Dodrecht, The Netherlands, 2001. 13 K. Y. J. Zhang, K. D. Cowtan, and P. Main, in ‘‘Macromolecular Crystallography’’ (C. W. Carter and R. M. Sweet, eds.), Vol. 277, p. 53. Academic Press, New York, 1997. 14 K. Y. J. Zhang, Acta Crystallogr. D Biol. Crystallogr. 49, 213 (1993). 15 K. D. Cowtan, K. Y. J. Zhang, and P. Main, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold eds.), Vol. F, p. 705. Kluwer Academic, Dodrecht, The Netherlands, 2001. 16 B. W. Matthews, J. Mol. Biol. 33, 491 (1968). 17 B. W. Matthews, J. Mol. Biol. 82, 513 (1974). 9
[9]
multidimensional histograms
191
similar atomic composition and the characteristic distances between atoms. These common features can distinguish correct from incorrect structures. Therefore the ideal density histogram can be used to improve an electron density map5,18,19 or to select a correct phase set among many randomly generated phase sets in ab initio phasing.20 The density histogram is degenerate in encoding structural information. While having an ideal density distribution is a necessary condition for being a correct structure, it is not a sufficient condition. Many incorrect structures may also have ideal density histograms. Moreover, the density histogram does not capture all the common features found in protein structures. Because the density histogram accounts for the value only at a given point and ignores its neighboring environment, any information about the neighborhood of a grid point will be complementary to the density histogram. Multidimensional Histograms
One way of reducing the degeneracy of the electron density histogram is to incorporate more stereochemical information into the constraints. The electron density histogram takes the density values as independent objects, and no relationship between them is taken into account. Xiang and Carter proposed extending the density histogram to include relationships between neighboring density values via multidimensional histograms defined as joint probabilities of the density values and their higher-order derivatives.21 Stereochemical information is usually expressed as bond length and angles between atoms. This information has been routinely used as restraints in structure refinement.22–24 However, it cannot be applied in the same form to the electron density distribution because the objects in the density distribution are not atomic positions but pixels of electron density. Nevertheless, in macromolecular structure determination, the characteristic geometric shape of the electron density that provides a unique guide for the crucial step of model building derives implicitly from the same bond lengths, bond angles, and atomic types. Thus this characteristic geometric shape expresses the stereochemical information.
18
R. W. Harrison, J. Appl. Crystallogr. 21, 949 (1988). V. Y. Lunin, Acta Crystallogr. A 44, 144 (1988). 20 V. Y. Lunin, A. G. Urzhumtsev, and T. P. Skovoroda, Acta Crystallogr. A 46, 540 (1990). 21 S. Xiang and C. W. J. Carter, Acta Crystallogr. D 52, 49 (1996). 22 A. T. Bru¨nger, J. Kuriyan, and M. Karplus, Science 235, 458 (1987). 23 D. E. Tronrud, L. F. Ten Eyck, and B. W. Matthews, Acta Crystallogr. A 43, 489 (1987). 24 J. H. Konnert and W. A. Hendrickson, Acta Crystallogr. A 36, 344 (1980). 19
192
phases
[9]
Geometric shape at a particular grid point, a, in an electron density map is not completely defined by the electron density value, (a). Complementary information is provided by the derivatives, (1)(a), (2)(a) . . . (n)(a), at grid point a. If all the derivatives are known, the electron density, (r), at the neighborhood of a can be expressed as a Taylor series, ðrÞ ¼ ðaÞ þ ðr aÞð1Þ ðaÞ þ
1 X ðr aÞ2 ð2Þ ðr aÞn ðnÞ ðaÞ þ . . . ¼ ðaÞ 2! n! n¼I
(1) Here, ðkÞ ¼ rk is defined as the kth derivative of , where 0 1 0 1 i i @ @ @ @j A r ¼ ðrx ry rz Þ@ j A ¼ @X @Y @Z k k is the gradient operator. Here, @/@X, @/@Y and @/@Z represent the partial derivative along the orthogonal axes X, Y, and Z respectively; and (i j k) represents the unit vector along the three orthogonal axes. Successive applications of the gradient operator on the electron density give the successive orders of derivatives. The nth derivative can be represented as the derivative of (n 1)th derivative and this implies that the successive derivative contains information about the neighborhood of the (n 1)th derivative. The (a) in Eq. (1) represents the ‘‘average’’ value of the function in the neighborhood (a). The first-order derivative represents the difference between (r) and its immediate neighbor at (r a). The second-order derivative represents the differences between its neighbor’s neighbor. The higher the order of the derivative, the longer range interaction it represents and therefore enables the expansion to a longer range around (r). If all these derivatives are known, (r) could be determined precisely using the Taylor series. Even if the derivatives are not known, their distribution will constrain the values of the density and more importantly, its relationship with its neighbors. A multidimensional histogram therefore could capture neighborhood information about the electron density. The n-dimensional (n-D) histogram is the joint distribution of the derivatives of the electron density to the nth order, Hn ¼ Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ:
(2)
The projection of the n-D histogram along each dimension could also give us the 1D histogram or histograms of lower dimensions, such as
[9]
multidimensional histograms
Pðð0Þ Þ ¼
X
193
Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ ¼ PðÞ
ðiÞ 6¼ð0Þ
Pðð1Þ Þ ¼ Pðð0Þ ; ð1Þ Þ ¼
X
ðiÞ 6¼X ð1Þ
Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ ¼ PðgÞ
(3)
Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ ¼ Pð; gÞ
ðiÞ 6¼ð0Þ ; ð1Þ
The derivatives of any order can be calculated using the following formulae: If the fractional coordinates (x y z) along the crystal axes a, b, and c are transformed to the orthogonal coordinates (X Y Z) along the orthonormal axes (i j k) by the orthogonalization matrix 0 10 1 a bcos c cos x ðX Y ZÞ ¼ @ 0 bsin cð cos cos cos Þ=sin A@ y A (4) 0 0 V=ðab sin Þ z
where V ¼ abc (1 cos2 cos2 cos2 þ 2 cos cos cos)1/2. The first derivative of the orthogonal coordinates (X Y Z) along the orthonormal axes (i j k) is given by
r¼
0 1 i @ @ @ @ A j @X @Y @Z k
0
a b cos ¼ @ 0 b sin 0 0
0 1 @ 1B @x C0 1 c cos B @ C i B C cð cos cos cos Þ= sin AB C@ j A B @y C V=ðab sin Þ @ @ A k @z
(5)
The second derivative is then given by
1 @ 0 1 B @X C i C B @ @ @ @2 @2 @2 @ C @ j A ð i j kÞ B r2 ¼ þ þ C¼ B B @Y C @X 2 @Y 2 @Z2 @X @Y @Z k @ @ A 0
(6)
@Z
The third derivative follows as r3 ¼ r2 r ¼
@2 @2 @2 þ þ @X 2 @Y 2 @Z2
0 1 i @ @ @ @ A j @X @Y @Z k
(7)
194
[9]
phases
For even order derivatives the equation becomes 2 n=2
n
r ¼ ðr Þ
¼
@2 @2 @2 þ þ @X 2 @Y 2 @Z2
n=2
(8)
For odd order derivatives we have rn ¼ ðr2 Þðn1Þ=2 ¼
@2 @2 @2 þ þ 2 2 @X @Y @Z2
0 1 i @ @ @ @ A (9) j @X @Y @Z k
ðn1Þ=2
The even order derivatives are scalars that can be calculated by one fast Fourier transform (FFT). The odd order derivatives are vectors that can be calculated by three FFTs. The modulus of the odd order derivatives can be calculated as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ðn1Þ=2 2 2 2 2 2 @ @ @ @ @ @ (10) þ þ þ þ jrn j ¼ @X @Y @Z @X 2 @Y 2 @Z2
There is no restriction on the number of components used to construct a multidimensional histogram. The more components used, the more stereochemical information can be encoded in the multidimensional histogram. However, the density derivatives are not independent of each other. The lower order derivatives carry most of the information about the characteristic shape of the electron density. Also considering the computational cost, Xiang and Carter examined only the electron density value and its two lowest order derivatives, gradient and Laplacian,21 EGL ¼ Pð; g; lÞ
(11)
where ¼ ð0Þ is the electron density, g ¼ ð1Þ is the gradient, and l ¼ ð2Þ is the Laplacian. Projections of the above-described three-dimensional histogram give rise to the following one-dimensional histograms, and two-dimensional histograms, Z Z E ¼ PðÞ ¼ Pð; g; lÞdgdl (12) l
G ¼ PðgÞ ¼
Pð; g; lÞddl
(13)
Z Z
Pð; g; lÞddg
(14)
l
L ¼ PðlÞ ¼
g
Z Z
g
[9]
multidimensional histograms
195
Z
Pð; g; lÞdl
(15)
EL ¼ Pð; lÞ ¼
Z
Pð; g; lÞdg
(16)
GL ¼ Pðg; lÞ ¼
Z
Pð; g; lÞd
(17)
EG ¼ Pð; gÞ ¼
l
g
The gradient can be calculated by three FFTs with the Fourier coefficient modified accordingly, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 @ @ @ (18) þ þ g ¼ jrj ¼ @X @Y @Z where @ 2 i X ¼ a1 hFðhklÞ exp ½2 iðhx þ ky þ lzÞ ¼ gx @X V hkl @ 2 i X ¼ ða2 h þ b2 kÞFðhklÞ exp ½2 iðhx þ ky þ lzÞ ¼ gy @Y V hkl @ 2 i X ¼ ða3 h þ b3 k þ c3 lÞFðhklÞ exp ½2 iðhx þ ky þ lzÞ ¼ gz @Z V hkl
(19)
Similarly, the Laplacian can be calculated by one FFT with the Fourier coefficient modified accordingly, 4 2 X Dhkl FðhklÞ exp ½2 iðhx þ ky þ lzÞ (20) l ¼ r2 ¼ V hkl with
Dhkl ¼ a21 þ a22 þ a23 h2 þ ðb22 þ b23 Þk2 þ c23 l2 þ 2ða2 b2 þ a3 b3 Þhk þ 2a3 c3 hl þ 2b3 c3 kl
where the elements of the orthogonalization matrix are a1 ¼ a sin ð Þ sin ðÞ a2 ¼ a sin ð Þ cos ðÞ a3 ¼ a cos ð Þ
b2 ¼ b sin ð Þ b3 ¼ b cos ð Þ c3 ¼ c The a ; b ; c ; ; , and * variables are reciprocal space cell parameters and x, y, and z are crystallographic coordinates. The orthogonal
196
[9]
phases
axes, X, Y, and Z were chosen such that X is along the crystallographic axis a and Z is along the c* axis. To verify the insensitivity of the multidimensional histograms to molecular conformation, Xiang, Carter, and coworkers built three different secondary structures artificially from the same 16-residue peptide taken from cytidine deaminase25 as well as a random atom model from the same peptide. For simplicity, only the one-dimensional histograms E, G, and L were tested from the above-described four different atomic models. They found that histograms from different secondary structural conformations, helix, sheet, and loop, are almost the same. In contrast, the histograms of the random atom model differ significantly everywhere from those of the corresponding secondary structures. They noted specifically that the gradient histogram recorded the largest overall differences between the random atom model and those models with regular secondary structures. It was suggested that the gradient histogram encodes much more stereochemical information than either the electron density or the Laplacian histograms. The usefulness of histograms in density modification and ab initio phasing depends on their sensitivity to phase errors. The phase sensitivity ˚ X-ray diffraction of the multidimensional histograms was tested with 3.5-A data of cyclophilin A.26 To give a quantitative measure of the sensitivity of histograms Xiang and Carter defined a histogram R factor P to phase errors, P as Rh ¼ jP Pm j= Pm , where P is the histogram of the electron density in question and Pm is the error-free histogram. Rh measures the difference between a histogram and the error-free histogram and, by implication, the phase difference between them because the structure factor amplitudes are the same in both cases. The changes of Rh with phases error for the three-dimensional histogram, EGL, as well as its various projections to the lower dimensions are nicely illustrated in Fig. 5 in Xiang and Carter.21 The variation of Rh, Rh, when phase error changes from 0 to 90 , are summarized in the following table:
Rh
25
E
G
L
EG
EL
GL
EGL
0.16
0.36
0.17
0.43
0.29
0.42
0.49
L. Betts, S. Xiang, S. A. Short, R. Wolfenden, and C. W. J. Carter, J. Mol. Biol. 235, 635 (1994). 26 H. M. Ke, L. D. Zydowsky, J. Liu, and C. T. Walsh, Proc. Natl. Acad. Sci. USA 88, 9483 (1991).
[9]
multidimensional histograms
197
Thus, higher dimension histograms have increased sensitivity to phase errors. This indicates that the components of the histograms, the density, the gradient, and the Laplacian, encode somewhat independent stereochemical information. They also found that histograms that contain the gradient have a higher sensitivity to phase error. This enhanced phase sensitivity arises probably because the gradient captures more stereochemical information owing to the higher molecular shape sensitivity compared with either the density or the Laplacian. On the basis of their studies, Xiang and Carter have concluded that the multidimensional histogram, including additional dimensions composed of the gradient magnitude and the Laplacian of the density, is minimally dependent on molecular folding and packing, while capturing substantially more stereochemical information than the conventional electron density histogram. The multidimensional histogram substantially reduces the degeneracy of the electron density histogram. They suggested multidimensional histograms could be used as improved targets for density modification and as more reliable figure of merit for evaluating correct phases. Multidimensional Histogram as Constraint for Density Modification
Double Histogram Method In the conventional (1D) electron density histogram matching method,5 a one-to-one mapping is made on the original electron density to the new electron density so that the density histogram of the modified map matches that of the ideal histogram. The order of the electron density values is retained after histogram matching. Two grid points with the same electron density value will have the same density value after histogram matching. Therefore, the pattern of peaks and troughs in the modified map is similar to that in the original map. This is necessary in the histogram matching process, because there are many alternative ways of adjusting the electron density values to match an ideal electron density distribution. However, this feature is undesired, especially when spurious electron density peaks need to be removed and new electron densities corresponding to missing atoms need to be generated. This coupling of original and modified maps is broken during phase combination or when other constraints are introduced. Alternative ways of decoupling the electron density order between the modified and original maps include incorporating other features of electron density in addition to the electron density distribution. Refaat et al. proposed a double-histogram matching method in which the density modification takes into account not only the current density
198
phases
[9]
values at a grid point but also some characteristics of the environment of that grid point within some distance.27 They investigated three local density environments: (1) local minimum density, Lmin, (2) local maximum density, Lmax, and (3) local density variance, VL. By local in this context it means within some distance R of the grid point in question. The local minimum and maximum density for a given grid point can be easily determined by comparing the density values of all the grid point within a radius R. The local density variance can be found by the use of Fourier transforms. VL ¼ h2 iL hi2L ¼ 2 W ð WÞ2 2 ¼ =1 =ð2 Þ =ðWÞ =1 ½=ðÞ =ðWÞ 2 ¼ =1 ½G Q =1 ½F Q
(21)
Here, = and =1 represent Fourier transform and inverse Fourier transform and the symbols and represent convolution and multiplication, respectively. F and G are the Fourier transform of the electron density and the squared electron density 2, respectively. Q is the Fourier transform of the weight function W used to calculate the average. Two types of weighting schemes were used to calculate the average density within a given radius. The first was a uniform weight everywhere within a sphere radius of R (ball function, Wb). The second was a weight that varies linearly from a maximum at the center to zero at the surface within a sphere radius of R (tent function, Wt). The Fourier transform of both the ball and tent functions can be derived analytically and scaled so that the integration of the function over the sphere gives unity. The Fourier transform of the ball function is 3 sin ð2 RsÞ cos ð2 RsÞ (22) Qb ðsÞ ¼ =ðWb Þ ¼ 4 2 R2 s2 2 Rs The Fourier transform of the tent function is 3 1 cos ð2 RsÞ sin ð2 RsÞ Qt ðsÞ ¼ =ðWt Þ ¼ 2 3 R3 s3 2 Rs
(23)
In their double-histogram matching procedure, the grid points are divided into 10 groups containing the same number of grid points in each group over 10 different value ranges of the local characteristic, such as local minimum, local maximum, and local variance of density. Therefore, 27
L. S. Refaat, C. Tate, and M. M. Woolfson, Acta Crystallogr. D Biol. Crystallogr. 52, 252 (1996).
[9]
multidimensional histograms
199
10 different histograms are created and each corresponds to a different value of the local environment. The electron density values within each local environment are modified according to the histogram matching process described by Zhang and Main5 such that the resulting histogram after modification conforms to that of the ideal histogram of the same local environment. The ideal histograms for the 10 different local environments are obtained from a model structure resembling the one under investigation. To reduce the changes to the density in each cycle, a damping factor, c, was used. The revised modified density, 0 , is given by 0 ¼ ð1 cÞ0 þ cm
(24)
where 0 and m are the original and histogram matching modified density, respectively. The double-histogram matching method has been tested on two known protein structures, RNAp128 and 2Zn insulin.29 Various averaging radius and damping factors have been tested. It was found that the best results for RNAp1 are from using the local density variance as the local character˚ as the averaging scheme. istic with the tent function and a radius of 0.5 A The mean phase error was reduced by 10 and the map correlation coefficient was improved by 0.14 as compared with the normal density histogram matching method. The best results for 2Zn insulin are from using the local density histogram matching method. The best results for 2Zn insulin are from using the local density maximum as the local characteristic with a ˚ and a damping factor of 0.9. The improvement over the radius of 0.5 A normal density histogram matching method was a 4 reduction in phase error and a 0.06 increase in map correlation. Refaat et al. have shown that judicious use of the double-histogram matching method can give appreciably better results than use of the normal density histogram matching procedure. The choice of the damping factor c is consistently indicated at about 0.9, but the best value for the radius of local characteristic R seems to be structure dependent. Good results are obtained with either the local density maximum or the tent function weighted local density variance as the local characteristic. However, the most reliable choice of parameters seems to be to use local maximum density and a damping factor of 0.9 and a value of R in the range of 0.5–0.6, which gives good results for both RNAp1 and 2Zn insulin. The 28
S. I. Bezborodova, L. A. Ermekbaeva, S. V. Shlyapnikov, K. M. Polyakov, and A. M. Bezborodov, Biokhimiya 53, 965 (1988). 29 E. N. Baker, T. L. Blundell, J. F. Cutfield, S. M. Cutfield, E. J. Dodson, G. G. Dodson, D. M. Hodgkin, R. E. Hubbard, N. W. Isaacs, C. D. Reynolds, N. Sakabe, and M. Vijayan, Philos. Trans. R. Soc. Lond. B Biol. Sci. 319, 369 (1988).
200
phases
[9]
double-histogram matching procedure has been incorporated into a computer program package, PERP (phase extension and refinement program), by Refaat et al.30 Two-Dimensional Histogram Matching Method In pursuit of an electron density constraint that reduces the degeneracy of the electron density histogram and incorporates more stereochemical information from the underlying structure, Goldstein and Zhang examined the 2D histogram of the joint probability distribution of the electron density and its gradient31 in a manner similar to that of Xiang and Carter.21 They considered an extended scope of protein structures from 16 distinct fold families.32 Electron density gradients were calculated by FFTs in a way similar to that proposed by Xiang and Carter.21 The accumulation of the 2D histogram is similar to that for the 1D density histogram.5 They have systematically examined the 16 structures to study the dependence of the 2D histogram on resolution, overall temperature factor, structural conformation, and phase error. The 2D histogram was found to vary with resolution and overall temperature factor, but was found to be insensitive to structure conformation. The average correlation coefficient between pairs of 2D histograms at three different resolutions examined was 0.90, with a standard deviation of 0.04. The 2D histogram was also found to be sensitive to phase error. The average correlation coefficient between 2D histograms with a 10 phase difference is 0.71. The variation of the 2D histogram due to structure conformation was estimated to be equivalent to that of a 4 phase error. This establishes the minimal phase error that a 2D histogram matching method could achieve. The conservation of the 2D histogram with respect to structure conformation enables the prediction of the ideal 2D histogram for unknown structures. The sensitivity of the 2D histogram to phase error suggests that it could be used as a target for the density modification method and also could be used as a figure of merit for phase selection in ab initio phasing. Having established the predictability of the 2D histogram due to its independency to structural conformation and its sensitivity to phase error, a 2D histogram matching procedure has been developed by Nieh and Zhang to exploit the joint probability distribution of electron density and its gradient as a constraint for density modification.33 30
L. S. Refaat, C. Tate, and M. M. Woolfson, Acta Crystallogr. D Biol. Crystallogr. 52, 1119 (1996). 31 A. Goldstein and K. Y. J. Zhang, Acta Crystallogr. D Biol. Crystallogr. 54, 1230 (1998). 32 C. A. Orengo, T. P. Flores, W. R. Taylor, and J. M. Thornton, Protein Eng. 6, 485 (1993). 33 Y. P. Nieh and K. Y. J. Zhang, Acta Crystallogr. D Biol. Crystallogr. 55, 1893 (1999).
[9]
multidimensional histograms
201
The 2D histogram matching on density and its gradients is achieved through two alternating steps of 1D histogram matching on density and 1D histogram matching on gradients. The 1D histogram matching on density follows the method described by Zhang and Main.5 In this method, the new electron density value is derived from the old electron density value through a linear transformation such that the cumulative distribution of the new density value equals the cumulative distribution of the ideal histogram. The histogram matching on gradients also follows a similar protocol in which the density value was replaced by the gradients. The modified gradient maps were converted to the modified structure factors by the fast Fourier transform method as shown in the following equation: V X hFðhklÞ ¼ gx exp ½2 iðhx þ ky þ lzÞ 2 iN xyz V X gy exp ½2 iðhx þ ky þ lzÞ kFðhklÞ ¼ (25) 2 iN xyz V X lFðhklÞ ¼ gz exp ½2 iðhx þ ky þ lzÞ 2 iN xyz Three structure factor sets are generated through the inverse FFT on each of the three modified gradient maps after the histogram matching on gradients. A single structure factor data set is obtained by vector averaging the equivalent reflections in the three structure factor sets. The 2D histogram matching process was implemented in two modes: the parallel mode and the sequential mode. In the parallel mode, the histogram matching on density and gradients is applied in parallel, using the same initial structure factor set. After matching, the two new structure factor sets are combined by vector summation. In the sequential mode, as shown in Fig. 1, the structure factor set calculated after density histogram matching is used as input for the gradient histogram matching, and vice versa. Test results showed that the histogram matching in sequential mode gave better phase improvements and converged closer to the ideal 2D histogram in fewer matching cycles compared with the parallel mode. The 2D histogram matching procedure was incorporated into the density modification program SQUASH.14,34 Nieh and Zhang have tested their 2D histogram matching procedure using the 2Zn insulin29 with both MIR phases and calculated phases with random errors. The MIR phases contain both random and systematic errors. The contribution of these two components to MIR phase errors varies from structure to structure. Whereas random errors can all be modeled statistically and are therefore easier to eliminate, systematic errors 34
K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 377 (1990).
202
phases
[9]
Fig. 1. The 2D histogram matching procedure. The 2D histogram matching is achieved through alternating application of 1D histogram matching on electron density and 1D histogram matching on density gradients in the sequential mode. The ideal 1D density histogram and 1D gradient histogram are obtained from the projection of the ideal 2D histogram along the gradient and density, respectively. Starting from an initial structure factor sets, F0, an electron density map, 0, is calculated by a fast Fourier transform, =. The initial map, 0, is modified by the 1D histogram matching, H to produce a new map, 0 , whose density histogram conforms to the ideal density histogram. The modified map, 0 , is inverse Fourier transformed, =1, to give a new structure factor set, F1, from which three gradient maps, gx, gy, and gz, along each crystal axis are calculated. The three gradient maps are transformed from the crystal axes system to the orthogonal axes system by the orthogonalization matrix, O. The transformed gradient maps, gu, gv, and gw, are modified by the 1D histogram matching on gradients similar to that on density to produce new gradient maps, gu0 , gv0 , and gw0 , whose gradient distribution matches the ideal gradient distribution. The new gradient maps are then transformed from the orthogonal axes to the fractional axes by the deorthogonalization matrix, D, to produce gx0 , gy0 , gz0 . Three sets of structure factors, Fx0 , Fy0 , Fz0 , are obtained from each of the three new gradient maps, gx0 , gy0 , gz0 , by the inverse Fourier transforms. These three sets of structure factors, Fx0 , Fy0 , Fz0 , are combined to produce a new set of structure factors, F2, from which a new map, 0, is calculated. The process is iterated until the 2D histogram of the modified map matching that of the ideal 2D histogram.
[9]
multidimensional histograms
203
vary from case to case and are difficult to model and therefore more difficult to eliminate. Phase refinement and extension results from phases with random errors will demonstrate the upper limit of phase improvement when no systematic errors are present in the MIR phases. Tests on 2Zn insulin have shown that employing extra constraints based on the density gradients can further reduce the phase errors and improve the overall map quality. For phase refinement and extension to high resolution, the result showed that 2D histogram matching improves the phases more than 1D histogram matching. The phase improvement for the refine˚ was 9.6 . However, ment and extension of MIR phases from 1.9 to 1.5 A the difference between 2D and 1D histogram matching methods decreases at medium resolution. The phase improvement for the refinement and ˚ was 6.2 . extension of MIR phases from 3.0 to 2.0 A Although the results using the MIR phases of 2Zn insulin provided a typical example of phase refinement and extension, the results from the phases with random errors gave an upper limit of phase improvement when no systematic errors are present in the MIR phases. When tested on phases with randomly generated errors, the 2D histogram matching method im˚ phases by 34.2 versus the 1D histogram matching proved the 1.9- to 1.5-A method. This demonstrates the importance of eliminating systematic errors during any phasing process. The 2D histogram specifies not only the probability of the electron density value for a given grid point in the map but also the local environment around that grid point as reflected by the density gradients, which arise from chemical bonding. It also provides a means of decoupling the electron density order between the modified and original maps through sequential incorporation of the electron density gradient distribution. Test results have demonstrated that the increased sensitivity to phase error in the 2D histogram of the density and gradient can be translated into an improved density modification method. The method used to achieve 2D histogram matching is computationally efficient. The density and gradient maps, and the corresponding structure factors, can be efficiently transformed back and forth through fast Fourier transform techniques. More importantly, 2D histogram matching can be efficiently achieved by applying 1D matching on density and density gradients alternately. The strategy of using alternating 1D histogram matching to achieve 2D histogram matching can be generalized to the matching of higher dimension histograms. By exploring histograms of density values and its higher-order derivatives, it may be possible to obtain a density modification method with further enhanced effectiveness in phase determination, refinement, and extension.
204
[10]
phases
[10] Docking of Atomic Models into Reconstructions from Electron Microscopy By Niels Volkmann and Dorit Hanein Introduction
In the three-dimensional structure determination of macromolecules, X-ray crystallography covers the full range from small molecules to large assemblies with molecular masses of megadaltons. The limiting factors are expression, the stability and homogeneity of the structure, and subsequent crystallization. In the case of nuclear magnetic resonance (NMR), structures can be determined from molecules in solution, but the size limit, although increasing, is presently on the order of 100 kDa. Dynamic aspects can be quantified, but again the structures of mixed conformational states cannot be determined. Owing to dramatic improvements in experimental methods and computational techniques, electron microscopy (EM) has matured into a powerful and diverse collection of methods that allow visualization of the structure and dynamics of an extraordinary range of macromolecular assemblies at resolutions spanning from molecular to near atomic.1–6 In addition, cryomethods enable the observation of molecules under nearly physiological conditions in their native aqueous environment.7 Although not hampered by many of the limitations of NMR or crystallography, EM imaging is ˚ ), thus limited to lower resolution for most biological specimens (10–30 A precluding atomic modeling directly from the data. Still, atomic models often can be generated by combining highresolution structures of individual components in a macromolecular complex with a low-resolution structure of the entire assembly.8 Analysis of the resulting models can lead to new hypotheses and contribute important insights into interaction and regulation of the individual molecular 1
W. Ku¨hlbrandt and K. A. Williams, Curr. Opin. Chem. Biol. 3, 537 (1999). W. Chiu, A. McGough, M. B. Sherman, and M. F. Schmid, Trends Cell Biol. 9, 154 (1999). 3 W. Baumeister, R. Grimm, and J. Walz, Trends Cell Biol. 9, 81 (1999). 4 W. Baumeister and A. C. Steven, Trends Biochem. Sci. 25, 624 (2000). 5 H. Stahlberg, D. Fotiadis, S. Scheuring, H. Remigy, T. Braun, K. Mitsuoka, Y. Fujiyoshi, and A. Engel, FEBS Lett. 504, 166 (2001). 6 H. R. Saibil, Nat. Struct. Biol. 7, 711 (2000). 7 J. Dubochet, M. Adrian, J.-J. Chang, J.-C. Homo, J. Lepault, A. W. McDowall, and P. Schultz, Q. Rev. Biophys. 21, 129 (1988). 8 T. S. Baker and J. E. Johnson, Curr. Opin. Struct. Biol. 6, 585 (1996). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[10]
docking of atomic models into EM maps
205
components. Thus the combination of atomic resolution structures with EM provides a powerful tool to gain insight into cellular processes. The Fitting Problem
The combination of high-resolution features of individual components with the more complete picture of macromolecular assemblies that EM produces requires fitting of the atomic structures of the components into the density provided by EM. According to Numerical Recipes in C,9 a genuinely useful fitting procedure should provide the following items: 1. The best possible fitting parameters (globally) 2. Error estimates on these parameters 3. A statistical measure for goodness-of-fit (a confidence interval) To illustrate the rationale behind this statement, suppose that the third item suggests that a certain ‘‘best fit’’ satisfies the data just as well as any other fit. Providing the best-fitting parameters (item 1) would then be basically meaningless. Items 2 and 3 are especially important in the context of inferring high-resolution information from fitting into low-resolution reconstructions as determined by EM. The accuracy and reliability of the conclusions depend highly on the error margin and confidence of the fit. Despite the importance of items 2 and 3, most of the work on fitting as of today has focused on item 1. Fitting Methods
The current practice for fitting atomic models into EM reconstructions can be divided into three classes: (1) interactive manual fitting with various degrees of refinement and quality assessment, (2) semiautomatic fitting based on a reduced vector representation of the model and the data, and (3) automated fitting based on density-correlation measures with optional filtering operations. Until more recently, interactive manual fitting with optional refinement was the method of choice. However, the other two methodologies are gaining significantly in popularity. Correlation measure based fitting approaches are, in particular, available in a variety of implementations. A comparison between various computational fitting methods was performed using calculated error-free densities.10 The comparison gives an 9
W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, ‘‘Numerical Recipes in C: The Art of Scientific Computing.’’ Cambridge University Press, Cambridge, 1988. 10 W. Wriggers and S. Birmanns, J. Struct. Biol. 133, 193 (2001).
206
phases
[10]
assessment for the fitting precision of these methods in one particular implementation in an error-free environment. Unfortunately, this comparison is of only limited use for practical purposes. One of the main concerns in fitting atomic structures into EM reconstructions is the presence of fitting artifacts due to experimental errors in the EM data. Implementations of fitting functions that perform well on error-free, perfect data may fail completely in a noisy environment. In particular, edge enhancement algorithms such as convolution with Laplacian operators11 are known to be notoriously sensitive to noise.12 Another potential source of fitting artifacts that was not addressed in the comparison is the tendency of molecules to exhibit local conformational changes (induced fit mechanism) on complex formation.13 In this chapter we give first an overview of the main features of the three classes of fitting methods while having real-life applications in mind. We try to pinpoint potential problem areas of these methods in dealing with experimental data that carry systematic and random errors. We then give a more detailed example of one particular implementation of a density correlation measure. Last, we present the concept of solution sets14,15 that can provide error estimates (item 2) and confidence intervals (item 3) for fitting of atomic models into reconstructions from EM, and build the basis for a more quantitative assessment of the final fits. Manual Fitting
Interactive manual fitting is widely used for combining atomic structures with EM reconstructions.16–23 In this approach, the fit of the model into isosurface envelopes is judged by eye and corrected manually, using 11
W. Wriggers and P. Chacon, Structure 9, 779 (2001). M. Seul, L. O’Gorman, and M. Sammon, ‘‘Practical Algorithms for Image Analysis.’’ Cambridge University Press, Cambridge, 2000. 13 T. J. Smith, E. S. Chase, T. J. Schmidt, N. H. Olson, and T. S. Baker, Nature 383, 350 (1996). 14 N. Volkmann, D. Hanein, G. Ouyang, K. M. Trybus, D. J. DeRosier, and S. Lowey, Nat. Struct. Biol. 7, 1147 (2000). 15 N. Volkmann and D. Hanein, J. Struct. Biol. 125, 176 (1999). 16 I. Rayment, H. M. Holden, M. Whittaker, C. B. Yohn, M. Lorenz, K. C. Holmes, and R. A. Milligan, Science 261, 58 (1993). 17 D. Voges, R. Berendes, A. Burger, P. Demange, W. Baumeister, and R. Huber, J. Mol. Biol. 238, 199 (1994). 18 R. Beroukhim and N. Unwin, Neuron 15, 323 (1995). 19 H. Sosa, D. P. Dias, A. Hoenger, M. Whittaker, E. Wilson-Kubalek, E. Sablin, R. J. Fletterick, R. D. Vale, and R. A. Milligan, Cell 90, 217 (1997). 20 A. Hoenger, S. Sack, M. Thormahlen, A. Marx, J. Muller, H. Gross, and E. Mandelkow, J. Cell Biol. 141, 419 (1998). 12
[10]
docking of atomic models into EM maps
207
a modeling program such as O,24 until the fit ‘‘looks best.’’ Sometimes this initial fit is refined locally using various reciprocal-space25–28 or real-space scoring functions,29–32 some of which originate in crystallographic refinement or molecular replacement.33 If the components of the assembly under study are large molecules with distinctive shapes at the resolution of the reconstruction, manual fitting often can be performed with relatively little ambiguity.16,34 For example, when the complex of human rhinovirus and an attached Fab was solved by X-ray crystallography, it was found that a model from previous manual docking experiments was accurate to ˚ .13 On the other hand, divergent models docked manually in difwithin 4 A ferent laboratories using EM data of identical constructs (e.g., microtubule decorated with kinesin) also have been reported.20,35 One obvious disadvantage of the manual docking approach is its subjectivity. Objective scoring functions have been used occasionally to assess the quality and to refine the initial manual fit. However, local refinement does not necessarily increase precision or resolve ambiguities because the refinement can easily become trapped in a local maximum close to the initial manual fit that served as a starting point. Local refinement cannot answer the question concerning whether there is perhaps a better or equivalent fit in some remote comer of parameter space that was missed 21
E. A. Hewat, T. C. Marlovits, and D. Blass, J. Virol. 72, 4396 (1998). R. J. Gilbert, J. L. Jimenez, S. Chen, I. J. Tickle, J. Rossjohn, M. Parker, P. W. Andrew, and H. R. Saibil, Cell 97, 647 (1999). 23 X. Yu, T. Horiguchi, K. Shigesada, and E. H. Egelman, J. Mol. Biol. 299, 1299 (2000). 24 T. A. Jones, J.-Y. Zou, S. W. Cowan, and M. Kjeldgaard, Acta Crystallogr. A 47, 110 (1991). 25 R. H. Cheng, V. S. Reddy, N. H. Olson, A. J. Fisher, T. S. Baker, and J. E. Johnson, Structure 2, 271 (1994). 26 Z. Che, N. H. Olson, D. Leippe, W. M. Lee, A. G. Mosser, R. R. Rueckert, T. S. Baker, and T. J. Smith, J. Virol. 72, 4610 (1998). 27 W. R. Wikoff, G. Wang, C. R. Parrish, R. H. Cheng, M. L. Strassheim, T. S. Baker, and M. G. Rossmann, Structure 2, 595 (1994). 28 M. Mathieu, I. Petitpas, J. Navaza, J. Lepault, E. Kohli, P. Pothier, B. V. Prasad, J. Cohen, and F. A. Rey, EMBO J. 20, 1485 (2001). 29 J. M. Grimes, J. Jakana, M. Ghosh, A. K. Basak, P. Roy, W. Chiu, D. I. Stuart, and B. V. Prasad, Structure 5, 885 (1997). 30 P. L. Stewart, S. D. Fuller, and R. M. Burnett, EMBO J. 12, 2589 (1993). 31 E. Nogales, M. Whittaker, R. A. Milligan, and K. H. Downing, Cell 96, 79 (1999). 32 E. A. Hewat, N. Verdaguer, I. Fita, W. Blakemore, S. Brookes, A. King, J. Newman, E. Domingo, M. G. Mateu, and D. I. Stuart, EMBO J. 16, 1492 (1997). 33 J. Navaza, J. Lepault, F. A. Rey, C. Alvarez-Rua, and J. Borge, Acta Crystallogr. D Biol. Crystallogr. 58, 1820 (2002). 34 T. J. Smith, N. H. Olson, R. H. Cheng, H. Liu, E. S. Chase, W. M. Lee, D. M. Leippe, A. G. Mosser, R. R. Rueckert, and T. S. Baker, J. Virol. 67, 1148 (1993). 35 F. Kozielski, I. Arnal, and R. Wade, Curr. Biol. 8, 191 (1998). 22
208
phases
[10]
in the initial manual fitting attempt. Only global fitting protocols can address this question. Fitting Based on Vector Quantization
In this approach the distribution of atoms within the high-resolution structure as well as the low-resolution reconstructions are approximated by a small number of vectors (typically about three to six each)36 that are calculated by vector quantization (VQ), a technique used in data compression for image and speech processing applications.37 The use of a small number of vectors reduces the complexity of the fitting problem to a least-squares fit of two coordinate sets, making the method fast. However, the price that must be paid is the loss of information: VQ is known to be a so-called lossy compression technique in its compression implementation.37 In the fitting context, VQ does not preserve the information content of the density but reduces it substantially. In VQ-based fitting, the best fit is selected according to the lowest root– mean–square deviation (RMSD) between the two fitted vector sets. However, the RMSD of matched vector distributions is known to be a poor estimator for the fitting accuracy of the rest of the aligned volumes.38 In addition, the accuracy of the docking and the usefulness of the RMSD as a scoring function are critically dependent on how well the vector distribution represents the atomic structure and the EM reconstruction at the given resolution. Because the exact placement of the VQ vectors is sensitive to experimental noise,11 the method needs to be used carefully in real cases. To account for conformational changes that occur on complex formation, one can perform an atom-based positional refinement after the initial rigid-body fit.10 The movement of the atoms is constrained by a standard molecular dynamics force field and is subject to enforcing a match of the VQ vector distributions. This essentially deforms the atomic structure with the hope that the deformed structure is a better representation of the EM density than the original structure. The procedure introduces a large number of additional degrees of freedom by allowing the atoms to move relatively freely. This can lead to overfitting artifacts, especially because the information content of the data is already reduced by the VQ procedure. This problem can be reduced somewhat by interactively introducing some additional distance constraints (skeletons) during docking.11 A test of the flexible VQ-based fitting method with error-free calculated data10 36
W. Wriggers, R. A. Milligan, and J. A. McCammon, J. Struct. Biol. 125, 185 (1999). R. Gray, IEEE ASSP Mag. 1, (1984). 38 J. Fitzpatrick, J. West, and C. Maurer, IEEE Trans. Med. Imaging 17, 694 (1998). 37
[10]
docking of atomic models into EM maps
209
˚ in a shows that—even in the absence of noise—distortions of up to 9 A 10 ˚ 15-A resolution map can occur, likely because of an insufficient parameter-to-observable ratio. As a consequence, real-life applications of flexible VQ-based docking tend to lead to unacceptable loss of secondary structure.39 VQ-based fitting is limited to cases in which all density in the EM map is accounted for by the atomic model.11 Missing portions or disordered regions of the atomic model that are present in the EM map need to be modeled and accounted for before VQ-based fitting can be applied. Sometimes the application of iterative rounds of discrepancy mapping14,15 can be used to help alleviate the problem.40 In summary, the high computational speed of the method is appealing, particularly for applications in which accuracy is less of an issue. If accuracy is a concern, the methodology needs to be applied with care in the presence of noise, especially if the flexible fitting option is used. In practice, even rigid-body applications of VQ-based fitting often need to be refined by correlation-based methods in order to make the fit acceptable.41 Several parameters need to be adjusted interactively, making the approach semiautomatic rather than automatic. Density-Correlation-Based Fitting
Various flavors of automated, global searches using density-correlation measures as fitting criteria are being developed by different laboratories.10,11,14,15,42,43 These flavors vary in the exact mathematical form of the density correlation, the use of various preprocessing steps including masking and filtering, and in implementation details. Masking operations using the calculated envelope42 or using the atomic model directly43 both enhance high-resolution features and suppress low-resolution information, making it somewhat equivalent to high-pass filtering in Fourier space. The amplification of background noise is a common side effect of high-pass filtering.12 Thus, the success of this type of masking will depend strongly on the noise level in the reconstruction. Convolution with a Laplacian operator11 is also known to be sensitive to, and tends to amplify, noise.12 39
W. J. Rice, H. S. Young, D. W. Martin, J. R. Sachs, and D. L. Stokes, Biophys. J. 80, 2187 (2001). 40 S. A. Darst, N. Opalka, P. Chacon, A. Polyakov, C. Richter, G. Zhang, and W. Wriggers, Proc. Natl. Acad. Sci. USA 99, 4296 (2002). 41 M. Kikkawa, E. P. Sablin, Y. Okada, H. Yajima, R. J. Fletterick, and N. Hirokawa, Nature 411, 439 (2001). 42 A. M. Roseman, Acta Crystallogr. D Biol. Crystallogr. 56, 1332 (2000). 43 M. G. Rossmann, Acta Crystallogr. D Biol. Crystallogr. 56, 1341 (2000).
210
phases
[10]
Although these filters have the potential to boost the signal in the absence of noise,11 they must be used with considerable care if noise is present in the reconstruction. Implementation details can dramatically affect the performance and accuracy of the underlying algorithm. For example, using one particular implementation of a density-correlation measure, the fitting becomes severely ˚ root–mean–square difference from correct fit) inaccurate (more than 10 A ˚ for error-free data.11 The time for completat resolution lower than 15 A ing a typical docking with this implementation is stated as 6 h. A different implementation of essentially the same scoring function14 yields reasonable ˚ resolution in the presence of noise (root–mean– results even at 100-A ˚ )44 and completes a typical square difference from correct fit was 3.5 A docking task within 5 min on a single-processor Linux box. Product-Moment Correlation Coefficient as a Fitting Criterion
We describe now in detail a fitting approach based on a global search of parameter space using as a scoring function a real-space, product-moment correlation coefficient (CC).14,15 There are six adjustable parameters involved in the fitting problem of a single rigid body: the three translational parameters x, y, and z; and the rotational parameters , , and . As a consequence, CC(x, y, z, , , ) is a function of these six parameters. The correlation coefficient used in this fitting approach is defined as P aÞðei eÞ ðai ffi CC qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1) P 2P 2 ðai aÞ ðei eÞ
where ai denotes density at voxel i, calculated from the atomic model in the trial position; ei denotes experimental EM density at voxel i, and the sum is over all i. The overbars denote mean value. The CC is a Pearson-type, product-moment correlation coefficient and is used routinely as a voxel-based similarity measure to align 3D representations of medical data such as those coming from magnetic resonance and computed tomography.45 The CC was shown to be among the best criteria in blind tests using experimental data,46,47 whereas techniques based on surface information performed poorest.48 44
N. Volkmann, in ‘‘Biophysical Discussion,’’ Asilomar, www.biophysics.org/discussions/ volkmann-speaker.pdf (2002). 45 P. Van den Elsen, E. Pol, T. Sumanawaeera, P. Hemler, S. Napel, and J. Adler, in ‘‘Proceedings of Visualization in Biomedical Computing,’’ p. 227. SPIE Press, Rochester, MN, 1994.
[10]
docking of atomic models into EM maps
211
Once the global CC search is done, it is followed by a statistical analysis of the CC distribution. Statistical properties of distributions related to the CC are well characterized and can be used to obtain confidence intervals49 that lead eventually to the definition of ‘‘solution sets.’’ These sets contain all fits that satisfy the data within the error margin defined by the chosen confidence level, which implicitly accounts for all error sources in the data and the fitting calculation. Structural parameters of interest such as fitting uncertainty or interaction probabilities50 can be evaluated as properties of these sets. The method can be used to fit modules (domains or substructures) of the assembly individually into the EM reconstruction and thus accounts for conformational changes that can be modeled as relative domain movements.14 Most protein conformational changes involve movements of rigid domains that have their internal structure preserved.51–53 Because the CC is calculated in real space, all types of real-space constraints, including the position of physical labels in the reconstruction such as heavy atom clusters54 or compact protein domains,55 can be easily incorporated into the fitting process by restricting search space accordingly. All information that is contained in the reconstruction is used. No compression, arbitrary cutoffs, or reliance on surface representations are necessary. Biochemical and mutagenesis information can be exploited to assist the fitting process.50 Accuracy of Fitting To assess the performance of the CC, the crystal structure of the ironbinding 691-residue protein lactoferrin containing iron bound (PDB entry 1LFG) was fit into a density calculated from lactoferrin without iron bound (PDB entry 1LFH). This is a good test system, because it displays a 46
C. Studholme, D. L. Hill, and D. J. Hawkes, in ‘‘Proceedings of the British Machine Vision Conference’’ (D. Pycock, ed.), Vol. 1, 27. British Machine Vision Association Southampton, UK, 1995. 47 C. Studholme, D. L. Hill, and D. J. Hawkes, Med. Image Anal. 1, 163 (1996). 48 J. West, J. M. Fitzpatrick, M. Y. Wang, B. M. Dawant, C. R. Jr., Maurer, R. M. Kessler, and R. J. Maciunas, IEEE Trans. Med. Imaging 18, 144 (1999). 49 R. von Mises, ‘‘Mathematical Theory of Probability and Statistics.’’ Academic Press, New York, 1964. 50 D. Hanein, N. Volkmann, S. Goldsmith, A. M. Michon, W. Lehman, R. Craig, D. DeRosier, S. Almo, and P. Matsudaira, Nat. Struct. Biol. 5, 787 (1998). 51 W. G. Krebs and M. Gerstein, Nucleic Acids Res. 28, 1665 (2000). 52 M. Gerstein and W. Krebs, Nucleic Acids Res. 26, 4280 (1998). 53 S. Hayward, Proteins 36, 425 (1999). 54 J. F. Hainfeld, J. Struct. Biol. 127, 93 (1999). 55 T. G. Wendt, N. Volkmann, G. Skiniotis, K. N. Goldie, J. Muller, E. Mandelkow, and A. Hoenger, EMBO J. 21, 5969 (2002).
212
phases
[10]
substantial conformational change involving three independently moving rigid-body domains. Size-wise it is somewhere in between small (about 300 residues) and large (more than 1000 residues) proteins that, while bound to helical filaments such as actin or microtubules, are accessible by EM and image reconstruction techniques that take advantage of the helical symmetry of the filament scaffold for alignment and averaging. Lactoferrin would be too small for reconstruction techniques that do not rely on symmetry (single-particle reconstruction). The smallest particle solved without symmetry so far is the Arp2/3 complex with about 2000 residues.56 The ‘‘gold standard’’ for the fitting was chosen to be the least-squares fit of the coordinates after dividing the structure into the three rigid-body domains that move relative to each other during the conformational change between 1LFG and 1LFH. This calculation yields a root–mean– ˚ if all flexible loops are included. The square deviation (RMSD) of 0.94 A problem of fitting 1LFG to 1LFH was previously also tackled using the VQ-based flexible fitting approach,10 allowing a direct comparison between the accuracy of the two methods. Modular, CC-based fitting of the atomic ˚ map of the model of one lactoferrin conformation into a calculated 15 A ˚ . This accuracy is essentially other conformation yields an RMSD of 0.98 A the same as that of the least-squares fit using both coordinate sets, our gold ˚ without user intervention standard. The vector-based fitting yielded 4.54 A ˚ after interactive selection of vectors.10 In problematic regions and 2.72 A ˚ (Fig. 1), making the fit of the VQ-based fit, the displacement exceeds 9 A virtually useless for interpretation in that particular region. Confidence Intervals
It is reassuring that the CC gives such accurate parameter estimates in this test application. However, there are important issues that go beyond the mere finding of the best fitting parameters. Data are generally not exact. They are subject to measurement errors (noise). Thus, typical data never exactly fit the model, even when that model is correct. We need to assess whether or not a model is appropriate, that is, we need to test the goodness-of-fit against some useful statistical standard. The statistical properties of the CC distribution are well characterized.49 In particular, the distribution derived by Fisher’s z-transformation of the CC follows a Gaussian distribution. We can use this fact to estimate confidence intervals. Fisher’s z-transform of the CC is defined by
56
N. Volkmann, K. J. Amann, S. Stoilova-McPhie, C. Egile, D. C. Winter, L. Hazelwood, J. E. Heuser, R. Li, T. D. Pollard, and D. Hanein, Science 293, 2456 (2001).
[10]
docking of atomic models into EM maps
213
Fig. 1. Comparison of fits to a calculated map of lactoferrin. The density was calculated ˚ . The gray chain corresponds to the iron-free conformation that was at a resolution of 15 A used to calculate the density. The white chain corresponds to the fitted iron-bound conformation (a) Correlation-based modular fitting (RMSD 0.98). (b) Vector quantizationbased flexible fitting using molecular mechanics rules (RMSD 2.72).10 The blow-up in (b) shows a region of a problematic region in the vector quantization fitting.
1 1 þ CC z ¼ log 2 1 CC
and
1 z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi N3
(2)
z denotes the standard deviation of the z distribution and N is the number of independent pieces of information in the data (degrees of freedom). N cannot be estimated easily from the EM density because neighboring voxels are not independent, especially if we use oversampling. However, z can be estimated directly from the data if we have more than one data set for the fitting problem. If only a single data set exists it can be split in two, as commonly done for estimation of EM resolution.57 This is different from the crystallographic cross-validation procedures (e.g., free R factor) that rely on splitting the data into a small test set that is omitted from refinement and a working set used for refinement. Here, we try to generate two independent structures from the same data by splitting it in two (or 57
J. Frank, ‘‘Three-Dimensional Electron Microscopy of Macromolecular Assemblies.’’ Academic Press, San Diego, CA, 1996.
214
[10]
phases
more) equal-sized parts and repeat the complete fitting procedure for each of the independent sets. This will give us a CC distribution with a defined standard deviation (related to z). Once we have an estimate for z, we can use the complementary error function to test the hypothesis that a particular CCi is significantly different from the maximum CCmax: z max zi Pi ¼ erfc (3) 2z zmax denotes the z value derived from CCmax, zi that from CCi. erfc denotes the complementary error function. Now, by choosing a particular confidence level Conf, we can define the solution set {S} as fSg
is the set of all
mðx; y; z; ; ; Þ
for which
1 Pm Conf
(4)
A model m in the orientation (x, y, z, , , ) is an element of {S} if zm is not significantly different from zmax. Interpretation and Analysis of Solution Sets In this context, the confidence level can be interpreted as the likelihood of finding the truly correct (or best possible) fit inside the corresponding solution set. In general, the higher the confidence level, the more solutions must be taken into account. In the limiting case, to achieve a likelihood of 1.0 (confidence level, 1.0), that is, to be absolutely sure that the best solution is included, all possible solutions must be included. Otherwise there is always the possibility that the one solution excluded was the correct one. The more restrictive the solution set, the lower the likelihood of having the best fit included. In the limiting case of this extreme, the confidence level is zero when we rely on a single solution (even the one with the best score). This makes sense: the likelihood of picking exactly the correct fit, not even 0.000001 of an angstrom off, is practically zero, no matter how good the fitting procedure. The quality of the fitting can be assessed by analyzing the size (volume) and shape of the corresponding solution set in six-dimensional fitting space. These properties of solution sets and their dependence on the confidence level can be best visualized by plotting P1/2 as a function of one of the six docking parameters (termed confidence plots; see Fig. 2). The shape of the sets can be arbitrary and gives information about degeneracies in the fitting. For example, presuming the center of mass is well defined and looking only at the three orientational parameters (, , ), a well-defined fit would yield a spherical solution set (with a small volume). If there is a 2-fold rotational symmetry (or pseudo-symmetry) present, the solution set would consist of two, disjointed regions. If the fitting is less well defined around
[10]
docking of atomic models into EM maps
215
Fig. 2. Application of the z-transform. This shows an angular scan through the correlation landscape (a) from a docking experiment using a helical reconstruction of microtubules decorated with Kinesin and the corresponding atomic model. There are two local maxima showing up in this scan that are at 0 (correct solution) and 90 . The correlation (CC) does not fall under 0.7 during the whole scan. The correlation plot is difficult to interpret. It is difficult to tell whether the secondary maximum should be considered a valid solution and what the angular uncertainty would be. The use of the z-transform yields a confidence plot (b) that is much more straightforward to interpret. Plotted is the square root of the confidence measure P [Eqs. (3) and (4)]. The dashed line indicates a confidence level (1 P) of 99.5%. Every angle that has a value above that line is part of the solution set; every angle with a value below the line can be considered significantly different from the best solution at this confidence level and is not part of the solution set. The angular uncertainty can be estimated from the width of the solution set (here, about 15 ). If the confidence level is raised, the line will move down and more solutions will need to be considered. In this example, increasing the confidence level to 99.9% would move the line down far enough so that the secondary solution needs to be considered. At the 99.5% confidence level, the secondary solution does not need to be considered.
one of the axes, the solution set would take the shape of an ellipsoid; the permissible solutions would be smeared out around that axis. A good tool for analyzing the solution sets and for detecting 6D degeneracies is cluster analysis of the best few hundred refined solutions from the global fitting protocol. We use the pairwise RMSD between these fits as a distance measure in a modified agglomerative hierarchical clustering approach. The major modification from the original procedure58 consists of weighting by the individual fitting scores during the cluster merging step. The procedure sorts the fits into clusters with similar orientations and positions biased toward the solution with the highest score in that particular cluster. This allows efficient detection of local correlation maxima. An inspection of the resulting dendrogram (Fig. 3) gives a good indication for the size and partitioning of the respective solution set. 58
R. Sokal and P. Sneath, ‘‘Principles of Numerical Taxonomy.’’ W. H. Freeman, San Francisco, 1963.
216
phases
[10]
Fig. 3. Cluster analysis of the correlation distribution. The mean distance between all atoms of the fitted structures is used as a distance criterion for the clustering. The relative strength of the correlation between the corresponding solution and the density is indicated below the zero-distance line of the dendrogram. The higher the correlation, the longer the bar. Any solution with a bar extending into the white area is part of the solution set. The cluster analysis finds local correlation maxima and identifies degeneracies in the fitting. Each cluster corresponds to a local maximum. The various maxima often relate to each other by low-resolution pseudo-symmetry. From the four local maxima identified in this analysis, three are part of the solution set (A,C, and D). The correct maximum (here subcluster D) is usually associated with a subcluster that has a large number of solution set members with a small mean distance between them. This example is taken from docking the atomic structure of the kinesin motor ncd into the corresponding density isolated from helical reconstructions of ncd-decorated microtubules.55
Once solution sets are determined, parameters of interest are extracted from the sets with all members contributing equally. For example, the center-of-mass position of the fitted model and its experimental uncertainty can be estimated by calculating the mean and standard error of the center-of-mass positions for all solution set members. Similarly, the orientation parameters can be extracted. Because estimates for the expectation value as well as the standard deviation are available, standard statistical tests such as the Student t test can be used to test for the significance of differences between solution sets.14 Robustness of Solution Sets The final outcome of the fitting procedure described here is a solution set. Once we decided on a suitable confidence level, we are not interested in the actual values of the CC anymore. Also, the exact location of the global maximum is not of major interest. The only thing one needs to know is whether a particular CC is significantly different from the global CC
[10]
docking of atomic models into EM maps
217
maximum. In other words, is this CC a member of the solution set or not? The size and shape of the solution set determine the outcome of the structural interpretations (e.g., fitting uncertainties, interaction probabilities, and so on). It is therefore important to understand how robust the size and shape of the estimated solution set are. In particular, the potential difference in scale between the EM reconstruction and the atomic model has been perceived as a factor that can significantly influence fitting results.43 Note that the solution set will not change if we only shift the global maximum within the set. The solution sets are always more robust estimators than the actual maximum. In the solution set approach, all error sources are automatically accounted for without the need for explicit error modeling. Reduction of experimental errors will show immediately in the size and shape of the solution sets (Fig. 4). The influence of various factors on the size and shape of solution sets was assessed by using calculated data from lactoferrin (1LFH). Because error-free data were used for this assessment, an experimental estimate of z cannot be obtained. Instead, the value of N was estimated. A crystallographic FFT (space group P1) was calculated using a cubic box with twice the maximum diameter of the molecule. With this sampling, the structure factors of the molecular transform should be spatially uncorrelated; any higher sampling (larger box) will result in correlation between the structure factors. The number of structure factors at this sampling should therefore be a fair estimate of the degrees of freedom involved in the molecular transform (which is equivalent to the real-space density by Fourier transform). Comparison with experimentally derived z from experimental actomyosin data14 indeed indicated that the number of structure factors up to the resolution in question can be used as a rough estimate for N. A potential problem with using this approximation in practical applications is that the resolution of EM reconstructions is not always well defined. In crystallography, inspection of the X-ray diffraction pattern gives a good indication of where the resolution limit is exceeded and the signal disappears (no more spots). EM reconstructions are not necessarily subsampled in Fourier space but are based on continuous Fourier transforms. In such a case, it is much less straightforward to determine when the signal disappears into the noise. Therefore, an experimental determination of z is preferable because this is independent of the resolution estimation and also potentially accounts for undetected systematic errors. To test the influence of potential disrupting parameters on the solution sets, the calculated data were perturbed accordingly and then the CC distribution was reevaluated using the model for 1LFH as derived from modular fitting. The 6D volumes of the solution sets were roughly spherical with respect to orientation and position. Thus, the volume of the solution sets can
218
phases
[10]
Fig. 4. Influence of various parameters that can be potential sources of systematic errors on solution sets. Shown are scans of the square root of the confidence measure P [Eqs. (3) and (4)] along an arbitrary translation direction (confidence plots). The dotted line parallel to the x axis corresponds to a confidence level of 99.5%. Everything above that line is a member of the solution set defined by that confidence level. The zero on the x axis denotes the position of ˚. the correct fit. The dashed graph represents the confidence plot without perturbation at 15 A The solution set radius is the x projection of the point where the confidence line (dotted line ˚ is about for 99.5%) and the graph cross. The solution set radius for unperturbed data at 15 A ˚ . (a) Influence of perturbing parameter N of Eq. (2). (b) Influence of scaling errors. The 0.5 A step size is 1%. (c) Influence of errors in reciprocal-space amplitude fall-off estimation. Step ˚ . (e) Additive Gaussian noise. ˚ 2. (d) Influence of resolution limitation. Step size, 5 A size, 50 A Step size is 0.1 of mean protein density. SNR denotes signal-to-noise ratio. (f) Influence of lowering the contrast between protein and surroundings. Step size, 0.2 of mean protein density.
be accurately described by a radius. In the following we parameterized this radius in terms of permitted translational displacements within the solution sets. Two types of behavior can be distinguished (Fig. 4). 1. Most of the parameters display an approximate linear relationship with the solution set radius (solution set radius increase, RI). These ˚ RI per 1% misestimation), reciprocalparameters include: scaling (0.1-A ˚ ˚ 2 misestimation in B-factor), space intensity fall-off (0.1-A RI per 50-A ˚ ˚ resolution (1-A RI per 5-A loss in resolution), and additive Gaussian ˚ RI for every 10% mean protein density). random noise (0.1-A 2. Some parameters do not change the sets at all, up to a certain threshold value. Then the effect is dramatic. These parameters include
[10]
docking of atomic models into EM maps
219
sampling (smaller than one-fifth the resolution: no effect) and contrast (as long as outside density below mean protein density: no effect). The parameter N deserves special attention because it is actually used in calculating the confidence interval [Eqs. (2) and (3)]. Reevaluation of the solution sets using 2N and 0.5N in the formula indicates that this only ˚ , respectively (Fig. 4a). A misestichanges the solution set radius by 0.1 A mation of a factor of two in N (or a corresponding misestimation of z) is not critical. An approximate N is fully sufficient to generate meaningful solution sets. All in all, the solution set size and shape are remarkably robust concerning these parameters, especially if one considers the likely size of those errors in practical applications. Because of the large amount of averaging in most EM reconstruction techniques the signal-to-noise level should be well above 5, meaning that the corresponding Gaussian noise would in˚ . The misestimation in recrease the solution set radius by less than 0.5 A ˚ 2 (RI < ciprocal-space fall-off is not likely to be much worse than 200 A ˚ 0.4 A). Scaling (i.e., how to convert the voxel size in an EM reconstruction ˚) into angstroms) is usually accurate within 1–3%59 (therefore RI < 0.3 A and can be improved if additional information (such as the known layerline positions of filamentous structures added to the sample) is introduced.59 The solution set concept, in conjunction with real-space-correlation scoring and evaluation of confidence intervals, leads to more meaningful parameter and uncertainty estimates. The size of the solution set can serve as a normalized goodness-of-fit criterion. The smaller the set, the better the data determine the position of the fitted atomic structure. The statistical nature of the approach allows the use of standard statistical tests, such as the Student t test, to evaluate differences between models of assemblies in different functional states and to gain deeper insight into biological problems than previously possible. Validation of Results
If the resolution is high enough to resolve residues, and a structure is fit into such a map, one can employ tests for validity based on stereochemis˚ , no side chains or even secondary try. At a resolution lower than 10 A structure elements can be recognized in the density map. No mechanism exists that can be used to validate fits into maps with resolution lower than ˚ ; there is no molecular Ramachandran plot to help evaluate the quality 10 A of a fit. Evaluation of close contacts has been proposed43 but is of limited 59
R. A. Milligan, M. Whittaker, and D. Safer, Nature 348, 217 (1990).
220
phases
[10]
use because substantial local conformational changes close to the interface often occur on complex formation.34 If independent information from labeling, mutagenesis, or other biophysical and biochemical experiments is available, these data can be used to validate the fitting. If the fitting is ambiguous, this additional information can be used actively in the fitting in order to resolve these ambiguities.50 If no external information is available, the only validation tools that can be used are checks for self-consistency, using multiple data sets or by splitting the original data set into multiple parts. Application Examples
In practice, each docking problem will tend to be different in one way or the other and it is essential to keep in mind that the main objectives are to extract the maximum amount of reliable information while, on the other hand, avoiding overinterpretation of the underlying data. In the following sections we describe two application examples to demonstrate the use of the solution set concept and the modular docking approach toward this end. Actin-Bound Smooth Muscle Myosin Myosins are a superfamily of actin-based molecular motors, ubiquitous in animal cells. Interaction of myosin with filamentous actin has been implicated in a variety of biological activities including muscle contraction, cytokinesis, cell movement, membrane transport, and certain signal transduction pathways. The filamentous nature of actomyosin complexes has so far hampered all crystallization attempts but also makes this structure an ideal target for EM and helical reconstruction techniques. We analyzed two different strong-binding states of actomyosin (ADP and nucleotidefree, rigor) by electron cryomicroscopy and helical reconstruction.14 The ˚ . Several crystal resulting 3D maps had a resolution of approximately 21 A structures of myosin fragments (more than 20) in different conformations and from different organisms are available. These include a total of three different conformations of fragments with similar length and composition as that used for the EM experiment. Docking of these crystal structures into the EM reconstructions showed that both actin-bound conformations are distinctly different from all unbound myosin conformations imaged by crystallography.14 Analysis of the crystal structures allowed us to define rigid-body domains that move independently during the conformational changes observed by crystallography. The two most significant rigid bodies are the motor domain (MD), which contains the actin-binding interface,
[10]
docking of atomic models into EM maps
221
and the light-chain domain (LC) that is believed to act as a lever arm during force production. We divided the structure into three rigid-body domains (MD, LC, and a small domain called converter) and used the modular fitting approach to model the two actin-bound conformations. A movie of the modular fitting procedure applied to the rigor reconstruction can be viewed at www.burnham.org/papers/actoS1/modu.mov. We used multiple data sets and crystal structures for validation of the solution sets. For example, the fitting of the MD module into the rigor reconstruction was repeated for 12 of the available crystallized MD fragments and into 4 different maps, making 48 independent docking experiments. The resulting solution sets are of well-defined, near-spherical shape and are es˚ ). sentially identical (by t test) for all 48 docking experiments (RMSD, 2.9 A Similarly, the docking into the ADP maps results in well-defined, identical ˚ ). However, a t test solution sets for all 12 crystal structures (RMSD, 2.2 A between the various rigor and ADP solution sets always shows a significant difference for the MD docking. The difference corresponds to a 9 rotation, ˚ resolution that went undetected by a prea relatively subtle change at 21-A vious study with similar-quality reconstructions of the same constructs.60 The use of solution sets and the associated statistical tests was essential in detecting this movement. The solution sets for the subsequent LC docking were less well defined than those for MD. The main contribution to the spread of solutions was a rotational uncertainty parallel to the actin filament axis (Fig. 5). The amount of uncertainty correlates with the distance from the MD connection. The RMSD of the connection point itself was similar to the RMSD in the respective MD. Note that there was no restriction for the movement of this connection point. The fact that the beginning of the LC domain is well determined and the end allows more variation indicates that the LC can adopt multiple conformations that are pivoted at the same place in the MD and consist primarily of a rotational freedom parallel to the filament axis. This interpretation is independently supported by a variance analysis for helical reconstructions61 applied to smooth muscle actomyosin.14 We tested the Laplacian filter on the MD docking of the rigor data sets. Consistent with results with calculated densities,62 the absolute differences between the correlation values were larger than for density-only correlation. However, analyzing the two correlation distributions for confidence levels using the z-transform reveals that the solution set size is actually 60
M. Whittaker, E. M. Wilson-Kubalek, J. E. Smith, L. Faust, R. A. Milligan, and H. L. Sweeney, Nature 378, 748 (1995). 61 L. E. Rost, D. Hanein, and D. J. DeRosier, Ultramicroscopy 72, 187 (1998). 62 P. Chacon and W. Wriggers, J. Mol. Biol. 317, 375 (2002).
222
phases
[10]
Fig. 5. Solution set for actin-bound smooth muscle myosin.14 (a) Representation of the solution set (several representative solutions) within the experimental density for the rigor state. The MD and LC domains were fitted independently, using a modular docking approach. The line shows the approximate location of the interface between MD and LC. The uncertainty within the MD is homogeneous; the uncertainty in the LC correlates with the distance from the MD/LC interface. (b) Component of RMSD from rotations perpendicular (dashed line) and parallel (solid line) to the actin filament axis as functions of the distance from the MD/LC interface. The plot indicates that the main component of the uncertainty is rotational, parallel to the filament axis and pivoted close to the MD/LC interface.
˚ ) for the data subjected to Laplacian filtering. Thus, somewhat larger (0.5 A there is no advantage in using a Laplacian filter for these data. Actin-Binding Domain of Fimbrin Fimbrin is a member of a large superfamily of actin-binding proteins and is responsible for cross-linking of actin filaments into ordered, tightly packed 3D networks such as actin bundles in microvilli or stereocilia of the inner ear. Similar to actomyosin, the tendency of this complex to form higher-order structures hampers crystallization but also makes it a good candidate for EM and image analysis. Helical reconstructions of actin decorated with one of the actin-binding domains of fimbrin yielded 3D maps at ˚ resolution.63 A crystal structure of the same actin-binding about 25-A 63
D. Hanein, P. Matsudaira, and D. J. DeRosier, J. Cell Biol. 139, 387 (1997).
[10]
docking of atomic models into EM maps
223
domain of fimbrin lacking the N-terminal 110 residues (of a total of 375 residues)64 was docked into the fimbrin portion of the maps to elucidate the interactions between fimbrin and actin and to locate the missing N-terminal domain.50 The analysis of the solution sets for this docking problem indicated a partitioning of the set into two separate subsets and an additional degeneracy along the long axis of the density (Fig. 6). According to the confidence level analysis, the two subsets are equally likely to contain the correct solution and the angle around the long axis is arbitrary. A closer examination of the solution sets using cluster analysis reveals that the solutions are arranged in four subclusters, clustering around four distinct local maxima, three of which are part of the solution set. These maxima relate to each other by low-resolution pseudo-symmetry around the principal axes of the molecule (Fig. 6b). To resolve the ambiguity of this docking, we constructed an additional scoring function based on biochemical and mutagenesis data.50 Using this function, we were able to rule out one of the solution subsets and to restrict the angular range of the second subset.50 The correct subcluster turned out not to contain the absolute maximum but only the third highest local maximum. We repeated the analysis for different resolution ranges to validate these results and check consistency. The organization of the solutions into four subclusters occurred ˚ , and the correct subcluster always for all resolutions between 25 and 40 A came out as the second or third highest local maximum. We tested the use of Laplacian filtering for this docking problem in order to assess its performance in a real-life difficult case (Fig. 6c). The Laplacianbased docking partitioned into three subclusters. The first subcluster corresponds to the absolute maximum of the density-only docking. Surprisingly, the second highest local maximum corresponds to a solution that places the molecule partially outside the density. The third maximum corresponds to the second highest maximum of the density-only docking. Only the first maximum is part of the Laplacian solution set. The correct local maximum does not show up at all. Lowering the resolution improves the situation somewhat. Maxima that place the molecule partially outside the density are less frequent and the correct maximum shows up occasionally. However, there is no consistent partitioning of the clustering or consistent maxima that show up at all resolution ranges except for the absolute maximum. This analysis clearly shows the dangers of Laplacian filtering. If one would rely on an analysis using this technique, one would have to conclude that the absolute maximum is the correct solution. However, this solution is incompatible with mutagenesis and biochemical data. 64
S. C. Goldsmith, N. Pokala, W. Shen, A. A. Fedorov, P. Matsudaira, and S. C. Almo, Nat. Struct. Biol. 4, 708 (1997).
224
phases
[10]
Fig. 6. Docking of the actin-binding domain of fimbrin into density from helical reconstructions of an actin–fimbrin complex.50 (a) Analysis of the solution set from densitybased docking. Top: Cluster analysis. It indicates the existence of four main local maxima ˚ ), three of which (X,Y, and Z) are part of the solution set. These (mean distance less than 10 A three maxima relate to each other by 180 rotations around the principal axes (, , and ) shown in (b) (top). X is related to Y through ; Y is related to Z through ; and Z is related to X through . Analysis of biochemical and mutagenesis data indicates that Z is the correct maximum. The orientation of Z is shown with the principal axes and the location of the two CH domains in (b) (top). The other three parts of (b) show the orientation of X, Y, and Z within the density of the helical reconstruction. The middle portion of (a) shows the confidence plot of a scan around the long axis of the density (). Solution Z appears at 0 and solution X at 180 . The dashed line corresponds to a confidence level of 99.5%. There is no orientation that can be safely ignored, even if the confidence level were lowered (the dashed line raised) considerably. This axis is degenerate. The bottom part of (a) shows the confidence plot around principal axis . Solution Z appears at 0 and solution Y at 180 . The dashed line corresponds to a confidence level of 99.5%. The solution set partitions into two distinct subsets around Z and Y. (c) Results of Laplacian filter-based docking. Top: Cluster analysis; the remaining portions show the three resulting local maxima (U, V, and W) within the experimental density. Only cluster U is part of the Laplacian-based solution set according to the analysis of the correlation distribution.
[10]
docking of atomic models into EM maps
225
If those data were not available, the mistake would go unnoticed. The solution set analysis of unfiltered data, on the other hand, clearly exposes the degeneracy in the docking and identifies the problem. Conclusions
Apart from interactive manual fitting, a number of semiautomatic and automatic procedures are available for the fitting of atomic models into reconstructions from electron microscopy. Vector quantization leads to fast algorithms but suffers from potential inaccuracies. Correlation-based approaches are somewhat slower, but tend to be more accurate. Masking and filtering operations can enhance the signal-to-noise ratio in the correlation under favorable circumstances but are also more sensitive to noise artifacts. The concept of solution sets leads to the possibility of defining confidence intervals and error margins for the fitting parameters. Because no independent information is available on how a correct fit should look, definition of confidence intervals and error margins is particularly important in the context of docking atomic structures into low-resolution density maps. For validation of results, tests on self-consistency (cross-validation) using multiple independent reconstructions or splitting of the original data set in two are options. Acknowledgments This work was supported by NIH research grants AR47199 (D.H.), U54 GM64346 (Cell Migration Consortium; D.H., N.V.), and GM64473 (N.V.).
[11]
ARP/wARP model building
229
[11] ARP/wARP and Automatic Interpretation of Protein Electron Density Maps By Richard J. Morris, Anastassis Perrakis, and Victor S. Lamzin Introduction
X-ray crystallography has become a routine tool to aid the investigation of biological phenomena at the atomic level. Once phase information becomes available for the measured structure factor amplitudes, a threedimensional image of the diffracting electronic matter may be computed. From this electron density distribution a chemically sensible model of the molecule must be derived. However, the initial phase estimates are often poor. Model building affords a means to improve these phases by providing a set of atoms whose parameters may be refined according to some optimization residual and a set of stereochemical restraints, the latter being needed for the refinement to proceed smoothly against diffraction data extending to less than atomic resolution. In this chapter phase improvement, coupled with automated map interpretation and model building, are presented as one unified process within the framework of the Automated Refinement Procedure, ARP/wARP, software suite.1 Theory
Pattern Recognition Map interpretation and model building are pattern recognition problems that consist of mapping features of an electron density distribution onto a chemical model of the molecule under study. These features and their significance depend on the information content of the data, which necessarily depends on the resolution of the diffraction pattern. For reso˚ or higher—electron density maps of good quality, in lution around 1.8 A which density peaks correspond well to atomic centers—map interpretation is an exercise in connecting points to produce well-known covalent ˚ , atoms lose their indigeometry. At medium resolution, about 2.5–3.5 A viduality and connectivity becomes the important feature to search for.
1
V. S. Lamzin, A. Perrakis, and K. S. Wilson, ‘‘International Tables for Crystallography: Crystallography of Biological Macromolecules’’ (M. Rossmann and E. Arnold, eds.), p. 720. Kluwer Academic, Dordrecht, The Netherlands, 2001.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
230
map interpretation and refinement
[11]
In the low-resolution range, map interpretation may simply reduce to distinguishing between a macromolecule and its surrounding solvent. In general, pattern recognition2 may be seen as a mapping, ðfÞ ¼ !k , that takes a continuous input variable f and assigns it to one class !k of a finite set of (predefined) output classes ¼ f!i ji ¼ 1; . . . ; Cg. C is the cardinality of the set, jj, which is the number of chosen classes. The pattern recognition function maps feature space F to the classification set . The goal is to extract from a wealth of information only those properties that are interesting to the problem. The set of features chosen to drive classification is often referred to as a feature vector f ¼ ðf1 ; f2 ; . . . ; fD Þ. D is the dimension of feature space, which is the number of features chosen. For map interpretation at medium resolution, classification space may consist of the classes {helix}, {strand}, {loop}, and {nonprotein}. An appropriate feature vector may consist of properties such as the number of distances between density maxima, minima, and saddle-points, moments of inertia, and other moments of the electron density distribution. The actual values of a feature vector will be denoted by x ¼ ðx1 ; x2 ; . . . ; xD ) and will be called the observed feature vector. An observed feature vector x belongs to class !k if and only if the posterior probability of that class is greatest, Pð!k jxÞ ¼ maxi Pð!i jxÞ; where Pð!i jxÞ ¼ nPðxj!i ÞPð!i Þ, in which n is a normalizing multiplication factor, P(xj!i) is the probability (likelihood) of the class !i being able to reproduce the observed feature vector x, and P(!i) is the prior expectation of this class being observed. The general problem of electron density map interpretation is to find features that can be mapped onto model classes, together with the appropriate mapping functions. For optimal recognition, the features should be as distinct as possible between the individual classes of a given set. A good feature for classification will therefore have a large variance over the entire classification set, or equivalently, different mean values between the classes, and small variances within each class. A standard approach for finding useful features on which to drive classification is initially to choose a large set of all conceivable characteristics. Techniques such as Principal Components Analysis (PCA) may then be employed to analyze the importance of these feature parameters and to introduce new features of higher classification power as linear combinations of the original ones. By choosing only the most significant new features in terms of information content, a large reduction in the dimension of feature space often can be achieved. The abundance of structures in the Protein Data Bank3,4 offers sufficient
2
K. Fukunaga, ‘‘Introduction to Statistical Pattern Recognition,’’ 2nd Ed. Academic Press, New York, 1990.
[11]
ARP/wARP model building
231
flexibility for the application of learning algorithms to develop, validate, and test possible coordinate-based features. Problems in Automation What currently hinders many standard modeling software packages from full automation is the necessity of decisions having to be made by the user. Decision making during the process of model building becomes necessary owing to the failure of current implementations to recognize protein fragments correctly in medium-to-poor quality electron density, or when the fragments or atoms are not placed with sufficient accuracy. This point can be readily shown by adding a random coordinate error to the atomic positions of a refined structure and attempting to find the correct connectivity. Based purely on local geometrical criteria, this becomes increasingly difficult with the inaccuracy of the positions since many pairs of originally nonbonded atoms fall into a valid bonded geometry. This placement error is often caused by poor density. The reasons for poor density are manifold; here we divide them into temporary and permanent. Temporary errors may arise from, for example, poor starting phases from a molecular replacement solution with significant parts of the structure far from the final position or even missing, strong nonisomorphism for MIR, weak anomalous signal in any anomalous scattering technique, and so on. Permanent errors may arise from, for example, disordered loops, nonrandom missing data (e.g., poor resolution; incomplete data), or simply badly measured diffraction data. Both causes will give rise to the same result: the interpretation of the density will be ambiguous and will require a sound knowledge of protein structures and much expertise to be narrowed down successfully to the correct result. Many model-building and modeling packages do a good job in finding connectivity (tracing the main chain) in reasonable density, and database routines then can match the main-chain fragments to a given sequence, build the polypeptide backbone, and place in the side chains. In regions of poor density, these methods often break down and one must rely on the eye of an imaginative and experienced crystallographer (an increasingly rare and endangered species) to find the most plausible solution. The above-described distinction between different sources for poor density was made on the basis of our experience with the automated
3
F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shomanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). 4 H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. Nucleic Acids Res. 28, 235 (2000).
232
map interpretation and refinement
[11]
model-building module, warpNtrace,5 of the ARP/wARP package. Through the incorporation of model-building routines into an iterative cycle of refinement, the noninterpretability caused by poor density of the temporary type often can be overcome. If part of the density is correctly interpreted, resulting in a hybrid model consisting of free atoms and the partial model (the atoms that were recognized as part of the protein), subsequent refinement with added restraints on the partial model will provide the next interpretation step with better phases. To initiate this iterative approach, one must have methods that are capable of building parts of a model with a reasonable accuracy into density calculated from initial phase estimates. Although a variety of numerical methods have been proposed for dealing with the identification problem in poor density, the most straightforward is to lower the threshold criteria for acceptance of, for example, connectivity, atoms, fragments, and so on. The price to pay for this simplicity is that often a significant number of false positives are thereby introduced, causing the route of the main chain to become ambiguous—we say the chain becomes branched—and requiring some form of further processing. Decisions must be made as to which route to follow, and we refer to the solution of this problem as resolving branch-points. Model Building and Refinement In terms of the above-described definitions, the Automated Refinement Procedure (ARP)6 has a simple set of output classes consisting of two elements: (1) a given point in real space is an atomic center; (2) a given point is not an atomic center. With this classification ARP drives its realspace model update based on a feature vector reflecting the density shape. This atomic map interpretation with iterative refinement cycles bears a close resemblance to the successful direct methods packages SnB7 and SHELX ‘‘half-baked.’’8 The approach has been shown to be an extremely powerful model refinement method enjoying a large radius of convergence. The basic idea of ARP is that the model consists only of what is found in the electron density map. The initial model after map interpretation with ARP consists of a set of atoms that reproduce the density calculated with the current phases. To go from this set of free atoms to a chemical model of the molecule requires a further layer of pattern recognition based on this intermediate interpretation. It is the macromolecular models with their 5
A. Perrakis, R. J. Morris, and V. S. Lamzin. Nat. Struct. Biol. 6, 458 (1999). V. S. Lamzin and K. S. Wilson. Acta Crystallogr. D Biol. Crystallogr. 49, 129 (1993). 7 C. M. Weeks and R. Miller. J. Appl. Crystallogr. 32, 120 (1999). 8 G. Sheldrick, in ‘‘Direct Methods for Solving Macromolecular Structures’’ (S. Fortier, Ed.), p. 401. Kluwer Academic, Dordrecht, The Netherlands, 1998. 6
[11]
ARP/wARP model building
233
rich structural information of atom types and bonds that have become such an important tool for biology, and not the actual result of a successful diffraction experiment—the electron density—or a representation in terms of just free atoms. A second motivation for model building is that the initial phases often are poor, and substantial improvement is necessary to reproduce faithfully the electron density distribution of the scattering matter within the crystal. A powerful method for improving the phase estimates and thereby the density is that of map interpretation and model building coupled with an iterative manner with refinement—the built model provides restraints with which the diffraction data may be titrated to enhance refinement. This is the underlying idea behind the success of ARP/wARP. The general ARP/wARP flowchart is depicted in Fig. 1. From Free Atoms to a Protein Model Our approach is based on atomic entities: the free atoms placed by ARP. Given a set of N candidate positions, S ¼ {xij i ¼ 1,. . ., N}, the goal is to find a subset of positions, denoted by M S, such that the degree with which the geometrical consequences of this set, G(M), resemble known geometric expectations (prior knowledge) is greatest. M ¼ argmaxs S [p(G(s) )], in which p is some similarity score between the observed and the expected stereochemical parameters—the protein-likeness. Despite the similarity of equations, note that this formulation reverses the abovedescribed strategy of pattern recognition by now fixing the desired output class (protein) and trying to find the best possible feature vector (geometric quantities within a set of free atoms). One seeks to find a subset of positions that maximizes the protein-likeness of the resulting model. Frequencies for various geometric parameters obtained from an analysis of structures in the PDB can be used as prior geometric expectations. For the purpose of model building, a protein may be thought of as a set of long, nonbranching chains of repetitive units, the main chain or backbone, with a number of short structural units attached to it, the side chains. This is a standard simplification to the problem and one used by most modeling programs. The geometry of the main chain characterizes the tertiary structure of a protein. The main chain itself can be determined to an acceptable degree of accuracy by the positions of the C atoms alone.9 Protein model building is therefore often rightly seen as the problem of locating the C positions and we reformulate our task as trying to identify C atoms in a set of free atoms.
9
R. M. Esnouf, Acta Crystallogr. D Biol. Crystallogr. 53, 665 (1997).
234
map interpretation and refinement
[11]
Fig. 1. The ARP/wARP flowchart.
Once the free atoms that best reproduce expected C geometry have been determined, the remaining main-chain atoms can be put in place and the current main-chain hypothesis tested by submitting the current model to refinement against structure factor amplitudes. These main-chain fragments then can be docked into sequence and the side chains placed in density.
[11]
235
ARP/wARP model building
The Main Chain We have concentrated on the analysis of expected C-backbone geometry (and as a second class the expected not-C geometry). The parameterization of the problem in terms of C geometry represents only an approximation to the problem, but the idea may easily be extended to incorporate as many parameters as is feasible; for more details see Morris et al.10 Multidimensional frequency distributions have been computed from the PDB for all C(n)–C(n þ 1)–C(n þ 2)–C(n þ 3) distances, valence angles between nonbonded atoms, and dihedral angles. The distance distributions together with peptide planarity checks are used in ARP/wARP to identify C–C pairs.11 In brief, one searches for pairs of atoms separated ˚ , and checks that there is reasonable density beapproximately by 3.8 A tween them at the expected positions of the atomic centers in the peptide plane (Fig. 2A and B). To catch the correct peptide units, a large number of false positives are also accepted. The multidimensional distance and angle distributions derived directly from database analyses are well suited for pattern recognition in sets of accurate candidate positions. The accuracy of the free atom positions is, however, dependent on current phase quality and resolution. The patterns in a free atoms model differ to a varying degree from those of well-refined structures. Therefore frequency distributions have been computed that correspond to structures with a wide range of random coordinate errors. The
~3.8
A
B
C
D
Fig. 2. The main steps in building the main chain. (A) Search for pairs of atoms separated ˚ and determine the most likely peptide plane orientation approximately by a distance of 3.8 A between these atoms (B). This procedure results in a large number of possible peptide planes, indicated by the arrows in (C). Graph-search techniques are then employed to reduce this to the most likely set of nonbranched, nonoverlapping chains (D).
10
R. J. Morris, A. Perrakis, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 58, 968 (2002). 11 V. S. Lamzin and K. S. Wilson, Methods Enzymol. 277, 269 (1997).
236
map interpretation and refinement
[11]
distributions lose rapidly in classification power beyond a coordinate error ˚ . Provided the accuracy of the free atoms positioning can be of about 0.7 A correctly estimated, the appropriate error distributions prove to be a powerful tool for recognizing C atoms in sequence. The solution strategy for building the best chain, with choices having to be made at each C atom, should ideally consult what has been built so far and what result would be obtained if each possible route were followed from the current point onward, before making a decision. This idea is implicit in the formulation given above of choosing the best chain from all possibilities. When dealing only with a C atom model of the main chain, one must determine (1) which free atoms to use, and (2) the directed (N–C or C–N) connections between all C atom pairs that comprise one peptide unit. The problem of choosing a subset of atoms and their connections so as to maximize the similarity with expected geometry can therefore be formulated as an optimization problem in which the optimization variables are binary (0/1), (Fig. 2C and D). Each optimization variable cij represents a connection between two candidate positions, cij ¼ 1 means a directed connection from atom i to atom j is chosen, and cij ¼ 0 that it is not. Constraints must be added to the problem to ensure that every point in a C backbone trace has at most one incoming and one outgoing connection. The problem is merely to choose which connections to turn on and which to turn off. This transforms the problem to an exercise in combinatorial optimization. This type of problem has been well studied and belongs to the NP-hard class of problems—a class of problems of such complexity that no known algorithms exist that can be guaranteed to run in polynomial time.12 If one assumes that each free atom has an average number hBi of free atoms to connect to (hBi is called the branching average), and that of the N free atoms one can always build chains of length hLi, then the number of chains is equal to the number of decisions that have to be made along the chain, and is approximately proportional to hBihLi. For a modest branching average of 3 and a chain length of 50, the worst-case number of chains exceeds by far the number of seconds elapsed since the Big Bang. This complexity analysis is crude but demonstrates correctly the kind of worstcase behavior to be expected. Enumerating all possible subsets and all possible connections between the elements to find the one with the highest protein-likeness is clearly a formidable task. Rather than evaluating each test chain globally as a single entity, it would be of advantage for approximation schemes to have a handle on decisions at a more local level. One would like to approximate each 12
C. H. Papadimitriou and K. Steiglitz. ‘‘Combinatorial Optimization,’’ p. 194. Dover, Mineola, NY, 1998.
[11]
ARP/wARP model building
237
chain evaluation byPa summation over smaller units PðGðsÞ Þ ¼ PðGðfcij ji; j 2 SgÞ Þ u u ðfcij ji; j 2 u SgÞ, where u represents the overall structural units along a chain and overall possible chains, and u is a unit-based score. The obvious choice for these units would be the actual connection variables, but these have already undergone a quality assessment. Also, they are unsuitable for testing protein-likeness testing since they fail to capture any 3D structure. The minimum 3D structural information in terms of a C parametrization of the problem is provided by the use of fragments consisting of four C atoms. From a list of putative C atoms one can easily scan all possible C fragments of length four. These fragments can be evaluated and stored as structural building blocks for the problem at hand. The main chain can then be built by overlapping the last three atoms of each fragment with the first three of the following one. The summation of some probability-based score means that u log P(u); this can be put on a significance scale by using the log-odds ratio, u log{P(ujC)/P(ujR)}, where the probability has been conditioned on a C atoms model and on a random R model. Random here means: random ˚ apart, since under the restriction that the atoms are approximately 3.8 A this condition has already been applied at an earlier stage. For inaccurately placed free atoms the classification probability for non-C atoms is frequently higher than for C, even for atoms that correspond to C positions. This would result in a local classification error, should the building be carried out only as a classification problem. But in the current optimization scheme this results only in an insignificant lowering of the overall chain score. We have developed specific heuristics based on the divide-and-conquer approach outlined above by recasting the optimization problem as a search for the longest path in a weighted graph. The weights of the connections are the scores derived from the geometry. The nonbranching nature of polypeptide chains imposes restrains on the ideal search strategy. For a given starting point one would like to obtain a set of all single, nonbranching, longest chains. This deep probing into graph structures is accomplished by the depth-first-search (DFS) algorithm.13 The time requirements of the standard algorithm are proportional to the number of nodes and arcs in the graph. The algorithm can readily be modified to keep track internally of all found chains and the fragments used, and to check for geometric clashes. When an end node is encountered the full chain is returned and the algorithm steps back, thereby resetting the availability of those nodes over which the routine back-traced, and creating a new chain for each decision
13
R. Sedgewick, ‘‘Algorithms in Cþþ,’’ p. 415. Addison-Wesley, Reading, MA, 1992.
238
map interpretation and refinement
[11]
that is made. The chains are stored as a list that is returned as the result of the function call. In this manner the whole structure may be systematically searched. A full search through all possible chains can easily become intractable and one must introduce a number of restrictions and settle for an approximate solution. We circumvent this by exponentially limiting the search depth with the average number of branching points per node. Each accepted four-C fragment is assigned a quality score based on the frequency of its geometry in the PDB. The search algorithm can be set up to require a minimum quality of the fragments while scanning for chains. For large problems (above 10,000 candidate positions with a branching average greater than 2) the algorithm attempts first to build chain stretches of high quality before reducing the acceptance threshold level. The highscoring fragments are most commonly helices, following by strands. The second measure we have taken is to limit the search depth and/or the total number of chains per node to evaluate. Our initial implementation may be seen in this framework as an extreme case of search depth equal to one. Sequence Docking and Side-Chain Fitting The process outlined above delivers a set of main-chain fragments (C atoms with the remaining main-chain atoms placed into the positions of best agreement between them). The side chain-building module has two tasks: first, to assign the main-chain fragments to the known protein sequence, and second, to build and refine side chains according to the sequence assignment. For the sequence docking, a feature vector is used that represents the possible connectivity between the free atoms in the vicinity of each C. If there is one free atom close to the C and this atom is also close to one more atom, the feature vector would be ‘‘11.’’ If one further atom is connected then the feature would be ‘‘111,’’ and if two atoms are connected to the last one it would be ‘‘1112.’’ For each of the 20 residues the full side-chain connectivity vector is known (e.g., serine is ‘‘11,’’ valine is ‘‘12,’’ and aspartate is ‘‘112’’). By comparing each observed feature vector with all 20 full-chain connectivity vectors a probability is assigned to each C for its chain side being of a certain residue type. This way each piece of main chain can be represented as a vector of probability vectors (an array)—each probability vector contains 20 values representing how well the observed free atoms match all possible residues. By sliding that array across the given protein sequence, the probability that this placement is correct is computed by calculating the product of all probabilities within
[11]
ARP/wARP model building
239
the main-chain fragment that each sequence residue is observed. In the next step the difference of the score (probability) for the best placement with the second best score for each fragment (a kind of z-score) is used to derive a confidence score for that placement to be unique—that is, presumably correct. After the fragment with the best confidence score is docked, the corresponding sequence space is no longer available for placement of additional fragments, and thus the confidence scores are updated and the algorithm is iterated until either all fragments are docked or the confidence level becomes lower than a preset threshold. At that stage every residue of all fragments is assigned to a specific residue type. In the final step the best rotamer from the Richardson’s rotamer database14 is built. The chosen rotamer angles are refined in real space with a target function that includes the interpolated density at atomic centers and geometric considerations (bad contacts, hydrogen bond contacts). During torsional refinement the angle of the residue also is allowed to vary to achieve better C placement. The methods here resemble closely those implemented by Jones and Thirup15 in the modeling package O and those published by Oldfield.16 Practice
Applications and Limitations of ARP/wARP The ARP/wARP package can currently be used for the following applications. 1. Automatic construction of a protein model from diffraction data ˚ (in some cases 2.7 A ˚ ) or higher and extending to a resolution of 2.5 A reasonable initial phase estimates from heavy atom methods or molecular replacement. The required phase quality for successful autobuilding can vary greatly and depends on the overall quality of the data and on the resolution (in general, the lower the resolution, the better the phases need to be). Sometimes, localized areas of good density in an otherwise uninterpretable map can provide a sufficient seed for the iterative model building to proceed smoothly. 2. Density modification by free atoms models can provide significant ˚ , depending on the solvent phase improvement for data higher than 3.0 A content. 14
S. C. Lovell, J. M. Word, J. S. Richardson, and D. C. Richardson, Proteins 40, 389 (2000). T. A. Jones and S. Thirup. EMBO J. 5, 819 (1986). 16 T. J. Oldfield, Acta Crystallogr. D Biol. Crystallogr. 57, 82 (2001). 15
240
map interpretation and refinement
[11]
3. Automated solvent building with ARP. This requires the resolution ˚ or higher. Protocol validation with of the diffraction data to about 2.5 A Rfree can be employed and is recommended. 4. Side-chain mutations for molecular replacement solutions. The side chain-fitting routines are part of the autobuilding procedure but they can also be used as a stand alone application. The side-chain fitting performs ˚ and can be used whenever well for data of resolution higher than 3.5 A density, main chain (with residue assignment), and sequence are available. Program Interface We have developed a simple knowledge-based system in the form of automated scripts that set up most standard ARP/wARP applications with reasonable default parameters. These scripts take the user through the whole setup by interactive questioning. The startup script takes care of job initialization and distribution over multiple processors. A parameter file is created that then drives the individual routines of the ARP/wARP package. This can be edited if fine tuning is required. There is plenty of flexibility provided within the scripts and these options should be experimented with first. In addition to the UNIX shell scripts, a graphical user interface (GUI) has been written using modules of the CCP4i toolbox. For details, see the documentation at www.arp-warp.org. Example Autobuilding typically takes about 2 to 12 h on a standard workstation as used for other crystallographic computations (depending on the size of the structure and the initial phase quality) and successfully builds about 70–95% of the structure (again dependent on resolution, phase quality, and the amount of disordered regions). The following example is a novel structure solution kindly provided by R. Meijers before publication. It shows a currently rather untypical example of how the modules of ARP/wARP can the applied to a difficult molecular replacement case. Data were collected at EMBL Outstation ˚ . The data are 98% comHamburg beamline X11 to a resolution of 1.5 A plete and overall of good quality. The highest scoring sequence alignment showed 32% sequence identity in the overlapping regions. The model underwent a series of carefully chosen mutilations, deleting step by step those parts with the least sequence similarity until a molecular replacement solution could be picked up with AmoRe.17 Other PDB models of lower similarity were also attempted but no solution was found. The successful 17
J. Navazza, Acta Crystallogr. A 50, 157 (1994).
[11]
ARP/wARP model building
241
search model is shown in Fig. 3A. The C positions of the molecular re˚ away from the nearest equivalents placement solution are on average 1.5 A in the final model and less than 70% of the total number of C atoms were present. This model proved notoriously difficult to refine. The standard warpNtrace protocol of ARP/wARP failed and even with customization it was not able to correct for the poor, highly biased starting phases from the molecular replacement solution. The density modification routine wARP was employed, using four independent free atoms models to calculate a weighted average phase and figure of merit for each structure factor. The improvement gained by such averaging procedures is often not significant in terms of phase quality indicators, but it can be crucial in terms of structure solution. The initial density map had only about 30% correlation
Fig 3. (A) A drawing of the model used for molecular replacement. (B) A drawing of the model autobuilt by ARP/wARP, starting from initial phases from the molecular replacement solution. (C) A drawing of the final model.
242
map interpretation and refinement
[11]
to the final map. The wARP map was given as a starting point to warpNtrace and subjected to 100 ARP cycles (each ARP cycle consisting of real-space density modeling by ARP and three internal reciprocal-space refinement cycles using REFMAC18 from the CCP419 suite) and autobuilding of the main chain after every 10 ARP cycles. The progress of the iterative building and phase refinement is shown in Fig. 4A and B. At the first model-building stage, merely four small-chain fragments are found with a total number of 14 built residues. The ARP cycles then attempt to remove or add atoms according to its density criteria. The poor density makes this procedure rather slow in this case. The phases, however, improve gradually and the procedure takes off at cycle 61 after the sixth building cycle. The slight jump in R-factor after each autobuilding cycle is due to the rearrangement, deletion, and addition of atoms to accommodate the built main chain—this system relaxed again after a few refinement cycles. Figure 3B shows a drawing of the autobuilt model and Fig. 3C shows the final model. The C atoms show a mean distance to their closest C atoms in ˚. the autobuilt structure of 0.09 A Discussion
ARP/wARP is a software suite (copyrighted by the European Molecular Biology Laboratory) based on the paradigm of viewing model building and refinement as one unified procedure for optimizing phase estimates. The current version, ARP/wARP 6.0, released in July 2002, works with density recognition-driven procedures for placing and removing atoms and is therefore limited to diffraction data extending to about ˚ . The iterative cycles of density modeling by the placement of atoms, 2.5 A unrestrained refinement of their parameters, automated model building, and restrained refinement of the hybrid model provide a powerful means of phase refinement. One receives an almost complete protein model as a by-product. Initial phase estimates may be provided in the form of a molecular replacement solution, MIR/SIRAS/MAD/SAD phases, experimental measurements, witchcraft, or heavy atom sites alone (provided the data extend to atomic resolution). Pattern recognition techniques are a crucial element in such a procedure and more robust algorithms for medium resolution are currently under development. Even with better density processing and classification algorithms, the building of a protein model will 18
G. N. Murshudov, A. A. Vagin, and E. J. Dodson, Acta Crystallogr. D Biol. Crystallogr. 53, 240 (1997). 19 Collaborative Computational Project Number 4, Acta Crystallogr. D Biol. Crystallogr. 50, 760 (1994).
[11]
243
ARP/wARP model building A
55 R-factor Rfree-factor
50
R/Rfree
45 40 35 30 25 20 15
0
10
20
30
40
50
60
70
80
90
100
90
100
Number of ARP cycles (building every 10 cycles) Number of residues in the hybrid model
B 250
200
150
100
50
0 0
10
20
30
40
50
60
70
80
Number of ARP cycles (building every 10 cycles) Fig. 4. The progress of warpNtrace over 100 ARP cycles with autobuilding after every 10 cycles. (A) The crystallographic R-factor and the free R-factor. (B) The number of autobuilt residues.
remain a complex process, and decision-making is required to enhance the state of automation. ARP/wARP is an experimental hypothesis-generating and testing procedure for placing atoms in the most likely places (according to density), and using graph-searching combined with geometric comparisons against expected stereochemical parameters to determine the most likely mainchain fragments. The iterative approach, with maximum likelihood refinement using REFMAC of the current model at every stage, has proved to be a powerful tool for overcoming the insufficient robustness of
244
[12]
map interpretation and refinement
the map interpretation routines regarding phase quality, and the inadequate use of knowledge-based decision making during model building. Acknowledgments This work was supported by EU Grant BIO2-CT920524/BIO4-CT96-0189 (R.J.M.). The authors thank Keith Wilson, Zbyszek Dauter, Rob Meijers, and Petrus Zwart for fruitful discussions and useful comments; Ge´rard Bricogne, for helpful suggestions, advice, mathematical rigor, and for generously allowing R. J. M. to write this contribution while working at Global Phasing, Ltd.; Eric Blanc, Pietro Roversi, Claus Flensburg, and Clemens Vonrhein for constructive critique on an initial draft of this manuscript; Garib Mushudov and Eleanor Dodson for help with REFMAC; and all ARP/wARP users for helpful suggestions.
[12] TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building By Thomas R. Ioerger and James C. Sacchettini Introduction
Significant advances have been made toward improving many of the complex steps in macromolecular crystallography, from new crystal growth techniques, to more powerful phasing methods such as multiwavelength anomalous diffraction (MAD),1 to new computational algorithms for heavyatom search,2–4 reciprocal-space refinement, and so on.5 However, the step of interpreting the electron density map and building an accurate model of a protein (i.e., determining atomic coordinates) from the electron density map remains one of the most difficult to improve. Currently it takes days to weeks for a human crystallographer to build a structure from an electron density map, often with the help of a 3D visualization and model-building program such as O.6 This manual process is both time-consuming and error-prone. Even with an electron density map of high quality, model 1
W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). C. M. Weeks, G. T. DeTitta, H. A. Hauptmann, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 3 de la E. Fortelle and G. Bricogne, Methods Enzymol. 276, 590 (1997). 4 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). 5 A. T. Bru¨nger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J.-S. Jiang, J. Kuszewski, M. Nigles, N. S. Pannu, R. J. Read, L. M. Rice, T. Simmonson, and G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 6 T. A. Jones, J.-Y. Zou, and S. W. Cowtan, Acta Crystallogr. A 47, 110 (1991). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
244
[12]
map interpretation and refinement
the map interpretation routines regarding phase quality, and the inadequate use of knowledge-based decision making during model building. Acknowledgments This work was supported by EU Grant BIO2-CT920524/BIO4-CT96-0189 (R.J.M.). The authors thank Keith Wilson, Zbyszek Dauter, Rob Meijers, and Petrus Zwart for fruitful discussions and useful comments; Ge´rard Bricogne, for helpful suggestions, advice, mathematical rigor, and for generously allowing R. J. M. to write this contribution while working at Global Phasing, Ltd.; Eric Blanc, Pietro Roversi, Claus Flensburg, and Clemens Vonrhein for constructive critique on an initial draft of this manuscript; Garib Mushudov and Eleanor Dodson for help with REFMAC; and all ARP/wARP users for helpful suggestions.
[12] TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building By Thomas R. Ioerger and James C. Sacchettini Introduction
Significant advances have been made toward improving many of the complex steps in macromolecular crystallography, from new crystal growth techniques, to more powerful phasing methods such as multiwavelength anomalous diffraction (MAD),1 to new computational algorithms for heavyatom search,2–4 reciprocal-space refinement, and so on.5 However, the step of interpreting the electron density map and building an accurate model of a protein (i.e., determining atomic coordinates) from the electron density map remains one of the most difficult to improve. Currently it takes days to weeks for a human crystallographer to build a structure from an electron density map, often with the help of a 3D visualization and model-building program such as O.6 This manual process is both time-consuming and error-prone. Even with an electron density map of high quality, model 1
W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). C. M. Weeks, G. T. DeTitta, H. A. Hauptmann, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 3 de la E. Fortelle and G. Bricogne, Methods Enzymol. 276, 590 (1997). 4 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). 5 A. T. Bru¨nger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J.-S. Jiang, J. Kuszewski, M. Nigles, N. S. Pannu, R. J. Read, L. M. Rice, T. Simmonson, and G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 6 T. A. Jones, J.-Y. Zou, and S. W. Cowtan, Acta Crystallogr. A 47, 110 (1991). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[12]
TEXTAL system
245
building is a long and tedious process.7 There are many sources of noise and errors that can perturb the appearance of the density map.8–10 All these effects contribute to making the density sometimes difficult to interpret. There exist several prior methods for, or related to, automated model building, such as searching fragment libraries,11,12 template convolution and other fast Fourier transform (FFT)-based approaches,13,14 the freeatom insertion method of ARP/wARP,15,16 DADI17 (which uses a real-space correlational search), X-Powerfit,18 MAID,19 MAIN,20 and molecular scene analysis.21–24 However, most of these methods have limitations. For example, fragment library searches11,12 require user intervention to pick C coordinates in the sequence (although backbone tracing can help, it does not reliably determine C locations), and methods like ARP/wARP and molecular scene analysis seem to work best only at high ˚ or better. resolution, for example, around 2.5 A TEXTAL is a new computer program designed to build protein structures automatically from electron density maps. It uses AI (artificial intelligence) and pattern recognition techniques to try to emulate the intuitive 7
G. J. Kleywegt and T. A. Jones, Methods Enzymol. 277, 208 (1997). J. S. Richardson and D. C. Richardson, Methods Enzymol. 115, 189 (1985). 9 C. I. Branden and T. A. Jones, Nature 343, 687 (1990). 10 T. A. Jones and M. Kjeldgaard, Methods Enzymol. 277, 173 (1997). 11 T. A. Jones and S. Thirup, EMBO J. 5, 819 (1986). 12 L. Holm and C. Sander, J. Mol. Biol. 218, 183 (1991). 13 G. J. Kleywegt and T. A. Jones, Acta Crystallogr. D Biol. Crystallogr. 53, 179 (1997). 14 K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 54, 750 (1998). 15 A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 16 A. Perrakis, R. Morris, and V. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 17 D. J. Diller, M. R. Redinbo, E. Pohl, and W. G. J. Hol, Proteins Struct. Funct. Genet. 36, 526 (1999). 18 T. J. Oldfield, in ‘‘Crystallographic Computing 7: Proceedings from the Macromolecular Crystallography Computing School’’ (P. E. Bourne and K. Watenpaugh, eds.). Oxford University Press, New York, 1996. 19 D. G. Levitt, Acta Crystallogr. D Biol. Crystallogr. 57, 1013 (2001). 20 D. Turk, in ‘‘Methods in Macromolecular Crystallography’’ (D. Turk and L. Johnson, eds.). NATO Science Series I, Vol. 325, p. 148. Kluwer Academic, Dordrecht, The Netherlands, 2001. 21 L. Leherte, S. Fortier, J. Glasgow, and F. H. Allen, Acta Crystallogr. D Biol. Crystallogr. 50, 155 (1994). 22 K. Baxter, E. Steeg, R. Lathrop, J. Glasgow, and S. Fortier, in ‘‘Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology,’’ p. 25. American Association for Artificial Intelligence, Menlo Park, CA, 1996. 23 L. Leherte, J. Glasgow, K. Baxter, E. Steeg, and S. Fortier, Artif. Intell. Res. 7, 125 (1997). 24 S. Fortier, A. Chiverton, J. Glasgow, and L. Leherete, Methods Enzymol. 277, 131 (1997). 8
246
map interpretation and refinement
[12]
decision-making of experts in solving protein structures. Previously solved structures are exploited to help understand the relationship between patterns of electron density and local atomic coordinates. The method applies this along with a number of heuristics to predict the likely positions of atoms in a model for an uninterpreted map. TEXTAL is aimed at solving ˚ range, which are just at the range of human intermaps in the 2.5 to 3.5-A pretability, but turn out to be the majority of cases in practice for maps constructed from MAD data. TEXTAL has the potential to reduce one of the bottlenecks of highthroughput structural genomics.25 By automating the final step of model building (for noisy, medium-to-low resolution maps), less effort will be required of human crystallographers, allowing them to focus on regions of a map where the density is poor. TEXTAL will eventually be integrated with other computational methods, such as reciprocal-space refinement26 or statistical density modification,27 to iterate between building approximate models (in poor maps) and improving phases, which can then be used to produce more accurate maps and allow better models to be built. This will be implemented in the PHENIX crystallographic computing environment currently under development at the Lawrence Berkeley National Laboratory.28 Principles of Pattern Recognition and Application to Crystallography
Pattern recognition techniques can be used to mimic the way the crystallographer’s eye processes the shape of density in a region and comprehends it as something recognizable, such as a tryptophan side chain, or a sheet, or a disulfide bridge. The history of statistical pattern recognition is long, and a great deal of research, both theoretical and applied (i.e., development of algorithms), has been done in a wide range of application domains. Much work has been done in image recognition in two dimensions, such as recognizing military vehicles or geographic features in satellite images, faces, fingerprints, carpet textures, parts on an assembly line for manufacturing, and even vegetables for automated sorting29; however,
25
S. K. Burley, S. C. Almo, J. B. Bonanno, M. Capel, M. R. Chance, T. Gaasterland, D. Lin, A. Sali, W. Studier, and S. Swaminathian, Nat. Genet. 232, 151 (1999). 26 G. N. Murshudov, A. A. Vagin, and E. J. Dodson, Acta Crystallogr. D Biol. Crystallogr. 53, 240 (1997). 27 T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 56, 965 (2000). 28 P. D. Adams, R. W. Grosse-Kunstleve, L.-W. Hung, T. R. Ioerger, A. J. McCoy, N. W. Moriarty, R. J. Read, J. C. Sacchettini, and T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 58, 1948 (2002).
[12]
TEXTAL system
247
patterns in electron density maps are three-dimensional, and less work has been done for these kinds of problems. The basic idea behind pattern recognition (at least for supervised learning) is to ‘‘train’’ the system by giving it labeled examples in several competing categories (such as tanks vs. civilian cars and trucks). Each example is usually represented by a set of descriptive features, which are measurements derived from the data source that characterize the unique aspects of each one (e.g., color, size). Then the pattern recognition algorithm tries to find some combination of feature values that is characteristic of each category that can be used to discriminate among the categories (i.e., given a new unlabeled example, classify it by determining to which category it most likely belongs). There are many pattern recognition algorithms that have been developed for this purpose but work in different ways, including decision trees, neural networks, Bayesian classifiers, nearest neighbor learners, and support vector machines.30,31 The central problem in many pattern recognition applications is in identifying features that reflect significant similarities among the patterns. In some cases, features may be noisy (e.g., clouds obscuring a satellite photograph). In other cases, features may be truly irrelevant, such as trying to determine the quality of an employee by the color of his or her shirt. The most difficult issue of all is interaction among features, where features are present that contain information, but their relevance to the target class on an individual basis is weak, and their relationship to the pattern is recognizable only when they are looked at in combination with other features.32,33 A good example of this is the almost inconsequential value of individual pixels in an image; no single pixel can tell much about the content of an image (unless the patterns were rigidly constrained in space), yet the combination of them all contains all the information that is needed to make a classification. While some methods exist for extracting features automatically,34 currently consultation with domain experts is almost always needed to determine how to process raw data into meaningful, high-level features that are likely to have some form of correlation with 29
R. C. Gonzales and R. C. Woods, ‘‘Digital Image Processing.’’ Addison-Wesley, Reading, MA, 1992. 30 T. Mitchell, ‘‘Machine Learning.’’ McGraw-Hill, New York, 1997. 31 R. O. Duda, P. E. Hart, and D. G. Stork, ‘‘Pattern Classification.’’ John Wiley & Sons, New York, 2001. 32 L. Rendell and R. Seshu, Comput. Intell. 6, 247 (1990). 33 G. John, J. Kohavi, and K. Pfleger, in ‘‘Proceedings of the Eleventh International Conference on Machine Learning,’’ p. 121. Morgan Kaufmann, San Francisco, 1994. 34 H. Liu and H. Motoda, ‘‘Feature Extraction, Construction, and Selection: A Data Mining Perspective.’’ Kluwer Academic, Dordrecht, The Netherlands, 1998.
248
map interpretation and refinement
[12]
the target classes, often making manual decisions about how to normalize, smooth, transform, or otherwise manipulate the input variables (i.e., ‘‘feature engineering’’). A pattern recognition approach can be used to interpret electron density maps in the following way. First, we restrict our attention to local ˚ radius* (a whole regions of density, which are defined as spheres of 5A ˚ map can be thought of and modeled as a collection of overlapping 5A spheres). Our goal is to predict the local molecular structure (atomic coordinates) in each such region. This can be done by identifying the region with the most similar pattern of density in a database of previously-solved maps, and then using the coordinates of atoms in the known region as a basis for estimating coordinates of atoms in the unknown. To facilitate this patternmatching process, features must be extracted that characterize the patterns of density in spherical regions and can be used to recognize numerically when two regions might be similar. This idea of feature-based retrieval is illustrated in Fig. 1. Recall that electron density maps are really 3D volumetric data sets (i.e., a three-dimensional grid of density values that covers the space around and including the protein). As we describe in more detail later, features for regions can be computed from an array of local density values by methods such as calculating statistics of various orders, moments of inertia, contour properties (e.g., surface area, smoothness, connectivity), and other geometric properties of the distribution and ‘‘shape’’ of the density. Not all of these features are equally relevant; we use a specialized feature-weighting scheme to determine which ones are most important. For this pattern recognition approach to work, an important requirement of the features is that they should be rotation-invariant. A feature is rotation-invariant if its value would remain constant even if the region were rotated, i.e., F(region) ¼ F(Rot(,region)), where is an arbitrary set of rotation parameters that can be used to transform grid-point coordinates locally around the center of the region. Rotation-invariance is important for matching regions of electron density because proteins can appear in arbitrary orientations in electron density maps. If one region has a similar pattern to another, we want to detect this robustly, even if they are in different orientations. Features such as Fourier coefficient amplitudes are well-known to be translation-invariant, but they are not rotation-invariant. Therefore, one of the initial challenges in the design of TEXTAL was to *We selected 5 A as a standard size for our patterns because (a) this is just about large enough to cover a single side-chain, (b) smaller regions would lead to redundancy in the database of feature-extracted regions and fewer predicted atoms per region, and (c) larger regions would contain such complexity in their density patterns that sufficiently similar matches might not be found in even the largest of databases, i.e. they might be unique.
[12]
TEXTAL system
249
˚ -spherical region of density is shown Fig. 1. Illustration of feature-based retrieval. A 5A around a histidine residue (centered on the C atom). Below it is shown a hypothetical feature vector, consisting of a list of scalar values that are a function of the density pattern in the region. This feature vector may be used to search for other regions with similar patterns of density, which would presumably have a similar profile of feature values, independent of orientation.
develop a set of rotation-invariant numeric features to capture patterns in spherical regions of electron density. Given a set of rotation-invariant features, this pattern-matching approach may be used to predict local atomic coordinates in arbitrary spherical regions throughout a density map. However, to provide some structure for the process, we subdivide the problem of map interpretation along the traditional lines of decomposition used by human crystallographers: first, we try to identify the backbone (or main-chain) of the protein, modeled as linear chains of C atoms, and then we apply the pattern-matching process to predict the coordinates of other backbone and side-chain atoms around each C. The first step is accomplished by a routine called CAPRA (for ‘‘C-Alpha Pattern Recognition Algorithm’’). We refer to the second step as LOOKUP, because of the use of a database of previously solved maps. Both routines use pattern recognition (though different techniques), and both rely centrally on the extraction of rotation-invariant features. CAPRA feeds the features into a neural network to predict likely locations 35 of C atoms in a map. Then LOOKUP is run on each consecutive C by ˚ sphere around it and using them to identify extracting features for the 5A 35
T. R. Ioerger and J. C. Sacchettini, Acta Crystallogr. D Biol. Crystallogr. 58, 2043 (2002).
250
[12]
map interpretation and refinement
the most similar region of density in the database of solved maps (with features pre-extracted from C-centered regions in maps of known structures), from which coordinates of atoms in the new map may be estimated. Methods
Overview of TEXTAL System As an automated model-building system, TEXTAL takes an electron density map as input and ultimately outputs a protein model (with atomic coordinates). There are three overall stages, depicted in Fig. 2, that TEXTAL goes through to solve an uninterpreted electron density map. The first step involves tracing the main chain, which is done by the CAPRA subsystem. The output of CAPRA is a set of C chains—a PDB file containing several chains (multiple fragments are possible, due to breaks in the main-chain density), each of which contains one atom per residue (the predicted C atoms). These C chains are fed as input into the second stage, which we call LOOKUP. During LOOKUP, TEXTAL calculates features for each region around a predicted C atom and uses these features to search for regions with similar density patterns in a database of regions from maps of known structures, whose features have been calculated offline. For each C, LOOKUP extracts the atoms of the local residue in the best-matching region (from a PDB file) and translates and rotates them into corresponding positions in the uninterpreted map. By concatenating these transformed ATOM records, LOOKUP fills out the C chains with all the additional side-chain and backbone atoms; the output of LOOKUP is a complete PDB file of the residues that could be modeled. There is a variety of inconsistencies that might need to be resolved, such as getting the independent residue predictions in a chain to agree on directionality
CAPRA
Post-Processing
LOOKUP
O
O
PDB file
OH Electron density map
Backbone (Ca chains)
Initial model
Fig. 2. Main stages of TEXTAL.
Final model
[12]
TEXTAL system
251
of the backbone, refining the coordinates of backbone atoms to draw close ˚ spacing of C atoms, and adjusting the side-chain atoms to the ideal 3.8-A to eliminate any steric conflicts. Also, another major postprocessing step involves correcting the identities of the residues by aligning chains into likely positions in the amino acid sequence of the protein, if known. Because TEXTAL is able to recognize amino acids only by the shape (size, structure) of their side chains, there is ambiguity that occasionally leads to prediction of incorrect amino acid identities in chains. By using a special amino acid similarity matrix that reflects the kinds of mistakes TEXTAL tends to make, we often can determine the exact identity of amino acids based on where those chains fit in the known sequence. This can be used to go back through the LOOKUP routine to select the best match of the correct type, producing a more accurate model. CAPRA: C-Alpha Pattern Recognition Algorithm CAPRA operates in essentially four main steps, shown in Fig. 3. First, the map is scaled to roughly 1.0 ¼ 1, which is important for making patterns comparable between different maps. Then, a trace of the map is ˚ made. The trace gives a connected skeleton of pseudo-atoms (on a 0.5-A grid) that generally goes through the center of the contours (i.e., approximating the medial axis). Note that the trace goes not only along the backbone, but also branches out into side chains (similar to the output of Bones).6 CAPRA picks a subset of the pseudo-atoms in the trace (which we refer to as ‘‘waypoints’’) that appear to represent C atoms. This central step is done in a pattern-analytic way using a neural network, described later. Finally, after deciding on a set of likely C atoms, CAPRA must link them together into chains. This is a difficult task because there are often breaks in the density along the main chain, as well as many false connections between contacting side chains in the density. CAPRA uses a combination of several heuristic search and analysis techniques to try to arrive at a set of reasonable C chains. Tracing is done in a way similar to many other skeletonization algo˚ grid rithms,36,37 as follows. First, a list is built of all lattice points on a 0.5-A throughout the map that are contained within a contour of some fixed threshold (we currently use a cutoff of around 0.7 in density). This leaves hundreds of thousands of clustered points that must be reduced to a backbone. The points are removed by an iterative process, going from worst (lowest density) to best (highest density), as long as they do not create a 36 37
J. Greer, Methods Enzymol. 115, 206 (1985). S. M. Swanson, Acta Crystallogr. D Biol. Crystallogr. 50, 695 (1994).
252
map interpretation and refinement
Electron density map
[12]
Scaling of density
Tracing of map
Predicting Ca locations (Neural network)
Ca chains
Linking Ca atoms into chains
Fig. 3. Steps within CAPRA.
local break in connectivity. This is evaluated by collecting the 26 surrounding points in a 3 3 3 box and preventing the elimination of the center point if it would create two or more (disconnected) components among its 26 neighbors. Generally, because the highest density occurs near the center of contour regions, the outer points are removed first, and maintaining connectivity becomes a factor only as the list of points is reduced nearly to a linear skeleton. What remains is roughly a few thousand pseudo-atoms, typically around 10 times as many as expected C atoms (due to the closer ˚ spacing along the backbone, and the meandering of the skeleton into 0.5-A side chains). To determine which of these pseudo-atoms are likely to represent true C atoms, CAPRA relies on pattern recognition. The goal is to learn how to associate certain characteristics in the local density pattern with an estimate of the proximity to the closest C. CAPRA uses a two-layer feedforward neural network (with 20 hidden units and sigmoid thresholds in each layer) to predict, for each pseudo-atom in the trace, how close it is likely to be to a true C (see Ioerger and Sacchettini35 for details). The inputs to the network consist of 19 feature values extracted from the region of density surrounding each pseudo-atom (Table I). The neural network is
[12]
253
TEXTAL system TABLE I Rotation-Invariant Features Used in TEXTAL to Characterize Patterns of Density in Spherical Regions
Class of features Statistical
Symmetry Moments of inertia and their ratios
Shape/geometry
Specific features Mean of density Standard deviation Skewness Kurtosis Distance to center of mass Magnitude of primary moment Magnitude of secondary moment Magnitude of tertiary moment Ratio of primary to secondary moment Ratio of primary to tertiary moment Ratio secondary to tertiary moment Min angle between density spokes Max angle between density spokes Sum of angles between density spokes
Method of calculation P ð1=nÞ Pi ½ð1=nÞ ði Þ2 1=2 P ½ð1=nÞ ði Þ3 1=3 P ½ð1=nÞ ði Þ4 1=4 j < xc ; yc ; zc > j where P xc ¼ ð1=nÞ xi j , etc. Compute inertia matrix, diagonalize, sort eigenvalues
Compute 3 distinct radial vectors that have greatest local density summation
trained by giving it examples of these feature vectors for high-density lattice points at varying distances from C atoms, ranging from 0 to around ˚ , in maps of sample proteins. The weights in the network are optimized 6A on this data set, using the well-known back-propagation algorithm.38 Given these distance predictions, the set of candidate C atoms (i.e., way points) is selected as follows. All the pseudo-atoms in the trace are ranked by their predicted distance to the nearest C. Then, starting from the top (smallest predicted distance), atoms are chosen as long as they ˚ of any previously chosen atoms. This procedure has are not within 2.5 A the effect of choosing waypoints in a more-or-less random order through the protein, but the advantage is that preference is given to the pseudoatoms with the highest scores first, since these atoms are generally most likely to be near C atoms; decisions on those with lower scores are put off until later. Each trace point selected as a candidate C has the property of being locally best, in that it has the closest predicted distance to a true ˚ radius. Trace points whose preC among its neighbors within a 2.5-A ˚ are discarded, as they dicted distances to C atoms are greater than 3.0 A are highly unlikely to really be near C atoms, and are almost always found in side chains. 38
G. E. Hinton, Artif. Intell. 40, 185 (1989).
254
map interpretation and refinement
[12]
Next, the putative C atoms are linked together into linear chains using the BUILD_CHAINS routine. There are many choices on how to link the C atoms together, since the underlying connectivity of the trace forms a graph with many branches and cycles. BUILD_CHAINS relies on a variety of heuristics to help distinguish between genuine connections along the main-chain and false connections (e.g., between side-chain contacts). Possible links between C atoms are first discovered by following connected chains of trace atoms to a neighboring C candidate not too far away ˚ ), provided the path through the trace atoms does not go too (within 5A ˚ ) to another C candidate. This links the C candidates toclose ( 40 A defined in the MAD map and contribute weakly to lower the R-free value. Comparison of Medium- and High-Resolution Refinements
The crystal structure of the aldose reductase complex with IDD 594 had ˚ resolution.28a The availability of two strucbeen originally solved at 1.8-A ˚ and then tures from the same crystal form refined initially at 1.8 A
[15]
structural information content at high resolution
339
Fig. 10. R-free factor as a function of increasing number of water molecules, together with its linear approximation (R-free ¼ 22.03 0.024 N waters, r ¼ 0.96).
extended to subatomic resolution provides a unique opportunity to compare them and evaluate the refinement procedures. The overall shape of the models is similar; agreement between the structures is best in the regions with lowest B factor in the ultra-high-resolution structure (RMSD ˚ ; RMSD for backbone atoms with B < 4 A ˚ 2, 0.315 A ˚ ). The overall, 0.335 A largest differences are observed at the surface of the molecule. However, the solvent structure is different, and the number of double occupancies in the ultra-high-resolution structure (99 residues show at least 1 nonfully ˚ occupied atom) is characteristically large when compared with the 1.8-A structure, in which no multiple conformations were modeled. The molecular ˚ resolution strucmechanics restrictions used in the refinement of the 1.8-A ture could not accommodate the departures from standard stereochemistry and the multiple conformations, resulting in shorter contacts (e.g., region around Met-168, Leu-175, and Pro-179). Other short contacts (e.g., C to C contacts around Lys-84) and slight differences in the placement of many side chains can be traced to regions exposed to solvent. ˚ resolution structure, one of the most In the context of the 0.66-A ˚ resolution structure is the general level troublesome aspects of the 1.8-A ˚ of noise in the placement of the backbone atoms (on the order of 0.3 A smeared across the structure). This is particularly obvious when the best resolved regions are compared. The structure was reoptimized, using an all atoms force field, leading to a small improvement (backbone RMSD˚ ). R values are too insensitive to reflect against ultrahigh resolution, 0.27 A 28a
R. E. Cachau, S. W. Rick, and A. Podjarny, in preparation (2003).
340
map interpretation and refinement
[15]
these differences. These results agree with observations29 suggesting that the use of all-atom force fields results in discrete improvements in low/ medium-resolution structures. A characteristic observation in the high-resolution structure is the deviation of the peptide bond from planarity. Given the dominant role of the N–C bond character in the values of all the other parameters30 we decided to test the use of an ! value dependent on secondary structure, improving further the agreement between the backbone atom positions to an RMSD ˚ . This test is useful as a proof of concept and shows that there is of 0.22 A an inherent agreement between the two structures if the restraints are improved. In conclusion, comparison of the structures of aldose reductase at 0.66˚ resolution highlights the difficulties in current refinement protoand 1.8-A cols. In the lower-resolution structure the geometries of the backbone are largely idealized and do not reflect the perturbation of the geometries due to the presence of the neighboring atoms. The geometry of the peptide bond can be greatly affected by the refinement protocol and it can be used as a yardstick with which to measure the quality of the model. Conclusions
The atomic model refined against subatomic resolution data enables us to determine with high accuracy the well-ordered regions, and to establish departures from usual stereochemistry or unusual short contacts. The unbiased experimental MAD maps can validate these unexpected structural features. It is also possible to determine fine details, such as hydrogen atoms, in the well-ordered regions. However, in the less-ordered regions, such as the side chains with multiple conformations or the disordered solvent zones, the signal is weak for the low-occupancy conformers and it is necessary to validate possible interpretations. Brute force refinement can lead to model bias, even at atomic resolution. It is in these regions that the experimental phases can be extremely useful, as they show clearly and unambiguously the multiple conformations. In the solvent zone, the experimental maps indicate that only ˚ 2 can be clearly assigned to an ordered site, while the waters with B < 40 A the rest should only be considered as indication of high-density values. Therefore the experimental phases from MAD experiments at atomic
29
T. R. Transue, J. M. Krahn, and T. A. Darden, in ‘‘Annual Meeting of the American Crystallographic Association,’’ P032. San Antonio, TX (2002). 30 R. Susnow, C. Schutt, and H. Rabitz, J. Comput. Chem. 15, 963 (1994).
[15]
structural information content at high resolution
341
resolution can improve interpretation of electron density maps and reduce the model bias in the less-ordered regions. In the more general case, the prevalence of the MAD method in the solution of the phase problem has a strong impact on the speed and accuracy of obtaining a final model. In fact, since the MAD phases are free of model bias, interpretation of the electron density maps is more straightforward and can be automated,31 leading in favorable cases to fast structure determination.1 In the high-resolution cases, models obtained from MAD maps should not be biased by a priori stereochemical information or subjective interpretations, since the atomic positions are directly taken from the electron density map peaks. Therefore, MAD models are more reliable, more accurate, and faster to obtain. Acknowledgments We thank Patrick Barth, Cecile Bon, B. Chevrier, Fabio D’all Antonia, Benoit Guillot, Eduardo Howard, Andre Mitschler, Dino Moras, Nukri Sanishvili, Andrea Schmidt, and George Sheldrick for providing data. We thank Federico Ruiz for figure preparation. We thank the SBC staff for assistance in the use of beamlines. This work was supported by the Centre National de la Recherche Scientifique (CNRS), France, by a CNRS-NSF collaboration, by a CERC-CNRS collaboration, by ECOS Sud (project A99502), by funds from the National Cancer Institute (Contract No. NO1-CO-12400) and the National Institutes of Health, by the Institute for Diabetes Discovery, Inc. through a contract with the CNRS, and by the U.S. Department of Energy, Office of Biological and Environmental Research under Contract No. W-31-109-ENG-38. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the U.S. Government. The submitted manuscript has been created by the University of Chicago as Operator of Argonne National Laboratory (‘‘Argonne’’) under Contract No. W-31-109-ENG-38 with the U.S. Department of Energy. The U.S. Government retains for itself, and others acting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
31
R. J. Morris, A. Perrakis, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 58, 968 (2002).
[16]
full matrix refinement
345
[16] Full Matrix Refinement as a Tool to Discover the Quality of a Refined Structure By Lynn F. Ten Eyck Introduction
Refinement of macromolecular crystal structures against observed X-ray diffraction data is the most precise method for determining those structures. Despite major progress in theory and methods for this task, the refinement step remains labor intensive and continually presents unexpected obstacles. Fundamental problems include (1) evaluation of the accuracy of the refined model and (2) evaluation of the precision of the parameters of the refined model. The first of these problems is essentially the problem of determining where the model is inconsistent with the data, and the second involves determining how precisely the data determine the adjustable parameters of the model. These problems are becoming much more acute with the advent of structural genomics. The structural genomics projects are designed to produce large numbers of structures automatically to populate the structural databases. The number of structures to be solved precludes detailed expert examination of all refinement calculations. If we are to avoid pollution of the structural databases, we must develop more robust and reliable refinement methods. The methods described in this chapter, which can be run during or after a structure refinement, are intended to address this problem. With these algorithms both the overall and local quality of the structure can be rigorously evaluated. These methods also appear capable of precisely identifying at least some places where the refined model will need further attention. Structure refinement is most simply posed as a large optimization problem, typically with thousands of variables and comparable numbers of observations. The complexity of the refinement problem is owed only partly to the size of the problem; the relationship between parameters and observables is nonlinear, and the function to be optimized contains a number of local minima. The nonlinearity greatly slows convergence of the refinement calculations, and unfortunately the local minima can closely resemble the global minimum. This chapter steps back from the complexities of model construction and model building to look at the foundations of optimization. The methods described here are thus applicable to both least-squares and maximum-likelihood refinement protocols. They are also extremely useful
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
346
analysis and software
[16]
for studying the effects of different parameterizations of the models—anisotropic atomic displacement parameters, rigid substructures, internal versus Cartesian coordinates, restraints versus constraints, and so on. The methods show promise for automated detection of inconsistencies between the model and the data. Principles of Optimization
The crystallographic optimization problem for biological macromolecules is nonlinear and plagued with local minima, but has some redeeming features. The functions to be refined are generally continuous, with continuous derivatives. Furthermore, the problem is remarkably stable compared with other nonlinear problems of comparable size. The stability arises in part because the diffraction experiment amounts to a physical Fourier transform, and thus the problem is expanded in orthogonal functions (sines and cosines). This feature tends to isolate the effects of observational error, missing observations, and limitations on resolution. The discussion here is restricted to optimization of functions of continuous variables, which covers the issues encountered in refinement of models against X-ray diffraction data, but does not cover other aspects of crystallography, such as optimization of electron density maps as a function of phase angles—phase angles for reflections in centrosymmetric zones are not continuous variables. Taylor Expansion The basic optimization methods are based on the Taylor series expansion of the function to be optimized, that is, *
+ @ ðxÞ ¼ ðx0 Þ þ ðx x0 Þ @xi x0 * + 2 @ 1 ðx x0 Þ þ ðx x0 Þ þ 2 @xi @xj x0
(1)
which, provided the derivatives exist and are bounded, converges over some interval about x0. Equation (1) uses the Dirac bracket notation, in which jxi denotes a column vector with elements xi, jmj represents a matrix with elements mi,j, and hxj is a row vector with elements xi. Thus jMjxi is the column vector produced by premultiplying the column vector x by the matrix M, and hxjyi is the inner product of vectors x and y. Equation (1) could be written (more verbosely) as
[16]
full matrix refinement
347
n n 2 n X X X @ @ ðxÞ ¼ ðx0 Þ þ ðxi x0;i Þ ðxi x0;i Þ þ @xi @xi @xj x0 i¼1 j¼1 i¼1 ðxj x0;j Þ þ
Different optimization methods truncate the Taylor series at different levels. Zeroth-Order Methods Methods that use no derivatives are useful for functions that are irregular or have many local minima. In these cases they can locate regions in which the global minimum may be found. They are generally slow to converge to the final result. In crystallography these methods are generally used on the phase problem but not on the refinement problem. This category of methods includes direct search (scanning the entire function space), Monte Carlo algorithms (randomly searching the function space), some forms of simulated annealing, genetic algorithms, and simplex searches. SnB1 and SHELXD2 are examples of crystallographic programs that work by direct searches. Other examples are found in structure solution by molecular replacement; the initial search for rotation peaks is usually a direct scan of the entire function space. Candidate solutions are then refined, sometimes by higher-order methods, but the initial search is a zeroth-order method. First-Order Methods First-order methods use information from the gradients of the target function. This class of methods is popular for macromolecular crystallography because the storage requirements grow only linearly with the number of parameters. The most direct version, steepest descents, simply follows the gradient into the closest local minimum. This is not particularly efficient. Variations on these methods search in the direction of the gradient by taking several steps and extrapolating to a predicted position of a minimum. An efficient line search can greatly accelerate the performance of these methods without sacrificing their robustness. Second-Order Methods Second-order methods include information about correlation between parameters, which is found in all of the off-diagonal elements of the second derivative matrix. This information can greatly accelerate convergence, 1 2
R. Miller, S. M. Gallo, H. G. Khalak, and C. M. Weeks, J. Appl. Crystallogr. 27, 613 (1994). T. R. Schneider and G. M. Sheldrick, Acta Crystallogr. D. Biol. Crystallogr. 58, 1772 (2002).
348
analysis and software
[16]
often at a high cost in storage. The memory requirements for full matrix methods are proportional to the square of the number of parameters, which for macromolecular problems has been prohibitively large. Advances in computer design, and reductions in cost, now make these methods feasible for all but the largest problems, and probably feasible for those in the near future. The most popular full matrix program in crystallography is SHELXL.3,4 Full matrix methods retain all the correlation information. There are numerous methods for obtaining some of the benefits of second-order methods without paying the full penalty in storage and execution time. The most obvious is to use a sparse second derivative matrix, which contains only the more significant matrix elements. The program REFMAC,5,6 for example, does this by using a matrix that contains only the single atom blocks on the diagonal, and the off-diagonal blocks relating the coordinates of atoms connected by restraints. TNT7,8 uses the method of conjugate gradients, which accumulates information about the effects of off-diagonal second derivative matrix elements by monitoring changes in the gradient vector as the refinement proceeds. The gradients used in TNT are scaled by the inverse of the curvature,8 which markedly increases the speed and convergence radius of the program. SHELXL has a conjugate gradients mode as well, especially designed for macromolecular refinement. These methods do not return information about the full matrix at the end of the calculation. The key to the power of second-order methods, both for acceleration of convergence and for analysis of the properties of the function being optimized, is that as the coordinates approach a minimum, the first-order term of the Taylor series expansion of the function vanishes and the secondorder term becomes the curvature matrix for the function. Thus if the position x0 is the (unknown) position of the global minimum, * + 2 @ 1 ðx x0 Þ ðx x0 Þ ðxÞ 0 þ (2) 2 @xi xj x0 3
G. M. Sheldrick and T. R. Schneider, SHELXL: Methods Enzymol. 277, 319 (1997). G. M. Sheldrick, ‘‘SHELXL-97: A Program for the Refinement of Crystal Structures from Diffraction Data.’’ Institut fu¨r Anorganische Chemie, Go¨ttingen, Germany, 1997. 5 G. N. Murshudov, A. A. Vagin, and E. J. Dodson, Acta Crystallogr. D. Biol. Crystallogr. 53, 240 (1997). 6 G. N. Murshudov, A. Lebedev, A. A. Vagin, K. S. Wilson, and E. J. Dodson, Acta Crystallogr. D. Biol. Crystallogr. 55, 247 (1999). 7 D. E. Tronrud, L. F. Ten Eyck, and B. W. Matthews, Acta Crystallogr. A 43, 489 (1987). 8 D. E. Tronrud, Acta Crystallogr. A 48, 912 (1992). 4
[16]
full matrix refinement
and by differentiation, + 2 @ @ ðx x0 Þ
@x @xi xj x0 i x
349
(3)
The gradient at position x depends on the coordinate shift and the curvature matrix. The parameter shifts can be estimated by solving Eq. (3) using matrix elements calculated at x instead of at the unknown position x0. This approximation is not a problem for parameter estimates close to x0. Note that the assumption that the function can be approximated locally by a quadratic polynomial is equivalent to assuming that the matrix of second derivatives is constant. In the crystallographic refinement case the function (x) is either the least-squares target or the negative log likelihood target.9 In both cases the functional form to be minimized is ðxÞ ¼
m 1X w2 ðfj ðxÞ yj Þ2 2 j¼1 j
where yj corresponds to an observed value and fj (x) is either the calculated value or the expectation value given the parameters x. The term wj is the weight applied to the difference between observation and calculation. In least-squares refinement this is simply the inverse of the variance of the measurement of yj, but for maximum likelihood refinement the weight takes into account both the reliability of the observation and the reliability of the estimate of the expectation value.9 The solutions for the parameter shifts can be written in the form of a set of linear equations as Hðx x0 Þ ¼ g where the elements of H and g are given by m X @fk ðxÞ @fk ðxÞ w2k hij ¼ @xi @xj k¼1 m X @fk ðxÞ 2 wk ðfk ðxÞ yk Þ gi ¼ @xi k¼1
(4)
If the problem were truly linear this process would converge in one step. In practice a number of steps are required, and the magnitude of the gradient vector hgjgi1/2 does not reach zero. The ‘‘last mile’’ of the refinement is difficult because the model does not exactly match the data, and attempting to 9
N. S. Pannu and R. J. Read, Acta Crystallogr. A 52, 659 (1996).
350
analysis and software
[16]
match it produces inconsistencies. These inconsistencies, and a number of other features of the problem, can be analyzed by examination of the curvature matrix H. If v is an arbitrary vector in parameter space representing a displacement from the minimum position x0, we have from Eqs. (2) and (3) 1 ðx0 þ vÞ 0 þ vT Hv 2
(5)
which is a standard quadratic form describing an N-dimensional ellipsoid in the parameter space. The eigenvalues and eigenvectors of H are defined as solutions to the matrix equation Hv ¼ v where the scalar is an eigenvalue of H and the vector v is the corresponding eigenvector of H. A matrix of rank n has n nonzero eigenvectors; any point in the n-dimensional space can be expressed as a weighted sum of the eigenvectors. The eigenvectors have the property that vi vj ¼ 0 if i 6¼ j. Setting v in Eq. (5) to an eigenvector of H gives 1 1 1 ðx0 þ vi Þ ¼ vTi Hvi ¼ i vTi vi ¼ i 2 2 2
(6)
when vi is the ith normalized eigenvector of H and i is the corresponding eigenvalue. Equation (6) means that if the parameters are shifted in the direction of one of the eigenvectors of H, the value of the function will change in direct proportion to the corresponding eigenvalue. Because the eigenvectors of H are perpendicular to one another, they specify combinations of parameters that are statistically independent of one another, and the eigenvalues are proportional to the reciprocal of the variance of those parameter combinations. Another way of expressing the same idea is that the eigenvectors that correspond to large eigenvalues are directions in which small parameter shifts have a large effect on the sum of squares of the residuals, and thus the parameters are well determined. Shifts of the coordinates in the directions of eigenvectors that correspond to small eigenvalues have little effect on the sum of squares of the residuals and thus correspond to poorly determined combinations of parameters. This situation is illustrated for a hypothetical two-parameter example in Fig. 1. Consider a one-dimensional crystallographic case with two atoms, one at position x and one at position y. Let x and y be close enough that at low resolution the peaks are not well resolved. Low-resolution Fourier coefficients will be sensitive to the position of the unresolved peak, while high-resolution Fourier coefficients will bring out the detail in the electron
[16]
351
full matrix refinement 1 x − y = 0, Constant separation
0.8 0.6 0.4
y
0.2 0
−0.2 −0.4 −0.6 −0.8 x + y = 0, Constant center
−1 −1 −0.8 −0.6 −0.4 −0.2
0
0.2
0.4
0.6
0.8
1
x Fig. 1. Two possible scenarios for a one-dimensional, two-parameter problem described in text. Both ellipses have the same eigenvectors, but the eigenvalues are swapped. The eigenvector in the (þ, þ) quadrant (upper right) corresponds to a shift of parameters that preserves the distance between the two atoms but moves the center of mass of the system. The eigenvector in the (þ, ) quadrant (lower right) corresponds to a shift of parameters that preserves the center of mass of the two atoms but changes the separation between them. The two ellipses reflect situations in which either the separation is more accurately known than the position, or in which the position is known more accurately than the separation. The former could arise if low-resolution data are missing, and the latter could arise if high-resolution data are missing.
density that shows the separation between the atomic centers. The model could be equally well described in terms of the center of mass (x þ y)/2 and the distance between centers (x y). If significant low-resolution data are missing, the accuracy of the center of mass is affected, and thus x þ y is not well determined. This would produce a small eigenvalue in the direction corresponding to shifts of both coordinates in opposite directions (the shift that preserves the value of x þ y). If significant high-resolution data are missing, the accuracy of the determination of the distance between x and y is affected and the eigenvalue corresponding to shifts of the coordinates in the same direction is small. Figure 1 shows two ellipses reflecting these situations. The ellipse with the long axis in the (þ, ) direction
352
analysis and software
[16]
corresponds to a case with poor low-resolution data, while the ellipse with the long axis in the (þ, þ) direction corresponds to poor high-resolution data. Small eigenvalues, which imply long axes for the ellipses, mean that statistical noise in the measured data can easily perturb the solution in those directions. The practical consequence of large variances is that small errors in the data can easily have large effects on the values of the refined parameters. Underdetermined Systems Crystals often do not diffract to a resolution sufficient to determine fully all of the parameters of an atomic model. In such cases the matrix of second derivatives is singular, and Eq. (4) does not have a unique solution. In this case one or more of the eigenvalues is zero. If i ¼ 0 the optimization target function is not affected by any change of parameters in the direction vi until the shift becomes large enough that the nonlinearity of the problem causes the Taylor series approximation to break down. If more than one of the eigenvalues is zero, any point in the entire subspace spanned by all of the corresponding eigenvectors is as good as any other with respect to the value of . It should be noted that none of the zeroth-order, first-order, or truncated second-order methods properly detect singularity. These methods will converge to an arbitrary point in the undetermined parameter subspace, but will give no reliable indication that portions of the solution reported are undetermined. Crystallographic Applications
The methods suggested here have been tried on three test systems. The first case examined was a small lactone, C7H11NO3, the data for which are distributed with the SHELXL-93 program as a test case. The data extend to ˚ resolution. The model contains coordinates and anisotropic thermal 0.8-A parameters for each heavy atom; the hydrogen atoms ride at calculated positions with one exception, which has a free bond rotation parameter. In all, there are 105 parameters in the lactone model. The second test case is a small protein, amicyanin,10,11 which contains 105 amino acid residues and 1 copper atom. The model also contains 88 solvent molecules, for a total of 896 atoms and 3856 parameters when using 10
R. Durley, L. Chen, L. W. Lim, F. S. Mathews, and V. L. Davidson, Protein Sci. 2, 739 (1993). 11 L. Chen, R. C. Durley, F. Scott Mathews, and V. L. Davidson, Science 264, 86 (1994).
[16]
full matrix refinement
353
isotropic thermal parameters. (Two parameters are used for scaling.) The ˚ resolution and are kindly provided by F. Scott data extend to 1.07-A Mathews and N.-h. Xuong. The model used in these preliminary calculations was not fully refined. The third test case is a 7-Fe ferredoxin from Azotobacter vinelandii for ˚ (room temwhich two sets of data are available,12,13 at resolutions of 1.9 A ˚ perature) and 1.3 A (frozen). These data were provided by D. E. McRee and C. D. Stout. Small Molecule Test Case The lactone refinement was done as described in the SHELXL manual. Refinement is on jFj2, hydrogen atoms were given a riding model, and all heavy atoms were refined with anisotropic thermal parameters. The program was modified so that the matrix was written to disk before Marquadt damping was applied. Some results for the lactone are presented in Figs. 2–5. Figures 2 and 3 introduce the use of the ‘‘power’’ represented by a particular eigenvector component. The eigenvectors of H are orthonormal, which means that if the matrix V contains the eigenvectors as columns we have N X
v2ik ¼ 1 and
N X
v2ki ¼ 1
k¼1
k¼1 th
Each element vij gives the i component of the jth eigenvector, that is, the contribution of parameter i to eigenvector j. The components of the eigenvectors corresponding to particular parameters can be examined to see which are important for determining the value of that parameter. A convenient measure is to look at the sums of squares of elements of V; these give the square of the length of the projection of the parameter i into the subspace spanned by the corresponding columns of V. Figures 2 and 3 show that the quality of the parameters does not decrease uniformly. Significant information persists at low resolution. These figures are also completely consistent with small-molecule refinement experience in that anisotropic thermal parameters cannot be refined at low resolution, and that they are clearly more strongly correlated with other parameters than the spatial coordinates are. Singular matrices do not mean that all parameters of the model are un˚ , the first determined. Figure 2 shows that if the data are truncated to 2.0 A 20 eigenvalues are zero, but all the eigenvectors that contain components 12 13
C. D. Stout, J. Biol. Chem. 268, 25920 (1993). C. D. Stout, E. A. Stura, and D. E. McRee, J. Mol. Biol. 278, 629 (1998).
354
[16]
analysis and software C-11 x Coordinate
1010
0.8 2.0 0.8 2.0
0.15
A A A A
Eigenvalues Eigenvalues Power Power
108
0.10
106
0.05
104
Eigenvalue
Power in eigenvector
0.20
102
0.00 0.0
50.0 Eigenvector number
100.0
Fig. 2. The peaks in this graph represent the contribution of each eigenvector (Power) to ˚ resolution. The eigenvalue spectra are shown the x coordinate of atom C-11 at 0.8- and 2.0-A ˚ the first 20 eigenvalues are on the same graph as solid and dashed continuous curves. At 2.0 A zero, but the coordinate is still strongly determined by the data.
in the direction of the x coordinate of atom C-11 have large eigenvalues. Most of the coordinates for this structure are demonstrably well determined even when the refinement is singular. Figure 3, in contrast, shows that the U11 atomic displacement parameter of the same atom (C-11) is ˚ resolution, but is not determined at well determined with data to 0.8-A ˚ resolution. all at 2.0-A Figures 4 and 5 show the effect of resolution on the eigenvectors of the lactone. The high-resolution vectors in Fig. 4 show a small amount of correlation. The peak in vector 100 near 50 on the scale is the x, y, and z coordinates of atom N-6, indicating that the positional uncertainty for this atom is essentially uncorrelated with any other parameter. The peaks at parameters 90–95 in vector 7 are the Uij parameters of atom C-10. Finally, the peak at 96 in vector 6 corresponds to the torsion angle of hydrogen H-10. The low-resolution eigenvectors shown in Fig. 5 are much more diffuse. Each eigenvector carries information about a number of parameters—but the number is not always large. For example, vector 100, which corresponds
[16]
355
full matrix refinement C-11 U11
1010
0.8 2.0 0.8 2.0
Power in eigenvector
0.08
A A A A
Eigenvalues Eigenvalues Power Power
108
0.06 106 0.04
Eigenvalue
0.10
104 0.02
0.00
0.0
50.0 Eigenvector number
100.0
102
Fig. 3. As with Fig. 2, the peaks on this graph show the contribution (Power) of each ˚ resolution. This parameter is eigenvector to the U11 parameter of atom C-11 at 0.8- and 2.0-A ˚ resolution because the smallest eigenvalue contributing to the well determined at 0.8-A solution is greater than 106. Any change in this parameter at this resolution will thus have a ˚ resolution the parameter is essentially large effect on the sum of squares of residuals. At 2.0-A undetermined because there are contributions to the parameter from eigenvectors corresponding to very small eigenvalues. Shifts of the parameter in those directions will produce small changes in the sum of squares of residuals. The low-resolution data cannot distinguish between different values for this parameter.
to an eigenvalue of 5.5 106, has four significant peaks, most of which span several adjacent parameters. These correspond to the x, y, and z coordinates of atoms O-1, O-2, and N-6; and to the z coordinate of atom O-3. This test case demonstrates the way in which the eigenvectors of the normal matrix carry information about multiparameter correlation. The high-resolution examples show that with sufficient data most parameters can be completely resolved, although this will of course depend on the spe˚ example showed mutual correlation cific problem. Vector 100 in the 2.0-A between the positional parameters of three atoms, but interestingly the x and y parameters of atom O-3 are not correlated with the coordinates of atoms O-1, O-2, and N-6. This kind of information is presented much more clearly by the eigenvectors than by analysis of the pairwise correlation coefficients obtained by inversion of the normal matrix.
356
[16]
analysis and software 1.0 Vector 6 Vector 7 Vector 100
Power (vi2)
0.8
0.6
0.4
0.2
0.0
0.0
50.0 Parameter number
100.0
˚ resolution. Each eigenvector Fig. 4. Selected eigenvectors for the lactone test case at 0.8-A has only a few significant components, showing little correlation between these parameters at high resolution.
0.20 Vector 6 Vector 7 Vector 100
Power (vi2)
0.15
0.10
0.05
0.00 0.0
50.0 Parameter number
100.0
˚ resolution. Note that the Fig. 5. Selected eigenvectors for the lactone test case at 2.0-A vertical scale is different from that in Fig. 4. These eigenvectors show a great deal of multiparameter correlation.
[16]
357
full matrix refinement
Amicyanin Analysis of the amicyanin normal matrix is far more complex. Normal matrices were calculated for this protein at resolutions of 1.07, 2.0, 2.5, 3.0, ˚ . Refinement was done using jFj2. Two refinements were calcuand 3.5 A lated at each resolution—one that used the standard stereochemical restraints produced by the PDBINS program, which is part of the SHELX distribution, and one that used no stereochemical restraints. The model was refined with isotropic thermal parameters. Figures 6 and 7 show the eigenvalue spectra for the 10 refinements. The restrained refinement contains no surprises. The matrices are nonsingular ˚ . The unrestrained refinements shown in at resolutions better than 3.0 A ˚ Fig. 7 show that even at 3.5 A there are 657 eigenvalues greater than 105 and hence significant information determined by the X-ray diffraction data. Both sets of curves have a pronounced drop near 900 due to a large difference in the effective scale of atomic displacement parameters compared with the atomic coordinates. The scaling issue is discussed in more detail in Cowtan and Ten Eyck.14
1010 1.07 A 2.0 A 2.5 A 3.0 A 3.5 A
Eigenvalue
108
106
104
102
100
0
1000
2000
3000
Eigenvalue number Fig. 6. Eigenvalue spectra for restrained refinements of amicyanin show that the system ˚ resolution. The pronounced ‘‘knee’’ in the spectra becomes singular between 2.5- and 3.0-A near 900 is due to the 896 thermal parameters.
14
K. Cowtan and L. F. Ten Eyck, Acta Crystallogr. D Biol. Crystallogr. 56, 842 (2000).
358
[16]
analysis and software 1010 1.07 A 2.0 A 2.5 A 3.0 A 3.5 A
Eigenvalue
108
106
104
102
100
0
1000 2000 Eigenvalue number
3000
Fig. 7. Eigenvalue spectra for unrestrained refinements of amicyanin show the importance of restraints in maintaining the conditioning of the refinement problem. The matrices become ˚ resolution. singular between 2.0- and 2.5-A
The quality of the parameter estimates is made much clearer by Fig. 8, which also demonstrates the power of the method for locating poorly refined portions of the structure. For Fig. 8 thresholds of 106 and 103 were chosen for well-determined and poorly determined parameters. The squares of the components of each parameter were summed in the subspaces determined by > 106 and < 103. The parameters are sorted so that the first two are the scale factors, the next 896 are the thermal parameters, and the remaining parameters are the spatial coordinates. In each case the parameters for the 88 solvent molecules are to the right. The well-determined set is drawn in gray; the poorly determined set is drawn in black. High values in gray indicate well-determined parameters, while high values in black indicate poorly determined parameters. Sections of suspect structure are immediately evident. The 1.07 ˚ -restrained refinement contains a number of gray spikes downward, A which clearly indicate parameters that are not as well determined as the rest of the structure. The spike in the black curve indicates a thermal parameter that is unstable on refinement. As the resolution decreases, the behavior of the thermal parameters and the level of accuracy of the coordinates show behavior consistent with current experience, except that ˚ than the thermal parameters are perhaps more poorly determined at 2.5 A an optimist might hope. The vertical black bar close to 900 indicates that
[16]
Unrestrained
Restrained
Power in range
359
full matrix refinement
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.07 A
2.0 A
2.5 A
3.0 A
3.5 A
0
1000 2000 Parameter λ < 103
3000
0
1000 2000 Parameter
3000
λ > 106
Fig. 8. Fraction of the information for each parameter contained in eigenvectors either greater than 106 (gray) or less than 103 (black) for the protein amicyanin at resolutions from ˚ . High values plotted in gray correspond to well-determined parameters; high 3.5 to 1.07 A values plotted in black correspond to badly determined parameters. Parameters are sorted so that atomic displacement parameters are on the left and atomic coordinates are on the right. Within each block the parameters for the 808 protein atoms are first, followed by parameters for 88 solvent molecules. The 10 graphs show the effects of resolution and of geometric restraints on the precision of the parameter determinations.
360
analysis and software
[16]
the thermal parameters for the solvent atoms are almost uniformly dubious ˚. at 2.0 A Comparison of the restrained and unrestrained columns is particularly ˚ enlightening. Many nonsolvent coordinates are well determined at 3.5 A in the restrained refinement, but are not determined at all without restraints. This shows how much of the information in the final structures is coming from the restraints as a function of resolution. This feature is highlighted further by the poor quality of the solvent coordinates in the restrained refinements; the only restraints that help them are the antibumping restraints. Another noteworthy feature is that the quality of the solvent ˚ -restrained structure is better than the quality of coordinates in the 2.5 A the same coordinates in the unrestrained structure, even though there are few restraints on the solvent molecules. This demonstrates that the restraints help the overall structure, not just the portions restrained. Ferredoxin Two data sets at different resolutions for the protein Azotobacter vinelandii 7-Fe ferredoxin13 were used to test the methods further. More detailed results may be found in Cowtan and Ten Eyck,14 from which these data are taken. The protein crystallized in space group P41212, with cell dimensions a ¼ b ¼ 54:8; c ¼ 92:6 for the frozen crystals. The structure contains 106 residues and 2 Fe-S clusters. The structure was originally solved ˚ (9586 reflections). The by Stout12 with room temperature data to 1.9 A model was refined in X-PLOR, with a final R factor of 21.5%. Thirty water molecules were modeled. The structure was later rerefined by Stout et al.,13 using data from ˚ (30,880 reflections). The model frozen crystals to a resolution of 1.3 A was refined with anisotropic thermal parameters and 162 water molecules (9 parameters per atom ¼ 9211 parameters) using the conjugate gradient mode of SHELXL4 to an R factor of 15%. The free-R factor was estimated for this model by perturbing the coordinates and forcing the thermal parameters to be isotropic. Anisotropic refinement was then repeated with 5% of the reflections left out of the refinement. The R factor for these data was 19%. Restraints. To examine the effect of various restraints on the conditioning of the problem, the normal matrices were calculated, adding successive restraints to the unrestrained calculation to obtain some indication of the effect of various restraints. The eigenvalue spectra for all the calculations are shown in Fig. 9. The types and numbers of different geometric restraints are listed in Table I. These spectra show the same scaling effect for atomic displacement parameters as the amicyanin test case.
[16]
361
log10 (eigenvalue)
full matrix refinement
Dist,ang,geom, δU Dist,ang,geom Dist,ang Dist Unrestrained
2.0 0
1000
2000
3000
Eigenvalue number Fig. 9. Effect of various geometric restraints on the eigenvalue spectrum of Azotobacter ˚ data. Note: Line styles are obscured where lines overlap. vinelandii 7-Fe ferredoxin with 1.9-A
TABLE I ˚ -Restrained Refinement Number of Restraints, Type, and Desired esd for 1.9 A
Type of restraint
SHELXL keyword
Bond length Bond angle Chiral volume Flat res/ring Antibumping Bonded atom U values
DFIX DANG CHIV FLAT BUMP SIMU
SHELXL restraint esd 0.02 0.04 0.10 0.45 0.02 0.14
˚ A rad ˚3 A ˚3 A ˚ A ˚2 A
Number of restraints 862 1177 149 262 15 851
Addition of bond length restraints to the unrestrained calculation makes a significant difference to the shape of the spectrum—all the eigenvalues increase, but in particular the 900 largest eigenvalues increase significantly more than the others. This correlates well with the 862 bond length restraints. Addition of the bond angle restraints significantly increases the remaining eigenvalues not affected by the bond lengths. This is consistent
362
analysis and software
[16]
with the bond angle restraints acting mainly perpendicular to the bond length restraints. The few remaining positional restraints (chiral volumes, flat rings, antibumping) give a slight further increase in the remaining eigenvalues. Restraining the U values of neighboring atoms increases the eigenvalues in the thermal region of the spectrum, including the smallest eigenvalues. The plot also shows that the positional restraints (particularly bond lengths and angles) do affect the smaller eigenvalues, which are mainly from thermal parameters. This supports the observation that the positional and thermal parameters are not separable. Most of the eigenvectors have clear physical interpretations. The primary components of a typical well-determined eigenparameter are shown in Fig. 10. This eigenvector captures parameter shifts that would cause Trp-78 to buckle. Detection of Inconsistencies. The right-hand side of the least-squares equations (the RHS vector) is obtained by premultiplying the residual vector of restraint disagreements by the matrix of derivatives with respect to the parameters. In theory, at the end of a refinement the RHS vector of the least-squares equations should be zero, otherwise technically the refinement has not converged. However, in practice this is never the case, for a number of reasons. 1. The eigenvalues have a large dynamic range. Small shifts to welldetermined parameters will perturb the normal matrix sufficiently that
TRP 78 CA
TRPTRP 78 78 CE2 CE2
Fig. 10. Eigenvector 3546: This well-determined eigenparameter from the ferredoxin refinement is involved in the flatness of a tryptophan.
[16]
363
full matrix refinement
estimates for the ill-determined parameters will be subject to large errors. Ill-determined parameters may therefore not refine until all the welldetermined parameters have been fully refined. 2. The elements of the normal matrix may not be slowly varying from cycle to cycle. If elements of the normal matrix vary rapidly with some parameters, those parameters will refine slowly or not at all. This is because in such cases the second-order approximation to the sum of squares of residuals is poor. Inconsistent restraints can cause this behavior in some cases. 3. Limits of machine precision will introduce noise at all stages. The remaining RHS vector of the normal equations can be projected onto the eigenvector axes by taking the dot-product of the RHS vector with each eigenvector in turn. This projection is the contribution of each eigenvector to the gradient. Squaring the projection and normalizing by the eigenvalue shows the nonconvergence in units of the variance of the corres˚ refinement of ferreponding eigenparameter. This is shown for the 1.9-A doxin in Fig. 11. Strong features are evident around eigenvalues 2647 and 2840. Eigenvector 2647, corresponding to the highest peak in Fig. 11, is shown in Fig. 12. The eigenparameter includes strong contributions from the
Projected RHS/eigenvalue 109
2.5
2.0
1.5
1.0
0.5
0.0 0
500
1000
1500
2000
2500
3000
3500
4000
Eigenvalue number Fig. 11. Ratio of the square of the projection of the RHS of Eq. (4) onto the eigen parameters to the eigenvalue, as a function of eigenparameter number. This represents the contribution of each eigenvector to the remaining errors in the refinement. The ratio shows sharp peaks where the model and the data do not agree.
364
analysis and software
[16]
LYS9898 LYS CGCG
Fig. 12. Eigenvector 2647 of the ferredoxin refinement, corresponding to a large peak in the RHS spectrum. The side chain is one of the worst defined in the structure.
positional parameters for the side chain of residue 98: this is a surface lysine for which the density is particularly poor. Variation of the eigenparameter corresponds to stretching and compressing bonds along the side chain. The restraint dissatisfaction has become concentrated in this eigenparameter because the X-ray terms (reflected in the map density) and the geometric restraints are irreconcilable for this side chain. Examination of this side ˚ refinement indicates that it is sufficiently disordered that chain in the 1.3-A modeling a second conformation would not help. The ratio plot has therefore revealed a genuine problem area in the structure. Strong peaks appear in the RHS/eigenvalue ratio plot only after full matrix refinement has been applied to near convergence. Models that have been refined using the conjugate gradient method, or that have not been refined to near convergence, tend not to show sharp peaks. With an unrefined model, the constraint dissatisfaction is spread among all the parameters. As the refinement approaches convergence, the dissatisfaction is concentrated into those parameters for which the quadratic approximation is a poor description. In this case the dissatisfaction is concentrated in those parameter combinations for which the model does not describe the X-ray data well, and therefore are among the last to refine.
[16]
full matrix refinement
365
The RHS contribution/eigenvalue ratio has a particular statistical significance (see, e.g., Kendell et al.15). The inverse eigenvalues are the variances of the eigenparameters, and the refinement shifts to the eigenparameters are given by the RHS vector in the eigenparameter space divided by the eigenvalues. Therefore the RHS contribution/eigenvalue ratio represents the 2 normalized eigenparameter shift. This function may also be calculated in parameter space for a direct indication of the significance of the refinement shift applied to each parameter. The ratio of RHS contributions to eigenvalues for the restrained calcu˚ is plotted in Fig. 13a. This graph again shows some sharp lation at 1.3 A peaks, but in contrast to the room temperature case the largest feature is at the top end of the thermal region of the spectrum. The eigenvectors corresponding to the highest peaks in this plot are listed in Table II in terms of their largest contributors. The eigenvectors in the large peak all include contributions from sulfur atom thermal parameters, which might be expected to have well-determined thermal parameters; however, the common and distinctive feature of the eigenvectors is contributions from thermal parameters of atoms in the side chain of residue 18. Contributions from the long flexible sidechains of residues 83 and 92 (both glutamates) comprise the lesser peak. The model for residue 18 is shown in Fig. 14, with the electron density. The electron density is poor, and the thermal ellipsoids for these atoms are extremely anisotropic. The thermal parameters appear to be trying to fit absent density, and are in disagreement with the geometric restraints. There is density to suggest an alternate, possibly better, conformation. The side chain was moved into the alternative density and subjected to local refinement, and then the whole model was refined using full matrix least-squares refinement. The new side chain displayed good density, and the R factor dropped, but only after refinement of the whole model (Table III). There were no major motions during this refinement but most parameters displayed small shifts, confirming that the whole model had been biased by the incorrect side chain. The RHS/eigenvalue ratio plot for the refined model (Fig. 13b) shows the original peak has disappeared, but the secondary peak, due to the glutamate side chains, has increased. Residue 18 was reexamined in the room temperature model. There was no clear density for the side chain, even with the low-temperature model as a guide. The residue does not contribute to any clear peaks in the RHS/ eigenvalue ratio plot at low resolution. It is probable that this side chain is disordered at room temperature and ordered at low temperature. 15
M. G. Kendell, J. K. Ord, and A. Stuart, ‘‘Advanced Theory of Statistics.’’ 5th ed., Vol. 2. Edward Arnold, London, 1991.
366
[16]
analysis and software (a) log10 (eigenvalue) RHS/eigenvalue ratio
40
log10 (eigenvalue)
10.0 30 8.0 6.0
20
4.0 10
RHS/eigenvalue ratio 109
12.0
2.0 0.0
0 0
1500
3000
4500
6000
7500
9000
(b) log10 (eigenvalue) RHS/eigenvalue ratio
40
log10 (eigenvalue)
10.0 30 8.0 6.0
20
4.0 10
RHS/eigenvalue ratio 109
12.0
2.0 0.0 0
1500
3000
4500
6000
7500
9000
0
˚ data and (a) initial refined 1.3-A ˚ model (b) Fig. 13. RHS/eigenvector ratio for the 1.3-A after modification of the side chain and refinement of the model.
Conclusion
Full matrix optimization methods, coupled with examination of the eigenvalues and eigenvectors of the curvature matrices, provides a powerful tool for automatic identification of problematic regions in a refined structure. Further, it provides a theoretically sound means of determining
[16]
TABLE II ˚ Model Strongest Contributions from Parameters to Eigenvectors for Peaks in RHS/Eigenvalue Ratio for Initial Refined 1.3-A RHS/eigenvector ( 109)
4737
4.1
4754
2.5
6005
13.8
6006
25.2
6007
12.1
6009
22.9
a
x% yza 2% U13 OE2-92 2% U23 OE2-83 31% U13 SG-45 25% U13 SG-45 16% U12 SD-64 26% U13 SG-49
1% U13 CD-92 1% U23 CD-83 11% U12 SD-64 13% U13 OE2-18 12% U23 S1-108 12% U13 S2-107
1% U11 OE2-92 1% U13 OE2-92 8% U23 S1-108 11% U12 SD-64 9% U12 S1-108 12% U13 OE2-18
1% U11 CD-92 1% U22 OE2-83 8% U13 OE2-18 8% U13 SG-49 6% U13 OE2-18 5% U33 OE2-18
1% U33 CG2-82 1% U13 O-104 6% U12 SI-108 6% U33 OE2-18 5% U23 SG-11 4% U13 SG-20
1% U22 CG-62 1% U13 O-62 4% U33 OE2-18 5% U23 SG-24 4% U13 SG-16 3% U22 SG-49
1% U23 CB-82 1% U22 CD-83 2% U13 SG-49 2% U11 OE2-18 4% U13 SG-49 2% U23 SG-11
1% U33 OE2-92 1% U13 CD-92 2% U13 SG-20 2% U23 OE2-18 4% U13 S1-108 2% U11 OE2-18
full matrix refinement
Eigenvector num.
x, contribution; y, parameter type; z, atom name.
367
368
[16]
analysis and software
GLU1818CB CB GLU
˚ model. At lower contour levels there Fig. 14. Residue 18 and density from the initial 1.3-A is continuous (but poor) density for the side chain.
TABLE III Change in Magnitude and Intensity R-Factor during Modeling of Residue 18 Side Chain Refinement stage
Conventional R factor (R1)
Intensity R factor (wR2)
Initially refined model Side chain 18 moved Local refinement (5 residues) Refinement of whole model
0.1501 0.1514 0.1505 0.1481
0.3585 0.3715 0.3673 0.3552
the precision of parameters at ‘‘low’’ resolution. The process can be reasonably automated. Full matrix methods are expensive in terms of both memory and processor time. Fortunately the prices of both resources are falling exponentially, while the requirements scale as a low power of the number of parameters. Tests of matrix diagonalization carried out by R. Leary at the San Diego Supercomputer Center are summarized in Fig. 15. Systems as large as 20,000 parameters can be solved in less than 2 h using 128 processors, each of which is roughly one-third the speed of currently available
[16]
full matrix refinement
369
Parallel Eigensystem Solution Timings SDSC Blue Horizon 10000
Minutes (log scale)
1000
8 16 64 128
Processors Processors Processors Processors Total time
100
10
1
0.1 1000
10000 Order of matrix (log scale)
Fig. 15. Computing requirements for calculation of eigenvalues and eigenvectors using the SCALAPACK parallel linear algebra package. Diagonalization of a system corresponding to 20,000 parameters required less than 2 h on 128 processors. These processors are about onethird as fast as current production (3-GHz) Intel Pentium 4 processors.
Intel 3.0-GHz Pentium 4 processors. This test used the ScaLAPACK16 parallel linear algebra package. There is clearly no problem in finding computers large enough for this task. At present there is no software available to compute the matrices in parallel. SHELXL was designed for vector computers and is not readily adapted to parallel systems. There are no full matrix packages presently available for the maximum likelihood refinement target. The work presented here demonstrates the potential significance of the technique. Acknowledgments Supported in part by grants BIR 9223760 DBI 9911196 from the National Science Foundation.
16
´ zevedo, J. Demmel, I. Dhillon, J. Dongarra, L. S. Blackford, J. Choi, A. Cleary, E. DA S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ‘‘ScaLAPACK Users Guide.’’ Society for Industrial and Applied Mathematics, Philadelphia, 1997.
370
[17]
analysis and software
[17] Validation of Protein Structures for Protein Data Bank By John Westbrook, Zukang Feng, Kyle Burkhardt, and Helen M. Berman Overview
The Protein Data Bank (PDB; http://www.pdb.org/)1,2 was first established in 1971 by Walter Hamilton at Brookhaven National Laboratory in response to community requirements for a central repository for information about biological macromolecular structures. Seven structures were included in the PDB at its inception. In 1974, the first PDB Newsletter3 announced the availability of 13 structures (Fig. 1). During the 30-year history of the PDB, the technologies for determining macromolecular structure have undergone remarkable improvements. Combined with similar advances in the field of structural biology, there have been dramatic increases in the size of the archive and the diversity of structure data, and the community of PDB users has broadened significantly. The PDB collects the results of structure determination experiments, organizes the data, and makes it available to an increasingly broad community of users. As part of this process, it is the responsibility of the PDB to ensure that the data released from the PDB are represented accurately, and that diagnostics are provided to assist depositors in correcting errors in deposited structures. This chapter presents the evolving methods used by the PDB to reach these goals. Most people would agree that the accurate transcription of deposited data, consistency, and compliance to standards are key determinants of data quality, and that attainment of data quality should be a primary goal of any data resource. While consensus on the importance of data quality may be easy to obtain in general terms, the issues surrounding data quality are far more complex when placed in the specific context of a resource like the PDB. During the first half of its life, the PDB served as a simple repository and a point of dissemination for structure data and was used primarily by 1
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000). 2 F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. E. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). 3 Protein Data Bank, Protein Data Bank Newslett. 1 (1974). ftp://ftp.rcsb.org/pub/pdb/doc/ newsletters/old_bnl/news01_sep74.pdf
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[17]
371
protein structure validation for PDB
Number 1
September '74 Protein Data Bank Newsletter
We thought it would be a good idea to mail out a report on the status of the Protein Data Bank and to establish at this time a regular newsletter. Some of the material in this first edition of the newsletter may be familiar to you. We promise that the next edition will be much smaller! Deposition of coordinates Data may be deposited by filling out the form in appendix 1. Tape or cards rather than a listing are appreciated. Mail these to T.F. Koetzle Department of Chemistry Brookhavan National Laboratory Upton, New York 11973 Telephone: 516-345-4384 Coordinate Directory The coordinate sets in final distributable form are listed below along with coordinate sets soon to be available (marked *): carboxypeptidase A carp muscle calcium binding parvalbumin a -chymotrypain cytochrome b5 flavodoxin* D-glyceraldehyde-3-phosphate dehydrogenase* horse hemoglobin (deoxy and met) lactate dehydrogenase lamprey hemoglobin lysozyme* myoglobin pancreatic trypsin inhibitor papain rubredoxin staphylococcal nuclease subtilisin thermolysin* Format The format of the coordinates is given in Appendix 2, Torison angles, structure factors and phases are also available for some proteins, as indicated in Appendix 3. Fig. 1. The first PDB newsletter announced the contents of the archive.3
372
analysis and software
[17]
crystallographers and nuclear magnetic resonance (NMR) spectroscopists. Depositions to the archive during this early period resemble journal publications and contain lengthy descriptive text sections encoded in the REMARK records of the PDB file format. The perception that the PDB is an accumulation of independent structures rather than a collection of data records in a database is one that continues to this day. As the number of structures in the archive increased and the user base broadened, the PDB confronted changing requirements to enable comparative analysis of the data in the archive. Such studies minimally require a consistent representation of the data and an increase in the types of the data included in each entry. Accordingly, extensions in the PDB format were advanced in 19924 and 1996.5 Because of the archival nature of the prior entries, neither format change was propagated backward. As researchers began to use the archive for structural analysis and comparison, they confronted inconsistencies in the PDB data representation. A considerable number of papers have been published describing these inconsistencies.6–9 The results of these comparative studies exist in both published works and in active derivative databases, and form the underpinnings of many widely used methods of structural validation and comparison.10–19 Changing an existing archive such as the PDB to comply with evolving data and nomenclature standards, and imposing consistency constraints in data representation, present a variety of problems. Such changes, no matter 4
Protein Data Bank, ‘‘Protein Data Bank Atomic Coordinate and Bibliographic Entry Format Description.’’ Brookhaven National Laboratory, Upton, NY, 1992. 5 J. Callaway, M. Cummings, B. Deroski, P. Esposito, A. Forman, P. Langdon, M. Libeson, J. Mc Carthy, J. Sikora, D. Xue, E. Abola, F. Bernstein, N. Manning, R. Shea, D. Stampf, and J. Sussman, ‘‘Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description.’’ Brookhaven National Laboratory, Upton, NY, 1996. 6 R. A. Laskowski, E. G. Hutchinson, A. D. Michie, A. C. Wallace, M. L. Jones, and J. M. Thornton, Trends Biochem. Sci. 22, 488 (1997). 7 R. W. Hooft, C. Sander, and G. Vriend, J. Appl. Crystallogr. 29, 714 (1996). 8 R. W. Hooft, G. Vriend, C. Sander, and E. E. Abola, Nature 381, 272 (1996). 9 P. Schultze and J. Feigon, Nature 387, 668 (1997). 10 A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, J. Mol. Biol. 247, 536 (1995). 11 C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton, Structure 5, 1093 (1997). 12 C. Dodge, R. Schneider, and C. Sander, Nucleic Acids Res. 26, 313 (1998). 13 W. Kabsch and C. Sander, Biopolymers 22, 2577 (1983). 14 H. Ohkawa, J. Ostell, and S. Bryant, Intell. Systems Mol. Biol. 3, 259 (1995). 15 C. Hogue, H. Ohkawa, and S. Bryant, Trends Biochem. Sci. 21, 226 (1996). 16 A. Siddiqui and G. Barton, ‘‘Perspectives on Protein Engineering 2,’’ CD-ROM edition (M. J. Geisow, Ed.). Biodigm, Nottingham, UK, 1996. 17 L. Holm and C. Sander, Nucleic Acids Res. 26, 316 (1998). 18 L. Holm and C. Sander, Nucleic Acids Res. 25, 231 (1997). 19 I. N. Shindyalov and P. E. Bourne, Protein Eng. 11, 739 (1998).
[17]
protein structure validation for PDB
373
how well intended, may corrupt references to a wide variety of published and Web-accessible work. Even small changes may have profound consequences for existing software applications. Such is the dilemma faced by the Research Collaboratory for Structural Bioinformatics (RCSB) in assuming the management for the PDB archive in 1998. The RCSB approached the PDB global data quality problem with great enthusiasm and naı¨vete´. It is easy to view the data quality in purely technical terms. It is much more difficult to address the problem in the context of existing resources and applications that depend heavily on the archive in its current format. Changes in the archive, even for the purpose of improving data uniformity, may inadvertently create problems for software or databases that have adapted to the existing inconsistencies in the data. In fact, the latter was the overwhelming concern of the community. The state of the PDB with respect to uniformity is difficult to reconcile, particularly from the perspective of a new user who is not aware of the archive’s rich history. To participate in current and future scientific challenges, the PDB must advance to a level of data quality that facilitates systematic archive-wide analyses and integration with other biological and structural databases. The path to attain this level of data quality and maintain historical continuity is a difficult one with many trade-offs. For example, it has been strongly recommended that the PDB follow IUPAC nomenclature for labeling hydrogen atoms.20 PDB atom labeling syntax differs from IUPAC in both numbering conventions and order of presentation (i.e., 2HB instead of HB3) of the hydrogen atom label. Although it is technically easy to follow IUPAC conventions, doing so would introduce an inconsistency into the archive that would violate assumptions made by software applications already coping with the current PDB naming convention. Here, the judgment to set continuity as the high priority is at odds with the tenet of data quality that prefers adherence to data standards. While all possible attempts are made to accommodate the nomenclature used by depositors, at times the maintenance of continuity makes this impossible. The use of the PDB chain identifier is a common source of difficulty. It has been the PDB’s practice to assign chain identifiers to polymer species and covalently bound prosthetic groups. Since the PDB format has limited scope for labeling groups of molecular species, the chain identifier has been used by some depositors to label aggregates of solvent and unbound ligands. In the interest of consistency, these latter uses of the chain identifier have typically not been supported. Another issue with the chain identifier arises when addressing data quality in early PDB entries in which no chain identifier was used. Raising these 20
J. L. Markley, A. Bax, Y. Arata, C. W. Hilbers, R. Kaptein, B. D. Sykes, P. E. Wright, and K. Wu¨thrich, J. Biomol. NMR 12, (1998).
374
analysis and software
[17]
entries to be consistent in format with current entries requires the introduction of a chain identifier, yet adding this identifier invalidates references to this structure both in the literature and in derivative databases. The major obstacle underlying the application of data quality and uniform standards within the PDB has been a limitation in the format used to describe PDB data. The PDB format,5 available from http://deposit.pdb. org/, provides space for the encoding of a single chemical description. The fixed-column format has proved difficult to extend without disrupting existing software applications. To address this limitation, the RCSB has used the macromolecular crystallographic information file (mmCIF)21 format as a means of representing PDB data. This format has built-in support for multiple nomenclatures and extensibility in content to support evolving technologies. The flexibility provided by mmCIF allows for the encoding of the depositor’s nomenclature preference along with an archive-wide systematic nomenclature. All new structures have been processed and archived using the mmCIF data format, and data have been released into the public archive in PDB format, following as closely as possible the format description published in 1996. This has been done in the spirit of maintaining the greatest possible continuity with the data in the existing archive. Data is also provided in mmCIF format following the PDB exchange dictionary (http://deposit.pdb.org/mmcif/). In addition to addressing the ongoing data quality issues, a significant effort has been undertaken to unify the legacy data released before 1998 in a single format using a systematic nomenclature while still preserving original nomenclature. A large component of this project has involved resolving consistency issues in chemical description and nomenclature required to represent each entry properly in mmCIF. In this chapter we describe the structure validation procedures that are used by the PDB to maintain data quality. We then present an analysis of how these procedures have been used to process new entries and ‘‘postprocess,’’ or revisit, the legacy data. Validation Procedures
Overview of Process To provide the community with high-quality data, the RCSB has developed a number of tools that support the deposition and processing of X-ray and NMR structures. Deposition is accomplished through an 21
P. Bourne, H. M. Berman, K. Watenpaugh, J. D. Westbrook, and P. M. D. Fitzgerald, Methods Enzymol. 277, 571 (1997).
[17]
protein structure validation for PDB
375
integrated Web-based user interface, the AutoDep Input Tool (ADIT).22 ADIT (http://deposit.pdb.org/adit/) takes data from an uploaded file, and presents an editor for the opportunity to modify and make additions to an entry. Data processing involves checking various aspects of the structure and the data collected through ADIT. Deposited information is converted to mmCIF representation and is subsequently processed by PDB validation programs. Over the years many programs and procedures have been developed to diagnose the errors in PDB files.7,8,23,24 These programs have allowed authors to detect errors and correct them before deposition to the PDB. The PDB has incorporated many of these methods and has developed a series of procedures to review and validate structures. All these validation procedures are made available on the PDB validation server (http://deposit.pdb.org/validate/) for use before deposition. A skilled annotator with a background in biological structure reviews the output of these validation checks. A distilled summary report of the validation diagnostics is forwarded to depositors for their information and for corrections. Since the author is the most knowledgeable about his/her own structure, the PDB collaborates with the author to ensure that the entry that is ultimately released to the public is the best possible representation of the results of the experiment. On receipt of the depositor’s response to the report, the processing of the entry is completed. The entry is loaded into an internal core relational database and stored in an archival file system. At this point, the entry is ready for public distribution according to the HOLD/RELEASE status of the entry. On a weekly schedule, entries to be released are transferred to the main PDB distribution site at the San Diego Supercomputer Center (SDSC), where additional derived features are calculated, public databases are loaded, and the data are transferred to the PDB mirror sites. Specific Structure Checks In this section, we outline the results contained in the PDB validation summary report for both NMR and X-ray crystal structures. Covalent Bond Distances and Angles. Covalent bond distances and angles for macromolecules are compared against standard values. For proteins, these are taken from Engh and Huber.25 The standard values are 22
J. Westbrook, Z. Feng, and H. M. Berman, ‘‘ADIT—The AutoDep Input Tool.’’ Department of Chemistry, Rutgers, State University of New Jersey, RCSB-99 (1998). 23 R. A. Laskowski, M. W. McArthur, D. S. Moss, and J. M. Thornton, J. Appl. Crystallogr. 26, 283 (1993). 24 R. A. Laskowski, J. A. Rullmann, M. W. MacArthur, R. Kaptein, and J. M. Thornton, J. Biomol. NMR 8, 477 (1996).
376
analysis and software
[17]
taken from Clowney et al.26 for nucleic acid bases, and from Gelbin et al.27 for nucleic acid sugar and phosphates. Bonds and angles related to hydrogens are not checked. For each type of bond (e.g., N–CA, N–C) or angle (e.g., N–CA–C, CA– CB–CG), the RMS deviation of that bond or angle (Vactual) relative to the standard value (Vstandard) is RMSD ¼ ðSUM ½ðVactual Vstandard Þ2 =N Þ1=2 where N is the number of individual angles or bonds of a particular type included in the summation. Vactual for a particular bond is listed as a 6 RMSD violation if jVactual Vstandard j > 6 RMSD In addition, the validation summary provides the total average deviations from standard dictionaries. This offers an overall measure of agreement with the standard values. Although other methods exist to make this comparison, this approach tends to highlight only serious outliers. Stereochemical Validation. All chiral centers of proteins and nucleic acids are checked for correct stereochemistry. Violations of standard stereochemistry are reported for both proteins and nucleic acids, using the following method. 1. Neighboring atoms a, b, and c of the chiral center form vectors Va, Vb, and Vc with the center. 2. The chiral volume is VC ¼ Va Vb Vc. 3. If the sign of the actual chiral volume is different from the standard chiral volume, a chirality violation is listed. Atom Nomenclature. The nomenclature of all atoms is checked for compliance with current PDB practice. In some cases, this nomenclature is not in complete agreement with IUPAC standards.20 This is particularly true for hydrogen atoms. Correspondences between PDB hydrogen nomenclature with the nomenclature used by IUPAC and many refinement programs are available from the BioMagResBank (BMRB) at http:// www.bmrb.wisc.edu/ref_info/atom_nom.tbl.
25
R. A. Engh and R. Huber, Acta Crystallogr. 47, 392 (1991). L. Clowney, S. C. Jain, A. R. Srinivasan, J. Westbrook, W. K. Olson, and H. M. Berman, J. Am. Chem. Soc. 118, 509 (1996). 27 A. Gelbin, B. Schneider, L. Clowney, S.-H. Hsieh, W. K. Olson, and H. M. Berman, J. Am. Chem. Soc. 118, 519 (1996). 26
[17]
protein structure validation for PDB
377
Although it would be desirable for PDB to support the IUPAC hydrogen nomenclature, most existing software supports the prior PDB conventions. For this reason, and to maintain consistency within the archive, the use of the PDB nomenclature has been continued. In addition, particular attention is paid to the nomenclature of hydrogen atoms on the ND2 atoms of Asn residues and/or the NE2 atoms of Gln as well as the NH1 and NH2 atoms of Arg residues for agreement with the standard for E/Z orientation presented by the IUPAC.20,28 For nucleic acids, the atom labeling of O1P/O2P atoms are checked against the convention defined by the IUBMB.29 During processing, the nomenclature of all of the described atoms is adjusted if necessary. Close Contacts. Many of the PDB data processing tasks are performed by the Macromolecular Exchange Input Tool, MAXIT.30 This program calculates the distances among all atoms within the asymmetric unit of crystal structures and the unique molecule of NMR structures. For crystal structures, contacts between symmetry-related molecules are checked as well. These checks include ligand and solvent molecules in addition to the macromolecular structure. ˚ apart are listed as For X-ray structures, heavy atoms less than 2.2 A close contacts. Interactions of atoms forming standard bonds, defined through PDB LINK records, or related by one to four contacts, are not listed as close contacts. In crystal structures, atoms that have full occupancy and lie on special positions that place two or more atoms on top of one another in the structure are listed as having close contacts to indicate that a lower occupancy is appropriate. Atoms with full occupancy related by a crystallographic symmetry element will be listed as having close contacts if they are less than ˚ apart. 2.2 A If disulfide bridges are denoted in the coordinate file with SSBOND records, they will not be listed as close contacts. In data annotation, close 28
IUPAC-IUB Commission on Biochemical Nomenclature, ‘‘Abbreviations and Symbols for the Description of the Conformation of Polypeptide Chains: Tentative Rules.’’ Arch. Biochem. Biophys. 145, 405 (1971); Biochem. J. 121, 577 (1971); Biochemistry 9, 3471 (1971); Biochim. Biophys. Acta 229, 1 (1971); Eur. J. Biochem. 17, 193 (1969); J. Biol. Chem. 245, 6489 (1970); J. Mol. Biol. 52, 1 (1970); Pure Appl. Chem. 40, 291 (1974); and in ‘‘Biochemical Nomenclature and Related Documents,’’ 2nd ed., p. 73. Portland Press, London, 1992. 29 C. Liebecq (Ed.), in ‘‘Biochemical Nomenclature and Related Documents: A Compendium Prepared for the Committee of Editors of Biochemical Journals,’’ p. 121. Portland Press, Chapel Hill, NC, 1992. 30 Z. Feng, S.-H. Hsieh, A. Gelbin, and J. Westbrook, ‘‘MAXIT: Macromolecular Exchange and Input Tool.’’ Rutgers University, New Brunswick, NJ, 1998.
378
analysis and software
[17]
contacts corresponding to metal coordination are represented as LINK records in the PDB file. Ligand and Atom Nomenclature. The names of residues and atoms are compared against the nomenclature used in the PDB HET group dictionary (ftp://ftp.rcsb.org/pub/pdb/data/monomers/het_dictionary.txt) for all ligands as well as for standard residues and bases. Unrecognized ligand groups are flagged and any discrepancies in known ligands are listed as extra or missing atoms. When structures are processed, residue and atom nomenclature for existing HET groups are corrected to follow the residue- and atom-naming convention that is given in the PDB HET group dictionary. For new ligands, a residue name is assigned with the preference given to that provided by the author. An automated procedure is then used to make a topological comparison of the new ligand with the dictionary to find similar molecules. Such similar molecules are used to create the most appropriate atom nomenclature for the new group. We are currently standardizing the existing HET group dictionary in order to make it more usable both for the community and ourselves. Significant effort has been made to classify the contents of the dictionary and to correct errors in chemical description. The review and maintenance of the dictionary is very much an ongoing project. Sequence Comparison. The sequence given in the PDB SEQRES records is compared against the sequence derived from the coordinate records. This information is displayed in a table where any differences or missing residues are marked. During structure processing, the sequence database references given by PDB DBREF and SEQADV records are checked for accuracy. If no reference is given, a BLAST31 search is used to find the best match. Any conflict between the PDB SEQRES records and the sequence derived from the coordinate records is resolved by comparison with various sequence databases. Residues in disordered regions modeled as alanines are switched both in the SEQRES and in the coordinate section to their true residue names. In general, the sequence and coordinates are made to reflect the sequence of the protein studied, even if it was not possible to model every region. The sequence database references are always selected to be consistent with the source organism that produced the molecule under study. The NCBI source organism taxonomy database32,33 is used to standardize source organism terminology. 31
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, J. Mol. Biol. 215, 403 (1990).
[17]
protein structure validation for PDB
379
Distant Waters. The program MAXIT calculates the distances between all water oxygen atoms and all polar atoms (oxygen and nitrogen) of the macromolecules, ligands, and solvent in the asymmetric unit. Isolated ˚ are listed in the validation waters with no neighboring atoms within 3.5 A report. Water clusters that are similarly distant from the macromolecule or ligands, are also listed in the report. Distant waters that are part of hydration shells are excluded from the list. Isolated water molecules further than ˚ from any polar neighbor atom are listed in REMARK records in the 5.0 A PDB file. For X-ray crystal structures, the validation summary also lists distant ˚ from polar atoms of the macromolecules, ligands, or solvent waters (>3.5 A of the asymmetric unit) that can be moved through the application of symmetry operations to be closer to the asymmetric unit. For example, if the closest contact that the oxygen of a water molecule makes with a polar atom of the macromolecules, ligands, or solvent of the asymmetric unit is ˚ , and a symmetry operation (e.g., x, 1 þ y, 1 þ z in space group 5.5 A P21) would place this water in closer proximity to a polar group of the asymmetric unit, then the water will be relocated. For all structures, a report called an Atlas Summary is created to highlight general information and to present molecular graphic images (in GIF and VRML) to check the overall appearance of the structure. For X-ray crystal structures, molecular graphic images are generated for the asymmetric unit and for crystal packing. For NMR structures, molecular images are generated for the first structure and the ensemble. Checks against Experimental Data. For X-ray crystallographic structures, structure factor data are validated using SFCHECK.34 This program extracts the deposited R factor, resolution, and model information, and then compares these with values calculated from deposited coordinates and structure factors. This program also calculates an overall B factor, coordinate errors, and effective resolution and completeness. The summary of density correlation, shift, and B factor are reported by residue. The SFCHECK report is available as part of the validation server report and is returned to the depositor during the annotation process if a problem is detected. Although the SFCHECK results do not exactly reflect values reported by depositors, differences are typically small. 32
D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler, Nucleic Acids Res. 28, 15 (2000). 33 D. L. Wheeler, C. Chappey, A. E. Lash, D. D. Leipe, T. L. Madden, G. D. Schuler, T. A. Tatusova, and B. A. Rapp, Nucleic Acids Res. 28, 10 (2000). 34 A. A. Vaguine, J. Richelle, and S. J. Wodak, Acta Crystallogr. D. Biol. Crystallogr. 55, 191 (1999).
380
analysis and software
[17]
NMR experimental data are collected but not currently validated. These data are passed on automatically to the BMRB.35 The BMRB is actively developing and integrating existing software to validate NMR constraint data. In future we hope to move some of this validation processing to the PDB so that any problems with these data may be resolved during the annotation of the structure data. Validation Server. The Validation Server allows the user to check the format of coordinate and structure factor files and to perform a variety of validation tests on the structure before deposition in the PDB. Available at http://deposit.pdb.org/validate/, these checks can be done independently by the user. Validation involves two steps: the coordinate format check and structure validation. The coordinate format precheck produces a brief report identifying any changes that need to be made in a data file in order to obtain a validation report. Structure validation presents the user with a validation summary report that contains a collection of structural and nomenclature diagnostics, including bond distance and angle comparisons, chirality, close contacts, and sequence comparisons. This report is designed to highlight features of a structure that may require some special attention (e.g., close contacts) and to present this information in a concise summary report. The validation letter is produced by MAXIT.30 In addition, the Validation Server presents an Atlas summary page, graphic images, and diagnostic reports from a variety of programs: PROCHECK 3.4.4,23 PROCHECK-NMR,24 NUCHECK,36 SFCHECK,34 and CURVES 5.1.37,38 Review of Findings
Primary Processing We have reviewed the structures processed since October 1998, and can relate some of our experiences in doing this. The relatively rapid turnaround and close collaboration with the authors help ensure that errors are detected before release of the structure into the public domain. If either the depositor or user finds errors after release, the current policy is to correct the errors as the PDB is made aware of them. 35
E. L. Ulrich, J. L. Markley, and Y. Kyogoku, Protein Seq. Data Anal. 2, 23 (1989). Z. Feng, J. Westbrook, and H. M. Berman, ‘‘NUCheck.’’ Rutgers University, New Brunswick, NJ, 1998. 37 R. Lavery and H. Sklenar, J. Biomol. Struct. Dyn. 6, 655 (1989). 38 R. Lavery and H. Sklenar, J. Biomol. Struct. Dyn. 6, 63 (1988). 36
[17]
protein structure validation for PDB
381
In addition to detecting and correcting errors, much of the work in the annotation process involves standardizing the data with respect to nomenclature and format. This type of standardization is necessary in order to be able to compare structures across the archive. A key aspect of annotation involves the checking of chemistry of all the components in a crystal. For the macromolecular components this involves a careful review of the sequence. This process can be time consuming and errors are commonly found. PDB SEQRES records are meant to describe the sequence of the macromolecule that has been studied. It must contain all residues, including those that are not observed in the electron density. Sequence errors are detected by inconsistencies in the coordinate and SEQRES records by examining the results of searches of sequence databases. Depositors resolve any differences in coordinate and chemical sequences before an entry is approved for release. Ligand geometry and chemical bonding are also checked. Since the coordinates of only the nonhydrogen atoms are typically provided in a deposition, bond orders and hydrogen valence must be assigned during annotation. Some depositions contain incorrect R values, incorrectly specified resolution ranges and reflection statistics, and even incorrect cell dimensions. These errors are generally obvious after validation, and are typically corrected during annotation. Depositors who take advantage of the electronic output of data collection and refinement parameters are less likely to encounter problems with the accurate description of the details of structure determination. Poor covalent geometry and serious stereochemical errors are not typically observed. Problems with model geometry are almost always the result of errors in encoding cell constants or space group. Validation of Legacy Files We have used the software developed for primary data processing for the validation and standardization of data released into the archive before October 1998. We have previously described how we have used our data processing system for file-by-file postprocessing of families of related structures.39 Although this work has resulted in the production of uniform data collections like the Nucleic Acid Database (NDB),40 the manual effort involved in file-by-file postprocessing was substantial. The pursuit of greater 39
T. N. Bhat, P. Bourne, Z. Feng, G. Gilliland, S. Jain, V. Ravichandran, B. Schneider, K. Schneider, N. Thanki, H. Weissig, J. Westbrook, and H. M. Berman, Nucleic Acids Res. 29, 214 (2001). 40 H. M. Berman, W. K. Olson, D. L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S. H. Hsieh, A. R. Srinivasan, and B. Schneider, Biophys. J. 63, 751 (1992).
382
[17]
analysis and software
software automation in primary data processing has resulted in the elimination of many manual processing steps. The combination of improved software and the experience gained from three years of primary processing has made it possible to attempt a more automated remediation of the legacy data in a batch mode. Although the tasks of primary data processing and postprocessing share a common software infrastructure, these processes differ in both scope and objective. This postprocessing of data differs from primary processing in that there is no communication with the depositor. Most of the problems discovered in primary data processing are resolved with the assistance of the depositor during the annotation process. In postprocessing of the 8368 legacy data files, it was not possible to interact with individual depositors. This collection of files and discussion that follows does not include the nucleic acid-containing crystal structures, which were processed as part of the NDB project. During primary processing, serious errors in model geometry are corrected by rerefinement by the depositor. Errors of this sort are beyond the scope of postprocessing. We discuss in the remainder of this section the focus of our postprocessing work that has concentrated on problems associated with the representation of the polymer sequence and atom and ligand nomenclature. A summary of these errors, as well as those found in recently processed files, is given in Table I. Errors in Sequence Representation. Correct and consistent representation of sequence is required by virtually all applications of the PDB data. The complete polymer sequence for the macromolecule under study is encoded in PDB SEQRES records as a list of three-letter residue codes. These records are intended to describe the full polymer sequence for the macromolecule or domain for which coordinates are deposited. In comparing the legacy sequence data with sequence data from GENBANK,32
TABLE I Summary of Released Entries Containing Nomenclature and Chemical Representation Errors
Legacy data (8368 entries)a 1999 data (3150 entries)b 2000 data (3569 entries)b a b
Incorrect sequence
Sequence– coordinate mismatch
Atom nomenclature errors
Stereochemical labeling errors
166 0 0
90 5 0
3311 162 31
294 3 3
Pre-October 1998 entries, excluding nucleic acid-containing crystal structures. Structures processed and released by the RCSB.
[17]
protein structure validation for PDB
383
166 cases were found in which the legacy sequence was incorrect. In most cases, these sequence errors reflect gaps in the model sequence or incompletely modeled residues where residues or side chains were not experimentally observed. In all these cases, the sequences were updated with the correct or missing residues. In some instances, two PDB chains were used to represent a single polymer with a residue gap. These sequences were consolidated into single PDB chains. Sequence information also can be derived from PDB coordinate records. Since coordinate data may not be deposited for all the residues in a structure, the PDB SEQRES records are provided to define the full chemical sequence. Even though the coordinate records may not provide complete sequence information, the sequence information in the SEQRES and coordinate records should be consistent. We found 90 cases in the legacy data in which SEQRES and coordinate records did not correspond. The majority of these inconsistencies result from the experimenter having labeled residues in the coordinate records with missing side chains as alanines. Only 4 of the 90 cases could not be reconciled on the basis of a missing side chain. Errors in Atom and Ligand Nomenclature. The most common problems found in the legacy data are related to the labeling of atoms and ligands. Atom nomenclature problems were found in 3311 (40%) of the legacy files. The labeling of terminal atoms was found to be the most common nomenclature error in that atoms adjacent to a gap of unobserved residues in continuous sequence were mislabeled as terminal atoms. All errors of this type could be corrected automatically. Labeling of ligand atoms and residues was the second most common nomenclature problem. Ligand atom names were standardized in software to the nomenclature used in the PDB ligand dictionary. This was accomplished by topology matching against the chemical descriptions in the dictionary. New ligand descriptions were created and added to the dictionary where necessary. Another common nomenclature problem arises from the duplication of atom labels. A total of 636 legacy data files were found to include redundant atom labels. This was most commonly the result of the mislabeling of alternate conformations. In a small number of cases identical coordinate records were duplicated. All instances of duplicated atom records could be resolved. Perhaps the most serious class of errors in atom nomenclature is that related to stereochemistry. Errors in chirality were found in 549 legacy files. Only 255 of these cases could be resolved as errors in atom labeling; the remainder represents exceptions to current stereochemical conventions.
384
analysis and software
[17]
Revisiting Data Processed since October 1998 We have applied the batch validation procedure used to postprocess the legacy data to data processed since October 1998. Since the functionality of our validation software and our understanding of the validation process have evolved greatly during the time we have been responsible for the archive, we present the results for data deposited before and after January 2000. The revalidation of the 3150 files that we processed before January 2000 showed 5 entries with conflicts in sequence information between SEQRES and coordinate records, 162 errors in atom and ligand nomenclature, 19 duplicated atom labels, 3 errors in stereochemical labeling, and 30 terminal atom labeling errors. The largest number of errors related to ligand atom nomenclature results from changes in our ligand dictionary that underwent significant correction and development during 1999. The remaining errors were undetected by our software or were omissions in our annotation procedures during this period. The results for files processed after January 2000 show further improvement. In this group of 3569 files we found 31 errors in atom and ligand nomenclature, 1 duplicated atom label, 3 errors in stereochemical labeling, and 2 terminal atom labeling errors. No sequence consistency errors were detected. All these errors were corrected. Future
As we look to the increasing volume of structure depositions in the future, the extent to which validation operations can be automated becomes increasingly important. One of the greatest current obstacles to greater automation in primary data processing is the lack of consistency in the syntax and representation of structure data. The majority of depositions at PDB continue to be a file of coordinate records with additional information being manually input through a Web interface. Coordinate data may not follow any particular nomenclature and may not even follow the column syntax of the PDB atom records. Dealing with data in this unstructured form requires some initial human intervention and prohibits the automation of the early stages of primary data processing. Although considerable progress has been made in program packages like CNS41 and CCP442 in capturing information for deposition in mmCIF format, these features are not yet widely used. PDB has recently provided 41
A. T. Bru¨nger, P. D. Adams, G. M. G. M. Clore, W. L. DeLano, P. Gros, R. W. GrosseKunstleve, J. S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J. Read, L. M. Rice, T. Simonson, and G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 42 CCP4, Acta Crystallogr. D Biol. Crystallogr. 50, 760 (1994).
[18]
improving structures using all-atom contacts
385
software tools to automatically harvest information from the output and log files of other structure determination packages, and to organize this information for automatic deposition (http://deposit.pdb.org/software/ ). Structural genomics projects that have a focus on high-throughput methodologies have helped to promote greater integration among software applications used in structure determination. Collectively, this should romote the production of complete and fully self-consistent data sets for structures that can be automatically validated and deposited. Acknowledgments The PDB is operated by the RCSB and is supported by funds from the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine. Questions and comments about the PDB may be sent to [email protected].
[18] New Tools and Data for Improving Structures, Using All-Atom Contacts By Jane S. Richardson, W. Bryan Arendall, III, and David C. Richardson
The methodology of macromolecular crystallography is mature, powerful, and effective, and it has transformed our understanding of biology at the molecular level. However, anyone who has done such a structure knows there are imperfections well worth correcting: occasional mistakes, and difficult places where no alternative seems right. Macromolecular structures have been improved significantly (i.e., made more accurate for a given resolution and data quality) by the use of validation tools such as the free R factor1 and Ramachandran plot criteria.2 Much of the sensitivity and the power of those tools derive from their independence of the target function being optimized in refinement. We have developed a new suite of validation tools based on the fact that the van der Waals contacts of hydrogen atoms are almost never part of the refinement target function, yet they yield a large set of powerful constraints on allowed conformations. These new tech-
1 2
A. T. Brunger, Nature 355, 472 (1992). R. A. Laskowski, M. W. Macarthur, D. S. Moss, and J. M. Thornton, J. Appl. Crystallogr. 26, 283 (1993).
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. Figures 1–3, 6, 8–13 Copyright William Bryan Arendall, III, David Richardson, and Jane S. Richardson All rights reserved. 0076-6879/03 $35.00
[18]
improving structures using all-atom contacts
385
software tools to automatically harvest information from the output and log files of other structure determination packages, and to organize this information for automatic deposition (http://deposit.pdb.org/software/ ). Structural genomics projects that have a focus on high-throughput methodologies have helped to promote greater integration among software applications used in structure determination. Collectively, this should romote the production of complete and fully self-consistent data sets for structures that can be automatically validated and deposited. Acknowledgments The PDB is operated by the RCSB and is supported by funds from the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine. Questions and comments about the PDB may be sent to [email protected].
[18] New Tools and Data for Improving Structures, Using All-Atom Contacts By Jane S. Richardson, W. Bryan Arendall, III, and David C. Richardson The methodology of macromolecular crystallography is mature, powerful, and effective, and it has transformed our understanding of biology at the molecular level. However, anyone who has done such a structure knows there are imperfections well worth correcting: occasional mistakes, and difficult places where no alternative seems right. Macromolecular structures have been improved significantly (i.e., made more accurate for a given resolution and data quality) by the use of validation tools such as the free R factor1 and Ramachandran plot criteria.2 Much of the sensitivity and the power of those tools derive from their independence of the target function being optimized in refinement. We have developed a new suite of validation tools based on the fact that the van der Waals contacts of hydrogen atoms are almost never part of the refinement target function, yet they yield a large set of powerful constraints on allowed conformations. These new techniques, reviewed here, promise to make significant further improvements in the accuracy and reliability of protein and nucleic acid crystal structures. 1 2
A. T. Brunger, Nature 355, 472 (1992). R. A. Laskowski, M. W. Macarthur, D. S. Moss, and J. M. Thornton, J. Appl. Crystallogr. 26, 283 (1993).
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. Figures 1–3, 6, 8–13 Copyright William Bryan Arendall, III, David Richardson, and Jane S. Richardson All rights reserved. 0076-6879/03 $35.00
386
analysis and software
[18]
The all-atom contact technique depends, of course, on adding the hydrogen atoms, most of which are completely determined to a suitable accuracy by the positions of the heavier atoms. This is done by our program Reduce as described and discussed in Word et al.3 H atoms are placed at ideal bond lengths and angles, with methyls staggered except for terminal Met methyls. Entire local H-bond networks are optimized, including rota tion of OH, SH, NH3, and so on, and 180 flips of Asn, Gln, and His, but with a simplified model for waters. With hydrogen atoms present, the Probe program can calculate all˚ , in an atom contacts.4 It uses a small probe sphere of radius 0.25 A algorithm close to the inverse of the Connolly solvent-accessible surface calculation.5 Instead of leaving dots where the probe does not intersect another atom, Probe leaves dots where the small probe intersects an atom greater than 3 covalent bonds away. The result is paired patches of contact ˚ of touching, as shown in Fig. 1. surface wherever atoms are within 0.5 A Either for graphical display or for numerical scoring, there are three terms: favorable van der Waals contacts (shown by green and blue dots); favorable overlaps of H-bond donor and acceptor atoms (shown by pillows of pale green dots); and unfavorable overlaps of other atom pairs, shown as ‘‘spikes’’ of increasingly violent red colors as the atomic clash becomes ˚ overlap. The two favorable more physically impossible beyond about 0.4-A contact terms evaluate the local goodness-of-fit inside or between molecules. However, the equilibrium structure of a real molecule can have no large clashes, so a crucial criterion for validation of crystallographically derived structural models is the avoidance of serious atomic clashes—a surprisingly demanding requirement once all H atoms are included. Strategies for making use of that criterion are the topic of this chapter, along with an update of geometric criteria that complement the all-atom contact analysis. Geometric Validation Criteria from Updated Survey of Database
Three circumstances have motivated us to update the traditional geometric criteria for protein structure validation: (1) development of the all-atom contact method, which can discriminate one major category of physically impossible from possible conformations; (2) the need to omit high B-factor examples, which surprisingly has seldom been done before; 3
J. M. Word, S. C. Lovell, J. S. Richardson, and D. C. Richardson, J. Mol. Biol. 285, 1735 (1999b). 4 J. M. Word, S. C. Lovell, T. H. LaBean, H. C. Taylor, M. E. Zalis, B. K. Presley, J. S. Richardson, and D. C. Richardson, J. Mol. Biol. 285, 1711 (1999). 5 M. L. Connolly, Science 221, 709 (1983).
[18]
improving structures using all-atom contacts
387
Fig. 1. Slice through a small section of protein structure (backbone white, side chains cyan) showing the relation of all-atom contact surfaces (colored dots) to the atomic van der Waals ˚ radius probe sphere (gray ball) used in the calculation. surfaces (gray dots) and to the 0.25-A The probe sphere is rolled over the surface of each atom, leaving a contact dot only when the probe touches another not covalently bonded atom. The dots are colored by local gap width ˚ separation, shading to bright green at between the two atoms: blue near maximum 0.5-A ˚ gap). When suitable H-bond donor and acceptor atoms perfect van der Waals contact (0-A overlap, the dots are pale green, forming lens shapes. When incompatible atoms interpenetrate, their overlap is emphasized with ‘‘spikes’’ instead of dots, and with colors ˚. from yellow for negligible overlaps to bright reds and pinks for serious clash overlaps 0.4 A Kinemage-format contact dots also carry color information about their source atom (e.g., O red, S yellow); in Mage, one can toggle between the two color schemes. For black-and-white figures, careful attention must be paid to the different appearance of dots (favorable) and spikes (unfavorable). Figures produced in Mage.6,7
and (3) the greatly expanded number of structures now available at high resolution. Filtering for quality at the local level (e.g., by B factor) as well as at the whole-structure level (e.g., by resolution) can remove much of the noise in empirical distributions of conformational features, while plotting occurrence as a function of quality indicator can identify and allow removal of some kinds of systematic errors. Therefore, we have revisited the classical rotamer and , criteria and have proposed the use of C deviation as a single measure encapsulating the most important aspects of bond angle distortions. 6 7
D. C. Richardson and J. S. Richardson, Protein Sci. 1, (1992). J. S. Richardson and D. C. Richardson, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F: ‘‘Crystallography of Biological Macromolecules,’’ p. 727. Kluwer Academic, Dordrecht, The Netherlands, 2001.
388
analysis and software
[18]
Side-Chain Rotamers It has been known since Ponder and Richards6 that not only do individual side-chain angles show distinct preferences (e.g., staggered for tetrahedral geometry), but also there are strong preferences for and against particular combinations of those angles over and above what would be predicted by multiplying the individual distributions. Favorable local energy minima in the multidimensional space are known as rotamers, and many authors have compiled libraries of side-chain rotamers.8–13 Such libraries are often used when fitting models to electron density maps, and either 1 alone or 1–2 distributions are also used as structure validation criteria.2 Side-chain rotamer libraries are a powerful and productive tool, but some are too sparse and, now that all-atom contact analysis is available, it can be seen that all previous libraries included at least some physically impossible rotamers. Since H-atom contacts are not refined, even highresolution structures can have impossible clashes in regions where the electron density was ambiguous; if those bad conformations were systematic errors that occur more often than random (such as flipped-over side chains) then they made their way into rotamer compilations, where again their H-atom contacts were not checked. Putative rotamers with serious internal clashes also occur in some libraries because of methodological idiosyncracies. Once incorrect rotamers were listed in libraries, then a vicious cycle made them occur even more often in the experimental structures. Figure 2 shows a sample of such cases from earlier rotamer libraries; the spikes show severe unfavorable H-atom contacts internal to each of these defined conformations. The primary goal of our new ‘‘penultimate’’ rotamer library14 was nearly complete coverage of the high-quality database, using rotamers located from the empirical distributions but avoiding the problems described above, so that each rotamer represents a physically reasonable local energy minimum. That library was compiled from a nonredundant database of 240 ˚ or better, satisfying various other quality and relevance structures at 1.7 A 14 criteria, and individual side chains were omitted if they had any atom 8
J. W. Ponder and F. M. Richards, J. Mol. Biol. 193, 775 (1987). P. Tuffery, C. Etchebest, S. Hazout, and R. Lavery, J. Biomol. Struct. Dyn. 8, 1267 (1991). 10 T. A. Jones, J.-Y. Zou, S. W. Cowan, and M. Kjeldgaard, Acta Crystallogr. A 47, 110 (1991). 11 H. Schrauber, F. Eisenhaber, and P. Argos, J. Mol. Biol. 230, 592 (1993). 12 M. De Maeyer, J. Desmet, and I. Lasters, Fold. Des. 2, 53 (1997). 13 R. L. Dunbrack and F. E. Cohen, Protein Sci. 6, 1661 (1997). 14 S. C. Lovell, J. M. Word, J. S. Richardson, and D. C. Richardson, Proteins Struct. Funct. Genet. 40, 389 (2000). 9
[18]
improving structures using all-atom contacts
389
Fig. 2. Examples of defined rotamers that have serious internal clashes, taken from previous rotamer libraries: Ponder and Richards,8 Tuffery et al.,9 Jones et al.,10 Dunbrack and Cohen,13 and De Maeyer et al.12 In addition to the stick figure for the relevant residue in ideal geometry,28 only the spikes for clash overlaps are shown; all include one or more serious ˚ . Figure produced in Mage.6,7 clashes 0.4 A
with B 40, alternate conformations, serious clashes, covalent modifications, or (for Asn, Gln, and His) an uncertain flip state.3 Rotamers were defined as the modal (peak) values in the smoothed distribution rather than as mean values to avoid dependence on a priori bin or range definitions, to allow for skewed distributions, and to correspond better with energy minima. Wherever compatible with the data, related rotamers were given common angles (producing common atom positions), to avoid having users choose between rotamers based on differences that are not statisically significant. The most general result from this new side-chain survey is that the conformational distributions are even more tightly clustered than observed previously. The weighted average of all 1 standard deviations is now only 8.6 , compared with 15.3 for Ponder and Richards8 and 12.4 for Dun13 brack and Cohen. Figure 3A shows the distribution of 1–2 values for Met. Note both that the clusters are not always centered exactly on the staggered values, and also that occurrence frequencies often differ greatly from those predicted by the individual values (e.g., 1 minus is most common overall, but is nearly absent if 2 is plus). Figure 3B shows the superimposed side chains (from 100 proteins) for all Met with 2 trans and 1 either minus or trans; note the two well-separated clusters for the
390
analysis and software
[18]
Fig. 3. Rotamers of methionine. (A) 1–2 plot for all Met in the Top500 database with B < 40. Although all staggered combinations occur, only five of the nine are common. Note that some clusters are significantly shifted, and also that occurrence frequencies are different from what would be predicted by multiplying the 1 and the 2 preferences. (B) All Met examples superimposed that have 12 trans-trans or plus-trans, from the first 100 structures of the Top500 database. Note that the sulfur positions (marked with balls) form two tight and distinct clusters. The C atoms form six looser but still well-defined clusters. Figure produced in Mage.6,7
S atom positions. Adjacent rotamers are nearly always quite distinct, as shown even for the extreme case of Lys in Fig. 7 of Lovell et al.14 For all 18 movable side-chain types in the quality-filtered data set, we found 94.5% to be ‘‘rotameric’’ (within the ranges around defined rotamers, usually 30 in each angle). The library includes a total of 152 rotamers, or an average of 8.5 per movable residue type, from a maximum of 34 for Arg down to 2 for Pro (only C exo and C endo puckers). The rotamers are tabulated in Lovell et al.14 and are available at our Web site15 in various forms, including drop-in files for use in O8 or XtalView.16,17 For most residue types, different secondary structures ( helix, sheet, left-handed, and other) show different occurrence frequencies for each rotamer, but the modal values stay the same. Therefore, percentage occurrences for each case are given in the penultimate library tables, but the list of possible rotamers is common. For Asn and Asp, however, the set of 15
Richardson laboratory, http://kinemage.biochem.duke.edu, Duke University, Durham, NC, 2002. 16 D. E. McRee, ‘‘Practical Protein Crystallography.’’ Academic Press, San Diego, CA, 1993. 17 D. E. McRee, J. Struct. Biol. 125, 156 (1999).
[18]
improving structures using all-atom contacts
391
Fig. 4. Electron density contours for two Thr side chains of the 1LYS19 hen egg lysozyme. (A) Thr-51 of molecule A with the expected boomerang-shaped density for the tetrahedral branch around the C. (B) Thr-51 of molecule B, with approximately straight-across density; this occurs relatively often and makes it easy to fit the side chain eclipsed and backward, as happened here.
modal positions varies with secondary structure,18 so that backbonedependent rotamers are needed and are defined only for those two residues. For example, -helical Asn with 1 minus shows two close but quite distinct clusters at 2 ¼ 20 and at 2 ¼ 80 ; the latter conformation makes an N H-bond to the i 4 CO, while the former has the flat face of the side-chain amide packed against the i 4 backbone.18 Left-handed (or þ ) Asn with 1 minus, however, shows a single peak at 2 ¼ 30 and for -sheet Asn the peak is at 2 ¼ 50 . Another observation resulting from the rotamer survey was that crystal structures are prone to occasional systematic errors which can be recognized and expunged from rotamer libraries and also from individual structures. One cause of such errors can be electron density for a tetrahedrally branched side chain that is straight across (as in Fig. 4B) rather than boomerang-shaped, and which is therefore easy to fit incorrectly with the group rotated by 180 . Such examples for Thr are analyzed below, and the more complex case of Leu is presented in detail in Lovell et al.14 The incorrect fitting can be distinguished definitively, because it has distorted bond angles, eclipsed values, atomic clashes, and an increased occurrrence rate at high B and low resolution. Overall, the use of a good rotamer library makes side-chain fitting more accurate as well as faster. Not every side chain is rotameric, since strained contacts or several good H-bonds can dictate an otherwise unfavorable conformation; however, such cases are surprisingly rare and should be 18
S. C. Lovell, J. M. Word, J. S. Richardson, and D. C. Richardson, Proc. Natl. Acad. Sci. USA 96, 400 (1999). 19 K. Harata, Acta Crystallogr. D Biol. Crystallogr. 50, 250 (1994).
392
analysis and software
[18]
accepted only when there are good physical reasons and when no rotamer can fit acceptably. In particular, partially disordered surface side chains should always be fit rotameric or as mixtures of rotamers, since there are no interactions to force them away from the local minima. Ramachandran Plots The Ramachandran plot is especially useful as a geometric validation criterion because and are not part of the target function for refinement.20 The percentage of residues found within the most favored , regions correlates strongly with resolution and is now standardly reported in protein structure articles, while individual ‘‘outlier’’ residues in accurate structures are taken to indicate either possible errors or potentially interesting strained conformations. The pioneering, and still most widely used, , criteria are those in ProCheck,2 which have the considerable advantage of defining multiple levels of core, allowed, and generously allowed regions; however, they have the serious drawback of being based on old and inaccurate data (the entire PDB from 1990, including structures at ˚ resolution and residues with B > 100), which made it impossible to 3.5-A locate those outer regions correctly. In reaction, Kleywegt and Jones21 chose to define only one ‘‘strictly allowed’’ boundary at 98% of a much more accurate data set, which provides a better validation criterion but does not address the issue of identifying outliers. ˚ resolution or better, We have used a new 500-protein database at 1.8-A B-factor filtering (keeping only residues with all backbone B < 30), and density-dependent smoothing to update the assignment of favored, allowed, and outlier regions in , space.22 This quality-filtered database of about 100,000 residues defines the allowed but disfavored regions quite clearly, because there now are essentially no points at all in the strongly disallowed regions which occupy nearly 60% of the plot area. Figure 5a shows that new , distribution for the general case: that is, for all residues not Gly, Pro, or pre-Pro. The omission of pre-Pro has the effect of deleting an area around ¼ 130 , ¼ þ80 (below left of ) from the favored region. Other than that difference, the inner smoothed contour enclosing 98% of the data (our ‘‘favored’’ region) matches quite exactly with the allowed region of Kleywegt and Jones.21 In addition, to separate the somewhat disfavored but ‘‘allowed’’ conformations from the strongly 20
A. L. Morris, M. W. MacArthur, E. G. Hutchinson, and J. M. Thornton, Proteins Struct. Funct. Genet. 12, 345 (1992). 21 G. J. Kleywegt and T. A. Jones, Structure 4, 1395 (1996). 22 S. C. Lovell, I. W. Davis, W. B. Arendall III, P. I. W. de Bakker, J. M. Word, M. G. Prisant, J. S. Richardson, and D. C. Richardson, Proteins Struct. Funct. Genet. 50, 437 (2002).
[18]
improving structures using all-atom contacts
393
Fig. 5. , plots for all data from the Top500 structures with backbone B factors 0.25 or 0.3 A on a computer screen). If almost all C deviations in a structure are small ˚ ) but a few are large, those cases usually signal misfit side chains (P1;code’. The second line with 10 fields separated by colons generally contains information about the structure file, if applicable. Only two of these fields are used for sequences, ‘sequence’ (indicating that the file contains a sequence without known structure) and ‘TvLDH’ (the model file name). The rest of the file contains the sequence of TvLDH, with ‘*’ marking its
G. Wu, A. Fiser, B. ter Kuile, A. Sˇali, and M. Mu¨ller, Proc. Natl. Acad. Sci. USA 96, 6285 (1999). 75 W. C. Barker, J. S. Garavelli, D. H. Haft, L. T. Hunt, C. R. Marzec, B. C. Orcutt, G. Y. Srinivasarao, L. S. L. Yeh, R. S. Ledley, H. W. Mewes, F. Pfeiffer, and A. Tsugita, Nucleic Acids Res. 26, 27 (1998). 74
[20]
comparative protein structure modeling
479
end. A search for potentially related sequences of known structure can be performed by the SEQUENCE_SEARCH command of Modeller. The following script uses the query sequence ‘TvLDH’ assigned to the variable ALIGN_CODES from the file ‘TvLDH.ali’ assigned to the variable FILE (file ‘seqsearch.top’).
The SEQUENCE_SEARCH command has many options,73 but in this example only SEARCH_RANDOMIZATIONS and DATA_FILE are set to nondefault values. SEARCH_RANDOMIZATIONS specifies the number of times the query sequence is randomized during the calculation of the significance score for each sequence–sequence comparison. The higher the number of randomizations, the more accurate the significance score. DATA_FILE = ON triggers creation of an additional summary output file (‘seqsearch.dat’). Selecting a Template. The output of the ‘search.top’ script is written to the ‘search.log’ file. Modeller always produces a log file. Errors and warnings in log files can be found by searching for the ‘_E>’ and ‘_W>’ strings, respectively. At the end of the log file, Modeller lists the hits sorted by alignment significance. Because the log file is sometimes long, a separate data file is created that contains the summary of the search. The example shows only the top 10 hits (file ‘search.dat’).
The most important columns in the SEQUENCE_SEARCH output are the ‘CODE_2’, ‘%ID’, and ‘SIGNI’ columns. The ‘CODE_2’ column reports the code of the PDB sequence that was compared with the target sequence. The PDB code in each line is the representative of a group of PDB sequences that share 40% or more sequence identity to each other and have fewer than 30 residues or 30% sequence length difference. All
480
analysis and software
[20]
the members of the group can be found in the Modeller ‘CHAINS_3.0_40_XN.grp’ file. The ‘%ID1’ and ‘%ID2’ columns report the percentage sequence identities between TvLDH and a PDB sequence normalized by their lengths, respectively. In general, a ‘%ID’ value above approximately 25% indicates a potential template unless the alignment is short (i.e., fewer than 100 residues). A better measure of the significance of the alignment is given by the ‘SIGNI’ column.73 A value above 6.0 is generally significant irrespective of the sequence identity and length. In this example, one protein family represented by 1bdmA shows significant similarity with the target sequence, at more than 40% sequence identity. While some other hits are also significant, the differences between 1bdmA and other top-scoring hits are so pronounced that we use only the first hit as the template. As expected, 1bdmA is a malate dehydrogenase (from a thermophilic bacterium). Other structures closely related to 1bdmA (and thus not scanned against by SEQUENCE_SEARCH) can be extracted from the ‘CHAINS_3.0_40_XN.grp’ file: 1b8vA, 1bmdA, 1b8uA, 1b8pA, 1bdmA, 1bdmB, 4mdhA, 5mdhA, 7mdhA, 7mdhB, and 7mdhC. All these proteins are malate dehydrogenases. During the project, all of them and other malate and lactate dehydrogenase structures were compared and considered as templates (there were 19 structures in total). However, for the sake of illustration, we will investigate only four of the proteins that are sequentially most similar to the target: 1bmdA, 4mdhA, 5mdhA, and 7mdhA. The following script performs all pairwise comparisons among the selected proteins (file ‘compare.top’).
The READ_ALIGNMENT command reads the protein sequences and information about their PDB files. MALIGN calculates their multiple sequence alignment, used as the starting point for the multiple structure alignment. The MALIGN3D command performs an iterative least-squares superposition of the four 3D structures. COMPARE command compares the structures according to the alignment constructed by MALIGN3D. It does not make an alignment, but it calculates the RMS and DRMS deviations between atomic positions and distances, differences between the mainchain and side-chain dihedral angles, percentage sequence identities, and
[20]
comparative protein structure modeling
481
several other measures. Finally, the ID_TABLE command writes a file with pairwise sequence distances that can be used directly as the input to the DENDROGRAM command (or the clustering programs in the Phylip package42). DENDROGRAM calculates a clustering tree from the input matrix of pairwise distances, which helps visualizing differences among the template candidates. Excerpts from the log file are shown below (file ‘compare.log’).
The comparison above shows that 5mdhA and 4mdhA are almost identical, both sequentially and structurally. They were solved at similar ˚ , respectively. However, 4mdhA has a better crysresolutions, 2.4 and 2.5 A tallographic R factor (16.7 versus 20%), eliminating 5mdhA. Inspection of the PDB file for 7mdhA reveals that its crystallographic refinement was
482
analysis and software
[20]
based on 1bmdA. In addition, 7mdhA was refined at a lower resolution ˚ ), eliminating 7mdhA. These observations than 1bmdA (2.4 versus 1.9 A leave only 1bmdA and 4mdhA as potential templates. Finally, 4mdhA is selected because of the higher overall sequence similarity to the target sequence. Aligning TvLDF with Template. A good way of aligning the sequence of TvLDH with the structure of 4mdhA is the ALIGN2D command in Modeller. Although ALIGN2D is based on a dynamic programming algorithm,76 it is different from standard sequence–sequence alignment methods because it takes into account structural information from the template when constructing an alignment. This task is achieved through a variable gap penalty function that tends to place gaps in solvent exposed and curved regions, outside secondary structure segments, and between two C positions that are close in space. As a result, the alignment errors are reduced by approximately one third relative to those that occur with standard sequence alignment techniques. This improvement becomes more important as the similarity between the sequences decreases and the number of gaps increases. In the current example, the template–target similarity is so high that almost any alignment method with reasonable parameters will result in the same alignment. The following Modeller script aligns the TvLDH sequence in file ‘TvLDH.seq’ with the 4mdhA structure in the PDB file ‘4mdh.pdb’ (file ‘align2d.top’).
In the first line, Modeller reads the 4mdhA structure file. The SEQUENCE_TO_ALI command transfers the sequence to the alignment array and assigns it the name of ‘4mdhA’ (ALIGN_CODES). The third line reads the TvLDH sequence from file ‘TvLDH.seq’, assigns it the name ‘TvLDH’ (ALIGN_CODES), and adds it to the alignment array (‘ADD_SEQUENCE = ON’). The fourth line executes the ALIGN2D command to perform the alignment. Finally, the alignment is written out in two formats: PIR (‘TvLDH-4mdh.ali’) and PAP (‘TvLDH-4mdh.pap’). The PIR format is used by Modeller in the subsequent model-building stage. The PAP alignment format is easier to inspect visually. Due to the
76
S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48, 443 (1970).
[20]
comparative protein structure modeling
483
high target–template similarity, there are only a few gaps in the alignment. In the PAP format, all identical positions are marked with a ‘*’ (file ‘TvLDH-4mdh.pap’).
Model Building. Once a target–template alignment is constructed, Modeller calculates a 3D model of the target completely automaticaly. The following script will generate five similar models of TvLDH based on the 4mdhA template structure and the alignment in file ‘TvLDH4mdh.ali’ (file ‘model-single.top’).
The first line includes many standard variable and routine definitions. The following five lines set parameter values for the ‘model’ routine. ALNFILE names the file that contains the target–template alignment in the PIR format. KNOWNS defines the known template structure(s) in
484
analysis and software
[20]
ALNFILE (‘TvLDH-4mdh.ali’). SEQUENCE defines the name of the target sequence in ALNFILE. STARTING_MODEL and ENDING_MODEL define the number of models that are calculated (their indices will run from 1 to 5). The last line in the file calls the ‘model’ routine that actually calculates the models. The most important output files are ‘model.log’, which reports warnings, errors, and other useful information
including the input restraints used for modeling that remain violated in the final model; and ‘TvLDH.B99990001’, which contains the model coordinates in the PDB format. The model can be viewed by the PDB format, such as CHIMERA (http://www.cgl.ucsf.edu/chimera/). 5. Evaluating a Model. If several models are calculated for the same target, the ‘‘best’’ model can be selected by picking the model with the lowest value of the Modeller objective function, which is reported in the second line of the model PDB file. The value of the objective function in Modeller is not an absolute measure in the sense that it can only be used to rank models calculated from the same alignment. Once a final model is selected, there are many ways to assess it (Section II.E). In this example, ProsaII45 is used to evaluate the model fold and Procheck70 is used to check the stereochemistry of the model. Before any external evaluation of the model, one should check the log file from the modeling run for runtime errors (‘model.log’) and restraint violations (see the Modeller manual for details73). Both ProsaII and Procheck confirm that a reasonable model was obtained, with a Z score comparable to that of the template (10.53 and 12.69 for the model and the template, respectively). However, the ProsaII energy profile indicates errors in the active site loop regions between residues 90–100 and 220–250 (Fig. 4). Loop 90–100 interacts with region 220–250, which forms the other half of the active site. In general, an error indicated by ProsaII is not necessarily an actual error, especially if it highlights an active site or a protein–protein interface. However, in this case, the same active site loops have a better profile in the template structure, which strengthens the assessment that the model is probably incorrect in the active site region. Example 2: Modeling of Protein–Ligand Complex Based on Multiple templates and User-Specified Restraints An important aim of modeling is to contribute to understanding of the function of the modeled protein. Inspection of the 4mdhA template structure revealed that loop 93–100, one of the functionally most important parts of the enzyme, is more disordered than the rest of the protein (Fig. 4). While loop 220–250 also has an unfavorable ProsaII profile in the model, it
[20]
comparative protein structure modeling
485
Fig. 4. ProsaII45 energy profile for the raw TvLDH model (dashed line), refined TvLDH model (thin line), and the 4mdhA template structure (thick line) (examples 1 and 2). The extended peak above the zero line in regions 90–100 and 220–250 of the raw model highlights a possible error in the raw model, significantly improved in the refined model.
is crystallographically well resolved in the template structure and is probably reported as an error in the model only because of its unfavorable nonbonded interactions with the incorrectly modeled region 90–100. Therefore, we focus here on refining region 90–100 only. To build a better model of the active site in TvLDH, we need to search for another template malate dehydrogenase structure, which may have a lower overall sequence similarity to TvLDH, but a better resolved active site loop. The old and new templates can then be used together to obtain a model of TvLDH. The active site loop tends to be more defined if the structure is solved together with its physiological ligand and a cofactor. The model based on a template with ligands bound is also expected to be more relevant for the purposes of our study of enzymatic specificity, especially if we also build the model with the ligands. 1emd, a malate dehydrogenase from E. coli was identified in PDB. While the 1emd sequence shares only 32% sequence identity with TvLDH,
486
analysis and software
[20]
the active site loop and its environment are more conserved. The loop in the 1emd structure is well resolved. Moreover, 1emd was solved in the presence of a citrate substrate analog and the NADH cofactor. The new alignment in the PAP format is shown below (file ‘TvLDH-4mdh.pap’).
[20]
comparative protein structure modeling
487
The modified alignment refers to an edited 1emd structure (see below), 1emd_ed, as a second template. The alignment corresponds to a model that is based on 1emd_ed in its active site loop and on 4mdhA in the rest of the fold. Four residues on both sides of the active site loop are aligned with both templates to ensure that the loop has a good orientation relative to the rest of the model. The modeling script below has several changes with respect to ‘model-single.top’. First, the name of the alignment file assigned to ALNFILE is updated. Next, the variable KNOWNS is redefined to include both templates. Another change is an addition of the ‘SET HETATM_IO = ON’ command to allow reading of the non-standard pyruvate and NADH residues from the input PDB files. Finally, the subroutine ‘special_restraints’ is also defined to restrain the pyruvate ligand to a desired position in the model. The script is shown next (file ‘model-multiple-hetero.top’).
A ligand can be included in a model in two ways by Modeller. The first case corresponds to the ligand that is not present in the template structure, but is defined in the Modeller residue topology library. Such ligands include water molecules, metal ions, nucleotides, heme groups, and many other ligands (see FAQ 18 in the Modeller manual). This situation is not explored further here. The second case corresponds to the ligand that is already present in the template structure. We can assume either that the ligand interacts similarly with the target and the template, in which case we can rely on Modeller to extract and satisfy distance restraints automatically, or that the relative orientation is not necessarily conserved, in which
488
analysis and software
[20]
case the user needs to supply restraints on the relative orientation of the ligand and the target (the conformation of the ligand is assumed to be rigid). The two cases are illustrated by the NADH cofactor and pyruvate modeling, respectively. Both NADH and pyruvate are indicated by the ‘.’ characters at the end of each sequence in the alignment file above (the ‘/’ character indicates a chain break). In general, the ‘.’ character in Modeller indicates an arbitrary generic residue called a ‘‘block’’ residue (for details see the Modeller manual73). The 1emd structure file contains a citrate substrate analog. To obtain a model with pyruvate, the physiological substrate of TvLDH, we convert the citrate analog in 1emd into pyruvate by deleting the –CH(COOH)2 group, thus obtaining the 1emd_ed template file. A major advantage of using the ‘.’ characters is that it is not necessary to define the residue topology. To obtain the restraints on pyruvate, we first superpose the structures of several LDH and MDH enzymes solved with ligands. Such a comparison allows to identify absolutely conserved electrostatic interactions involving catalytic residues Arg-161 and His-186 on the one hand, and the oxo groups of the lactate and malate ligands on the other hand. The modeling script can now be expanded by appending a routine that specifies the user-defined distance restraints between the conserved atoms of the active site residues and their substrate. The ADD_RESTRAINT command has two arguments. ATOM_IDS defines the restrained atoms, by specifying their atom types and the residue numbers as listed in the model coordinate file. RESTRAINT_PARAMETERS defines the restraints, by specifying the mathematical form (e.g., harmonic, cosine, cubic spline), modality, the type of the restrained feature (e.g., distance, angle, dihedral angle), the number of atoms in the restraint, and the restraint parameters. In this case, a harmonic upper bound restraint ˚ is imposed on the distances between the specified pairs of of 3.5 0.1 A atoms. A trick is used to prevent Modeller from automatically calculating distance restraints on the pyruvate–TvLDH complex; the ligand in the 1emd_ed template is moved beyond the upper bound on the ligand–protein ˚ ). distance restraints (i.e., 10 A The new script produces a model with a significantly improved ProsaII profile (Fig. 4). The predicted error in the 90–100 active site loop is much less and practically resolved in the loop region 220–250. The overall Z score is improved from 10.7 to 11.7, which compares well with the template Z score of 12.7. With this favorable evaluation, we gain confidence in the final model. The model was used for interpreting site-directed mutagenesis experiments aimed at elucidating the determinants of enzyme specificity in this class of enzymes.74
[20]
comparative protein structure modeling
489
Example 3: Modeling Fold and Loop in Circularly Permuted Cyanovirin Cyanovirin-N (CV-N) was originally isolated from Nostoc ellipsosporum. It was identified in a screening effort as a highly potent inhibitor of diverse laboratory-adapted strains and clinical isolates of HIV-1, HIV-2, and SIV. Subsequently, the structure of CV-N was solved, first by NMR spec˚ . The troscopy and later by X-ray crystallography at a resolution of 1.5 A two structures are similar. The CN-V monomer consists of two similar domains with 32% sequence identity to each other. In the crystal structure, the domains are connected by a flexible linker region, forming a dimer by intermolecular domain swapping. Work was initiated to solve the monomer structure of a CN-V variant with circularly permuted domains (cpCN-V).77 Assuming that the overall structure does not change significantly, the new protein can be modeled by comparative modeling. An initial coarse model is built by using the following alignment file in the PAP format (file ‘circ.pap’).
Next, the new linker loop and the short N and C termini are refined by ab initio loop modeling. The selected segments that are subjected to loop modeling are indicated by stars in the alignment above. The loop modeling script is as follows (file ‘loop.top’).
77
L. G. Barrientos, R. Campos-Olivas, J. M. Louis, A. Fiser, A. Sali, and A. M. Gronenborn, J. Biomol. NMR 19, 289 (2001).
490
analysis and software
[20]
Fig. 5. Superposition of models for six linker segments with lengths from six to nine residues. Toward the C terminus of the loop, a larger structural variation can be observed, but the dominant conformation is well defined by a cluster of four loops.
SEQUENCE defines the name of the model. LOOP_MODEL defines the name of the input coordinate file containing the cpCN-V model whose loops need to be refined. LOOP_STARTING_MODEL and LOOP_ENDING_MODEL define how many final loop models are calculated (in this case, 200). The subroutine ‘select_loop_atoms’ selects regions of the model for loop modeling. Two arguments are submitted to the PICK_ATOMS command. SELECTION_SEGMENT defines the starting and ending residues of the loop. SELECTION_STATUS defines whether or not the program initializes the selection or adds the current loop to the previously defined set of loops. In this case, three loops are selected and optimized simultaneously. The filenames of output models with refined loops have the ‘.BL’ extension to distinuish them from the default file naming convention of the regular models (‘.B’). For instance, the first loop model file generated is ‘cpCN-V.BL00010001’. Although the linker segment is only six residues long, it is not known whether or not some of the preceding and subsequent residues undergo
[20]
comparative protein structure modeling
491
conformational changes in the new construct. To investigate this question, we gradually extended the length of the modeled linker region from 6 to 12 residues. For this purpose, one needs to modify only the selection routine in the script above. The model with the lowest energy score of the 200 generated models was selected for each linker length from 6 to 12 residues. The superposition of the best models of varying length showed a dominant cluster of conformations, indicating that the modeling of the linker region is not limited by conformational changes in the immediately preceding or subsequent parts of the sequence (Fig. 5). The final comparative model with the optimized linker and terminal segments was used to refine the structure of cpCN-V against NMR dipolar coupling data. A good agreement between the experimental values and those calculated from the model confirmed that the fold of cpCN-V is similar to that of the wild type and that the model may facilitate characterization of the structure and dynamics of cpCV-N.77 Acknowledgments We are grateful to all the members of our research group for many discussions about comparative protein structure modeling. A. F. was a Burroughs Wellcome Fund Postdoctoral Fellow and is a Charles Revson Foundation Postdoctoral Fellow. A. S. is an Irma T. Hirschl Trust Career Scientist. Research was supported by NIH/GM 54762, a Merck Genome Research Award (A. S.), and the Mathers Foundation. This review is based on Refs. 6, 7, and 78. Modeller is available freely to academic users at http://salilab.org/ modeller/modeller.html. It runs on many UNIX systems, including PCs running LINUX. All the sample files shown in this review are available at http://salilab. org/modeller/methenz/. Modeller, with a graphical interface, is also available as part of Quanta, InsightII, and GeneExplorer (Accelrys, San Diego, CA: dje@ accelrys.com).
78
R. Sa´nchez and A. Sˇali, in ‘‘Protein Structure Prediction: Methods and Protocols’’ (D. M. Webster, ed.), p. 97. Humana Press, Totowa, NJ, 2000.
492
analysis and software
[21]
[21] GRASP2: Visualization, Surface Properties, and Electrostatics of Macromolecular Structures and Sequences By Donald Petrey and Barry Honig Introduction
The surface of a biological macromolecule is the region of the structure with which other molecules can interact. Consequently, visualization and analysis of surface properties can contribute significantly to the understanding of the physical determinants of protein function, from chemical catalysis, to allosteric properties, to protein–protein, protein–ligand, and protein– membrane interactions. The GRASP (Graphical Representation and Analysis of Surface Properties) program, written a number of years ago,1 was the first program specifically designed to display the molecular surface. It also provided several unique analytical tools. For example, electrostatic potentials calculated using the finite-difference Poisson–Boltzmann (FDPB) method could be displayed on the molecular surface. This chapter describes the newest version of GRASP, GRASP2. This version has been written using the object-oriented programming language C++, making the algorithms and code easily accessible and modifiable. Also, molecular graphics functionality is carried out using the platform-independent graphics language OpenGL, making it more portable (GRASP2 currently runs on PCs running Microsoft Windows). Features have been added that, through the use of storable atomic subsets as well as other objects, allow more convenient control of the different functions the program can perform. In addition, many new capabilities have been added that facilitate the comparison of structures and the correlation of structural properties with primary sequence. All these capabilities have been incorporated into the Troll object-oriented software library of macromolecular analysis tools (D. Petrey, PhD Thesis, 2002). One new feature in GRASP2 is its graphical user interface (GUI). Figure 1 shows the main components of the GRASP2 GUI. Details on its use are provided below in a set of five examples. Most functions of the program can be carried out using mouse operations in the three main windows of the GUI: an object window, a molecular graphics window, and a 1
A. Nicholls, K. A. Sharp, and H. Honig, Proteins 11, 281 (1991).
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[21]
GRASP2
493
Fig. 1. Graphical user interface for GRASP2. The interface contains a molecular graphics window similar to the previous version of the program, shown at upper right. The interface also contains two new windows. The object window is shown to the left of the molecular graphics window; the sequence window is shown below it.
sequence window. The molecular graphics window is, for the most part, identical to that used in the previous version of the program. The sequence window, in the bottom right-hand portion of Fig. 1, is a new feature. Whenever a protein or nucleic acid structure is loaded into the program, the primary sequences of all polypeptide and nucleic acid chains are displayed in this window. Sequences can be displayed in a number of formats. Highlighting of residues in sequence windows provides a means of manipulating or analyzing the properties of those residues. This is described in more detail in example 4 below. The object window, on the left in Fig. 1, contains icons representing the different types of objects that can be manipulated in order to control the rendering style of the macromolecular structure and to use the structural analysis tools that GRASP2 provides. In particular, the usual method of activating a particular function in GRASP2 involves the selection of one or more objects in the object window by clicking on them with the mouse,
494
analysis and software
[21]
right-clicking with the mouse to bring up the menu associated with that object (the ‘‘context menu’’), and selecting the desired function from among the menu options. These options are described in more detail below. Examples of all the objects that GRASP2 stores and displays are shown in Fig. 1 for hemoglobin Thionville (PDB code 1bab). The user can assign unique names to each object. The objects are as follows. Atomic subsets: Icons representing subsets of atoms in the macromolecular structure are the primary interface to the most frequently used functions that GRASP2 performs. For example, to change the rendering style of the heme in chain B of hemoglobin Thionville to ‘‘ball-and-stick,’’ one would simply right-click on that subset in the object window (see Fig. 1) and select the ‘‘Style’’ menu option. The same procedure is used to display surfaces, calculate electrostatic potential, and so on. Other rendering schemes include wire frames (the default), ball-and-stick, CPK spheres, backbone worms or ribbons, strand arrows, helical cylinders, and C trace. Coloring is also controlled in this way. On initial loading of a molecular structure, the program will attempt to parse the structure into reasonable subsets. These usually include a subset representing all the atoms in the structure, each chain, each ligand, each ion, and a subset representing solvent molecules. The object window in Fig. 1 contains the default subsets created for hemoglobin Thionville. New subsets of the structure can be defined and saved as described below. Each subset that is created will be represented as an icon in the object window. Surfaces: Five types of surface can be displayed by the program: the molecular surface, the accessible and van der Waals surfaces, a surface based on a Gaussian representation of the electron density of the molecule, and electrostatic potential contours. Many coloring schemes can be applied to each surface type automatically by the program. The user can define arbitrary coloring schemes as well. The creation of surfaces and the specification of their properties are described in more detail in examples 3 and 5 below. Electrostatic Potential Maps (Phimaps): The electrostatic potential at each point in space as determined by a numerical solution to the Poisson– Boltzmann equation is represented by a ‘‘phimap.’’ Phimaps are described in more detail in example 5 below. Views: The current graphics state of the program can be saved. This includes all rendering styles and coloring schemes applied to the objects in a scene as well as rotations and translations applied to the objects. Double-clicking on a View icon in the object window restores that particular graphics state. This feature of GRASP2, combined with the ability to save objects into a single project file (as described in example 1) can greatly simplify the creation of figures for publication.
[21]
GRASP2
495
Molecular Models: Molecular models are used primarily to set the focus of translation and rotation operations. In general, a single molecular model is created by the program for each PDB file loaded. Molecular models can also be copied. For example, to change the rotation or translation of only chain B of hemoglobin Thionville (see Fig. 1), one would copy the molecular model for 1bab, hide all chains except chain B, and set the rotation/translation focus to that molecular model. Until the focus is changed to some other object or made global, translation and rotation operations will be applied only to chain B. As usual, all these functions are carried out by right-clicking on the objects to bring up the context menu, and selecting the appropriate menu option. Also, each separate molecular model can be displayed in different sections of the molecular graphics window, facilitating the comparison of structures. This process is described in example 3. Examples
Example 1: File Input/Output, Manipulating Objects, Subsets In addition to the usual PDB-formatted three-dimensional coordinate files, GRASP2 now supports storage and retrieval of ‘‘project files’’ that contain information about rendering styles, surfaces, views, potential maps, and subsets that have been created by a user. All of this information can be stored and accessed at a later time. Multiple PDB-formatted files can be viewed simultaneously by adding a file to the current project through the ‘‘File Add’’ menu option of the main menu. Also, if a particular PDB file is unavailable locally, the program can download files by accession code from the RCSB Protein Data Bank,2,3 using the FTP protocol. Other conveniences standard to the Windows operating system include a recent file list as well as direct printing (without the necessity of performing a screen capture) at full printer resolution, as long as there is a suitable driver for a given printer available on the system. A new feature of the program is a ‘‘Print preview’’ option, allowing a user to adjust the size and position of an image on the page before it is printed. A figure caption can be included on the printed image. Most of the capabilities of GRASP2 can be accessed by clicking on the icons in the object window. Objects are selected by left-clicking on the icon. 2
F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. F. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535–542 (1977). 3 H. N. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000).
496
analysis and software
[21]
Multiple objects can be selected by holding down the ‘‘shift’’ key and leftclicking. As discussed above, each type of object has a different menu of options associated with it (a ‘‘context menu’’), which can be displayed by right-clicking on the icon for that object. Figure 2 shows the context menu that applies to atomic subsets, as well as a submenu listing the types of surfaces that can be created. Subsets reflecting a reasonable partitioning of the molecular structure into chains, ligands, and ions are created automatically by the program and displayed in the object window. Subsets can be created in several other ways. Simple subsets can be created in the subset creation bar, shown just below the main menu in Fig. 1. More complicated subsets involving Boolean operations ‘‘and,’’ ‘‘or,’’ and ‘‘not’’ can be created using the ‘‘Edit
Fig. 2. Creation of a molecular surface for a subset.
[21]
GRASP2
497
Subset’’ menu option on the main menu. Each time a subset is defined, an icon for it is placed in the object window, making it possible to revise the properties of that subset without redefining it. Other subsets can be automatically calculated by the program. For example, the atoms in an interface between two chains can be calculated automatically by the ‘‘Calculate Interface’’ menu option. In GRASP2, an atom of a given chain is interfacial if there is an atom of some other chain at a distance less than the sum of the van der Waals radii of the two atoms plus a probe radius. Neighbors of a subset (defined as atoms within some distance cutoff) can be calculated using the ‘‘Calculate Neighbors’’ menu option of the main menu. Finally, highlighted residues in the sequence window implicitly define a subset of residues that can be manipulated in the same way as subsets in the objet window (i.e., by right-clicking and selecting a menu option). Sequence windows are described in more detail in example 3. Example 2: Alignment of Structures Comparison of similar macromolecules has the potential to reveal valuable information about the physical determinants of their structure and function. For example, analysis of the differences between the complexed and isolated forms of interacting proteins can yield information about important conformational changes that occur on binding. While an automated procedure identifying all important differences between different forms of a protein would be desirable, it is often more efficient to ‘‘map’’ different structural properties onto the molecule, that is, draw each in a way that depends on some physical or statistical property, and simply compare them by eye so as to identify interesting similarities or differences that would be subjected to further analysis. In this and the examples that follow, the capabilities of GRASP2 are demonstrated by analyzing and comparing four antibodies binding to different antigens: lysozyme, cytochrome c, and two sites on hemagglutinin. One of the primary technical difficulties with structural comparisons involves placing the structures to be compared into the same coordinate system so that it is possible to view them simultaneously and in the same orientation. GRASP2 provides three means of orienting pairs of molecules. All three superposition methods are controlled thorugh a window similar to that shown in Fig. 3, used to transform coordinates using rigid body superposition. This window allows a user to specify the subset of atoms whose coordinates should be transformed, and the two subsets of atoms whose coordinates should be superposed to determine the transformation. To place the antibody–hemagglutinin complex in the same orientation as the
498
analysis and software
[21]
Fig. 3. Rigid body superposition. The three-dimensional coordinates of one subset of atoms (specified by selecting them in the top box) can be transformed by applying the linear transformation that minimizes the RMSD between other subsets of atoms (specified in the middle and bottom boxes).
antibody–lysozyme complex, the necessary transformation would be applied to all three chains of the antibody–hemagglutinin complex but calculated based solely on minimizing the RMSD between the heavy and light chains of the antibodies. The subset selections in Fig. 3 demonstrate how this would be specified. The default behavior of the program is to superpose all atoms. However, in the window shown in Fig. 3, a filter can be applied to the atoms so that only C atoms are superposed. When the number of residues in the subsets to be superposed is identical, the program simply carries out the rigid body superposition and reflects the changes in the molecular graphics window. When rigid body superposition is selected and the subsets to be superimposed do not have the same number of residues, the program will first perform a sequence alignment using the Smith–Waterman algorithm,4 and then bring up a second window displaying this alignment and allowing the user to modify it. When the alignment is accepted, the rigid body superposition is carried out. 4
T. F. Smith and M. S. Waterman, J. Mol. Biol. 147, 195 (1981).
[21]
GRASP2
499
Fig. 4. Structural superposition of the heavy chains of four antibodies in complex with different antigens: lysozyme (PDB code 1 mlc), cytochrome c (1wej), and two sites on hemagglutinin (1qfu and 2vir).
In situations in which two molecules are structurally similar but not similar in primary sequence, structural alignment methods must be used to place the two molecules in the same orientation. GRASP2 implements a double dynamic programming algorithm5,6 to align proteins structurally. An important step in this algorithm is a preliminary alignment of secondary structures elements, and the accuracy of the alignment can depend on the secondary structure definitions. When the structural alignment feature of GRASP2 is selected, a window allowing a user to view and edit the secondary structure definitions when structural alignment methods is displayed. When the secondary structure definitions are accepted, the structural alignment is carried out and a window is displayed showing the structure-based sequence alignment. Figure 4 shows the result of using these methods to orient the four sample antibody–antigen systems. Example 3: Viewing Different Molecules Simultaneously, Comparing Surface Properties Multiple views of different molecules can be displayed on the screen simultaneously. This is done by splitting the molecular graphics window, using the ‘‘Window Split’’ menu option of the main menu. This allows a 5
C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton, Structure 5, 1093 (1997). 6 A.-S. Yang and B. Honig, J. Mol. Biol. 301, 665 (2000).
500
analysis and software
[21]
user to define the dimensions of a two-dimensional grid of views, or ‘‘panes,’’ into which different molecular model objects can be drawn. Figure 5 shows the use of this option for the four antibody–antigen systems, with each complex displayed in a different pane of the molecular graphics window. When multiple panes are active in the molecular graphics window, rotations and translations can be applied in a synchronized fashion, or individually to each pane. In Fig. 5, rotations and translations were applied synchronously to each pane, so that the orientation of all molecules was such that the interfacial atoms of the antibody are facing the viewer. In this example and all that follow, we will use this orientation of the molecules to illustrate the capabilities of the program. Note that this view can be applied as needed, by selecting the ‘‘View 1’’ object from the object window. In Fig. 6 are shown the molecular surface of each of the antibodies used here as examples. The coloring of the surface in Fig. 6 is new to this version of the program and made possible by a different algorithm for constructing
Fig. 5. Viewing multiple molecules simultaneously and in the same orientation.
[21]
GRASP2
501
Fig. 6. Simultaneous viewing of the antigen-binding surfaces of four antibodies. The coloring of the surfaces is determined by the component of the molecular surface to which a particular point on the surface belongs. Vertices in contact areas are colored by atom type, and the reentrant areas are plum.
the molecular surface, which is a local implementation of the Connolly algorithm.7 In that algorithm, the molecular surface is divided into two parts, the surface area of atoms directly in contact with a probe sphere (the ‘‘contact’’ area) and the surface area determined by the probe sphere itself (the ‘‘reentrant’’ area). In Fig. 6, the coloring of contact areas is determined by atom type, and the reentrant area is shown in pink and gold. This coloring scheme makes visual identification of individual residues and identification of individual atoms (by clicking on its contact area of a particular atom) more convenient. One means of visualizing and comparing structural properties is to convert them into some range of numerical values, which can then be associated with a range of colors and ‘‘mapped’’ onto the molecular surface. For example, the minimum distance of the surface to some subset of atoms in the structure can be mapped to the surface. Converting these distances to colors and coloring the surface accordingly is a method frequently used to specify the surface patch that is in contact or within some distance cutoff of a ligand. In Fig. 7, distance from the antigen is mapped to the molecular 7
M. L. Connolly, J. Appl. Cryst allogr. 16, 548 (1983).
502
analysis and software
[21]
Fig. 7. Coloring the molecular surface using a structural property. Here distance from the antibody to the antigen is associated with a range of colors and mapped to the molecular surface. Yellow indicates regions of the surface that are close to the antigen, blue indicates regions at medium distances, and white indicates regions that are far away.
surfaces of the antibodies, with yellow indicating short distances, blue intermediate distances, and white long distances. A means of visualizing and comparing molecular shape is to map a property, loosely termed ‘‘curvature,’’ onto the molecular surface. In Fig. 8 the coloring of each point on the surface is determined by a measure of the degree to which solvent molecules are constrained at that particular point. More specifically, the numerical value mapped onto the surface is the percentage of solid angle of a probe sphere accessible to other water molecules, a number that is also related to the hydrophobicity of the surface.1 Since fewer water molecules will be able to contact a probe sphere placed at a vertex in a concave region of the surface (and vice versa for convex regions), this coloring scheme also indicates deeply concave or highly convex (gray and green, respectively) regions of the surface. Differences in the shapes of the molecules are thus reflected in their different colorings. Distance and curvature are just two examples of the properties that GRASP2 can map onto the molecular surface (and onto any other surface that the program can display). A complete list of these properties is
[21]
GRASP2
503
Fig. 8. Coloring the molecular surface using a structural property. Here ‘‘curvature’’ (see text) is associated with a range of colors and mapped to the molecular surface. Green represents concave regions of the surface, white represents flat regions, and gray represents convex regions.
provided in Table I with a brief description of how each property is defined and used. Attention should be drawn to the ‘‘General Property 1’’ and ‘‘General Property 2’’ coloring schemes. When one of these two coloring schemes is selected for a surface, the program will read the numerical values in the B-factor field of the ‘‘ATOM’’ record of the PDB file (‘‘General Property 1’’) or the occupancy field (‘‘General Property 2’’) and map these values to the surface. The mapping is performed based on a coloring scheme that can be specified by the user who chooses a minimum, median, and maximum value for the range of numbers and the color that should be associated with each. The program will then color the surface accordingly. An increasingly popular coloring scheme for surfaces is based on a measure of the conservation of a particular residue within a family of structurally related proteins. An interesting example of the use of this feature can be found in Armon et al.,8 which describes a method of identifying conserved functional sites on the surfaces of related molecules.
8
A. Armon, D. Graur, and N. Ben-Tal, J. Mol. Biol. 307, 447 (2001).
504
analysis and software
[21]
TABLE I Coloring Schemes for Surfaces Property
Description
CPK
Vertices associated with each convex region of the surface are colored according to the CPK coloring of the atom associated with that vertex. Reentrant areas are colored plum
Potential
Vertices are colored according to the electrostatic potential at that point, as determined by a trilinear interpolation of the electrostatic potential at the eight nearest grid points of a phimap specified by the user
General property 1,2
User-defined colors
Solid
The surface is colored uniformly with a single color
Accessibility/Gaussian curvature
Concave regions are colored gray and convex regions green
Distance
Vertices are colored according to the minimum distance of that vertex to an atom in a subset specified by the user
Example 4: Analyzing Sequence–Structure–Function Relationships Besides the sequence window contained in the primary GUI of the program, windows containing pairwise or multiple sequence alignments can be displayed. Figure 9 shows a multiple sequence alignment of the heavy chains of the four antibodies we are using as examples. Alignments such as these are conveniently constructed by double clicking on the label of a previously defined subset (listed in the top window of Fig. 9). This adds the sequence spanned by that subset to the sequence alignment and updates the lower window of Fig. 9, which contains the sequence alignment itself. The algorithm used to calculate pairwise alignments is a local implementation of the Smith and Waterman algorithm.4 The multiple sequence alignment is generated using the method of Thompson et al.,9 with the exception that no ‘‘guide tree’’ is used. Alignments based on structural superposition are displayed in a similar window. Multiple structure alignments are generated using the same algorithm as for multiple sequence alignments, with similarity scores between residues being determined geometrically.5,6 The ability to view multiple sequences and structures simultaneously can facilitate the analysis of the relationship between sequence, structure, and function. For example, a reasonable question to ask about the structurally similar but functionally different set of antibodies used here as 9
J. D. Thompson, D. G. Higgins, and T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994).
[21]
GRASP2
505
Fig. 9. Multiple sequence alignment of the heavy chains of four antibody–antigen systems. Hydrophobic residues are colored in black, acidic and basic residues in red and blue, polar residues in green, and aromatic residues are colored in black with a white background.
Fig. 10. Window used to calculate and analyze hydrogen bonds/salt bridges.
examples is whether or not certain types of interactions between the antibody and antigen, such as hydrogen bonds or ion pairs, are conserved. In GRASP2, hydrogen bonds and ion pairs between the antibodies and antigens can be identified by selecting a menu option on the main menu, which brings up the window shown in Fig. 10. Selecting a particular hydrogen
506
analysis and software
[21]
bond in that window by left-clicking on it automatically highlights the residues participating in that interaction in all open sequence windows. Selecting all the hydrogen bonds and ion pairs between the antibodies and antigens results in the pattern of highlighted residues shown in Fig. 9. As an example of the type of information that may be gleaned from this type of analysis, note that these polar interactions cluster in three distinct regions of the sequence, the well-studied complementarity-determining regions (CDRs). This connection between the sequence views and windows used to carry out structural analysis is a new feature of the program and applies to other types of windows as well. Another way that this feature might be used is to examine simultaneously the properties of aligned residues in a multiple sequence alignment. In Fig. 11, an interesting pattern of conserved aromatic residues that bracket the CDRs of the antibodies has been highlighted by a combination of left-clicking and shift-left-clicking. As with all sequence windows, highlighted residues define a subset that can be used to carry out several functions. To compare the positions of these residues in the three-dimensional structure, the user could change the display properties of aligned residues in sequence windows simultaneously by right clicking on the sequence in Fig. 11 and selecting the appropriate menu option. Similarly, a user can ask about the types of interactions of these residues with the antigen using a process similar to that described in the previous paragraph for hydrogen bonds and ion pairs. Example 5: Calculation and Visualization of Electrostatics, Electrostatic Interaction Energies Comparison of macromolecules within the same structural family such as the immunoglobulins is useful in that it allows the correlation of structural and physicochemical features with function and, in that sense, should have predictive value when a structure of a new member of that family
Fig. 11. Multiple sequence alignment window. The window is designed so as to make it possible to examine the properties of aligned or conserved residues simultaneously. Here a conserved pattern of aromatics is highlighted in the multiple sequence alignment and the context menu for that window is shown.
[21]
GRASP2
507
becomes available. The electrostatic properties of macromolecular systems are an important component of such a description. GRASP2 provides a means calculating the electrostatic potential at any point in the macromolecular system (a ‘‘phimap’’) by solving the linear Poisson-Boltzmann equation using finite-difference methods.10–12 Phimaps are created by selecting the subset or subsets for which the electrostatic potential calculation should be carried out in the object window, right-clicking on the selected subsets and selecting the ‘‘Calculate Phimap’’ menu option. When this menu option is selected, a window will appear that allows a user to change the parameters that control how the calculation is performed, such as the charge set to use, boundary conditions, assignment of specific charges to specific atoms, etc. In the examples and illustrations that follow, this procedure was carried out four times, for the heavy and light chains of each of the antibody–antigen systems. When a phimap is calculated, an icon for it appears in the object window. Phimaps can be used in several ways. Figure 12 shows the molecular surfaces of the interfacial atoms of each of the four antibodies. The coloring scheme in Fig. 12 is determined by the electrostatic potential at each point of the molecular surface, with red indicating electronegative regions of the surface (i.e., the energy required to bring a positive charge from an infinite distance to that point on the molecular surface would be favorable), white indicating neutral regions, and blue indicating electropositive regions. The residues that contribute significantly to the electrostatic potential at a particular point can be identified by clicking on the surface. As usual, residues identified in this way will be highlighted in all sequence windows. In addition, if a multiple sequence alignment window such as that shown in Fig. 9 is visible, residues that are conserved and contributing significantly to the electrostatic properties of the system can be identified conveniently. Other uses of phimaps include the construction and display of twodimensional isopotential contours and three-dimensional isopotential surfaces for arbitrary values of the electrostatic potential. Also, permanent storage of the phimaps makes it possible to calculate electrostatic interaction energies between different charged groups on the protein that accurately account for solvent effects (in the continuum solvent model). The electrostatic interaction energy of two charged groups is the energy of the charge distribution of one group in the potential generated by the second. For example, if one wanted to calculate the electrostatic 10
A. Nicholls and B. Honig, J. Comput. Chem. 12, 435 (1991). W. Rocchia, E. Alexov, and B. Honig, J. Phys. Chem. B 105, 6507 (2001). 12 W. Rocchia, S. Sridharan, A. Nicholls, E. Alexov, A. Chiabrera, and B. Honig, J. Comput. Chem. 23, 128 (2002). 11
508
analysis and software
[21]
Fig. 12. Coloring the molecular surface using a structural property. Here electrostatic potential is mapped to the molecular surface. Red indicates regions of negative electrostatic potential, white indicates neutral regions, and blue indicates positive regions.
interaction energy of a particular residue in the antibody with the antigen, one would first construct a phimap of the electrostatic potential due to the charges in the antigen, define a subset that included only atoms in the side chain of that particular residue in the antibody, and ask the program to calculate the energy of that subset in the potential generated by the antigen (again, by bringing up the context menu for that phimap). This could be carried out for an arbitrary number of residues, a process which could be used to identify all residues contributing significantly to the electrostatic energy of binding. Summary
The widespread use of the original version of GRASP revealed the importance of the visualization of physicochemical and structural properties on the molecular surface. This chapter describes a new version of GRASP that contains many new capabilities. In terms of analysis tools, the most notable new features are sequence and structure analysis and alignment tools and the graphical integration of sequence and structural information.
[22]
509
SNAPP: computational geometry
Not all the new features of GRASP2 could be described here and more capabilities are continually being added. An on-line manual, details on obtaining the software, and technical notes about the program and the Troll software library can be found at the Honig laboratory Web site (http://trantor.bioc.columbia.edu). Acknowledgment This work was supported by the National Science Foundation (DBI-9904841).
[22] Simplicial Neighborhood Analysis of Protein Packing (SNAPP): A Computational Geometry Approach to Studying Proteins By Alexander Tropsha, Charles W. Carter, Jr., Stephen Cammer, and Iosif I. Vaisman Introduction
The growth in the number of structures in the public protein structure database (expected to triple or quadruple over the next 5 years1) as well as the addition of new structures determined in structural genomics initiatives stimulates the need for computational tools designed to analyze and compare protein structures. This is perhaps the most important step towards utilization and comprehension of the new structural data provided by large-scale determination efforts. Related structural comparison methods are also relevant in the initial target selection process as well. Protein comparison relies on establishing relationships between protein 3D structure and its sequence, and recognition of recurrent structural and sequence-specific patterns, or motifs. Most often these motifs are detected in proteins with well-established evolutionary history. Such motifs can be detected on the basis of multiple sequence alignments of proteins with similar structural domains.2 In some cases, common structural fragments may be small, corresponding to active sites or other functionally relevant features such as in the Prosite patterns.3 Such global alignment 1
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000). 2 S. Henikoff, J. Henikoff, and S. Pietrokovski, Bioinformatics 15, 471 (1999). 3 K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch, Nucleic Acids Res. 27, 215 (1999).
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[22]
509
SNAPP: computational geometry
Not all the new features of GRASP2 could be described here and more capabilities are continually being added. An on-line manual, details on obtaining the software, and technical notes about the program and the Troll software library can be found at the Honig laboratory Web site (http://trantor.bioc.columbia.edu). Acknowledgment This work was supported by the National Science Foundation (DBI-9904841).
[22] Simplicial Neighborhood Analysis of Protein Packing (SNAPP): A Computational Geometry Approach to Studying Proteins By Alexander Tropsha, Charles W. Carter, Jr., Stephen Cammer, and Iosif I. Vaisman Introduction
The growth in the number of structures in the public protein structure database (expected to triple or quadruple over the next 5 years1) as well as the addition of new structures determined in structural genomics initiatives stimulates the need for computational tools designed to analyze and compare protein structures. This is perhaps the most important step towards utilization and comprehension of the new structural data provided by large-scale determination efforts. Related structural comparison methods are also relevant in the initial target selection process as well. Protein comparison relies on establishing relationships between protein 3D structure and its sequence, and recognition of recurrent structural and sequence-specific patterns, or motifs. Most often these motifs are detected in proteins with well-established evolutionary history. Such motifs can be detected on the basis of multiple sequence alignments of proteins with similar structural domains.2 In some cases, common structural fragments may be small, corresponding to active sites or other functionally relevant features such as in the Prosite patterns.3 Such global alignment 1
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000). 2 S. Henikoff, J. Henikoff, and S. Pietrokovski, Bioinformatics 15, 471 (1999). 3 K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch, Nucleic Acids Res. 27, 215 (1999).
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
510
analysis and software
[22]
based approaches work best for closely related homologues; they cannot always be applied to detect similarity relationships between evolutionarily distant relatives. Furthermore, detection of structural similarities in evolutionarily unrelated proteins for understanding recurrent protein architectures usually depends on extensive pairwise comparisons of many structures.4 Our group has developed an approach to protein structure analysis and representation based on a computational geometry technique known as Delaunay tesselation. The latter method affords a unique identification of sets of four nearest-neighbor residues in mutual contact in folded protein structures.5–7 Our approach, termed the Simplicial Neighborhood Analysis of Protein Packing (SNAPP),8 employs Delaunay tessellation to identify recurrent tertiary packing motifs that may be characteristic of protein structural and functional families.9 The SNAPP method afforded several important applications that are discussed in this chapter. They include automatic identification of elementary tertiary packing motifs recurring in a large database of protein structures, automatic identification of global patterns of protein structure organization by recognizing the protein’s hydrophobic core, and finally, identification of functional signature motifs in a family of proteins, which can assist structure-based annotation. These tools are summarized in Table I and are available from a SNAPP Web server (http://mmlsun4.pha.unc.edu/3dworkbench.html). The chapter is organized as follows. The first section describes the application of computational geometry methodology to protein structure analysis and comparison. Then, the SNAPP method is applied to the problem of automatically identifying recurrent substructures in a large database of diverse protein structures. The subsequent section presents a novel approach to mapping protein cores with application to fold recognition via structural templates. The fourth section describes the use of the SNAPP methodology for recognizing functional patterns characteristic of three unique protein families. We conclude by discussing areas for future work.
4
L. Holm and C. Sander, Nucleic Acids Res. 26, 316 (1998). A. Tropsha, R. Singh, I. Vaisman, and W. Zheng, Pac. Symp. Biocomput. 614 (1996). 6 R. Singh, A. Tropsha, and I. Vaisman, J. Comput. Biol. 3, 213 (1996). 7 W. Zheng, S. Cho, I. Vaisman, and A. Tropsha, Pac. Symp. Biocomput. 486 (1997). 8 http://mmlsun4.pha.unc.edu/psw/3dworkbench.html 9 S. A. Cammer, C. W. Carter, and A. Tropsha, in ‘‘Proceedings of the Third International Workshop for Methods for Macromolecular Modeling: Lecture Notes in Computational Science and Engineering (LNCSE),’’ p. 477. Springer-Verlag, New York, 2002. 5
[22]
SNAPP: computational geometry
511
TABLE I Modules Available at SNAPP Web Sitea Module ProCAM and WebProCAM: Protein Core Alignment Map MuSE: Mutagenesis with SNAPP Evaluation SNAPP-P: SNAPP Patterns SNAPP-M: SNAPP Motif
Contact Order Profile
Motif Library
SChiSM Motif Threader
a
Functionality Identify and visualize protein hydrophobic cores Calculate changes in the SNAPP score induced by single or multiple virtual mutations Identify and visualize all four residue packing patterns in a protein, ranking them by the value of SNAPP scores Identify all four residue packing motifs characteristic of a protein family (i.e., shared by the majority of family members) Identify all contact residues for any given residue in the protein structure and generate contact order profile for the entire structure Identify all protein structures in PDB Select or CulledPDB data sets that contain quadruplet motif with any given four-residue composition Create Web page annotation of molecular models, using Chime Identify protein sequences in PDB or SwissProt databases that contain a specific sequence motif (with or without gaps)
http://mmlsun4.pha.unc.edu/3dworkbench.html
Computational Geometry in Protein Structure Analysis
Reduced Representation of Protein Structure and Residue Contacts The analysis and prediction of protein structure frequently employs a reduced representation of the structure. Most often 1D protein models (sequences) are compared using sequence alignment algorithms that are able to detect similarity when the proteins share >30% identity for aligned residues. More distant homology detection, however, requires assessing compatibility using a 3D model of a known protein as a template to guide comparison. Such structures are often modeled using a united-atom representation of each amino acid residue following the work of Tanaka and Scheraga10 and Levitt and Warshel.11 In these models, protein 3D structures are represented by a set of pairwise contacts between residues in the coordinate space. Usually, these contacts are evaluated with a Boltzmannlike statistical analysis of contact amino acid specificity in a diverse database 10 11
S. Tanaka and H. A. Scheraga, Proc. Natl. Acad. Sci. USA 72, 3802 (1975). M. Levitt and A. Warshel, Nature 253, 694 (1975).
512
analysis and software
[22]
of protein structures,12 in which relative frequencies are considered as virtual equilibrium constants. Database contact analysis therefore leads to expression of contact preferences as a pseudo-free energy13–21 that may be used to assess sequence–structure (conformation) compatibility when evaluated for a particular protein. Several protein fold recognition protocols have been devised on the basis of statistical analysis of pairwise residue contacts in X-ray crystal structures.22,23 These algorithms are aimed at detecting structural similarity and evolutionary relationships between new and already known protein structures and sequences, which lie at the foundation of the knowledgebased methods for protein structure prediction from sequence, and structure comparison in protein classification. Moreover, they have potential applications in automated interpretation of electron density maps and provide important visualization tools for structure analysis. Residue-pair contacts also have been used in fold recognition to assess sequence– structure compatibility.24–26 Some researchers have extended residue contact definitions to include three-body terms.27 Statistical four-body contact potentials have been successfully applied in nongapped threading experiments and fold recognition as well.28,29 Higher-Order Residue Contact Geometry Two- and three-body interactions capture linear and planar relationships in structures, respectively. For example, evaluation of three-body contacts has been based on ad hoc addition of constituent pairwise scores.30 12
M. J. Sippl, J. Mol. Biol. 213, 859 (1990). S. H. Bryant and C. E. Lawrence, Proteins 16, 92 (1993). 14 G. M. Crippen, Biochemistry 30, 4232 (1991). 15 D. T. Jones, W. R. Taylor, and J. M. Thornton, Nature 358, 86 (1992). 16 R. S. DeWitte and E. I. Shakhnovich, Protein Sci. 3, 1570 (1994). 17 J. P. Kocher, M. J. Rooman, and S. J. Wodak, J. Mol. Biol. 235, 1598 (1994). 18 S. Miyazawa and R. L. Jernigan, J. Mol. Biol. 256, 623 (1996). 19 N. N. Alexandrov, R. Nussinov, and R. M. Zimmer, Pac. Symp. Biocomput. 53 (1996). 20 L. Mirny and E. Domany, Proteins 26, 391 (1996). 21 K. Yue and K. A. Dill, Protein Sci. 5, 254 (1996). 22 M. J. Sippl, Curr. Opin. Struct. Biol. 5, 229 (1995). 23 D. Jones and J. Thornton, Curr. Opin. Struct. Biol. 6, 210 (1996). 24 G. Casari and M. Sippl, Proteins 13, 258 (1992). 25 V. N. Maiorov and G. M. Crippen, J. Mol. Biol. 227, 876 (1992). 26 S. H. Bryant and C. E. Lawrence, Proteins 16, 92 (1993). 27 A. Godzik and J. Skolnick, Proc. Natl. Acad. Sci. USA 89, 12098 (1992). 28 P. Munson and R. Singh, Protein Sci. 6, 1467 (1998). 29 B. Krishnamoorthy and A. Tropsha, Bioinformatics 19, 1540 (2003). 30 A. Kolinski, W. Galazka, and J. Skolnick, J. Chem. Phys. 108, 2608 (1998). 13
[22]
SNAPP: computational geometry
513
However, four-body contacts, or 3D simplices, are intrinsic to threedimensional nearest-neighbor patterns. Delaunay tessellation was thus introduced as a means to extend contact analysis to four interacting residues, in order to capture these arrangements.5,6 It is worth emphasizing that when structural comparisons are made between related proteins, four points are both necessary and sufficient for aligning their three-dimensional structures for detailed comparison. Thus, partitioning structures into unique sets of quadruplet contacts reduces protein tertiary structure to a natural basis set of motifs, for identification and subsequent comparison. More importantly, sets of neighboring residues (local or distant in primary sequence) common to a group of proteins may be identified as multiple four-body packing patterns. When the motif is part of a functionally important substructure such as an active site or metal-binding cluster, even a single four-body contact common to a group of proteins can be sufficient to allow detection of a familial relationship. Searches for further structural similarity need only detect the simultaneous presence of multiple four-body motifs. A statistical geometry approach to study structure of molecular systems was pioneered by J. Bernal, who in the late 1950s suggested that ‘‘many of the properties of liquids can be most readily appreciated in terms of the packing of irregular polyhedra.’’31 This approach, based on the Voronoi partitioning of space32 occupied by the molecule, was further developed by Finney for liquid and glass studies.33 In the mid-1970s it was first applied to study packing and volume distributions in proteins by Richards,34 Chothia,35 and Finney.36 The standard Voronoi tessellation algorithm treats all atoms as points and allocates volume to each atom regardless of the atom size, which leads to the volume overestimate for the small atoms and underestimate for the large ones. Richards introduced changes in the algorithm34 that made Voronoi volumes proportional to the atom sizes, creating chemically more relevant partitioning; however, it has been done at the expense of the robustness of the algorithm. In addition to polyhedra around the atoms, Richards’s method produces vertex polyhedra that reduced the accuracy of the tessellation. An alternative procedure, ‘‘radical plane’’ partitioning, which is both chemically appropriate and completely rigorous, was designed by Gellatly and Finney and applied to the 31
J. D. Bernal, Nature 183, 141 (1959). G. F. Voronoi, J. Reine Angew. Math. 134, 198 (1908). 33 J. L. Finney, Proc. R. Soc. Lond. B Biol. Sci. A319, 479 and 495 (1970). 34 F. M. Richards, J. Mol. Biol. 82, 1 (1974). 35 C. Chothia, Nature 254, 304 (1975). 36 J. L. Finney, J. Mol. Biol. 96, 721 (1975). 32
514
analysis and software
[22]
calculation of protein volumes.37 Volume calculation remains one of the most popular applications of Voronoi tessellation to protein structure analysis. It has been used to evaluate the differences in amino acid residue volumes in solution and in the interior of folded proteins,38 to compare the sizes of atomic groups in proteins and organic compounds,39 and to measure sensitivity of residue volumes to the selection of tessellation parameters and protein training set.40 One of the problems in constructing Voronoi diagrams for molecular systems is related to the difficulty of defining a closed polyhedron around the surface atoms. This problem can be addressed by ‘‘solvating’’ the tessellated molecule with at least one layer of solvent or by using computed solvent-accessible surface for the external closures of Voronoi polyhedra. Analysis of atomic volumes on the protein surface can be used to adjust parameters of the force field for implicit solvent models.41 Examination of the packing of residues in proteins by Voronoi tessellation revealed a strong 5-fold local symmetry similar to random packing of hard spheres, suggesting a condensed matter character of folded proteins.42 Correlations between the geometric parameters of Voronoi cells around residues and residue conformations were discovered by Angelov et al.43 Another study described application of the Voronoi procedure to study atom–atom contacts in proteins.44 A topological counterpart to Voronoi partitioning, the Delaunay tessellation,45 has an additional utility as a method for nonarbitrary identification of neighboring points in molecular systems represented by sets of points in space. Originally the Delaunay tessellation had been applied to study model46 and simple47 liquids, as well as water48 and aqueous solutions.49 The first application of the Delaunay tessellation for identification of nearest-neighbor residues in proteins and derivation of a four-body 37
B. J. Gellatly and J. L. Finney, J. Mol. Biol. 161, 305 (1982). Y. Harpaz, M. Gerstein, and C. Chothia, Structure 2, 641 (1994). 39 J. Tsai, R. Taylor, C. Chothia, and M. Gerstein, J. Mol. Biol. 290, 253 (1999). 40 J. Tsai, M. Gerstein, Bioinformatics 18, 985 (2002). 41 M. Schaefer, C. Bartels, F. Leclerc, and M. Karplus, J. Comput. Chem. 22, 1857 (2001). 42 A. Soyer, J. Chomilier, J. P. Mornon, R. Jullien, and J. F. Sadoc, Phys. Rev. Lett. 85, 3532 (2000). 43 B. Angelov, J. F. Sadoc, R. Jullien, A. Soyer, J. P. Mornon, and J. Chomilier, Proteins 49, 446 (2002). 44 B. J. McConkey, V. Sobolev, and M. Edelman, Bioinformatics 18, 1365 (2002). 45 B. N. Delaunay, Izv. Akad. Nauk. SSSR Otd. Mat. Est. Nauk. 7, 793 (1934). 46 V. P. Voloshin, Y. I. Naberukhin, and N. N. Medvedev, Mol. Simul. 4, 209 (1989). 47 Y. I. Naberukhin, V. P. Voloshin, and N. N. Medvedev, Mol. Phys. 73, 917 (1991). 48 I. I. Vaisman, L. Perera, and M. L. Berkowitz, J. Chem. Phys. 98, 9859 (1993). 49 I. I. Vaisman, F. K. Brown, and A. Tropsha, J. Phys. Chem. 98, 5559 (1994). 38
[22]
SNAPP: computational geometry
515
statistical potential was described by Tropsha et al.5 and Singh et al.6 This potential has been successfully tested for fold recognition,6,7 protein folding on a lattice,50 and mutant stability studies.9,51 Statistical compositional analysis of Delaunay simplices revealed highly nonrandom clustering of amino acid residues in folded proteins when all residues were treated separately as well as when they were grouped according to their chemical, structural, or genetic properties.52 A Delaunay tessellation-based -shape procedure for protein structure analysis was developed by Liang et al.53 It proved to be useful for the detection of cavities and pockets in protein structures.54 Several alternative Delaunay- and Voronoi-based techniques for cavity identification were described by Richards,55 Bakowies and van Gunsteren,56 and Chakravarty et al.57 Kobayashi and Go used Delaunay tessellation to compare similarity of protein local substructures58 and to study the mechanical response of a protein under applied pressure.59 As Delaunay tessellation forms the basis of the discrete geometric representation of protein molecules, the Voronoi and Delaunay tessellations of a set of points are discussed below. Voronoi Diagram. The idea of formalizing proximity relationships within a set of point objects was first discussed by Dirichlet, and then by Voronoi.60 Given a set of points in a space, one wants to know which points are closest to a given point or which points form local neighborhoods. The Voronoi diagram of the set of points, or sites, distills the information necessary to describe these relationships. It tessellates, or tiles, the space by defining regions that are closer to each site than any other. Formally, for a set of points P ¼ (p1, p2, . . . , pn) in Euclidean space, the Voronoi polytope, V( pi), associated with each point is defined as the follows: Vð pi Þ ¼ x : jpi xj jpj xj;8j 6¼ i 50
H. H. Gan, A. Tropsha, and T. Schlick, Proteins 43, 161 (2001). C. W. Carter, Jr., B. C. LeFebvre, S. A. Cammer, A. Tropsha, and M. A. Edgell, J. Mol. Biol. 311, 625 (2001). 52 I. I. Vaisman, A. Tropsha, and W. Zheng, ‘‘Proceedings of the IEEE Symposium on Intelligence and Systems,’’ 1998, p. 163. 53 J. Liang, H. Edelsbrunner, P. Fu, P. V. Sudhakar, and S. Subramaniam, Proteins 33, 1 (1998). 54 J. Liang, H. Edelsbrunner, P. Fu, P. V. Sudhakar, and S. Subramaniam, Proteins 33, 18 (1998); J. Liang, H. Edelsbrunner, and C. Woodward, Protein Sci. 7, 1884 (1998). 55 F. M. Richards, Methods Enzymol. 115, 440 (1985). 56 D. Bakowies and W. F. van Gunsteren, Proteins 47, 534 (2002). 57 S. Chakravarty, A. Bhinge, and R. Varadarajan, J. Biol. Chem. 277, 31345 (2002). 58 N. Kobayashi and N. Go, Eur. Biophys. J. 26, 135 (1997). 59 N. Kobayashi, T. Yamato, and N. Go, Proteins 28, 109 (1997). 60 F. Aurenhammer, ACM Comput. Surv. 23, 345 (1991). 51
516
analysis and software
[22]
Fig. 1. Voronoi diagram for a set of points in the plane, showing clustering of 2D points representing objects.
As formalized above, each polytope encloses a region with edges that are boundaries equidistant from a site and a nearest-neighbor site. Points internal to the boundary of the set share more than one neighbor and the collection of these comprise the Voronoi diagram. Thus, the tessellation comprises a set of nonoverlapping, convex polytopes, each defining the boundary between one site and all neighboring sites. The diagram serves as a map of the clustering of objects represented by single points in space. A two-dimensional example of the Voronoi diagram illustrates the basic features of the construction (Fig. 1). Because the Voronoi polytopes define volumes associated with individual atoms or residues, this analysis in 3D has been applied to solid modeling problems including atomic packing analysis in crystallographic protein structure studies,39,61 where each atom is located by a single point and such questions as packing density are of interest. In these applications the Voronoi diagram captures the agglomeration of atoms in a molecule. Restriction of the method to united-atom representations facilitates identification of close packing arrangements between residues in proteins. Delaunay Tessellation. The Delaunay construction can be obtained directly from the Voronoi diagram. Note that, for each vertex of a polygon in the 2D diagram in Fig. 1, there are three sites equidistant from it that form a local neighborhood of objects that cluster together. An equivalent means to the information in the Voronoi diagram is the dual graph obtained by 61
B. Lee and F. M. Richards, J. Mol. Biol. 55, 379 (1971).
[22]
SNAPP: computational geometry
517
Fig. 2. Dual tessellation of the point set in Fig. 1, showing the relationship between the Voronoi tessellation (gray lines) and Delaunay tessellation (black lines). Circumspheres (circles in 2D) are shown for four of the Delaunay 2 simplices at different gray levels.
connecting all adjacent Voronoi sites (sharing one edge in 2D) by the straight line between them. This construct was proved by Delaunay to give a triangulation of all points in a set, where each basic geometric feature is termed a simplex.60 In 2D the set is partitioned into a tessellation composed of triangles, while in 3D, the situation is analogous, with four neighbors clustering together and the tiled volume is filled with irregular tetrahedra. Formally, in 3D, a Voronoi site is referred to as a 0-simplex, an edge is a 1-simplex, a triangle is a 2-simplex, and a tetrahedron is a 3simplex. The collection of these k-simplexes is referred to as a simplicial complex. Each tetrahedron in a 3D simplicial complex corresponds exactly to one vertex of a Voronoi polyhedron: the four sites of the 3-simplex are equidistant from this vertex. Thus, the Delaunay tessellation contains the same information as the Voronoi tessellation, but represented in a different way. A dual map of the Voronoi and Delaunay tessellation shows the relationship between the two constructions (Fig. 2). A popular algorithm for computing the Delaunay tessellation in 3D is an incremental buildup procedure applicable to any n-dimensional space.62 In 3D the Delaunay tessellation partitions a structure such that each circumsphere associated with a 3-simplex is equidistant from the corresponding 62
D. F. Watson, Comput. J. 24, 167 (1981).
518
analysis and software
[22]
Voronoi polyhedron vertex about which the neighboring sites cluster. The task, then, is to identify all tetrahedra such that the circumspheres on which they lie contain no other site from the input data. The process is illustrated for four 2D simplices and their 2D circumspheres in Fig. 2. Initially, four points are chosen such that they are on the bounding surface, which is a convex hull; that is, the minimum volume that contains all points. This volume is a 3-simplex, to which other points from the global set are added incrementally. After each new point is introduced to the tessellation, the circumspheres associated with the existing tetrahedra are examined for occupancy. New points contained by existing circumspheres form a simplicial subcomplex with the associated vertices of the overlapping simplices. In these cases old simplices are replaced and new simplices are constructed and tested for overlap at each step. If a new point does not lie within any predefined circumsphere, then the existing simplicial complex is not changed and new simplices are constructed that include the new point. In subsequent steps, as points are incrementally added and the resulting tessellation verified, the structure maintains a Delaunay tessellation of the growing substructure until all input points have been added. The final configuration is independent of the order in which points are added, although preordering of the input can improve computational speed. For protein structures, where each point represents an amino acid residue, the ordering along the chain is advantageous in this regard. When applied to protein structure the Delaunay 4-tuples correspond to elementary 3D packing motifs of agglomerated residues, whose compact neighbor relations define elementary residue packing motifs that can be defined explicitly for statistical analysis in terms of their compositions and dimensions. Application of Delaunay Tessellation to Protein Structure As shown in the 2D example (Fig. 1), the Voronoi polytopes are irregular, with many possible coordination numbers. In contrast, the Delaunay tessellation partitions the plane into only triangular polygons (Fig. 3) or tetrahedra in 3D. This representation offers an advantage over the Voronoi site neighbor description when two different point sets are compared, since residue neighborhoods defined by the triangulation each contain the same number of members. More precisely, the Delaunay tessellation partitions a discrete model into elementary substructures that are geometrically homologous, and thus form a natural basis set for structure comparison aimed at recognizing global structural homology. This representation offers two distinct kinds of benefits. One stems from the unique tabulation of all tertiary contacts between residues or atoms, which simplifies visualization, identification, and comparison. The other results from the essential
[22]
SNAPP: computational geometry
519
Fig. 3. Delaunay tessellation of protein structure. Left: Raw tessellation of a protein structure. Right: Schematic showing that a tetrahedron in the raw tessellation corresponds to a four-body contact between residue side chains.
homology of all Delaunay simplices, which facilitates the analysis of frequencies when consulting a database. When two proteins are modeled using the united-atom residue scheme, substructural similarity detection is achieved when two simplicial subcomplexes of residues are recognized as homologous, and contain similar or identical residues. In general, homologous 3-simplexes contain homologous 2-simplexes (triplet contacts) and 1-simplexes (pair contacts). Therefore, 3-simplex-based structure comparison is a natural generalization of pair-contact descriptors to include 3D geometry. Thus, it is particularly suitable for 3D protein structure evaluation and comparison. The examples in the next section illustrate, using the Delaunay tessellation of simplified protein structure, models for identifying interresidue contacts in the simplicial complex of a protein. Identification of Recurring Tertiary Packing Motifs in Large Databases of Diverse Protein Structures
Definition of Quadruplet Residue Contacts and Elementary Packing Motifs Using Delaunay Tessellation In a protein crystal structure the Delaunay simplices resulting from the united-atom reduced representation define four nearest-neighbor amino acids at their vertices. These, in turn, are connected by six interresidue physical distances (edge lengths) for the associated tetrahedron. Some of the edge lengths, especially for many surface residues, are unreasonably large, which makes the definition of the associated residues as nearest neighbors geometrically correct but physically meaningless. Consequently, tetrahedra with at least one edge length exceeding some threshold ˚ ) are generally excluded from consideration. This restriction is (e.g., 10 A
520
analysis and software
[22]
Fig. 4. Five types of sequence topology for residue clusters are possible. Only clusters composed of residues not adjacent in the sequence (the leftmost example) are used in the present analysis.
similar to pairwise contact definition schemes where interresidue ˚ . Tighter residue contact distance ranges typically vary from 6.5 to 10 A clusters may be selected for study by reducing the cutoff used to filter the results of tessellation. For protein structures the relatively tight quadruplets define interfaces where four-residue side chains interact (Fig. 3). The procedure allows an unambiguous definition of unique residue neighborhoods, thus providing elementary 3D structural descriptors for partitioning a protein structure into its elementary tertiary packing motifs. Four-residue clusters identified by Delaunay tessellation have an associated set of three sequence distances, that is, numbers of residues, separating the vertex residues along the protein sequence. These residue contacts may be divided into five classes based on sequence separation7 (Fig. 4). Taking residue identities into account, an elementary packing motif M can be generally defined as four nearest-neighbor residues in protein structure separated by three sequence distances as in the following expression: M ¼ aai aaj aak aal ; d1; d2; d3 (1) In this formulation, i, j, k, and l are residue numbers in a protein sequence; the sequence distance d1 = j–i, d2 = k–j, and d3 = l–k. For this work only contacts with no sequence adjacent residues have been included (Fig. 4) because they are the most representative of nonlocal, or tertiary, contacts. For the purpose of classification, sequence distances d1, d2, and d3 are separated into four bins: (2, 3, 4, x), where x > 4. Three databases of protein structures have been considered for tertiary packing analysis. These included a set of 1100 structures gen˚ resolution), the PDB erated using Culled PDB63 (20% identity, 2.5-A 64 Select set of 1289 structures (982 X-ray, 307 NMR), and the What if?
63
CulledPDB: http://www.fccc.edu/research/labs/dunbrack/pisces/
[22]
SNAPP: computational geometry
521
Fig. 5. Sequence–structure space of nonlocal contacts. For 1100 protein structures included in the CullPDB set (80% of hhhh contacts fall below the 8.5-A
532
analysis and software
[22]
Fig. 11. Frequency of contact types, using two binary residue classifications. In both classifications all-hydrophobic clusters tend to be compact (maximum interaction distance of ˚ ). To focus on tighter contacts, a lower cutoff (8.5 A ˚ ) can be applied without loss of many 10 A clusters in visualization. HP: h ¼ {CFILMVWY}. AHP: h ¼ {ACFILMVWY}.
˚ , which has been used often in residue contact Therefore, a cutoff at 8.5 A 75 order studies, may be applied to visualize more intimate contacts. In practice, short distance filtering is used to examine the tightest clusters, which may be important for protein folding and stability. Color coding the SNAPP score of tetrahedra provides quite clear delineation of the hydrophobic core regions of proteins. The ProCAM (Protein Core Alignment Map; Table I) program interface implements this approach for mapping hydrophobic cores. Visual inspection of structures using this color filter shows that contact patterns are distributed nonrandomly (Fig. 12); the example displayed is a typical protein of 294 residues: cytidine deaminase from Escherichia coli. Clusters that score highly are 75
A. R. Fersht, Proc. Natl. Acad. Sci. USA 97, 1525 (2000).
[22]
SNAPP: computational geometry
533
Fig. 12. Distribution of quadruplets in 1ctt colored according to log score. Contacts formed in the hydrophobic core appear as aggregates of red tetrahedra. Scores decrease away from the cores; surface tetrahedra, appear in blue, due to the occurrence of many polar side chains.
most often buried while other types progressively are distributed more toward the protein surface and away from the cores; a similar pattern appears when using simplified residue classification (Fig. 11). These contact distributions create a layered appearance in the visual representation of a protein. Separation of the hydrophobic phase from the protein surface is well defined in the core maps. In the 1ctt structure for cytidine deaminase (Figs. 12 and 13) there are three core regions associated with domains in the protein monomer. Each hydrophobic cluster is clearly separated from the polar surface residues. Remote Fold Recognition The cytidine deaminase structure also offers an interesting test of the relationship between structural core mapping and fold recognition. Two of the three domains in the cytidine deaminase monomer are structurally homologous and oriented with 2-fold symmetry. These two domains can
534
analysis and software
[22]
Fig. 13. Comparison of the distribution of all-hydrophobic (hhhh) contacts using two binary residue classifications. The AHP model uses the following grouping: h ¼ {A,C,F,I,L,M,V,W,Y}. For the HP classification, the residues {C,F,I,L,M,V,W,Y} are designated as hydrophobic. Both schemes map core residues well; however, the AHP model includes contacts left out of the HP grouping.
be superimposed with root-mean-square deviation (RMSD) of at least ˚ for backbone atoms of 102 corresponding pairs of residues (Fig. 2.1 A 14). Despite structural homology, the sequences share only 16% identity for aligned residue pairs. The N-terminal domain (residues 48–149) also contains residues directly involved in ligand binding. Patterns of nonpolar contacts, mapped for each substructure and then compared (Fig. 15), show that core regions appear in homologous positions within the two domain folds. Alignment of the two subdomains shows that residues contributing nonpolar side chains to the cores occur at exactly corresponding positions along the sequences in most, but not all cases. Specifically, analogous hydrophobic core pairs include I147–I288, L139–L283, M136–L280, V124–L262, I122–A260, L119–I257, A111–L247, I108–L244, and 77A–215L (Fig. 15). Details of core packing differ in the two domains; the core for domain 1 (which includes most of the active site) contains fewer high-scoring contacts. Sparser packing in domain 1 might allow flexibility and improve
[22]
SNAPP: computational geometry
535
Fig. 14. Alignment of homologous substructural domains in E. coli cytidine deaminase. The bound ligand accompanies the domain with lighter shading (residues 48–149). The loop to the right of the ligand (darker shade) occurs in the structural domain; it sterically overlaps the corresponding region in the first domain.
substrate binding. It potentially implies that the zinc- and substrate-bound enzyme is in a relatively less stable conformation because of reduced nonpolar interactions. Much of the difference in cores is due to replacement of a glutamate (E104) in the domain containing the active site to a leucine residue in the C-terminal domain; the glutamate forms part of the active site zinc-binding cluster, which is necessary for activity of the enzyme. Three other zinc-chelating residues, C129, C132, and H102, are
536
analysis and software
[22]
also missing counterparts in the second of these domains. The conformational requirement to attain a functional zinc-binding site reducesinterdigitation and packing of core residue side chains, while in return providing nearly covalent cross-linking of the three Zn-coordinating residues. Gene fusion probably occurred to produce the global structure from the homologous domains within the same monomer. It is likely that mutation of residues led to evolution of a new function for an extant module. Optimizing function along with sequence drift eliminated recognizable sequence similarity, which precludes homologous fold recognition based on sequence comparison alone by a variety of other methods. The capacity of the two distinct sequences to adopt the same folding pattern is apparently explained by the occurrence of residues able to form similar structural cores. Identification of a common core in the two different domains in cytidine deaminase unambiguously identifies the conservation of the hydrophobic core as the dominant fold-determining homology. Virtual Mutagenesis Figures 9 and 10 suggest a strong link between the statistical potentials for high-likelihood quadruplets and the free energy contributed by hydrophobic residues to protein stability. An important and growing database with which to test this proportionality further is provided by the directed mutations made to hydrophobic core residues for which experimental (G) measurements have been published. The SNAPP MuSE (Mutation with Snapp Evaluation) module was developed to test and exploit this idea. The interface can be used either with PDB files on disk or directly from the database, and residue numbers for up to eight residues can be entered, together with new amino acids for each position, and the module returns a virtual change in SNAPP score. ‘‘Virtual’’ mutations are evaluated assuming that mutation does not alter the structure. The SNAPP score for the native residue is accumulated for the native protein as the sum SNAPP scores for all tetrahedra in which the mutated residue(s) participate. The process is repeated for the mutant protein, in which the likelihoods are obtained for the mutant tetrahedra by changing the compositions of those tetrahedra summed for the native score. The initial test of this module on mutations in five proteins showed an excellent correlation (R2 ¼ 0.74), which improved if the proteins were considered separately (0.70 < R2 < 0.94). The protein-specific effect is apparently closely related to the mean SNAPP score of the protein, as the slopes for the five individual proteins correlate quite well (R2 ¼ 0.95) with 1/.76
[22]
SNAPP: computational geometry
537
The high correlations between SNAPP scores and observed stability changes suggest that the MuSE module might be used, for example, to design mutations that would induce temperature sensitivity in proteins of known structure, but unknown function, for functional genomics studies. In this section, we have desribed a methodology and tools to facilitate identification and visualization of hydrophobic cores in protein structures based on four-body interresidue contacts. A structure is partitioned into a unique set of four-body contacts using Delaunay tessellation of the reduced representation of structure defined by residue side-chain centers. These contacts are grouped according to statistical log-likelihood scores derived by analysis of a 1100-structure database. Scores for clusters of uncharged residues correlate with hydrophobicities determined experimentally for constituent residues; scores increase with increasing hydrophobicity. Based on this correlation, regions of extensive nonpolar contacts may be identified and visualized by selecting contacts above a threshold likelihood score and the thermodynamic effects of mutation examined by virtual mutagenesis. Identification of Functionally Relevant Packing Motifs
Definition of Structure–Sequence Motifs We conclude this chapter with several applications in structural genomics. The combination of a three-dimensional simplex with its associated three distances defined by the number of residues that separate two consecutive vertex residues along the sequence, that is, relation (I), affords an unambiguous description of an elementary sequence–structure motif. We hypothesized that four-body contacts formed by residues nonadjacent in protein sequence may represent the most interesting packing motifs since the composing residues may fold together as a result of specific tertiary interactions embodied in the resulting Delaunay tetrahedron, as opposed to physical proximity in the primary sequence. This methodology is well suited for identifying recurrent sequence–structure patterns common to particular protein families. Three groups of structures from SCOP-classified proteins77 were analyzed for recurrent motifs, using the SNAPP-motif module. One set consisted of the 7 structures for ligand-binding domains of nuclear receptors,
76
C. W. Carter, Jr., B. C. LeFebvre, S. A. Cammer, A. Tropsha, and M. A. Edgell, J. Mol. Biol. 311, 625 (2001). 77 A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, J. Mol. Biol. 247, 536 (1995).
538
[22]
analysis and software
TABLE V Sequence–Structure Patterns for Ligand-Binding Domain of Nuclear Receptor Residue sequence Protein identifier 1a28 1bsx 1db1 1lbd 2lbd 2prg 3erd
FLQL F739 F293 F251 F289 F251 F306 F367
L742 L296 L254 L292 L254 L309 L370
Q747 Q301 Q259 Q297 Q259 Q314 Q375
[AS]F[IL]L L750 L304 L262 L300 L262 L317 L378
S733 F739 I751 L826 A287 F293 L305 L378 A245 F251 L263 I338 A283 F289 L301 L375 A245 F251 L263 L336 A300 F306 L318 I392 A361 F367 L379 L453
LDL[ILV] L742 L296 L254 L292 L254 L309 L370
D746 D300 D258 D296 D258 D313 D374
L750 L304 L262 L300 L262 L317 L378
L835 I392 I352 V389 V350 I406 I475
another comprised 27 members spanning the eukaryotic trypsin-like serine protease family, and the third included 13 structures from the G-protein family. In some cases it was necessary to relax the requirement for residue sequence identities to allow for conservative variation of the patterns. Sequence distances greater than five are written as ranges of actual values in the descriptions below. Ligand-Binding Domains of Nuclear Receptors. This structure set contains seven proteins related by less than 33% sequence identity between pairs. One tertiary packing pattern was observed in all of the proteins: M ¼ (FLQL; 3, 5, 3) (Table V). This substructure appears in the scaffold holding residues involved in coactivator protein binding to these receptor domains. The feature is located in the sequence segment that defines the signature for identifying these proteins from primary sequence. A second motif (AFLL; 6, 12, 73–74) appears in four of the seven structures; two variants are found in the other structures at equivalent structural positions (AFLI; 6, 12, 74–75) and (SFIL; 6, 12, 75). A third motif is shared by four proteins (LDLI; 4, 4, 88–97); variants of this pattern are found in the other structures (LDLL; 4, 4, 85) and (LDLV; 4, 4, 88–89). All three repeated patterns share common residues. A single phenylalanine residue appears as part of both the FLQL and [AS]F[IL]L motifs. A leucine residue is shared by the first and third motif. The three patterns form a structural core common to all the members of this protein family for which structural data are available. A consensus pattern may be written in Prosite-like form (i.e., as a specific amino acid residue motif allowing for limited residue substitution and flexible sequence separation between residues that constitute the motif) combining these structure-derived motifs: [AS]-x(5, 5)-F-x-L-x(3)-DQ-x(2)-L[IL]-x(72, 74)-[IL]-x(8, 21)-[ILV]. This composite substructure was used as a structural template to thread
[22]
539
SNAPP: computational geometry
sequences for all protein chains in the PDB. There were 44 sequences identified as containing this pattern. All 44 sequences correspond to structures for the ligand-binding domains of nuclear receptors. There were no false positives for this search. Eukaryotic Trypsin-Like Serine Proteases. One motif appears in all 27 of the structures in this set; a second is found in 26 structures, and a third is found in 24 structures (Table VI). The most common pattern contains two residues (D, H) of the catalytic triad as well as a noncatalytic serine in contact with the catalytic histidine sidechain (AHDS; 2, 45–50, 112–114) (Table VI). The second most common pattern (GGDG; 97–102, 54–56, 3) has a variant in one of the structures: (GGDS; 97, 54, 3). This substructure is located near the first motif; the aspartate is adjacent to the catalytic serine in sequence. The third motif (AHSS; 2, 138, 19) includes the catalytic serine and histidine residues. Two variants are found in two of the remaining three structures: (AHGS; 2, 145, 19) and (AHAS; 2, 138, 19). Together, the first and third motifs define a single cluster of five residues common to 26 proteins. All sequences for PDB structures were threaded onto this template to yield 429 sequence matches. Two proteins (six sequences) that do not belong to the eukaryotic trypsin-like serine protease family were matched. One of the sequence matches corresponds to another serine protease, a subtilase (proteinase K), that is thought not to be related to the proteins used to derive the threading template. G Proteins. This group of structures proved to be the most variable of the three families studied (Table VII). The most common pattern (GLNA; 4, 96–108, 30–44) appears in six structures. Variants of this pattern appear in the other structures. In two structures the corresponding pattern TABLE VI Sequence–Structure Patterns for Eukaryotic Serine Protease Residue sequence Protein identifier 1a0l, 1aht, 1ao5, 1aut, 1azz 1bda, 1bqy, 1cgh, 1dan, 1dfp 1ekb, 1fon, 1fuj, 1fxy, 1gct, 1hcg, 1lmw, 1mct, 1npm, 1pfx, 1rtf, 1sgf, 1ton, 2hlc, 3rp2 1ddj 1ppf
AHDS
GGD[GS]
A55 H57 D102 S214
G43 G140 D194 G197
A601 H603 D646 S760 A55 H57 D102 S214
G589 G684 D740 G743 G43 G140 D194 S197
540
[22]
analysis and software TABLE VII Sequence–Structure Patterns for G Proteins Residue sequence
Protein identifier 1efu 1ega 1gua 1tx4 2rap 5p21 1hur 1tad 1a2k 3rab 1dar 1mh1 2ngr
G[FILT][NT]A G23 G20 G15 G17 G15 G15 G29 G41 G22 G34 G23 G15 G15
L27 N135 A174 L24 N124 A156 L19 N116 A148 L21 N117 A161 L19 N116 A146 L19 N116 A146 I33 N126 A160 I45 N265 A322 F26 N122 A151 F38 N135 A166 T26 N135 A174 L19 T115 A159 L19 T115 A159
[KQ]D[CS][EKLT] K136 D138 S173 L175 K125 D127 S155 E157 K117 D119 S147 K149 K118 D120 S160 K162 K117 D119 S145 K147 K117 D119 S145 K147 K127 D129 C159 T161 K266 D268 C321 T323 K123 D125 S150 K152 K136 D138 S165 K167 K138 D140 S262 L264 K116 D118 S158 L160 Q116 D118 S158 L160
is (GINA; 4, 93–220, 34–57), in two structures it is (GFNA; 4, 96–97, 29–31), in two structures it is (GLTA; 4, 96, 44), and in one structure it is (GTNA; 4, 109, 126). This group of patterns appears in the middle of the guanine-binding pocket of the proteins at the interface of the two structural modules. A second sequence–structure pattern (KDSK; 2, 25–40, 2) also occurs in six proteins. A variant (KDSL; 2, 35–122, 2) appears in three proteins. Another variant (KDCT; 2, 30–53, 2) appears in two proteins. Two variants occur in one protein each: (QDSL; 2, 40, 2) and (KDSE; 2, 28, 2). This set of variants forms a pocket around the aspartate that interacts specifically with the guanine ring of the substrate. The global sequence pattern obtained by combining the two families of sequence–structure motifs is G-x(3)-[FILT]-x(92,219)-[NT][KQ]-x-D-x(24,121)-[CS]A[EKLT]. A template-based search against PDB sequences identified 124 structures (174 sequences) that contain this pattern. From these, 146 sequences were identified as belonging to the G-protein family. One of the structure matches (three sequences) was found in a protein from a different family in the SCOP77 superfamily that includes the G proteins. Eleven proteins (28 sequences) identified did not belong to this protein family. Because of the great variability in the sequence spacing between residues for this pattern, it appears to be more effective to perform sequence comparisons with appropriate variants of this consensus. For example, using a pattern based on the shorter sequence ranges that appear
[22]
SNAPP: computational geometry
541
Fig. 15. Cores for two homologous domains in cytidine deaminase appear in analogous positions (ligand-binding domain is on the left). Illustrated here is the reduced packing of the core in the ligand-binding domain to create the zinc-binding site, especially the change from a core leucine to zinc-chelating glutamate at corresponding positions.
in most of the structures, G-x(3)-[FILT]-x(92, 109)-[NT][KQ]-x-Dx(24,52)[CS]A[EKLT], 127 sequences for G proteins were matched with no false positives. Subsequent search using long variants, G-x(3)-[FILT]x(219)-[NT][KQ]-x-D-x(24, 52)[CS]A[EKLT] and G-x(3)-[FILT]-x(92, 109)-[NT][KQ]-x-D-x(121)[CS]A[EKLT], yielded 26 more matches with no false positives. Moreover, this set of proteins might be subdivided into groups based on less recurrent motifs in order to improve sequence analysis based on these patterns. The recurrent substructures identified for these protein families may be used for global structural alignment. Alignment of a representative pair of structures for each family demonstrates that this is a reasonable first step toward global structure alignment based on a single quadruplet of residues (Fig. 16). The composite structural templates described for these protein families may be used to scan new sequences to identify active site features or otherwise functionally relevant substructures. Other active sites may be mapped using this approach so that new sequence patterns derived from structural analysis can be included in motif searches and patterned sequence searches such as PSI-BLAST.78 Combination of the motifs identified here with patterns obtained by other methods is likely to improve sensitivity of sequence comparisons.79 78
S. F. Altschul, T. Madden, A. Scha¨ffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, Nucleic Acids Res. 25, 3389 (1997). 79 G. Eriani, M. Delarue, O. Poch, J. Gangloff, and D. Moras, Nature 347, 203 (1990).
542
analysis and software
[22]
Fig. 16. Structural alignments based on geometry of one motif. (A) Alignment of two nuclear receptor ligand-binding domains (2prg, 1a28) using the (FLQL; 3, 5, 3) motif. (B) Alignment of two serine proteases (1a0l, 1azz) using the (AHDS; 2, x, x) motif. (C) Two G proteins (1efu, 5p21) aligned using the (GLNA; 4, x, x) motif. For each alignment the cluster of residues used to direct the alignment are shown as all-atom space-filling models.
[22]
SNAPP: computational geometry
543
Conclusions
This chapter describes applications of a novel computational geometric approach (termed SNAPP) based on Delaunay tessellation and likelihood scoring to problems of relevance to the macromolecular crystallography community. These all depend on identification of tertiary packing motifs in protein structures. These motifs represent unique sequence-specific packing arrangements of four nearest-neighbor residues identified by means of Delaunay tessellation, the contributions of which to stability can be estimated quantitatively by likelihood scoring. We examined protein structural databases for the presence of recurrent motifs, and have shown that they frequently represent components of secondary–secondary interfaces in protein structures. The low tolerance for structural variation ˚ RMSD) ensures that they may adopt puzzle-piece80 patterns (0.6 A repeated in unrelated proteins. These pieces may serve as tertiary building blocks for construction of novel proteins or as parallel platforms for combinatorial mutagenesis studies81 aimed at testing structural propensities of amino acids. The quantitative proportionality of SNAPP scores to experimentally derived potentials based on hydrophobicity, together with that of SNAPP scores to experimental (G) values, suggest that the ready identification and visualization tools we have described have a variety of useful applications, some of which we have suggested. Finally, detection of functional motifs related to primary sequence patterns offers the promise of applications in annotation and structure–function analysis of nonredundant structures in the PDB. In a separate effort, we have investigated a hypothesis that some of the packing motifs could serve as unique markers of protein functional families. Using three families as test cases (G proteins, serine proteases, and nuclear receptors), we demonstrated that in every case we could identify common and unique motifs characteristic of each family. These results are promising and our future work should expand this analysis in a number of ways. Active sites should be further analyzed for motifs that do not fit the secondary structure contact rules, that is, a more general analysis of recurrent patterns beyond the types described in this chapter should be conducted. This future work will include sequence–structure patterns that are not exclusively nonlocal. An analysis of all active sites in the known 3D structures would be particularly useful for developing strategies to predict active site or otherwise functionally relevant residues in new sequences. Analysis of other protein 80
K. M. Gernert, B. D. Thomas, J. C. Plurad, J. S. Richardson, D. C. Richardson, and L. D. Bergman, Pac. Symp. Biocomput. 331 (1996). 81 S. J. Lahr, A. Broadwater, C. W. Carter, Jr., M. L. Collier, L. Hensley, J. C. Waldner, G. J. Pielak, and M. H. Edgell, Proc. Natl. Acad. Sci. USA 96, 14860 (1999).
544
[23]
analysis and software
families derived by structural classifications should yield motif patterns for individual sets of proteins. These patterns could then be used to define sequence–structure cores common to family members. These cores may be used to constrain alignments of new family members. Motif patterns may be used as additional constraints in current fold recognition protocols to search databases for new sequences with structural/functional relationship to a protein of interest. Such analyses will be undertaken as more families of structures are mapped using this methodology. In summary, a novel protein structure packing analysis method, SNAPP, has been developed. The SNAPP modules discussed in this chapter have been made accessible at the following server: http://mmlsun4.pha.unc.edu/ 3dworkbench.html. We expect the SNAPP methodology will continue to be useful for the analysis of structures determined by structural genomics projects.
[23] Tools and Databases to Analyze Protein Flexibility; Approaches to Mapping Implied Features onto Sequences By W. G. Krebs, J. Tsai, Vadim Alexandrov, Jochen Junker, Ronald Jansen, and Mark Gerstein Introduction
Macromolecular motions are often the essential link between structure and function. They also are intrinsically interesting because of their relationship to principles of macromolecular structure and stability. A rich literature in macromolecular motions exists.1–4 Studying individual protein motions provides the most information about how a specific protein operates. However, systematizing and analyzing many of the instances of protein structures solved in multiple conformations allows the study of motions in a database framework. This provides a statistical overview of motions, making it possible to sense broad patterns and trends, as well as to place an individual motion in perspective. This approach also encourages the development of standardized tools and approaches. 1
B. Isralewitz, M. Gao, and K. Schulten, Curr. Opin. Struct. Biol. 11, 224 (2001). M. Young, K. Kirshenbaum, K. A. Dill, and S. Highsmith, Protein Sci. 8, 1752 (1999). 3 R. Shaknovich, G. Shue, and D. S. Kohtz, Mol. Cell. Biol. 12, 5059 (1992). 4 M. M. Dixon, H. Nicholson, L. Shewchuk, W. A. Baase, and B. W. Matthews, J. Mol. Biol. 227, 917 (1992). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
544
[23]
analysis and software
families derived by structural classifications should yield motif patterns for individual sets of proteins. These patterns could then be used to define sequence–structure cores common to family members. These cores may be used to constrain alignments of new family members. Motif patterns may be used as additional constraints in current fold recognition protocols to search databases for new sequences with structural/functional relationship to a protein of interest. Such analyses will be undertaken as more families of structures are mapped using this methodology. In summary, a novel protein structure packing analysis method, SNAPP, has been developed. The SNAPP modules discussed in this chapter have been made accessible at the following server: http://mmlsun4.pha.unc.edu/ 3dworkbench.html. We expect the SNAPP methodology will continue to be useful for the analysis of structures determined by structural genomics projects.
[23] Tools and Databases to Analyze Protein Flexibility; Approaches to Mapping Implied Features onto Sequences By W. G. Krebs, J. Tsai, Vadim Alexandrov, Jochen Junker, Ronald Jansen, and Mark Gerstein Introduction
Macromolecular motions are often the essential link between structure and function. They also are intrinsically interesting because of their relationship to principles of macromolecular structure and stability. A rich literature in macromolecular motions exists.1–4 Studying individual protein motions provides the most information about how a specific protein operates. However, systematizing and analyzing many of the instances of protein structures solved in multiple conformations allows the study of motions in a database framework. This provides a statistical overview of motions, making it possible to sense broad patterns and trends, as well as to place an individual motion in perspective. This approach also encourages the development of standardized tools and approaches. 1
B. Isralewitz, M. Gao, and K. Schulten, Curr. Opin. Struct. Biol. 11, 224 (2001). M. Young, K. Kirshenbaum, K. A. Dill, and S. Highsmith, Protein Sci. 8, 1752 (1999). 3 R. Shaknovich, G. Shue, and D. S. Kohtz, Mol. Cell. Biol. 12, 5059 (1992). 4 M. M. Dixon, H. Nicholson, L. Shewchuk, W. A. Baase, and B. W. Matthews, J. Mol. Biol. 227, 917 (1992). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[23]
tools, databases to analyze protein flexibility
545
Previously, we developed a comprehensive scheme for classifying and systematizing protein motions.5,6 This scheme is intended to be useful to those studying structure–function relationships (in particular, rational drug design7) and also to those involved in large-scale protein or genome surveys. Boutonnet et al. also made a detailed attempt at the systematic classification of protein motions.8,9 Numerous computational methods have been developed for the study of protein motions.5 Among these are many computational methods from traditional biophysics (e.g., molecular dynamics, energy minimization), which also relate to problems involving protein folding and the analysis of static (i.e., nonmoving) protein structures; these are well described in the literature.10–17 In this chapter, we describe how protein flexibility can be analyzed statistically in a database. The Database of Macromolecular Movements, which is accessible over the Internet at http://molmovdb.org, organizes a few hundred well-characterized motions on the basis of size and then packing, with the involvement of a well-packed interface in the motion being a key classifying feature. It also contains 4000 putative motions (September 2001) from automatic comparisons within the whole PDB. [As this goes to press in September of 2003, newer automatic comparison methods within an enlarged PDB have more than tripled the number of putative motions in the database.] Systematic literature searches suggest that the ‘‘universe’’ of studied motions is not much more than twice this. We illustrate some overall statistical themes derived from it, particularly related to the prevalence of motions in proteins. We then survey the computational tools employed in the database analysis: (1) structure comparison is useful to align and superpose different conformations; (2) adiabatic mapping interpolation, which is implemented on 5
W. G. Krebs and M. Gerstein, Nucleic Acids Res. 28, 1665 (2000). M. Gerstein and W. G. Krebs, Nucleic Acids Res. 26, 4280 (1998). 7 I. D. Kuntz, Science 257, 1078 (1992). 8 M. Gerstein, R. Jansen, T. Johnson, J. Tsai, and W. G. Krebs, in ‘‘Rigidity Theory and Applications’’ (M. F. Thorpe and P. M. Duxbury, eds.) pp. 401–442. Kluwer Academic, Dordrecht, The Netherlands, 1999. 9 N. S. Boutonnet, M. J. Rooman, and S. J. Wodak, J. Mol. Biol. 253 633 (1995). 10 J. Tsai, M. Levitt, and D. Baker, J. Mol. Biol. 291, 215 (1999). 11 Y. Z. Tang, W. Z. Chen, and C. X. Wang, Eur. Biophys. J. 29, 523 (2000). 12 D. Van Belle, L. De Maria, G. Iurcu, and S. J. Wodak, J. Mol. Biol. 298, 705 (2000). 13 S. T. Wlodek, T. Shen, and J. A. McCammon, Biopolymers 53, 265 (2000). 14 V. Daggett and M. Levitt, Annu. Rev. Biophys. Biomol. Struct. 22, 353 (1993). 15 S. Berneche and B. Roux, Biophys. J. 78, 2900 (2000). 16 M. K. Gilson et al., Science 263, 1276 (1994). 17 W. Wriggers and K. Schulten, Proteins 35, 262 (1999). 6
546
analysis and software
[23]
a large scale by the morph server, provides movie-like pathways between two superposed conformations, and in the process generates many standardized statistics; (3) normal mode analysis provides readily interpretable information about the flexibility of a single conformation; and (4) Voronoi volume calculations provide a rigorous basis for characterizing packing. Finally, we explain how the structural features in the motions database can be related to sequence, an important part of the overall process of transferring annotation to uncharacterized genomic data. This allows determination of a sequence-propensity scale for amino acids to be in linkers, in general, or flexible hinges, in particular. Preliminary calculations show more proline and less alanine and tryptophan in linkers. Database of Macromolecular Motions
A statistical survey of protein and nucleic acid motions is embodied in the Database of Macromolecular Motions (Fig. 1), a comprehensive Internet-accessible database7 that attempts to classify all known instances of macromolecular motions on the basis of size and packing (Fig. 2). This database is accessible on the Web at molmovdb.org and is tightly integrated with a number of other Internet resources, such as the PDB,18 scop,19 CATH,20 Entrez,21 SPINE,22 and PartsList.23–26 Attributes of a Motion Each motion in the database is associated with a variety of information. Classification: A classification number gives the place of a motion in the size and packing classification scheme for motions described below. 3D structures: The identifiers have been made into hypertext links that link indirectly to the structure entries at PDB and other databases. Literature references: Cross-referenced through Medline. 18
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000). 19 A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, J. Mol. Biol. 247, 536 (1995). 20 C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton, Structures 5, 1093 (1997). 21 C. W. Hogue, H. Ohkawa, and S. H. Bryant, Trends Biochem. Sci. 21, 226 (1996). 22 P. Bertone et al., Nucleic Acids Res. 29, 2884 (2001). 23 J. Qian et al., Nucleic Acids Res. 29, 1750 (2001). 24 G. D. Schuler, J. A. Epstein, H. Ohkawa, and J. A. Kans, Methods Enzymol. 266, 141 (1996). 25 J. A. Epstein, J. A. Kans, and G. D. Schuler, 2nd Ann. Int. WWW Conf. (1994). 26 A. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, J. Mol. Biol. 247, 536 (1995).
[23] tools, databases to analyze protein flexibility
547
Fig. 1. The Database of Macromolecular Motions, on the Web. (A) The World Wide Web ‘‘home page’’ of the database. One can type keywords in the small box at the top to retrieve entries. (B) A new ‘‘ranker’’ interface to the motions database. Movies (and their associated motions entries) can be sorted on the basis of dozens of useful statistics, including the size of the motion in angstroms, rotation of the motion around the hinge in degrees, date of submission, as well as energy statistics associated with the interpolated pathway. (C) A protein ‘‘morph’’ (animated representation) for calmodulin referenced by the database, along with the start of the database entry. Graphics and movies are accessed by clicking on an entry page. The main URL for the database is http://www.molmovdb.org. Beneath this are pages listing all the current movies, graphics illustrating the use of VRML to represent endpoints, and an automated submission form to add entries to the database. The database has direct links to the PDB for current entries (http://www.pdb.bnl.gov); the obsolete database (http:// pdbobs.sdsc.gov) for obsolete entries; scop (http://scop.mrc-lmb.cam.ac.uk);Entrez/PubMed (http://www.ncbi.nlm.nih.gov/PubMed/medline.html); and LPFC (http://smi-web.stanford.edu/projects/helix/LPFC). Through these links one can easily connect to other common protein databases such Swiss-Prot, Pro-Site, CATH, RiboWeb, and FSSP.24,160–166
548
[23]
analysis and software
Number known forms
Mechanism of motion
Examples
#
Fragment
Hinge TIM, LDH, TGL14 Shear Insulin 3 Unclassifiable MS2 Coat 3
Domain
Hinge Shear Refold Special Unclassifiable
Subunit
Allosteric PFK, Hb, GP Non-allosteric Ig VL-VH Unclassifiable
Fragment
Hinge Shear Unclassifiable bR
LF, ADK, CM CS, TrpR, AAT Serpin, RT Ig elbow TBP, EF-tu
16 8 3 1 3 4 2
Motion
2 forms
Size of motion
Interfaces
1 form
Hinge 1
Domain
Refold Hinge LF~TF,SBP 10 Shear HK~PGK,HSP 4 Special Unclassifiable Myosin 4
Subunit
Allosteric Non-allosteric Unclassifiable PCNA, GroEL
Shear motion
Hinge motion
3
Fig. 2. Schematic showing the overall classification scheme for motions. Left: The database is organized around a hierarchical classification scheme, based on size (fragment, domain, subunit) and then packing (hinge or shear). Currently, the hierarchy also contains a third level for whether or not the motion is inferred. Right: A schematic showing the difference between shear (sliding) and hinge motions. It is important to realize that the hinge–shear classification in the database is only ‘‘predominant,’’ so that a motion classified as shear can contain a newly formed interface and one classified as hinge can have a preserved interface across which there is motion. Adapted from Gerstein and Krebs.6
Standardized numeric values: Describing the motion, such as the maximum displacement (overall and of just backbone atoms), the degree of rotation around the hinge, and residues with large torsion angle changes when these numbers are available from the scientific literature. (The morph server, described below, attempts to compute these values automatically from the structures.) Annotation level: The database is constructed so that each entry indicates the evidence behind its description and classification. For example, the classification might be based on careful manual analysis of two conformations, automatic output of a conformation comparison program, inferred based on structure comparison, or inferred based on sequence comparison. A clear distinction was made between the carefully analyzed, ‘‘gold standard’’ motions, such as lactoferrin, and the much more tentatively understood motion in a protein that is a sequence homolog of another protein
[23]
tools, databases to analyze protein flexibility
549
that is structurally similar to lactoferrin. They indicated the evidence behind a motion through listing information about the experimental techniques used, telling whether or not the motion is inferred, and giving a standardized ‘‘annotation level.’’ Size Classification The classification scheme for proteins has a hierarchical layout shown in Fig. 2 (left). Protein motions are first ranked in order of their size (subunit, domain, and fragments). Domain motions, such as those in hexokinase or citrate synthase,27,28 provide the most common examples of protein flexibility.29–31 Usually, the motion of fragments smaller than domains refers to the motion of surface loops, such as the ones in triose phosphate isomerase or lactate dehydrogenase. It can also refer to the motion of secondary structures, such as of the helices in insulin.32–34 Packing Classification For fragment and domain protein motions, the database systematizes the motions on the basis of the packing of atoms inside of proteins, which is a fundamental constraint on protein structure.29,34–42 Interfaces between different parts of a protein are usually packed tightly. Consequently, two basic mechanisms for protein motions, hinge and shear, are proposed depending on whether or not there is a continuously maintained interface preserved through the motion (Fig. 2, right). The special kind of sliding motion a protein must undergo to maintain a well-packed interface is 27
S. Remington, G. Wiegand, and R. Huber, J. Mol. Biol. 158, 111 (1982). W. S. Bennett, Jr. and T. A. Steitz, Proc. Natl. Acad. Sci. USA 75, 4848 (1978). 29 M. Gerstein, A. M. Lesk, and C. Chothia, Biochemistry 33, 6739 (1994). 30 W. S. Bennett and R. Huber, Crit. Rev. Biochem. 15, 291 (1984). 31 J. Janin and S. Wodak, Prog. Biophys. Mol. Biol. 42, 21 (1983). 32 C. Abad-Zapatero, J. P. Griffith, J. L. Sussman, and M. G. Rossman, J. Mol. Biol. 198, 445 (1987). 33 R. K. Wierenga et al., Proteins 10, 93 (1991). 34 C. Chothia, A. M. Lesk, G. G. Dodson, and D. C. Hodgkin, Nature 302, 500 (1983). 35 F. M. Richards and W. A. Lim, Q. Rev. Biophys. 26, 423 (1994). 36 Y. Harpaz, M. Gerstein, and C. Chothia, Structure 2, 641 (1994). 37 M. Levitt, M. Gerstein, E. Huang, S. Subbiah, and J. Tsai, Annu. Rev. Biochem. 66, 549 (1997). 38 F. M. Richards, Methods Enzymol. 115, 440 (1985). 39 F. M. Richards, Annu. Rev. Biophys. Bioeng. 6, 151 (1977). 40 M. Gerstein and F. Richards, in ‘‘International Tables for Crystallography’’ (M. Rossman and E. Arnold, eds.), p. 531. Kluwer Academic, Dordrecht, The Netherlands, 2001. 41 J. Tsai, N. Voss, and M. Gerstein, Bioinformatics 17, 949 (2001). 42 A. M. Lesk and C. Chothia, J. Mol. Biol. 174, 175 (1984). 28
550
analysis and software
[23]
known as a shear motion; individual shear motions are small. When no continuously maintained interface constrains the motion, a hinge motion occurs. Typically, these motions usually occur in proteins with two domains (or fragments) connected by linkers (i.e., hinges) that are relatively unconstrained by packing. The whole motion may be produced by a few large torsion-angle changes. Beyond hinge and shear, there are number of other possible classifications. A special mechanism that is clearly neither hinge nor shear accounts for the motion. An example of this sort of motion is what occurs in the immunoglobulin ball-and-socket joint,43 where the motion involves sliding over a continuously maintained interface (like a shear motion) but because the interface is smooth and not interdigitating the motion can be large (like a hinge). Motion involves a partial refolding of the protein. This usually results in dramatic changes in the overall structure. A complete protein motion can be built up from a number of these basic motions. For the database, a motion is classified as ‘‘Shear’’ if it is predominantly a shear motion and ‘‘Hinge’’ if it is predominantly composed of hinge motions. Subunit motions are classified differently as allosteric, nonallosteric, or unclassifiable. Finally, large protein motions that cannot easily be classified as subunit motions are classified as complex movements. For example, the Lac repressor is classified as a complex motion because it can undergo at least two different motions: an order-to-disorder transition that the headpiece domain undergoes when it binds DNA, as well as a motion between the two dimers that make up the Lac repressor. Another example of complex motion involves a molecule binding between two other domains in a protein, such as that observed in the bacterial periplasmic binding proteins.44 How Many Motions Are There? One basic question relates to the number of protein motions and to what degree they divide amongst the basic classification categories in the database. This can be answered on a number of levels, depending on our degree of knowledge about the motion. There are currently (September 21, 2001) the following motions in the protein motions database and other public repositories: 43 44
A. M. Lesk and C. Chothia, Nature 335, 188 (1988). N. K. Vyas, M. N. Vyas, and F. A. Quiocho, J. Biol. Chem. 266, 5226 (1991).
[23]
tools, databases to analyze protein flexibility
551
1. 120 manually classified and curated motions: There are 261 pairs of PDB identifiers that are associated with the best-studied (gold standard) motions. [As this goes to press in September of 2003, the number of manually classified and curated motions has more than doubled.] These are motions for which evidence has been manually gathered from the scientific literature and compared with the structures. The vast majority of these have at least two solved X-ray structures. The gold standard motions can be statistically tabulated by classifications or experimental technique (Fig. 3). When tabulated by their mechanistic classification, over 60% of the motions in the database are classified as domain motions, while the hinge mechanism is the most common mechanistic classification in the database, accounting for 45% of the entries. Reflecting the greater ease with which smaller motions can be studied experimentally, a greater percentage of fragment motions have structures for multiple conformations in the motion when gold standard motions are tabulated by size classification. Most of the fragment and domain motions in the database fall into the hinge or shear categories when classified by mechanism. The most common method for study of protein motions involving a mechanical function is traditional X-ray crystallography,45,46 which was found to have contributed experimental data to nearly all of the protein motions in one comprehensive survey.6 NMR,47 time-resolved X-ray crystallography,48–50 and computational techniques such as molecular dynamics each contributed to less than 7% of the surveyed motions.7 However, it is conceivable that one or more of these latter techniques may become considerably more important as methodological advances continue to be made. 2. 240 submitted morphs: These were contributed interactively by Internet users via a Web form. They have annotation of a variable quality, depending on the person submitting the motion. Some of these explicitly reference two different PDB structures, although many use uploaded coordinates. 3. 441 PDB annotated entries: There are 441 entries in the PDB that explicitly mention the phrase ‘‘conformational change’’ in their comments section. 45
Z. Xu and P. B. Sigler, J. Struct. Biol. 124, 129 (1998). C. A. Wilson, J. Kreychman, and M. Gerstein, J. Mol. Biol. 297, 233 (2000). 47 B. F. Volkman, D. Lipson, D. E. Wemmer, and D. Kern, Science 291, 2429 (2001). 48 T. Oka, et al., Proc. Natl. Acad. Sci. USA 97, 14278 (2000). 49 U. K. Genick et al., Science 275, 1471 (1997). 50 I. Schlichting et al., Nature 345, 309 (1990). 46
552
[23]
analysis and software
Unclassifiable 20%
120 Goldstandard Motions
Nucleic Acid 2%
Notably Motionless 1%
by Packing Hinge 45%
Other/NonAllosteric 7% Allosteric 7%
Complex 5% Subunit 11%
Domain 62%
Kno
Suspected 7%
3814 Automatically Found Motions in PDB
In Motions DB
Universe of Studied Motions (~10K PubMed hits) 1% Other 2% TR X-ray 7% NMR
2% CD 7% NMR
95% X-ray
28% Suspected 72% Known
Known 93%
wn vs. Sus pec ted
Fragment 22%
by Experimental Technique
Siz
e
Shear 14%
by
Partial Refolding 4%
Domain
240 User submitted morphs
Fragment
Studied Not-studied
Fig. 3. Classification statistics. Approximately one-third of protein motions that have been studied are in the database (middle right). The database itself is divided into three categories: automatically found in the PDB, user-submitted, and the extensively studied ‘‘gold standard’’ motions that were manually curated from the scientific literature (top right). The gold standard set is further classified into subcategories on the basis of packing (top left), size of motion (middle left), known versus suspected motions based on the number of solved conformations (bottom left), and the experimental techniques used to study each motion (bottom right). The latter numbers sum to more than 100% because some motions were studied by more than one technique.
[23]
553
tools, databases to analyze protein flexibility
4. 3814 automatically found conformational change outliers: Wilson et al.45 did a comprehensive set of structural alignments on version 1.39 of the scop database, which represented most known structures as of the beginning of 1999. From this they found 4403 pairs of domains that had appreciable sequence similarity yet had great structural differences; these represented putative instances of conformational changes. Of these, 3814 could be standardized and processed by the morph server (see below). 5. 13,191 hits in PubMed: Searching the titles and abstracts provides an additional way to identify putative motions and get a sense of the full size of the motions ‘‘universe.’’ There are currently 13,191 entries in the NCBI PubMed database that contain phrases such as ‘‘conformational change’’ or ‘‘macromolecular motion.’’ The increase in these terms over the years is diagrammed in Fig. 4 and the search methodology is explained in the caption. This number likely contains many false positives. One can estimate the fraction of false positives by examining their occurrence in a randomly selected subset of 100 articles. Doing this yields a false-positive rate of 20%, implying that only about 10,000 represent real motions. 1000 900 "conformational changes" protein conformational changes protein macromolecular motions protein motion
800 # of publications
700
domain motions Expon. ("conformational changes" protein) Expon. (conformational changes protein)
600
Expon. (protein motion)
500 400 300 200 100
98
96
94
92
90
88
00 20
19
19
19
19
19
84
82
80
78
76
74
72
86
19
19
19
19
19
19
19
19
19
19
70
0
Year
Fig. 4. An illustration of how the use of various terms and phrases associated with protein motion has increased every year in the literature. To construct this graph, various searches were done with the NCBI PubMed database. The graph distinguishes between various quoted and not-quoted searches. In total there were 13,191 hits in the PubMed database relating to protein motions.
554
analysis and software
[23]
Methods for Protein Structure Comparison
The study of protein motions in a database framework rests on a number of techniques, which we discuss in the following sections. One of the most basic of these techniques is structure comparison, that is, the comparison of two structures to determine which residues are analogous, and then to superpose them based on these residues. Sieve-Fit Superposition and Screw Axis Orientation If one has the correspondences between atoms in two structures (i.e., an alignment between an open and closed structure), one can use traditional ‘‘RMS superposition’’ to minimize the RMS difference between the atoms. However, there are some complexities associated with this. In a simple hinge motion, for example, calmodulin, such an alignment fits the closed conformation symmetrically inside the open conformation. Amongst other things, the maximum C displacement computed from such a superposition is considerably underestimated from the common sense alignment, and an analysis or morph movie made with such an alignment would give the impression of a motion far more complicated than a simple opening of a hinge. To overcome these problems, one needs to do an iterated superposition or ‘‘sieve fit.’’42,51–53 Once the ‘‘sieve-fit’’ procedure has superimposed the stationary domains of the protein, and, by exclusion, identified one or more mobile domains, the change in position between starting and ending conformation of each mobile domain may be analyzed to identify a geometric transformation whose screw axis is (approximately) the axis of the corresponding hinge motion.51 If a significant hinge motion is present, one can use these transformations to align the z axis of the coordinate frame parallel to the hinge axis so that, when the motion is rendered, viewers will look down the screw axis of the hinge motion. Structural Alignment When the proteins being compared have different sequences and there is no obvious alignment between the two sequences, one first must use pairwise structural alignment before superposition can be attempted. Structural alignment consists of establishing equivalences between the residues in two different proteins, as is the case with conventional sequence 51
M. Gerstein, A. Lesk, and C. Chothia, in ‘‘Protein Motions’’ (S. Subbiah, ed.) R. G. Landes, Austin, TX, 1995. 52 M. Gerstein and C. H. Chothia, J. Mol. Biol. 220, 133 (1991). 53 M. Gerstein and R. Altman, J. Mol. Biol. 251, 161 (1995).
[23]
tools, databases to analyze protein flexibility
555
alignment. However, this equivalence is determined principally on the basis of the three-dimensional coordinates corresponding to each residue, not on the basis of the amino acid type. The general idea of structural alignment has been around since the first comparisons of the structures of myoglobin and hemoglobin.54 Systematic structural alignment began with the analysis of heme-binding proteins and dehydrogenases by Rossmann and colleagues.55,56 Completely automatic methods have the advantage of speed and objectivity. However, the structural classifications produced by a computer are not always as understandable or reliable as those produced by humans. Furthermore, although manual classification is slow, if it is done correctly it need be done only once. Because of their obvious utility, a large number of automated methods for protein structure comparison have been developed, using different representation of structures, definitions of similarity measure, and optimization algorithms.57–72 Among them, methods based on use of distance matrices (also called distance maps or distance plots)73 for describing and comparing protein conformations were found quite useful for treating large structures. Some of these effectively compare the respective distance matrices of each structure, trying to minimize the difference in intraatomic distances for selected aligned substructures.60,61,74 Other methods58,75 54
M. F. Perutz, et al., Nature 185, 416 (1960). M. G. Rossmann and P. Argos, J. Biol. Chem. 250, 7525 (1975). 56 P. Argos and M. G. Rossmann, Biochemistry 18, 4951 (1979). 57 S. J. Remington and B. W. Matthews, J. Mol. Biol. 140, 77 (1980). 58 Y. Satow, G. H. Cohen, E. A. Padlan, and D. R. Davies, J. Mol. Biol. 190, 593 (1986). 59 P. Artymiuk, E. Mitchell, D. Rice, and P. Willett, J. Inform. Sci. 15, 287 (1989). 60 W. R. Taylor and C. A. Orengo, J. Mol. Biol. 208, 1 (1989). 61 A. Sali and T. L. Blundell, J. Mol. Biol. 212, 403 (1990). 62 G. Vriend and C. Sander, Proteins 11, 52 (1991). 63 R. B. Russell and G. J. Barton, Proteins 14, 309 (1992). 64 H. M. Grindley, P. J. Artymiuk, D. W. Rice, and P. Willett, J. Mol. Biol. 229, 707 (1993). 65 L. Holm and C. Sander, FEBS Lett. 315, 301 (1993). 66 A. Godzik and J. Skolnick, Comput. Appl. Biosci. 10, 587 (1994). 67 Z. K. Feng and M. J. Sippl, Fold. Des. 1, 123 (1996). 68 A. Falicov and F. E. Cohen, J. Mol. Biol. 258, 871 (1996). 69 J. F. Gibrat, T. Madej, and S. H. Bryant, Curr. Opin. Struct. Biol. 6, 377 (1996). 70 G. Cohen, ALIGN: J. Appl. Crystallogr. (1997). 71 M. Gerstein and M. Levitt, Protein Sci. 7, 445 (1998). 72 M. Levitt and M. Gerstein, Proc. Natl. Acad. Sci. USA 95, 5913 (1998). 73 D. C. Phillips, P. S. Rivers, M. J. E. Sternberg, J. M. Thornton, and I. A. Wilson, Biochem. Soc. Trans. 5, 642 (1977). 74 L. Holm and C. Sander, J. Mol. Biol. 233, 123 (1993). 75 G. H. Cohen, J. Appl. Crystallogr. 30, 1160 (1997). 55
556
analysis and software
[23]
directly try to minimize the interatomic distances between two structures. A similar approach is taken in minimizing the ‘‘soap bubble area’’ between two structures.68 Yet other methods involve further techniques, such as geometric hashing or lattice fitting.59,66,69 To understand these procedures, it is useful to compare structural alignment with the much more thoroughly studied methods for sequence alignment.76,77 Both sequence and structure alignment methods produce an alignment that can be described as an ordered set of equivalent pairs (i, j) associating residue i in protein A with residue j in protein B. Both methods allow gaps in these alignments that correspond to nonsequential i (or j) values in consecutive pairs—i.e., one has pairs (i, j 6¼ i). And both methods reach an alignment by optimizing a function that scores well for good matches and badly for gaps. The major difference between the methods is that the optimization used for sequence alignment is globally convergent, whereas that used for structural alignment is not. This is the case for sequence alignment because the optimum match for one part of a sequence is not affected by the match for any other part. Structural alignment fails to converge globally because the possible matches for different segments are tightly linked since they are part of the same rigid 3D structure. For this reason, the alignment found by a structural alignment algorithm can depend on the initial equivalences, whereas in sequence alignment there is no such dependence. The lack-of-convergence problem has led to a large number of different approaches to structural alignment, the methods differing in how they attack the problem. However, no current algorithm can find the globally optimum solution all the time; the convergence problem remains unsolved in the general case. The methods also differ in the function they optimize (the equivalent of the amino acid substitution matrix used in sequence alignment) and how they treat gaps. Multiple Structural Alignment The next step after pairwise structural alignment is multiple structural alignment, simultaneously aligning three or more structures together. This is an essential first step in the construction of consensus structural templates, which aim to encapsulate the information in a family of structures. It can also form the nucleus for a large multiple sequence alignment—that is, highly homologous sequences can be aligned to each structure in the 76 77
R. Doolittle, ‘‘Of Urfs and Orfs.’’ University Science Books, Mill Valley, CA, 1987. M. Gribskov and J. Devereux, ‘‘Sequence Analysis Primer.’’ Oxford University Press, New York, 1992.
[23]
tools, databases to analyze protein flexibility
557
multiple alignment. There are currently a number of approaches for this.72,78–80 Most of them proceed by analogy to multiple sequence alignment80–83 building up an alignment by adding one structure at a time to the growing consensus. Since most new structures are similar structurally to ones reported previously, they can be grouped into families, and with sufficient members in each family it becomes possible to summarize, statistically, the commonalities and differences within each family. A method has been developed for finding the atoms in a family alignment that have low spatial variance and those that have higher spatial variance.84,85 It allows one to determine the ‘‘core’’ atoms that have the same relative position in all family members and the ‘‘noncore’’ atoms that do not. HMMs for Structural Alignment A hidden Markov model (HMM) is a stochastic chain of connected probabilistic states, where each state generates an observation. One can only see the observations, and the goal is to infer the hidden state sequence. For example, in speech recognition, the observations are sounds forming a word, and a model is one that by its ‘‘hidden’’ random process generates these sounds with high probability. HMMs applied to sequences were found to be highly useful for relating protein structures. In particular, they have been used for building the Pfam database of protein familes,86–88 for gene finding,89 and for predicting secondary structure90 and transmembrane helices.91 In such HMMs we would typically have match states, corresponding to the letters of the consensus sequence, and insert states,
78
R. B. Russell and G. J. Barton, J. Mol. Biol. 234, 951 (1993). A. Sali and T. L. Blundell, J. Mol. Biol. 212, 403 (1990). 80 W. R. Taylor, T. P. Flores, and C. A. Orengo, Protein Sci. 3, 1858 (1994). 81 W. R. Taylor, Comput. Appl. Biosci. 3, 81 (1987). 82 W. R. Taylor, J. Mol. Evol. 28, 161 (1988). 83 W. R. Taylor, Methods Enzymol. 183, 456 (1990). 84 M. Gerstein and R. Altman, CABIOS 11, 633 (1995). 85 R. Schmidt, M. Gerstein, and R. Altman, Protein Sci. 6, 246 (1997). 86 S. R. Eddy, Curr. Opin. Struct. Biol. 6, 361 (1996). 87 R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, ‘‘Biological Sequence Analysis: Probalistic Models of Proteins and Nucleic Acids.’’ Cambridge University Press, New York, 1998. 88 A. Bateman et al., Nucleic Acids Res. 28, 263 (2000). 89 M. G. Reese, D. Kulp, H. Tammana, and D. Haussler, Genome Res. 10, 529 (2000). 90 C. Bystroff, V. Thorsson, and D. Baker, J. Mol. Biol. 301, 173 (2000). 91 E. L. Sonnhammer, G. von Heijne, and A. Krogh, Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175 (1998). 79
558
analysis and software
[23]
corresponding to the positions between the letters of the consensus sequence where gaps can be opened up and letters inserted. In addition, we can also allow special delete states that do not ‘‘use up’’ any characters of the sequence, but would allow a path through the model to skip certain match or insert states. HMMs can be built/trained from observation data and then used further for scoring (classifying) the query sequence against a particular model. An important property of HMMs is their ability to capture information about the degree of conservation at various positions in a sequence alignment, and the varying degree to which insertions and deletions are permitted. This explains why HMMs can detect considerably more homologs than simple pairwise comparison.89,92 Despite the fact that attempts have proved that linear HMMs can be useful for structural studies,90,91 none of the suggested schemes are fundamentally three-dimensional (coordinate dependent), since all of them are based on building a ID HMM profile representing a sequence alignment and structural information only enters in the form of encoded symbols (i.e., H for helix and E for sheet). Adding in real 3D structure turns out to be nontrivial, since the structure is fundamentally different from the sequence not only in increased dimensionality, but also owing to the transition from discrete to continuous representation. Efforts to build HMMs that explicitly represent a protein in terms of 3D coordinates are currently under way.93 Interpolating between Structures: Morph Server
Following alignment and superposition of two structures, one can characterize the extent to which the two conformations differ, using a variety of straightforward analytical operations and statistical measures. Many of these metrics can be derived through trying to interpolate ‘‘intelligently’’ between the two structures. This is achieved through the morph server, which is associated with the database of macromolecular motions. Adiabatic Mapping Interpolation The morph server attempts to describe protein motions as a rigid-body rotation of a small ‘‘core’’ relative to a larger one, using a set of hinges. To ensure all statistics between any two motions are directly comparable, the motion is placed in a standardized coordinate system. Without special techniques, such as high temperature simulation or Brownian 92 93
J. Park et al., J. Mol. Biol. 284, 1201 (1998). V. Alexandrov and M. Gerstein, in preparation (2001).
[23]
tools, databases to analyze protein flexibility
559
dynamics,94,95 normal dynamics simulations cannot approach the time scales of the large-scale motions in the database. Rather, a pathway interpolation is produced by two principal methods. Straight Cartesian interpolation: The difference in each atomic coordinate (between the known endpoint structures) is calculated and then divided into a number of evenly spaced steps. Adiabatic mapping: This is a modification of straight Cartesian interpolation, but adding energy minimization after each interpolation step. This procedure produces interpolated frames with much more realistic geometry. A criticism of adiabatic mapping, often made by researchers attempting to interpolate between a protein in an unfolded and native, folded state, is that the intermediates, although geometrically realistic, have somewhat higher energies than a theoretical analysis would indicate. However, from the standpoint of the server, which tackles the significantly easier problem of interpolating between two folded conformations, the energies of the intermediates are not that unrealistic (they can serve at least as a reasonable upper bound). Moreover, the technique has the advantage of providing useful results back to the user in a reasonable amount of time. Visualization and Statistics With the intermediate conformations morphed, the molecule is now visually rendered (Fig. 5). In connection with this, Martz96 has developed an external Web site that provides a Chime-based interface to the interpolated images. Users have already submitted hundreds of examples of protein motions to the server, producing a comprehensive set of statistics. Some examples of morphs include human interleukin 5,97 bc1 complex,98,99 glycerol kinase,100,101 and lactoferrin.102,103 The server collects a number of statistics, 94
D. Joseph, G. A. Petsko, and M. Karplus, Science 249, 1425 (1990). R. C. Wade, M. E. Davis, B. A. Luty, J. D. Madura, and J. A. McCammon, Biophys. J. 64, 9 (1993). 96 E. Martz, URL: http://www.umass.edu/microbio/chime/explorer/index.htm (1999). 97 J. L. Verschelde et al., FEBS Lett. 424, 121 (1998). 98 A. R. Crofts et al., Proc. Natl. Acad. Sci. USA 96, 10021 (1999). 99 A. R. Crofts and E. A. Berry, Curr. Opin. Struct. Biol. 8, 501 (1998). 100 C. E. Bystrom, D. W. Pettigrew, B. P. Branchaud, P. O’Brien, and S. J. Remington, Biochemistry 38, 3508 (1999). 101 M. D. Feese, H. R. Faber, C. E. Bystrom, D. W. Pettigrew, and S. J. Remington, Structure 6, 1407 (1998). 102 A. B. Thompson et al., Am. Rev. Respir. Dis. 146, 389 (1992). 103 J. A. Sykes, M. J. Thomas, D. J. Goldie, and G. M. Turner, Clin. Chim. Acta 122, 385 (1982). 95
560
analysis and software
Fig. 5. (continued)
[23]
[23]
tools, databases to analyze protein flexibility
561
Fig. 5. (A) Protein chemistry elucidated through the motion of the protein’s moving parts. Shown is the use of the ‘‘flickerbook’’ feature of the morph server associated with the motions
562
analysis and software
[23]
include hinge angle rotation during motion. Of the 200 motions submit ted for analysis, the median motion has a maximum rotation of 9.5 over a range of 0 through 150 as computed by our algorithm, whereas the 12 motions culled from the scientific literature had an average rotation of 24 over a range of 5 through 148 . Similarly, the algorithms found a ˚ , ranging from 0 to 81 A ˚ for median maximum C displacement of 17 A the submitted motions, whereas 11 motions reported in the scientific litera˚ over a range of 1.5 through 60 A ˚ . Although most of the ture average 12 A structures are similar in sequence, the server has been able to accommodate sequence identity down to 8% for some motions. Normal Mode Analysis
While the morph server can analyze motions when two or more solved conformations exist, in many cases a protein having a suspected motion will only have one conformation with a solved 3D structure. Given only one
database. Each boxed image of the protein represents a frame in the movie of the motion produced by the server. Frames are sequential in time, from the bottom to the top of the page, and then from left to right. This particular flickerbook is a movie of the apical domain motion of GroEL. Readers can photocopy this figure, cut along the edges of the boxes to produce 10 still frames, and then bind these 10 frames into a booklet (using, say, a stapler) to produce a flickerbook. The movie may then be visualized by rapidly flipping the pages of the flickerbook to create the illusion of motion. Flickerbooks represent a low-tech means of displaying protein movies when Internet access to the server is not available. The high-tech means of seeing this movie and other movies is to access the Website for the motions database, http:// www.molmovdb.org. (This particular movie has 71095-15408 as its movie ID; it is referenced under the movies for ‘‘GroEL.’’) (B) This is a flickerbook of NtrC, a nitrogen-sensing regulatory protein. The motion and NMR determination of both structures are described in Volkman et al.47 This particular movie is available for viewing from within the online text of the article at the Science magazine Web site (http://www.sciencemag.org). It is also available for viewing as movie ID 7kern from the morph server Web site (URL: http:// www.molmovdb.org/molmovdb/cgi-bin/morph.cgi?ID ¼ 7kern). Normally, Web users can generate flickerbooks like this (as well as Internet-accesible protein movies) by supplying morph server component of the motion database (http://www.molmovdb.org) with the PDB IDs or solved structures of the conformations. However, in this case, NMR provided additional experimental information about changes in protein secondary structure as well as a more precise identification of the mobile atoms. Because this experimental information is not normally readily available, at the time the online interface to the morph server did not provide an easy means for users to input this surplus information to the morphing engine of the server. For this reason, this particular movie actually represents a custom morph in which the extra experimental data was manually fed into the server by its authors. Readers without Internet access to the database can view this movie using only scissors and a stapler by following the instructions in the caption to (A).
[23]
tools, databases to analyze protein flexibility
563
solved conformation, normal mode analysis is one of the best ways to understand and perhaps predict its flexibility. For this kind of analysis, normal mode analysis (Figs. 6 and 7) has two advantages for large-scale database analysis over other techniques, such as molecular dynamics: (1) it requires little CPU power (especially when certain realistic approximations are made), and thus is amenable to database-screening techniques,104 and (2) it provides an intuitive conceptual model of protein motions in terms of frequencies and vibrations. Widely used by spectroscopists for years,105 advances in computer technology made normal mode analysis of large molecules practical.106–115 The concept of normal mode analysis is to find a set of basis vectors (normal modes) describing the molecule’s concerted atomic motion and spanning the set of all 3N 6 degrees of freedom. For large molecules the lowest frequency normal modes of proteins are thought to correspond to the large-scale real-world vibrations of the protein,116 and can be used to deduce significant biological properties. Moreover, there is evidence to suggest117–122 that proper, symmetric normal mode vibration of binding pockets is crucial to correct biological activity in some proteins. The classic Lagrangian for the vibrations of a protein with N atoms is given by L¼T V 104
(1)
W. Krebs, V. Alexandrov, C. Wilson, and M. Gerstein, Proteins 48, 682–695 (2002). E. B. Wilson, J. C. Decius, and P. C. Cross, ‘‘Molecular Vibrations.’’ McGraw-Hill, New York, 1955. 106 J. Ma, P. B. Sigler, Z. Xu, and M. Karplus, J. Mol. Biol. 302, 303 (2000). 107 J. Ma and M. Karplus, Proc. Natl. Acad. Sci. USA 95, 8502 (1998). 108 S. Hayward, A. Kitao, and H. J. Berendsen, Proteins 27, 425 (1997). 109 D. van der Spoel, B. L. de Groot, S. Hayward, H. J. Berendsen, and H. J. Vogel, Protein Sci. 5, 2044 (1996). 110 B. S. Duncan and A. J. Olson, J. Mol. Graphics 13, 250 (1995). 111 M. Levitt, C. Sander, and P. S. Stern, J. Mol. Biol. 181, 423 (1985). 112 B. Brooks and M. Karplus, Proc. Natl. Acad. Sci. USA 82, 4995 (1985). 113 B. Brooks and M. Karplus, Proc. Natl. Acad. Sci. USA 80, 6571 (1983). 114 R. M. Levy, Ann. N.Y. Acad. Sci. 482, 24 (1986). 115 R. M. Levy, O. Rojas, and R. A. Friesner, J. Phys. Chem. 88, 4233 (1984). 116 R. Levy, D. Perahia, and M. Karplus, Proc. Natl. Acad. Sci. USA 79, 1346 (1982). 117 D. W. Miller and D. A. Agard, J. Mol. Biol. 286, 267 (1999). 118 O. Marques and Y. H. Sanejouand, Proteins 23, 557 (1995). 119 A. Thomas, K. Hinsen, M. J. Field, and D. Perahia, Proteins 34, 96 (1999). 120 A. Thomas, M. J. Field, and D. Perahia, J. Mol. Biol. 261, 490 (1996). 121 A. Thomas, M. J. Field, L. Mouawad, and D. Perahia, J. Mol. Biol. 257, 1070 (1996). 122 K. Hinsen, Proteins 33, 417 (1998). 105
564
analysis and software
[23]
Fig. 6. An illustration of figures generated by a new set of Web tools associated with normal mode analysis that the user may request on any protein for which a PDB structure file is available. (B) performs a normal mode flexibility analysis on the structure. Regions that are more flexible are colored in red, while less flexible regions are colored in blue. (A) gives similar information, using experimental b factors supplied in the PDB file, if available. (C) shows the parts of the protein that actually move, as calculated from comparison of the starting and ending PDB structures for the motion. Areas that move are colored in red, while areas that remain stationary are colored in blue. The user may compare these three panels to deduce structural information. For example, hinge locations involved in the motion may be deduced, as these are highly flexible regions [as identified by (A) and (B)] located near the moving domains [show in red in (C)].
where V is the potential energy describing interactions among atoms, and T is the kinetic energy T¼
N 1X mi x_ 2i þ y_ 2i þ z_ 2i 2 i¼1
(2)
where the dot notation has been used for derivatives with respect to time. The above expression for T can be rewritten by introducing massweighted Cartesian displacement coordinates. Let pffiffiffiffiffiffi pffiffiffiffiffiffiffiffi q1 ¼ m1 ðx1 x1e Þ; . . . ; q3N ¼ mN ðzN zNe Þ (3) Fig. 7. Normal modes. (A) A brief summarization of basic normal mode concepts. (It was inspired from P. Steinbach’s Web illustration, http://cmm.info.nih.gov/intro simulation/ node26.html.) (B) A drawing of one of the low-frequency normal modes of bacteriorhodopsin. This particular normal mode is approximately perpendicular to the cell membrane. (C) Illustration of the concept of the normalized dot-product, which is sometimes useful in statistical calculations on normal mode vectors and their relationship to experimental displacement vectors. Adapted from P. Steinbach’s web illustration, part of Chapter 3A of the Biophysical Society’s ‘‘Biophysics Textbook Online’’ at http://www.biophysics.org/btol. used with permission.
[23]
tools, databases to analyze protein flexibility
565
566
[23]
analysis and software
in which the qi coordinate is proportional to the displacement from the equilibrium value qie. Expanding potential energy in a Taylor series and neglecting all terms with powers greater than two (harmonic approximation), potential energy will assume the form V¼
3N 1X fij qi qj 2 i;j¼1
(4)
where fij, the force constants, are defined as the second derivatives of the potential energy function: fij ¼ Substituting T ¼ 12 tion of motion
P3N
i¼1
@2V @qi @qj
€2i and V ¼ 12 q
(5)
P3N
i;j¼1 fij qi qj
in the Lagrange’s equa-
d @L @L ¼0 dt @ q_ i qi @qi q_ i
(6)
we obtain €i þ q
3N X
i;j¼1
fij qi qj ¼ 0;
i ¼ 1 . . . 3N
(7a)
or in matrix form € þ Fq ¼ 0 q
(7b)
This is the set of 3N coupled second-order differential equations with constant coefficients. It can be solved by assuming a solution of the form qi ¼ Ai cosð!t þ ’Þ
(8)
This substitution converts the set of differential equations into a set of 3N homogeneous linear equations: 3N X j¼1
fij Aj !2 Ai ¼ 0
(9a)
[23]
tools, databases to analyze protein flexibility
567
~¼0 ~ !2 A FA
(9b)
or
This problem may now be solved with any eigenvector/eigenvalue solution method. One of the simplest is to attempt to diagonalize the matrix F and extract the eigenvalues from the diagonal. It turns out that six eigenvalues of F are zero for a nonlinear molecule. This result can be expected from the fact that there are 3 degrees of freedom associated with the translation of the center of mass, and three with rotational motion of the molecule as a whole. Since there is no restoring force acting on these degrees of freedom, their eigenvalues are zero. Associated with each eigenvalue is a coordinate, called normal mode coordinate Qi. The normal modes represent a set of coordinates related to the old one by an orthogonal linear transformation U: Q ¼ Uq
(10)
such that the transformation matrix U diagonalizes F: UFUT ¼ L ðdiagonalÞ
(11)
This transformation has a deep impact on the resulting form of the differential equations: Eq. (7b) transforms to € þ LQ ¼ 0 Q
(12)
but since L is diagonal, Eq. (12) is effectively decoupled: € 1 þ 1 Q1 ¼ 0; . . . ; Q € 3N þ 3N Q3N ¼ 0 Q
(13)
and the system therefore behaves like a set of 3N independent harmonic oscillators, each oscillating without interaction with the others. It is of considerable importance to examine the nature of the above solutions. It is evident from Eq. (8) that each atom is oscillating about its equilibrium position with the same frequency and phase for a given solution !k. In other words, each atom reaches its position of maximum displacement at the same time, and each atom passes through its equilibrium position at the same time. A mode of vibration having all these characteristics is called a normal mode of vibration, and its frequency is known as a normal mode frequency. Tools for Quantification of Packing: Voronoi Polyhedra
Packing clearly is an essential component of a motion’s classification. Often this concept is discussed loosely and vaguely by crystallographers analyzing a particular protein structure—for instance, ‘‘Asp-23 is packed
568
analysis and software
[23]
against Gly-38’’ or ‘‘the interface between domains appears to be tightly packed.’’ One can systematize and quantify the discussion of packing in the context of the motions database through the use of particular geometric constructions called Voronoi polyhedra. Nearly a century ago, Voronoi developed the method to construct polyhedra as a novel application of quadratic equations.123 Points equidistant from two atoms are on a plane; those equidistant from three atoms are on a line, and those equidistant from four centers from a vertex. Bernal and Finney used them to study the structure of liquids in the 1960s.124 However, despite the general utility of these polyhedra, their application to proteins was limited by a serious methodological difficulty. While the Voronoi construction is based around partitioning space amongst a collection of ‘‘equal’’ points, all protein atoms are not equal: some are clearly larger than others (e.g., sulfur versus oxygen). Richards found a solution to this problem and first applied the Voronoi methods to proteins in 1974.125 He has, subsequently, reviewed their use in this application.38,40 Richards’ solution was to allocate space based proportionally to the size of the radius of an atom. The resulting Voronoi-like polyhedra were no longer an equal partition of space, but were weighted by the size of an atom. However, as an additional level of complexity, atoms usually include their bonded hydrogens, since these are not usually resolved in the solutions to crystal structures. This united atom model has posed a problem in finding the correct radii to use in Voronoi as well as other applications.126,127 In a detailed analysis of organic crystals and protein structures, we have developed a standard set of protein atom radii for united atom models,41,127,128 which are shown in Table I. Voronoi polyhedra are a useful way of partitioning space amongst a collection of atoms. The simplest method for calculating volumes in a Voronoi-like manner is to put all atoms in the system on a grid. Then go to each grid point (i.e., voxel) and add its volume to the atom center closest to it. This is prohibitively slow for a real protein structure, but it can be made somewhat faster by randomly sampling grid points.129 More classic approaches to calculating Voronoi volumes have two parts: (1) for each atom find the vertices of the polyhedron around it and (2) systematically
123
G. F. Voronoi, J. Reine Angew. Math. 134, 198 (1908). J. D. Bernal and J. L. Finney, Disc. Faraday Soc. 43, 62 (1967). 125 F. M. Richards, J. Mol. Biol. 82, 1 (1974). 126 A. J. Li and R. Nussinov, Proteins 32, 111 (1998). 127 J. Tsai, R. Taylor, C. Chothia, and M. Gerstein, J. Mol. Biol. 290, 253 (1999). 128 J. Tsai and M. Gerstein, Bioinformatics 18, 985 (2002). 129 M. Gerstein, J. Tsai, and M. Levitt, J. Mol. Biol. 249, 955 (1995). 124
[23]
TABLE I Summary of ProtOr Type Seta Number (173)
Volume ˚ 3) (A
Radii ˚) (A
C3H0s
20
8.72
1.61
Carbonyl carbons with branching (main-chain carbonyls from residues with a C, so no Gly carbon)
C3H0b
13
9.70
1.61
Carboxyl and carbonyl carbons w/o branching (side chain and glycines) and aromatic carbons w/o hydrogen
C4H1s
18
13.17
1.88
Aliphatic carbons with one hydrogen and branching from all three heavy atom bonds
C4H1b
6
14.35
1.88
C3H1s
8
20.44
1.76
Aliphatic carbons with one hydrogen and no branching through at least one heavy atom bond Small aromatic carbons with one hydrogen
C3H1b
8
21.28
1.76
Big aromatic carbons with one hydrogen
C4H2s
21
23.19
1.88
Aliphatic carbons with two hydrogens, small
Comments
Protein atoms ALA_C,ARG_C,ASN_ C,ASP_C,CSS_C, CYS_C,GLN_C,GLU_C,HIS_C,ILE_C, LEU_C,LYS_ C,MET_C,PHE_C,PRO_C, SER_C,THR_C,TRP_C,TYR_C,VAL_C ARG_CZ,ASN_CG,ASP_CG,GLN_CD, GLU_CD,GLY_ C,HIS_CG,PHE_CG, TRP_CD2,TRP_CE2,TRP_CG,TYR_CG, TYR_CZ ARG_ CA,ASN_CA,ASP_CA,CSS_CA, CYS_CA,GLN_CA,GLU_CA,HIS_CA, ILE_ CA,LEU_CA,LYS_CA,MET_CA, PHE_CA,SER_CA,THR_CA,TRP_CA, TYR_ CA,VAL_CA ALA_CA,ILE_CB,LEU_CG,PRO_CA, THR_CB,VAL_CB HIS_CD2,HIS_CE1,PHE_CD1,TRP_CD1, TYR_CD1,TYR_ CD2,TYR_CE1, TYR_CE2 PHE_CD2,PHE_CE1,PHE_CE2,PHE_CZ, TRP_CE3,TRP_ CH2,TRP_CZ2,TRP_CZ3 ARG_CB,ARG_CD,ARG_CG,ASN_CB, ASP_CB,GLN_ CB,GLN_CG,GLU_CB, GLU_CG,GLY_CA,HIS_CB,LEU_CB, LYS_CB,LYS_ CD,LYS_CG,MET_CB, PHE_CB,PRO_CD,SER_CB,TRP_CB, TYR_CB
569
(continued)
tools, databases to analyze protein flexibility
Atom type
570
TABLE I (continued)
C4H2b
7
24.26
1.88
Aliphatic carbons with two hydrogens, big
C4H3u
9
36.73
1.88
Aliphatic carbons with three hydrogens, i.e., methyl groups
N3H0u N3H1s
1 20
8.65 13.62
1.64 1.64
Imide nitrogens (only member is ProN) Amide nitrogens with one hydrogen (all other main-chain N’s)
N3H1b
4
15.72
1.64
N3H2u
4
22.69
1.64
N4H3u O1H0u
1 27
21.41 15.91
1.64 1.42
Amide Nitrogens with one hydrogen (on side chains) All amide nitrogens with 2 hydrogens (only on side chains) Amide nitrogen charged, with 3 hydrogens All oxygens in carboxyl or carbonyl groups (no distinction made between oxygens in carboxyl group)
O2H1u S2H0u S2H1u
3 2 1
17.98 29.17 36.75
1.46 1.77 1.77
Comments
All hydroxyl atoms Sulfurs with no hydrogens Sulfurs with one hydrogen
Protein atoms CSS_CB,CYS_CB,ILE_CG1,LYS_CE, MET_CG,PRO_ CB,PRO_CG ALA_CB,ILE_CD1,ILE_CG2,LEU_ CD1, LEU_CD2,MET_CE,THR_CG2, VAL_CG1,VAL_CG2 PRO_N ALA_N,ARG_ N,ASN_N,ASP_N, CSS_N,CYS_N,GLN_N,GLU_N, GLY_N,HIS_N,ILE_ N,LEU_N, LYS_N,MET_N,PHE_N,SER_N, THR_N,TRP_N,TYR_N,VAL_N ARG_NE,HIS_ND1,HIS_NE2,TRP_NE1 ARG_NH1,ARG_NH2,ASN_ ND2, GLN_NE2 LYS_NZ ALA_O,ARG_O,ASN_O,ASN_OD1, ASP_O,ASP_OD1,ASP_ OD2,CSS_O, CYS_O,GLN_O,GLN_OE1,GLU_O, GLU_OE1,GLU_OE2,GLY_ O,HIS_O, ILE_O,LEU_O,LYS_O,MET_O,PHE_O, PRO_O,SER_O,THR_ O,TRP_O, TYR_O,VAL_O SER_ OG,THR_OG1,TYR_OH CSS_SG,MET_SD CYS_SG
[23]
Presented are ProtOr parameters for the various atom types used to model and compute the volumes of the amino acids.
analysis and software
Number (173)
a
Volume ˚ 3) (A
Radii ˚) (A
Atom type
[23]
tools, databases to analyze protein flexibility
571
Fig. 8. Voroni polyhedra. Left: A representative Voronoi polyhedron from 1CSE (subtilisin): a united polyhedron representing the volume of the six atoms in a Phe ring. Right: The Voronoi polyhedra construction. A schematic showing the construction of a Voronoi polyhedron in two dimensions. Points represent atoms. For the construction of a polyhedra, first atoms are considered within a distance cutoff from a central atom (shown by the outer circle). Lines (or planes in three dimensions) bisect the distance from these selected atom to the central one. Vertices are connected to form the gray polyhedron that is the volume of the central atom. Circled points connected by boldface lines are true neighbors, while the noncircled, thin lined points are not. The two radial circles also show the under- and overestimation of contacts by radial cutoffs. The inner circle underestimates the number of true neighbors by two and includes a false contact. The outer circle overestimates the neighbors by three. Also, the asymmetry parameter is defined as the ratio of the distances between the central atom and the farthest and nearest vertex. Adapted from Gerstein and Krebs.6
collect these vertices to draw the polyhedron and calculate its volume. In the classic Voronoi construction (Fig. 8), each atom is surrounded by a unique limiting polyhedron such that all points within an atom’s polyhedron are closer to this atom than all other atoms. Because, as we said before, points equidistant from four centers form a vertex, one can easily find all the vertices associated with an atom. With the coordinates of four atoms, it is straightforward to solve for possible vertex coordinates using the equation of a sphere. One then checks whether this putative vertex is closer to these four atoms than any other atom; if so, it is a vertex. In the procedure just outlined, all the atoms are considered equal, and the dividing planes are positioned midway between atoms (Fig. 8). As mentioned above, this method of partition, called bisection, is not physically reasonable for proteins, which have atoms of obviously different size (such as oxygen and sulfur). It chemically misallocates volume, giving more to the smaller atom. Currently, two principal methods of repositioning the dividing plane have been proposed to make the partition more physically
572
analysis and software
[23]
reasonable: method B125 and the radical-plane method.130 Both methods depend on the radii of the atoms in contact (R1 and R2) and the distance between the atoms (D). As Richards originally showed131 and many have shown more recently,132–137 the Voronoi procedure is particularly well suited for analyzing the packing of the protein interior. The Voronoi procedure fails at the surface of a protein, since atoms do not have neighbors and only incomplete polyhedra can be built. Unlike the surface, protein interiors all have neighbors, and the construction of Voronoi polyhedra is able to allocate all space amongst this collection of atoms. There are no gaps as there would be if one, say, simply drew spheres around the atoms. Thus, the volume of cavities or defects between atoms are included in their Voronoi volume, and one finds that the packing efficiency is inversely proportional to the size of the polyhedra. This indirect measurement of cavities contrasts with other types of calculations that measure the volume of cavities explicitly. Moreover, since protein interiors are tightly packed, fitting together like a jig-saw puzzle, the various types of protein atoms occupy well-defined amounts of space. This fact has made the calculation of standard volumes for atoms and residues in proteins a worthwhile proposition using Voronoi constructs.127,128,136,138 It was shown that the quality of packing can be analyzed by comparing standard residue volumes derived from the previously mentioned radii set, using a Voronoi procedure (Table II), with those calculated along the interface of a domain motion, as has been done in a similar analysis of protein crystals.136 Such an analysis provides another property to use in the characterization of protein motions. Additional Techniques for Studying Protein Motions
We have described computational techniques most suitable for a highthroughput analysis of thousands of proteins motions within a database. Over the years, many other techniques have been developed for analyses
130
B. J. Gellatly and J. L. Finney, J. Mol. Biol. 161, 305 (1982). F. M. Richards, Carlsberg Res. Commun. 44, 47 (1979). 132 C. Duyckaerts and G. Godefroy, J. Chem. Neuroanat. 20, 83 (2000). 133 P. J. Fleming and F. M. Richards, J. Mol. Biol. 299, 487 (2000). 134 E. Kussell, J. Shimada, and E. I. Shakhnovich, J. Mol. Biol. 311, 183 (2001). 135 V. A. Likic and F. G. Prendergast, Protein Sci. 8, 1649 (1999). 136 J. Pontius, J. Richelle, and S. J. Wodak, J. Mol. Biol. 264, 121 (1996). 137 M. L. Quillin and B. W. Matthews, Acta Crystallogr. D Biol. Crystallogr. 56, 791 (2000). 138 J. Tsai and M. Gerstein, Bioinformatics 18, 985 (2002). 131
[23]
573
tools, databases to analyze protein flexibility TABLE II ProtOr Residue Volumesa
Amino acid
˚) ProtOr Volume (A
Amino acid
˚) ProtOr Volume (A
GLY ALA VAL LEU ILE PRO MET PHE TYR TRP SER
63.8 89.3 138.2 163.1 163.0 121.3 165.8 190.8 194.6 226.4 93.5
THR ASN GLN CYS CSS HIS GLU ASP ARG LYS
119.6 122.4 146.9 112.8 102.5 157.5 138.8 114.4 190.3 165.1
a
Given are the volumes of the various amino acids as computed by ProtOr, using the parameter set given in Table I. Note that reduced cysteine (CYS) was considered distinct from disulfide-bonded cysteine (CSS). Adapted from Tsai and Gerstein.138
of individual protein motions, mostly derived from classic molecular mechanics approaches.2,4,5,139–142 Many of these require much greater computational resources than the methods described here, and so are better suited to detailed study of an individual molecule rather surveys of a whole database. Table III presents a summary of many of the computational techniques. Molecular dynamics (MD), energy minimization (EM), and normal mode analysis (NMA) are arguably the most widely used of the techniques presented.143 There are many variants on these techniques that do not differ substantially in terms of computational cost. Adiabatic mapping, used in the morph server, is essentially a form of energy minimization. It should be emphasized that there are significant differences in the tractability of the various techniques. For instance, for constructing a protein interpolation in the morph server, molecular dynamics requires six orders of magnitude more processing power than simple energy minimization.
139
P. D. Thomas and K. A. Dill, J. Mol. Biol. 257, 457 (1996). S. J. Wodak, M. De Crombrugghe, and J. Janin, Prog. Biophys. Mol. Biol. 49, 29 (1987). 141 M. Karplus and J. A. McCammon, Sci. Am. 254, 42 (1986). 142 R. A. Friesner and B. D. Dunietz, Acc. Chem. Res. 34, 351 (2001). 143 T. Schlick, Structure 9, R45 (2001). 140
574
TABLE III Computational Techniques for Studying Protein Motionsa Technique
Pros
Cons
CPU complexity
Continuous actions
Expensive; short time span
10 ps ¼ weeks for 50,000 atoms
Targeted MD (TMD)
Not necessarily physical
Same as MD for each step
Approximate
Technique dependent; can be as expensive as MD, but number of variables is reduced
Brownian dynamics (BD)
Connection between two states; useful for ruling out steric clashes Mean-force potential approximates environment and reduces model’s cost; useful information on ionic atmosphere and intermolecular associations Large-scale and long-time motion
Days for long DNA (1000s of base pairs)
Monte Carlo (MC)
Large-scale sampling; useful statistics
Approximate hydrodynamics; limited to systems with small relative inertia Move definitions are difficult
Hours of a million configurations
Minimization
Valuable equilibrium information; experimental constraints can be incorporated Filtering of high-frequency motion; approximate long-time trajectories
No dynamic information
Minutes to hours for biomolecules
Expensive (global optimization of entire trajectory)
Fast with interesting statistics, but potential unrealistic; may have large memory requirements
No dynamic information
1-ps approximate trajectory (1000 simulated annealing steps) ¼ 1 day on 100 processors for 25,000 atoms Seconds to minutes to hours, depending on problem and implementation
Continuum salvation
Stochastic path approach
Normal mode analysis
Based in part on a figure in Schlick.143
[23]
a
analysis and software
Molecular dynamics (MD)
[23]
tools, databases to analyze protein flexibility
575
Relating Protein Motions to Genome Sequences
Genome sequencing has vastly expanded the amount of information available for bioinformatic analyses. However, most of the information in genomes is raw and uncharacterized from the point of view of protein structure and function.144–146 One of the current challenges is to take the information about the few relatively well-characterized proteins, such as those in the macromolecular motions database, and to extrapolate this to uncharacterized genome sequences. In general this process is dubbed annotation transfer.45,147,148 Propensities for Linkers in General In relation to macromolecular motions, it may be possible to predict protein domains in protein sequences of unknown structure, using information about the amino acid composition of linker sequences. For example, a profile of flexible linker regions might be used to predict the location of domain hinges for the structural annotation of genome sequences. A tool to achieve this successfully would be quite useful in the context of genefinding in genomic sequences.150 A first step in accomplishing this is to determine the amino acid propensities of interdomain linkers. Two propensity scales, for amino acids to be in linkers in general, or in flexible hinges in particular, were calculated using structural data from the database of macromolecular motions.149 Flexible as well as inflexible linkers are included in the first method of analysis. In this method we have arbitrarily defined a linker sequence as a 16-residue region centered on the peptide bond linking two domains. The location of protein domains and other structural information can be found in SCOP,26,151 which contains several databases of amino acid sequences of protein domains. The PDB40 database provided by SCOP was used to create a database of linker sequences. The PDB40 database consists of a
144
M. Gerstein, J. Mol. Biol. 274, 562 (1997). M. Gerstein, Proteins 33, 518 (1998). 146 S. Teichmann, C. Chothia, and M. Gerstein, Curr. Opin. Struct. Biol. 9, 390 (1999). 147 H. Hegyi and M. Gerstein, J. Mol. Biol. 288, 147 (1999). 148 H. Hegyi and M. Gerstein, Genome Res. 11, 1632 (2001). 149 M. B. Gerstein, R. Jansen, T. Johnson, B. Park, and W. Krebs, in ‘‘Rigidity: Theory and Applications’’ (M. F. Thorpe and P. M. Duxbury, eds.), p. 401. Kluwer Academic/Plenum Press, New York, 1999. 150 P. M. Harrison, N. Echols, and M. B. Gerstein, Nucleic Acids Res. 29, 818 (2001). 151 T. J. P. Hubbard, A. G. Murzin, S. E. Brenner, and C. Chothia, Nucleic Acids Res. 25, 236 (1997). 145
576
analysis and software
[23]
subset of proteins in the Protein Data Bank (PDB) with known structures selected so that, when aligned, no two proteins in the subset show a sequence identity of 40% or greater. Thus, the data set is not biased toward protein structures listed multiple times in the PDB. From the 1500 protein sequences in the PDB40 database it was possible to extract 234 linker sequences, thus reflecting that only a small fraction of proteins contain multiple domains and therefore linker regions. Table IV shows a profile of the amino acid composition at each of the 16 positions in the linker sequences. The residue-specific amino acid composition can be summarized in the average amino acid composition of the whole linker sequences (Fig. 9). There seems to be some selection of amino acid types into linkers; can we find a statistic that will show whether there is a clear preference? To find this statistic, we can regard the linker sequences as a random sample of the sequences in the PDB40 database. In particular, the probability Pn(k) that a particular amino acid occurs k times among the n amino acids in a sequence sample is given by the familiar binomial distribution: n k n P ðkÞ ¼ p ð1 pÞnk (13) k where p is the probability that the amino acid occurs in the PDB40 database (n ¼ 234 for the distribution of every single of the 16 specific linker positions and n ¼ 234 16 for the distribution of the linker average). Accordingly, the cumulative distribution function CDFn(k), representing the probability that the amino acid occurs less than k times, is then given by CDFn ðkÞ ¼
k X i¼0
Pn ðiÞ
(14)
Consequently, if o and e are the observed and expected counts, then a two-sided P value is given by 1 CDFn(e þ j o ej) þ CDFn(e j o ej). This is the probability that the number of amino acid counts in a random subset of the PDB40 would be either smaller or greater than the expected value by a difference jo ej. The two-sided P values are shown in Fig. 10 for the average linker compositions across all 16 positions. The results imply, with better than 98% confidence, that linker regions are proline rich and alanine and tryptophan poor. In particular, the statistical evidence that linkers are proline rich is unusually strong and is significant at better than the hundredth-of-a-percent level. No particular trends could be seen after roughly grouping the amino acids according to the attributes hydrophobic, charged, and polar (Table V and Fig. 10) following the classification of
[23]
TABLE IV Profile of Amino Acid Composition in Linker Sequences for Every Single Linker Position in Detail, Compared with PDB40 Averagesa Position 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A V F P M I L D E K R S T Y H C N Q W G X
8.6 6.0 4.7 3.9 4.7 5.6 11.6 4.7 5.2 5.2 5.2 7.8 4.7 2.2 1.7 1.7 4.7 3.9 1.3 6.0 0.4
7.8 8.2 3.9 6.5 1.3 3.5 9.1 6.5 5.2 6.5 3.9 6.0 5.6 3.9 3.5 2.6 3.9 5.2 0.9 6.0 0.4
4.7 8.2 6.5 6.0 1.3 7.3 11.2 6.0 3.9 3.9 4.7 5.2 3.0 6.5 3.0 0.9 3.5 3.5 0.9 9.9 0.0
5.6 6.0 3.5 6.0 2.6 6.5 6.0 3.9 6.5 5.6 9.1 6.9 5.6 3.0 3.5 1.3 6.5 5.2 2.6 4.3 0.0
6.0 8.2 2.6 5.2 2.6 3.9 16.4 6.0 4.7 5.2 6.5 6.5 6.5 3.5 3.5 1.7 3.0 2.6 0.4 5.2 0.0
8.6 5.6 2.6 9.1 0.0 6.0 7.3 4.7 4.7 6.9 5.2 8.2 9.5 2.2 2.6 2.6 4.3 0.9 0.9 8.2 0.0
9.5 9.1 6.0 6.9 1.7 3.9 4.3 5.6 7.8 4.7 5.2 6.9 6.9 2.6 3.5 0.4 2.6 3.0 0.4 9.1 0.0
5.6 6.0 2.6 10.8 1.7 3.5 6.5 8.6 4.7 4.7 5.6 6.5 6.0 3.5 2.2 2.2 3.0 2.2 0.9 13.4 0.0
4.7 8.2 4.7 9.1 4.3 5.2 8.2 4.3 6.5 6.0 5.6 3.5 6.5 2.2 2.2 0.9 5.6 3.5 0.4 8.2 0.4
6.5 4.7 3.0 10.3 3.0 6.9 3.5 3.9 4.3 7.8 4.7 6.0 11.2 3.9 0.9 1.3 5.2 4.7 1.3 6.9 0.0
5.6 6.0 4.3 9.9 1.3 4.7 7.3 3.5 6.5 3.9 6.0 9.5 7.3 2.6 1.7 4.7 3.5 3.5 0.0 8.2 0.0
7.3 4.7 6.0 6.0 1.3 2.6 5.2 7.3 9.1 6.5 5.2 7.8 6.5 2.2 2.2 1.7 6.5 2.2 1.3 8.6 0.0
6.9 7.3 5.2 8.6 2.2 4.7 7.3 6.9 7.3 5.2 5.2 4.3 6.0 3.0 1.7 1.7 3.9 6.5 0.4 5.6 0.0
9.1 9.1 4.3 2.6 1.7 8.6 6.5 7.3 5.2 5.2 4.7 3.9 4.7 3.5 2.6 3.9 6.0 4.3 0.9 6.0 0.0
9.5 5.2 4.3 4.7 3.0 5.6 10.3 4.3 8.6 3.0 3.0 8.6 8.2 3.5 1.3 0.4 3.0 4.3 2.2 6.9 0.0
9.9 8.6 5.6 3.5 3.0 6.0 7.8 5.6 5.6 7.8 4.3 4.7 3.5 4.3 2.2 0.9 5.6 4.7 0.9 5.6 0.0
a
PDB40 average 8.4 7.0 4.0 4.7 2.2 5.6 8.5 6.0 6.3 5.9 4.8 6.0 5.8 3.7 2.2 1.7 4.6 3.8 1.5 7.8 0.2
577
Profile values are given as percentages. A linker has been arbitrarily defined as the 16-residue region centered around the peptide bond (between positions 8 and 9) linking two domains. Positions where the amino acid frequency is less than the PDB40 average are in bold face. Adapted from Gerstein and Krebs.6
tools, databases to analyze protein flexibility
Amino acid
578
[23]
analysis and software 9
Average frequency [%]
8 7 6 5 4 3 2 1 0
A V F
P M
I L D E K R S T Y H C N Q W G X Amino acid
Average amino acid frequencies in PDB40 Average amino acid frequencies in linkers
Fig. 9. Comparison of the average amino acid composition in linker sequences and proteins in general (as represented by the PDB40 database). Adapted from Gerstein and Krebs.6
1
Hydrophobic
polar
charged
0.9 0.8 0.7
P-Value
0.6 0.5 0.4 0.3 0.2 0.1 0 A
V
F
P
M
I
L
D
E
K
R
S
T
Y
H
C
N
Q
W
G
X
Amino acid
Fig. 10. P values for the average amino acid compositions in linker sequences. The P values of alanine, proline, and tryptophan are close to zero. The difference between the content of these amino acids in linkers and protein sequences in general (as represented by the PDB40 database) is statistically significant at better than 98% confidence. Adapted from Gerstein and Krebs.6
[23]
TABLE V P Values for Profile of Amino Acid Composition of Linker Sequences for Every Single Position in Linkersa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Attribute
A V F P M I L D E K R S T Y H C N Q W G X
0.908 0.577 0.598 0.573 1 102 0.990 0.084 0.442 0.476 0.638 0.793 0.269 0.498 0.234 0.619 0.997 0.942 0.937 0.810 0.324 0.717
0.728 0.481 0.911 0.207 0.366 0.155 0.754 0.750 0.476 0.730 0.530 0.990 0.897 0.864 0.237 0.336 0.597 0.281 0.459 0.324 0.717
4 102 0.481 0.059 0.346 0.366 0.267 0.136 0.966 0.127 0.194 0.974 0.599 0.069 2 102 0.455 0.345 0.404 0.804 0.459 0.233 0.752
0.125 0.577 0.666 0.346 0.717 0.585 0.186 0.185 0.936 0.842 2 103 0.578 0.897 0.619 0.237 0.647 0.193 0.281 0.193 5 102 0.752
0.196 0.481 0.276 0.737 0.717 0.257 3 105 0.966 0.327 0.638 0.240 0.774 0.673 0.872 0.237 0.997 0.251 0.359 0.197 0.139 0.752
0.908 0.417 0.276 2 103 2 102 0.793 0.541 0.442 0.327 0.538 0.793 0.166 2 102 0.234 0.740 0.336 0.820 2 102 0.459 0.823 0.752
0.562 0.224 0.126 0.114 0.637 0.257 2 102 0.821 0.384 0.457 0.793 0.578 0.485 0.402 0.237 0.139 0.143 0.562 0.197 0.482 0.752
0.125 0.577 0.276 5 105 0.637 0.155 0.280 0.089 0.327 0.457 0.575 0.774 0.886 0.872 0.939 0.634 0.251 0.206 0.459 1 103 0.752
4 102 0.481 0.598 2 103 3 102 0.772 0.882 0.296 0.936 0.945 0.575 0.101 0.673 0.234 0.939 0.345 0.500 0.804 0.197 0.823 0.717
0.293 0.184 0.449 1 104 0.433 0.408 6 103 0.185 0.211 0.243 0.974 0.990 5 104 0.864 0.166 0.647 0.710 0.460 0.810 0.621 0.752
0.125 0.577 0.836 3 104 0.366 0.571 0.541 0.108 0.936 0.194 0.389 2 102 0.328 0.402 0.619 2 102 0.404 0.804 0.055 0.823 0.752
0.561 0.184 0.126 0.346 0.366 4 102 0.071 0.389 0.092 0.730 0.793 0.269 0.673 0.234 0.939 0.997 0.193 0.206 0.810 0.643 0.752
0.415 0.841 0.393 4 103 0.961 0.571 0.541 0.556 0.545 0.638 0.793 0.283 0.886 0.619 0.619 0.997 0.597 3 102 0.197 0.218 0.752
0.729 0.224 0.836 0.134 0.637 5 102 0.280 0.389 0.476 0.638 0.974 0.176 0.498 0.872 0.740 2 102 0.326 0.684 0.459 0.324 0.752
0.562 0.285 0.836 0.971 0.433 0.990 0.312 0.296 0.158 0.061 0.215 0.095 0.121 0.872 0.354 0.139 0.251 0.684 0.452 0.621 0.752
0.416 0.338 0.235 0.385 0.433 0.793 0.705 0.821 0.653 0.243 0.742 0.425 0.127 0.612 0.939 0.345 0.500 0.460 0.459 0.218 0.752
Hydrophobic
a
Charged
Polar
tools, databases to analyze protein flexibility
Position Amins acid
P values less than 0.05 are represented in bold face. The low P values for proline in positions 6 to 11 are most conspicuous. The classification according to the attributes hydrophobic, charged, and polar (Branden and Tooze152) does not provide a satisfactory explanation for the observed levels of amino acids (see also Fig. 10). Adapted from Gerstein and Krebs.6
579
580
analysis and software
[23]
Branden and Tooze.152 The frequencies of the remaining amino acids in linkers are not statistically different from the database as a whole at the 5% significance level. P values for amino acids at each of the 16 linker positions are shown (Table V). Toward Propensities for Flexible Linkers A variant of this procedure involves focusing just on linkers that are known to be flexible. The Database of Macromolecular Movements contains residue selections for known protein hinge regions (i.e., flexible linkers) that have been found in the scientific literature. These sequences were manually verified to be true flexible linker regions, and thus this database is a potential ‘‘gold standard’’ free from algorithmic biases that can be used as a starting point in the development of propensity scales and other research leading towards algorithmic techniques. By expanding these residue selections slightly with a predetermined protocol and extracting the corresponding sequences from the PDB, a series of sequences of known flexible linkers can be obtained. A FASTA search with a suitable cutoff (e.g., e value 0.001) can then be performed on known linker sequences to obtain a series of near homologs (Table VI). It is then possible to arrange these homologs into a multiple alignment (via the CLUSTAL W) program153,154 and the multiple alignment can be fused into a variety of consensus pattern representations, such as hidden Markov models or simply consensus sequences.155–159 A sample multiple alignment for the hinge in calmodulin was performed (Table VI) and a number of consensus sequences were generated (Table VII). It is possible to average the amino acid composition over all the different hinges, and different positions within a hinge, to give a single composition vector for flexible hinges. Finally, by comparing this latter quantity with the overall amino acid composition, or that of just linkers, a preliminary scale of amino acid propensity in flexible linkers may be obtained (Table VIII). This can be compared
152
C. Branden and J. Tooze, ‘‘Introduction to Protein Structure.’’ Garland Publishing, New York, 1991. 153 J. D. Thompson, D. G. Higgins, and T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994). 154 D. G. Higgins, J. D. Thompson, and T. J. Gibson, Methods Enzymol. 266, 383 (1996). 155 E. L. Sonnhammer, S. R. Eddy, E. Birney, A. Bateman, and R. Durbin, Nucleic Acids Res. 26, 320 (1998). 156 A. Krogh, M. Brown, I. S. Mian, K. Sjo¨lander, and D. Haussler, J. Mol. Biol. 235, 1501 (1994). 157 S. R. Eddy, Curr. Opin. Struct. Biol. 6, 361 (1996). 158 S. R. Eddy, G. Mitchison, and R. Durbin, J. Comput. Biol. 2, 9 (1994). 159 P. Baldi, Y. Chauvin, and T. Hunkapiller, Proc. Natl. Acad. Sci. USA 91, 1059 (1994).
[23]
tools, databases to analyze protein flexibility
581
TABLE VI Example of FASTA Resultsa OWL
ID
CALN_CHICK MUSCAMC CALM_PATSP CALM_PYUSP CALM_METSE CALM_STIJA CALM_HUMAN CALM_DROME HSCAM3X1 CALM_EMENI CALM_NEUCR CALM_ELEEL NEUCLMDLN SSO4B01 CALL_ARBPU CALM_PLECO CALL_HUMAN CALS_CHICK CALM_PHYIN CALM_PNECA CALM_TRYBB CALM_TRYCR S53019 TRBCMRSG CALM_HORVU JC1033 CAL1_PETHY CAL6_ARATH
MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDNE MARKMKDTDSE MARKMKDTDSE MARKMKDVDSE MARKMQDSDSE MARKMQDSDSE MARKMKDTDSE MARKMQDSDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE MARKMKDTDSE
a
160
An example of sequences that might be obtained from a FASTA run on a known flexible linker sequence. In this case, the output of one FASTA run on the OWL database using the flexible linker region from calmodulin (4cln) with a cutoff (e value) of 0.001
A. Bairoch and B. Boeckmann, Nucleic Acids Res. 20, 2019 (1992). L. Holm and C. Sander, Nucleic Acids Res. 22, 3600 (1994). 162 E. Abola, J. Sussman, J. Prilusky, and N. Manning, Methods Enzymol. 277, 556 (1997). 163 C. A. Orengo, D. T. Jones, and J. M. Thornton, Nature 372, 631 (1994). 164 R. B. Altman, N. F. Abernethy, and R. O. Chen, ISMB 5, 15 (1997). 165 R. O. Chen, R. Felciano, and R. B. Altman, ISMB 5, 84 (1997). 166 A. Bairoch, P. Bucher, and K. Hofmann, Nucleic Acids Res. 24, 189 (1996). 161
582
analysis and software
[23]
TABLE VII Example of Protein Flexible Linker Consensus Sequences Extracted from Macromolecular Movements Databasea Linker ID
Linker consensus sequence
4cln 6ldh adenkin1 adenkin2 adenkin3 adenkin4 anxbreat anxtrp1 anxtrp2 dt enolase enolase2 lfh_hinge1 lfh_hinge2 ras tbsv
MARKMKDTDSE AGARQQEGESRLNLVQRNVNIFKF VPFEVI LRLTA GEPLIQRDDDKE AYHAQTE MKGAGT YEAGELKWG EETIDRET LEQVVHNS GASTGIY SDKS QTHY RVPS AGQEEYSAMRDQYMR PQPTNTL
a
The database contains residue selections for known hinge regions (flexible linkers) culled from the scientific literature. Sixteen of these residue selections were then ‘‘grown’’ slightly in both directions according to a fixed protocol. Each selection was assigned a linker ID, which is based either on a PDB ID or on the macromolecular movements database motion ID plus possibly an optional additional numeric suffix to identify the specific residue selection used. A FASTA search with a cutoff of 0.01 was then performed on each sequence to obtain near homologs. The consensus sequence corresponding to each linker ID is given here.
with the scale of amino acid propensities in linkers as obtained by the procedure previously described (Table IV). Conclusions
We have described how protein flexibility can be studied in a database framework. The database of macromolecular motions contains thousands of motions, with varying levels of annotation. We survey a number of the tools that underlie the statistics in the database (i.e., structural alignment, adiabatic mapping interpolation, normal mode analysis, and Voronoi packing calculations). Finally, in looking toward the future, we suggest how database analysis of motions can be extended to the vast new frontier
[23]
tools, databases to analyze protein flexibility
583
TABLE VIII Preliminary Flexible Linker Propensity Scalea Residue
Propensity
Residue
Propensity
A C D E F G H I K L
1.3268 0.1097 1.1684 1.4702 0.5624 1.2972 0.4806 0.4462 1.0519 0.5303
M N P Q R S T V W Y
2.6603 0.7729 0.4051 1.8076 1.8013 0.8269 0.9002 0.6865 0.308 1.3375
a
A FASTA search with a cutoff of 0.01 was performed on 16 flexible linker sequences, as described in text. Amino acid frequency in the flexible linker sequences and their near homologs obtained in the FASTA search were tabulated and divided by the amino acid sequence frequency in the PDB to obtain the preliminary propensities given here. (The high propensity shown for methionine may be an artifact arising from the presence of methionine as the first residue in many proteins.)
of genomic sequences, through the identification of likely hinge residues in primary sequence. We expect that the number of confirmed macromolecular motions available for study will greatly increase in future, making a database of motions increasingly valuable. We reason as follows: the number of new structures continues to rise at a rapid rate (nearly exponential). However, the increase in the number of folds is much slower and is expected to level off much more in future as we find more and more of the limited number of folds in nature, estimated to be as low as 1000. Each new structure solved that has the same fold as one in the database represents a potential new motion, that is, it is often a structure in a different liganded state or a structurally perturbed homolog. Thus, as we find more and more of the finite number of folds, crystallography and NMR will increasingly provide information about the variability and mobility of a given fold, rather than identifying new folding patterns. Databases potentially represent a new paradigm for scientific computing. In a highly schematized view, scientific computing traditionally involved big calculations on fast computers. The aim in these often was prediction based on first principles—for example, prediction of protein folding based on molecular dynamics. These calculations naturally emphasized the processor speed of the computer. In contrast, the new ‘‘database paradigm’’ focuses on small, interconnected information sources on many different computers. The aim is communication of scientific information
584
analysis and software
[24]
and the discovering of unexpected relationships in the data—for example, the finding that heat shock protein looks like hexokinase. In contrast to their more traditional counterparts, these calculations are more dependent on disk storage and networking rather than raw CPU power. Acknowledgments The authors gratefully acknowledge the financial support of the Keck Foundation and the National Science Foundation (Grant DBI-9723182). Numerous people have also either contributed entries or information to the database and morph server or have given us feedback on what the user community wants. The authors also wish to thank Informix Software, Inc., for providing a grant of its database software.
[24] Looking behind the Beamstop: X-Ray Solution Scattering Studies of Structure and Conformational Changes of Biological Macromolecules By Patrice Vachette, Michel H. J. Koch, and Dmitri I. Svergun Introduction
Advances in X-ray crystallography have made it much easier to determine high-resolution structures of biological macromolecules, even those of large macromolecular assemblies. As a consequence, the emphasis has shifted from the actual structure determination, or molecular anatomy, back to classic questions of mechanism, or molecular physiology, such as: How do differences in structure reflect adaptation in different species? What is the role of intermolecular interactions in the optimization of metabolic pathways? How are differences in structure related to pathological conditions? At a more technical level those trying to extract operational information from the rapidly increasing amount of structural data are thus more and more often faced with questions like the following: Is the (absence of) conformational change observed on ligand binding affected by crystal packing? Is the conformational transition an all-or-none, two-state process, or does it involve structural intermediates? What is the rate of the process? How are structurally known partners arranged within this noncrystallizing multiprotein complex? Can some structural features of this invisible, presumably disordered part of the molecule be characterized? These and similar questions can be addressed, and even at times answered, by shining a collimated X-ray beam on a solution of the relevant macromolecules or macromolecular assemblies. The aim of the present review
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
584
analysis and software
[24]
and the discovering of unexpected relationships in the data—for example, the finding that heat shock protein looks like hexokinase. In contrast to their more traditional counterparts, these calculations are more dependent on disk storage and networking rather than raw CPU power. Acknowledgments The authors gratefully acknowledge the financial support of the Keck Foundation and the National Science Foundation (Grant DBI-9723182). Numerous people have also either contributed entries or information to the database and morph server or have given us feedback on what the user community wants. The authors also wish to thank Informix Software, Inc., for providing a grant of its database software.
[24] Looking behind the Beamstop: X-Ray Solution Scattering Studies of Structure and Conformational Changes of Biological Macromolecules By Patrice Vachette, Michel H. J. Koch, and Dmitri I. Svergun Introduction
Advances in X-ray crystallography have made it much easier to determine high-resolution structures of biological macromolecules, even those of large macromolecular assemblies. As a consequence, the emphasis has shifted from the actual structure determination, or molecular anatomy, back to classic questions of mechanism, or molecular physiology, such as: How do differences in structure reflect adaptation in different species? What is the role of intermolecular interactions in the optimization of metabolic pathways? How are differences in structure related to pathological conditions? At a more technical level those trying to extract operational information from the rapidly increasing amount of structural data are thus more and more often faced with questions like the following: Is the (absence of) conformational change observed on ligand binding affected by crystal packing? Is the conformational transition an all-or-none, two-state process, or does it involve structural intermediates? What is the rate of the process? How are structurally known partners arranged within this noncrystallizing multiprotein complex? Can some structural features of this invisible, presumably disordered part of the molecule be characterized? These and similar questions can be addressed, and even at times answered, by shining a collimated X-ray beam on a solution of the relevant macromolecules or macromolecular assemblies. The aim of the present review
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[24]
SAXS studies of conformational changes in solution
585
is to illustrate the use of solution scattering methods to solve problems related to the crystal structures of biological macromolecules. As a consequence important applications, which have no explicit link with crystallography, such as those related to (un)folding of proteins and nucleic acids, are hardly mentioned. For all practical purposes solution scattering measurements are nowadays carried out at synchrotron radiation or neutron facilities. The reliance on such facilities also implies that most of the experimental aspects (source, optics, sample environment, beam path, and detectors) are largely specific to a particular installation and hence not under the control of the user. They will therefore not be considered here and it will be assumed that accurate scattering and background curves, corrected for instrumental effects (e.g., smearing due to finite beam dimensions, wavelength spread, detector characteristics, etc), are available. (For an introduction, see Feigin and Svergun.1) In this context it should be noted that radiation damage can become a serious problem with X-rays, especially with the intense beams of thirdgeneration sources. Sporadic evidence suggests that intermittent exposures to an intense undulator beam (e.g., 10 exposures of 0.1 s at 1-s intervals) are significantly less damaging than the corresponding continuous (1-s) exposure (T. Narayanan and S. Finet, personal communication). It is thus good practice to collect patterns in a succession of short time frames that can be averaged after checking for possible radiation damage. For neutrons, which have energies of only a few millielectron volts, as compared ˚ X-rays, these effects are usually negligible. with the 12.4 keV of 1-A The following section gives a brief reminder of the main concepts of solution scattering. Details can be found in the International Tables for Crystallography2 and in several monographs.1,3 Basics of Solution Scattering
Due to the random orientation of the macromolecules in solution, the scattering depends only on the modulus of the scattering vector s ¼ 2sin/ ¼ 1/d, where is the radiation wavelength, 2 is the scattering angle, and d denotes the Bragg resolution. Usually, the intensity patterns are
1
L. A. Feigin and D. I. Svergun, ‘‘Structure Analysis by Small-Angle X-Ray and Neutron Scattering.’’ Plenum Press, New York, 1987. 2 O. Glatter, ‘‘International Tables for Crystallography.’’ Kluwer Academic, Dordrecht, The Netherlands, 1995. 3 O. Glatter and O. Kratky, ‘‘Small Angle X-Ray Scattering,’’ p. 515. Academic Press, London, 1982.
586
analysis and software
[24]
represented as functions of the momentum transfer q ¼ 2s ¼ 4sin/. The measured scattering intensity IM(q) is obtained from the number N(n) of photons or neutrons counted by the detector in the nth channel of the histogramming device as IM ðqÞ ¼ NðnÞsol =I0; sol fNðnÞb =I0b ð1 f ÞNðnÞec =I0;ec
(1)
where the subscripts sol, b, and ec refer to the solution, solvent (buffer), and empty cell, respectively. I0 represents the intensity of the transmitted direct beam. In actual practice, the value of f is normally taken to be 1 and expression (1) reduces to IM ðqÞ ¼ NðnÞsol =I0; sol NðnÞb =Iob
(10 )
except when concentrated solutions are studied (25 to 200 mg/ml), in which case f is taken as the volume fraction of the solvent in the solution. The relationship between the value of n, the channel number in the histogramming device, and the momentum transfer q is obtained either by calibration with a standard sample or from an accurate knowledge of the sampledetector distance, the wavelength, and physical dimensions of the detector. In general, scattering from a monodisperse solution of macromolecules is given by IM ðqÞ ¼ NM IðqÞSFðqÞ
(2)
where NM is the number of macromolecules in the irradiated volume, I(q) is the scattering from a single particle averaged over all orientations, and SF(q) is the structure factor of the solution. The latter takes into account attractive or repulsive interparticle interactions and may significantly differ from unity at small angles (q < 2 nm1). At infinite dilution SF(q) ¼ 1 and it is thus useful to collect the scattering intensities for a concentration series (e.g., 1, 3, and 5 mg/ml) to obtain an undistorted scattering pattern by extrapolation to zero concentration in the low-angle region. For the high angle region (q > 2 nm1) it is generally necessary to use concentrated solutions (c > 10 mg/ml) to obtain a sufficient signal-to-background ratio, but the influence of the structure factor is usually negligible in this range. In the following, we concentrate on the analysis of the single particle scattering intensity I(q) that contains information about the structure of the macromolecule. For studies of the interparticle interactions through measurements of the structure factor, the reader is referred to Tardieu et al.4 Solution scattering emerges from inhomogeneities of the scattering length density in the sample due to the dissolved macromolecules (electron 4
A. Tardieu, F. Bonnete´, S. Finet, and D. Vivare`s, Methods in Enzymol. 368, 105 (2003), this volume.
[24]
SAXS studies of conformational changes in solution
587
density for X-rays or nuclear and spin density for neutrons). Mathematically, I(q) is the spherical average of the square of the modulus of the scattered amplitude A(q), which is the Fourier transform of the excess scattering density of the particle: I(q) ¼ ¼ . Here, is the solid angle in reciprocal space [q ¼ (q, )], and (r) ¼ (r) s, where (r) is the scattering length density distribution inside the particle and s is the scattering length density of the solvent. An important characteristic of the sample is the average value of the excess density inside the particle, that is, the contrast ¼ . For high contrasts [ much larger than the fluctuations of (r)], I(q) is mostly due to the shape of the macromolecule, whereas for low contrasts I(q) is dominated by the scattering from the internal particle structure. In most practical cases, scattering patterns contain both terms whereby the relative contribution of the scattering due to internal structure increases with q. The two contributions can be separated using contrast variation,5 which involves measurements at different solvent scattering length densities. The best conditions for contrast variation exist in neutron scattering due to a remarkable scattering length difference between hydrogen and deuterium. Valuable additional information about the particle structure is obtained either by measurements in different water–heavy water mixtures or by isotopic labeling of specific fragments of macromolecules.6–8 Contrast variation is especially useful for multicomponent (e.g., nucleoprotein) complexes.9–11 For X-ray studies of proteins in aqueous solutions, shape scattering usually prevails at low resolution (up to about d ¼ 2–3 nm, i.e., s < 0.5– 0.33 nm1 or q < 0.66 nm1), whereas for higher angles the scattering data reflect the internal particle structure. Contrast variation is rarely used with X-rays since it requires the addition of small molecules (salt, sucrose, glycerol). This introduces all the complications associated with three-component systems (water, small solute, macromolecule) often compounded by large changes in absorption (e.g., with NaI, KI, or CsI) or in 5
H. B. Stuhrmann and R. G. Kirste, Zeitschr. Physik. Chem. Neue Folge 46, 247 (1965). M. H. J. Koch and H. B. Stuhrmann, Methods Enzymol. 59, 670 (1979). 7 G. Zaccai and B. Jacrot, Annu. Rev. Biophys. Bioeng. 12, 139 (1983). 8 H. B. Stuhrmann, O. Scharpf, M. Krumpolc, T. O. Niinikoski, M. Rieubland, and A. Rijllart, Eur. Biophys. J. 14, 1 (1986). 9 M. S. Capel, D. M. Engelman, B. R. Freeborn, M. Kjeldgaard, J. A. Langer, V. Ramakrishnan, D. G. Schindler, D. K. Schneider, B. P. Schoenborn, I. Y. Sillers, et al., Science 238, 1403 (1987). 10 R. Junemann, N. Burkhardt, J. Wadzack, M. Schmitt, R. Willumeit, H. B. Stuhrmann, and K. H. Nierhaus, Biol. Chem. 379, 807 (1998). 11 D. I. Svergun and K. H. Nierhaus, J. Biol. Chem. 275, 14432 (2000). 6
588
analysis and software
[24]
Fig. 1. Experimental scattering pattern of a solution of hen egg white lysozyme (solid circles) and scattering of the corresponding homogeneous particle (open circles) obtained by subtraction of the slope of the Porod plot (shown in the inset).
solvent viscosity (e.g., with sucrose or glycerol). In X-ray scattering a good approximation of the shape scattering can be obtained by making use of Porod’s law,12,13 which describes the asymptotic behavior of the scattering of homogeneous particles. At higher angles, the scattering oscillates around a straight line given by the Porod plot: q4 IðqÞ Bq4 þ A
(3)
The scattering of the corresponding homogeneous body can be obtained from the experimental data by subtracting the contribution of the internal structure given by the constant B obtained from the slope of the Porod plot. As illustrated in Fig. 1, this leads to a significant modification of the experimental data at large angles. 12 13
G. Porod, Kolloid-Z. 124, 83 (1951). G. Porod, in ‘‘General Theory’’ (O. Glatter and O. Kratky, eds.), p. 17. Academic Press, London, 1982.
[24]
SAXS studies of conformational changes in solution
589
Fourier transformation of the scattering intensity I(q) yields the characteristic function of the particle Z 1 1 sinðqrÞ dq (4) q2 IðqÞ ðrÞ ¼ 2 2 0 qr (r) is directly related to the distance distribution function of the particle p(r) ¼ r2 (r). The latter is the spherically averaged self-convolution of the excess scattering density p(r) ¼ , which is equal to zero beyond r ¼ Dmax, where Dmax is the maximum particle size. Figure 2 illustrates that the shape of p(r) gives direct information about the overall shape of the particles (e.g., globular or cylindrical) and/or their domain structure. First Analysis: Distance Distribution Function and Invariants
An experimental data set after data reduction, background subtraction, and extrapolation to zero concentration is a collection of discrete points Iexp(qi), i ¼ 1, . . ., N, in a restricted angular range qmin < qi < qmax, affected by statistical errors (qi) and, possibly, also instrumental effects like the smearing associated with the size and shape of the direct beam and/or the wavelength distribution. A first check is provided by the use of the Guinier
Fig. 2. Typical p(r) curves computed from the experimental data of (1) a compact globular protein (pyruvate decarboxylase from Zymomonas mobilis65), (2) an elongated protein (Z1Z2 domain of titin), and (3) a protein with two domains (FBKPmem from Legionella pneumonia60).
590
[24]
analysis and software
approximation, historically the first tool of small-angle scattering data analysis, dating back to 1939.14 At small angles, the intensity is a function of only two parameters: IðqÞ Ið0Þ exp ðq2 R2g =3Þ
(5)
and the slope and intercept of a Guinier plot (ln[I(q)] versus q2) readily yield the values of the radius of gyration of the particle Rg and of the forward scattering I(0). The first parameter characterizes the particle size, the second one its molecular mass. The approximation is only strictly valid in the range qRg < 1 and is sensitive to the presence of small amounts of aggregates or interparticle interactions: nonlinearity of the Guinier plot is an indication of polydispersity and/or of significant interparticle interactions in the sample solution. Linearity of the Guinier plot alone does not, however, guarantee the absence of aggregates and/or interparticle interferences. The reliability of any further analysis rests on the accuracy of computation of the distance distribution function p(r). Direct Fourier transformation [Eq. (4)] is not reliable, as the exact intensity I(q) cannot be measured, and it is thus better to compute p(r) indirectly using the inverse transformation Z D max sin qr dr (6) IðqÞ ¼ 4 pðrÞ qr 0 where Dmax is the maximum dimension of the particle. Representing p(r) on [0, Dmax] by a linear combination of orthogonal functions ’k(r) pðrÞ ¼
K X
(7)
ck ’k ðrÞ
k¼1
the coefficients ck can be determined by fitting the experimental data minimizing the functional 2 32 K P D I ðq Þ c ðq Þ Zmax i k k i 7 N 6 exp X 6 7 k¼1 ½p0 ðrÞ 2 dr (8) F ¼ 6 7 þ 4 5
ðq Þ i i¼1 0
where k(q) are the Fourier transformed and smeared functions ’k(r). The regularizing multiplier 0 allows to balance between the goodness of fit to the data (first summand) and the smoothness of the p(r) function (second 14
A. Guinier, Ann. Phys. (Paris) 12, 161 (1939).
[24]
SAXS studies of conformational changes in solution
591
summand). This ‘‘indirect transform’’ approach first introduced by Glatter15 is usually superior to other data-processing techniques as it imposes strong constraints, namely boundedness and smoothness of p(r). Note that an approximate estimate of Dmax is usually known a priori and can be further refined by successive calculations with different values of Dmax. The main problem when using an indirect transform technique is to select the proper value of the regularizing multiplier . Too-small values of yield solutions that are unstable to experimental errors, whereas too large values lead to systematic deviations from the experimental data. The program GNOM16,17 provides the necessary guidance using a set of perceptual criteria describing the quality of the solution. The program either finds the optimal solution automatically or signals that the assumptions about the system (e.g., the value of Dmax) are incorrect. The forward scattering and the radius of gyration are readily derived from the p(r) as R Z r2 pðrÞdr 2 Ið0Þ ¼ 4 pðrÞdr; Rg ¼ R (9) 2 pðrÞdr
This method implicitly makes use of the entire scattering curve and is thus more reliable than the direct evaluation of Rg using the Guinier approximation.
Modeling Tools
Ab Initio Methods An unambiguous ab initio reconstruction of three-dimensional models from one-dimensional scattering curves is clearly not possible without imposing restrictions on the model. The simplest and most frequently used method is to limit the search to homogeneous models, usually a valid assumption at low resolution. In the past, trial-and-error approaches involving evaluation of scattering patterns from different simple shapes (spheres, ellipsoids, cylinders, etc.) and their comparison with the experimental data were popular. A modern version of this approach implemented in the program SASMODEL combines ellipsoids and cylinders filled with a dense random arrangement of spheres to compute p(r).18 More recently, ab initio 15
O. Glatter, J. Appl. Crystallogr. 10, 415 (1977). D. I. Svergun, A. V. Semenyuk, and L. A. Feigin, Acta Crystallogr. A 44, 244 (1988). 17 D. I. Svergun, J. Appl. Crystallogr. 25, 495 (1992). 18 J. Zhao, E. Hoye, S. Boylan, D. A. Walsh, and J. Trewhella, J. Biol. Chem. 273, 30448 (1998). 16
592
[24]
analysis and software
shape determination procedures were developed. In the multipole expansion method19,20 the shape is represented by an angular envelope function r ¼ F (!) describing the particle boundary in spherical coordinates (r, !). This function is economically parameterized as Fð!Þ FL ð!Þ ¼
L X l X
flm Ylm ð!Þ
(10)
l¼0 m¼l
where Ylm(!) are spherical harmonics, and the multipole coefficients flm are complex numbers. The truncation value L defines the number of parameters [for the general case, Np ¼ (L þ 1)2 6] and the spatial resolution r [(51/2) Rg] / [(31/2) (L + 1)]. For a homogeneous particle, the density is 1; 0 r < Fð!Þ (11) c ðrÞ ¼ 0; r Fð!Þ and the shape scattering intensity is expressed as21 IðqÞ ¼ 22
1 X l X
jAlm ðqÞj2
(12)
l¼0 m¼l
where the partial amplitudes Alm(q) are readily computed from the shape coefficients flm.19,22 These coefficients are determined by a nonlinear optimization procedure starting from a spherical initial approximation and minimizing the discrepancy 2 between the calculated and the experimental scattering curves 2 ¼
N X j¼1
½Iðqj Þ I exp ðqj Þ = ðqj Þ
2
(13)
The algorithm uses only a few parameters to describe the particle shape (10 to 20 parameters for L ¼ 3–4), which in most practical cases ensures a unique low-resolution shape restoration. The method—implemented in the program SASHA20,23—was successfully used in a number of applications by various groups.24–27 19
D. I. Svergun and H. B. Stuhrmann, Acta Crystallogr. A 47, 736 (1991). D. I. Svergun, V. V. Volkov, M. B. Kozin, and H. B. Stuhrmann, Acta. Crystallogr. A 52, 419 (1996). 21 H. B. Stuhrmann, Zeitschr. Physik. Chem. Neue Folge 72, 177 (1970). 22 D. I. Svergun, J. Appl. Crystallogr. 30, 792 (1997). 23 D. I. Svergun, V. V. Volkov, M. B. Kozin, H. B. Stuhrmann, C. Barberato, and M. Koch, H. J. J. Appl. Crystallogr. 30, 798 (1997). 24 D. I. Svergun, M. Malfois, M. H. J. Koch, S. R. Wigneshweraraj, and M. Buck, J. Biol. Chem. 275, 4210 (2000). 20
[24]
SAXS studies of conformational changes in solution
593
The use of an angular envelope function is limited to not-too-complicated shapes (in particular, without holes inside the particle). A more comprehensive description is achieved by the bead methods.28,29 A spherical volume with diameter Dmax is filled by M densely packed spheres of much smaller radius r0. Each of these spheres may belong either to the particle (index ¼ 1) or to the solvent (index ¼ 0), and the shape is thus fully described by a binary string X of length M. Starting from a random distribution of 1’s and 0’s, the model is randomly modified using a Monte Carlolike search to find a binary string (i.e., the shape) that fits the experimental data. As the search models usually contain thousands of beads, the solution must be constrained by imposing conditions of compactness and connectivity. In the simulated annealing procedure of Svergun,29 an explicit penalty term P(X) is added to the goal function f(X) ¼ 2 þ P(X) to meet these conditions. In the original method of Chacon et al.,28 which relies on a genetic algorithm, the solution is implicitly constrained by gradually decreasing r0 during minimization, whereas in the latest version30 explicit constraints have also been added. Other Monte Carlo-based ab initio approaches have been proposed, such as the ‘‘give-n-take’’ procedure31 and a model of interconnected ellipsoids.32 The ability of these methods to satisfactorily restore low-resolution shapes of macromolecules from solution scattering data was demonstrated in test examples and in several applications.24,25,30,33–35 A principal limitation of the shape determination methods lies in the necessity to fit the shape scattering curve. As discussed in the introduction, X-ray scattering data at higher resolution are dominated by the contribution from the internal structure, and in practice only a restricted portion of the scattering patterns can be used. This not only limits the resolution
25
D. I. Svergun, A. Becirevic, H. Schrempf, M. H. J. Koch, and G. Grueber, Biochemistry 39, 10677 (2000). 26 J. G. Grossmann, S. A. Ali, A. Abbasi, Z. H. Zaidi, S. Stoeva, W. Voelter, and S. S. Hasnain, Biophys. J. 78, 977 (2000). 27 J. K. Krueger, S. C. Gallagher, C. A. Wang, and J. Trewhella, Biochemistry 39, 3979 (2000). 28 P. Chacon, F. Moran, J. F. Diaz, E. Pantos, and J. M. Andreu, Biophys. J. 74, 2760 (1998). 29 D. I. Svergun, Biophys. J. 76, 2879 (1999). 30 P. Chacon, J. F. Diaz, F. Moran, and J. M. Andreu, J. Mol. Biol. 299, 1289 (2000). 31 D. Walther, F. E. Cohen, and S. Doniach, J. Appl. Crystallogr. 33, 350 (2000). 32 D. Vigil, S. C. Gallagher, J. Trewhella, J. and A. E. Garcia, Biophys. J. 80, 2082 (2001). 33 M. Bada, D. Walther, B. Arcangioli, S. Doniach, and M. Delarue, J. Mol. Biol. 300, 563 (2000). 34 T. Fujisawa, A. Kostyukova, and Y. Maeda, FEBS Lett. 498, 67 (2001). 35 A. Sokolova, M. Malfois, J. Caldentey, D. I. Svergun, M. H. J. Koch, D. H. Bamford, and R. Tuma, J. Biol. Chem. 276, 46187 (2001).
594
analysis and software
[24]
to 2–3 nm but also the reliability of the models. A new approach to reconstruct protein models including wide-angle X-ray scattering data up to a resolution of 0.5 nm has been proposed.36 The C atoms of neighboring amino acid residues in the primary sequence are separated by approximately 0.38 nm so that at 0.5-nm resolution a protein can be regarded as an assembly of dummy residues (DR) centered at the C positions (the number of residues M is usually known from the protein or translated DNA sequence). The method starts from a randomly distributed gas of M DRs in a spherical search volume of diameter Dmax. Following a simulated annealing protocol, a DR taken at random is relocated within the search volume to an arbitrary point at a distance of 0.38 nm from another randomly selected DR. The compactness criterion used in shape determination is replaced by a requirement for the model to have a ‘‘chain-compatible’’ spatial arrangement of the DRs that fits the experimental scattering pattern. Compared with shape determination, DR modeling substantially improves the resolution and reliability of the models and has potential for further development (e.g., use of residue-specific information). An example of application of different ab initio methods is presented in Fig. 3, which displays the reconstructed models of hen egg white lysozyme superimposed on its atomic structure in the crystal (PDB entry 6lyz37). Lysozyme is a small protein [molecular mass (MM) ¼ 14.3 kDa, 129 residues], and the contribution from the internal structure dominates its solution scattering curve starting from q ¼ 2 nm1 (Fig. 4). The low-resolution models restored by ab initio shape determination (programs SASHA and DAMMIN) are only able to fit the low-angle portion of the experimental scattering pattern, but still provide a fair approximation of the overall appearance of the protein. The dummy residues method (program GASBOR) neatly fits the entire scattering pattern and yields a significantly more detailed model. If reliable information about the particle structure is available it should, of course, also be used with ab initio methods. In particular, symmetry restrictions significantly speed up the computations and reduce the effective number of model parameters. In the programs SASHA, DAMMIN, and GASBOR, symmetry restrictions associated with the point groups P2 to P6 and P222 to P62 can be imposed.
36 37
D. I. Svergun, M. V. Petoukhov, and M. H. J. Koch, Biophys. J. 80, 2946 (2001). R. Diamond, J. Mol. Biol. 82, 371 (1974).
[24]
SAXS studies of conformational changes in solution
595
Fig. 3. Atomic model of lysozyme (C chain) superimposed with ab initio models obtained with the program SASHA (left column, semitransparent envelope), DAMMIN (middle column, semitransparent dummy atoms), and GASBOR (right column, semitransparent dummy residues). The low-resolution models are superimposed on the atomic structure using the program SUPCOMB.38 The middle and bottom rows are rotated counterclockwise by 90 around x and y, respectively. All three-dimensional models were displayed on an SGI Workstation using the program ASSA.39
Computation of Scattering Patterns from Atomic Models Information about the atomic structure of the entire macromolecule or of its individual fragments (e.g., from crystallography or NMR) can significantly enrich the interpretation of solution scattering studies. A necessary prerequisite is an accurate evaluation of the scattering patterns from atomic models, taking the influence of the solvent into account. Earlier methods40–42 differently represented the particle volume inaccessible to the solvent, but did not account for hydration effects. Analysis of numerous 38
M. B. Kozin and D. I. Svergun, J. Appl. Crystallogr. 34, 33 (2001). M. B. Kozin, V. V. Volkov, and D. I. Svergun, J. Appl. Crystallogr. 30, 811 (1997). 40 E. E. Lattman, Proteins 5, 149 (1989). 41 J. Ninio, V. Luzzati, and M. Yaniv, J. Mol. Biol. 71, 217 (1972). 42 M. Y. Pavlov and B. A. Fedorov, Biofizika 28, 931 (1983). 39
596
[24]
analysis and software
Fig. 4. X-ray scattering from lysozyme (1) and scattering from the ab initio models in Fig. 3: (2) envelope model (SASHA); (3) bead model (DAMMIN); (4) dummy residue model (GASBOR).
X-ray scattering patterns from proteins with known atomic structure indicated that it is indispensable to include a hydration shell to adequately describe the experimental data. This can be done by surrounding the macromolecule by a 0.3-nm-thick hydration layer with an adjustable density b, which may differ from that of the bulk solvent s.43 Typically, this border layer has a density of 1.05 to 1.20 times that of the bulk. Taking advantage of the significantly different contrasts between the protein and the solvent for X-rays and neutrons in H2O and D2O it was demonstrated that the higher scattering density in the shell cannot be explained by disorder or mobility of the surface side chains in solution and that it is indeed due to a higher density of the bound solvent.44 The scattering from such a particle (macromolecule with hydration shell) in solution is then D E IðqÞ ¼ jAa ðqÞ s As ðqÞ þ b Ab ðqÞj2 (14)
43 44
D. I. Svergun, C. Barberato, and M. H. J. Koch, J. Appl. Crystallogr. 28, 768 (1995). D. I. Svergun, S. Richard, M. H. J. Koch, Z. Sayers, S. Kuprin, and G. Zaccai, Proc. Natl. Acad. Sci. USA 95, 2267 (1998).
[24]
SAXS studies of conformational changes in solution
597
where Aa(q) is the scattering amplitude from the particle in vacuo, As(q) and Ab(q) are, respectively, the scattering amplitudes from the excluded volume and the hydration layer, both with unit density, and b ¼ bs. This method is implemented in the computer programs CRYSOL41 for X-rays and CRYSON44 for neutrons. These programs utilize spherical harmonics expansion and compute partial amplitudes Alm(q) for all terms in Eq. (14), which permits the performance of analytical spherical averaging [see Eq. (12)]. The partial amplitudes can also be used in rigid body modeling as described below. Given the atomic coordinates, for example, from the Protein Data Bank,45,46 these programs either fit the experimental scattering curve using two free parameters, the excluded volume of the particle and the contrast of the hydration layer b, or predict the scattering pattern using the default values of these parameters. Rigid Body Refinement The concept of rigid body refinement of quaternary structures against solution scattering data is readily illustrated by considering a macromolecule consisting of two domains with known atomic structures. If domain A is fixed and B is moving and rotating, the scattering intensity of the entire particle is Iðq; ; ; ; uÞ ¼ Ia ðqÞ þ Ib ðqÞ þ 42
1 X l X
ðqÞ Re ½Alm ðqÞClm
(15)
l¼0 m¼l
where Ia(q) and Ib(q) are the scattering intensities from domains A and B, respectively, Alm(q) are partial amplitudes of the fixed domain A, Clm(q) are those of domain B rotated by the Euler angles ; ; and translated by the vector u. The structure and the scattering intensity from such a complex depend on the six positional and rotational parameters and these can be refined to fit the experimental scattering data. The scattering amplitudes from both domains in reference positions are evaluated separately using CRYSOL or CRYSON. Using appropriate algorithms47,48 the amplitudes Clm(q) and thus the intensity Iðq; ; ; ; uÞ can be evaluated sufficiently rapidly to employ an exhaustive search of positional parameters to fit the experimental scattering from the complex. Such a straightforward search may, however, yield a model that perfectly fits the data but fails to display 45
F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. E. Meyer, Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). 46 H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000). 47 D. I. Svergun, J. Appl. Crystallogr. 24, 485 (1991). 48 D. I. Svergun, Acta Crystallogr. A 50, 391 (1994).
598
analysis and software
[24]
proper intersubunit contacts. Relevant biochemical information (e.g., contacts between specific residues) can be taken into account by using an interactive search mode. Enhanced possibilities for combining interactive and automated search strategies are provided by the rigid body modeling systems ASSA for major UNIX platforms39,49 and MASSHA for Wintelbased machines.50 In these systems, a main three-dimensional graphics program is coupled with computational modules implementing Eq. (15). This permits the user to interactively manipulate the subunits as rigid bodies while observing corresponding changes in the fit to the experimental data and to start automated search programs performing an exhaustive search in a specified range of positional parameters. Addition of Missing Loops and Domains Protein function is related not only to the three-dimensional arrangement of polypeptide chains but among other factors also to their intrinsic mobility. Conformational heterogeneity in proteins often renders flexible loops undetectable in structures determined by X-ray crystallography or NMR. In large multidomain proteins, inherent flexibility between domains can prevent successful crystallization, and in these cases high-resolution analysis may be limited to studies of individual domains produced using genetic or proteolytic methods. SAXS offers the possibility of obtaining complementary information about the structures of disordered or missing protein fragments. Methods have been developed for adding missing loops or domains by fixing a known portion of the structure and building the unknown regions to fit the experimental scattering data obtained from the entire particle.51 These methods are based on an extension of the dummy residue approach described above in the section on ab initio methods. The program CREDO adds domains to low-resolution structural models in the form of a chainlike distribution of dummy residues without specific residue information. The programs CHADD and GLOOPY complement the available high-resolution models by adding the loops or domains at appropriate locations along the polypeptide chain to obtain the best fit to the solution scattering data. The program CHADD represents the missing loops/domains in terms of free residues described by their C positions, connected by spring forces along the chain and subject to excluded volume constraints. The program GLOOPY employs additional residue-specific information (hydrophobicity,52 knowledge-based potentials53), as well as 49
M. B. Kozin and D. I. Svergun, J. Appl. Crystallogr. 33, 775 (2000). P. V. Konarev, M. V. Petoukhov, and D. I. Svergun, J. Appl. Crystallogr. 34, 527 (2001). 51 M. V. Petoukhov, N. A. J. Eady, K. A. Brown, and D. I. Svergun, Biophys. J. 83, 3113 (2002). 50
[24]
SAXS studies of conformational changes in solution
599
restrained distributions of angles and dihedral angles involving adjacent C atoms,54 to construct native-like folds of the missing domains and loops. The program CHARGE may be used to further account for the secondary structure of the missing fragments. Following validation with a series of simulated examples, these methods were used to add missing loops or domains to several proteins for which partial structures were available.51 Automated Constrained Fit Procedure
Another approach to shape modeling has been developed that is more specifically oriented toward the study of large, modular proteins such as those encountered in the complement system.55 Homology modeling is used to obtain a structure for each domain of the multidomain protein and a library of conformations for each flexible linker is created using molecular dynamics calculations. Starting from this information more than 104 possible models are generated and represented as assemblies of small spheres from which the scattering patterns are calculated using Debye’s formula.56 Only those models are retained that give values of selected structural parameters (e.g., the radius of gyration and the radius of gyration of the cross-section in the case of elongated particles) in agreement with the experimental values. They are finally ranked according to the quality of the fit to the experimental scattering data. A study of monomeric factor H of human complement, a protein comprising 20 domains, illustrates this procedure and leads to the conclusion that the molecule is bent back on itself and flexible in solution, a result attributed to the presence of several long linker sequences.55 This conclusion could already be anticipated from ˚ ) and that of the fully the comparison between the value of Dmax (380 A ˚ ). extended chain (750 A Applications
Crystal Structure Not Yet Available In this case solution scattering can be used to determine physicochemical parameters prior to crystallization and to establish the monodispersity of the preparation, which, even if not always a prerequisite for 52
E. S. Huang, S. Sibbiah, and M. Levitt, J. Mol. Biol. 252, 709 (1995). M. J. Sippl, J. Mol. Biol. 213, 859 (1990). 54 G. J. Kleywegt, J. Mol. Biol. 273, 371 (1997). 55 M. Aslam and S. J. Perkins, J. Mol. Biol. 309, 1117 (2001). 56 P. Debye, Ann. Physik 46, 809 (1915). 53
600
analysis and software
[24]
crystallization, often facilitates matters. The case of arcelin-1, a 244-residue lectin-like glycoprotein expressed in the seeds of the kidney bean (Phaseolus vulgaris), which it protects from predation by larvae, provides a good example of such studies. SAXS revealed that the protein is active as a dimer in solution and that anions have specific effects on its aggregation, solubility, and crystal growth.57 These results were used to optimize the crystallization conditions. In the crystals, which diffract beyond 0.19-nm resolution, noncrystallographic symmetry-related arcelin-1 molecules form a lectin-like dimer. A solvent-exposed anion binding site at a crystalpacking interface of the protein explains the solution properties of arcelin-1 and the observed crystal twinning. The systematic study of interactions between proteins in solution (see Tardieu et al.4) is especially important in the context of crystal growth studies. The second application, which obviously requires monodisperse solutions, is ab initio modeling of the shape of the macromolecule. Some examples include the structures of pyruvate decarboxylase from Saccharomyces cerevisiae,58,59 FKBPmem from Legionella pneumophila,60 and a functional unit of Rapana venosa hemocyanin.61 In general, the ab initio shapes are in good agreement with the corresponding crystallographic models, which became available later.62–64 A detailed comparison of a series of thiamine diphosphate-dependent enzymes65 revealed, however, significant differences between the crystal and SAXS structures, as explained in the next section. Ab initio shape determination is also useful when only incomplete crystallographic models are available. As an example, the shape of the 57
L. Mourey, J. D. Pedelacq, C. Fabre, H. Causse, P. Rouge, and J. P. Samama, Proteins 29, 433 (1997). 58 S. Koenig, D. I. Svergun, M. H. J. Koch, G. Huebner, and A. Schellenberger, Eur. Biophys. J. 22, 185 (1993). 59 S. Koenig, D. I. Svergun, V. V. Volkov, L. A. Feigin, and M. H. J. Koch, Biochemistry 37, 5329 (1998). 60 B. Schmidt, S. Koenig, D. I. Svergun, V. Volkov, G. Fischer, and M. H. J. Koch, FEBS Lett. 372, 169 (1995). 61 E. Dainese, D. I. Svergun, M. Beltramini, P. Di Muro, and B. Salvato, Arch. Biochem. Biophys. 373, 154 (2000). 62 P. Arjunan, T. Umland, F. Dyda, S. Swaminathan, W. Furey, M. Sax, B. Farrenkopf, Y. Gao, D. Zhang, and F. Jordan, J. Mol. Biol. 256, 590 (1996). 63 A. Riboldi-Tunnicliffe, B. Koenig, S. Jessen, M. S. Weiss, J. Rahfeld, J. Hacker, G. Fischer, and R. Hilgenfeld, Nat. Struct. Biol. 8, 779 (2001). 64 M. E. Cuff, K. I. Miller, K. E. van Holde, and W. A. Hendrickson, J. Mol. Biol. 278, 855 (1998). 65 D. I. Svergun, M. V. Petoukhov, M. H. J. Koch, and S. Koenig, J. Biol. Chem. 275, 297 (2000).
[24]
SAXS studies of conformational changes in solution
601
complete F1 ATPase complex from Escherichia coli, which contains five subunits in the stoichiometry 3 3 e, was determined ab initio from solution scattering data to a resolution of 3 nm.66 The 3 3 moiety, which is well defined in the crystal structure,67 could be located unambiguously in the shape of the complex. The , , and e subunits, which are partly or completely unresolved in the electron density maps, were interactively located by rigid body modeling inside the remaining space using the structure of the -helical domain of the subunit and the NMR structures of the and e subunits.68,69 The resulting model, which minimizes the discrepancy between the experimental and calculated scattering curves, incorporates not only structural data but is also compatible with the available crosslinking evidence. A comparison of the shape of the model with that of bovine F1 ATPase70 illustrates the close structural similarity between the two complexes.71 Differences between the positions of the subunits that are not homologous can only be resolved by a high-resolution structure for the E. coli F1 ATPase.72 Attempts have also been made to use SAXS envelopes as a search model in molecular replacement. This approach has been shown to be successful with two crystals of oligomeric proteins, the trimer nitrite reductase73 and the dimer superoxide dismutase.74 The noncrystallographic symmetry reduced the problem from a six-parameter to a more manageable four-parameter search. This approach should also be useful for large structures. The cryoelectron microscopy model of the 50S bacterial ribosome75 was used for initial phasing of the crystal structure.76 The model 66
D. I. Svergun, I. Aldag, T. Sieck, K. Altendorf, M. H. J. Koch, D. J. Kane, M. B. Kozin, and G. Grueber, Biophys. J. 75, 2212 (1998). 67 J. P. Abrahams, A. G. Leslie, R. Lutter, and J. E. Walker, Nature 370, 621 (1994). 68 S. Wilkens, F. W. Dahlquist, L. P. McIntosh, L. W. Donaldson, and R. A. Capaldi, Nat. Struct. Biol. 2, 961 (1995). 69 R. A. Capaldi, R. Aggeler, S. Wilkens, and G. Grueber, J. Bioenerg. Biomembr. 28, 397 (1996). 70 C. Gibbons, M. G. Montgomery, A. G. Leslie, and J. E. Walker, Nat. Struct. Biol. 7, 1055 (2000). 71 G. Grueber, D. I. Svergun, U. Coskun, T. Lemker, M. H. J. Koch, H. Schaegger, and V. Mueller, Biochemistry 40, 1890 (2001). 72 A. C. Hausrath, R. A. Capaldi, and B. W. Matthews, J. Biol. Chem. 276, 47227 (2001). 73 Q. Hao, F. E. Dodd, J. G. Grossmann, and S. S. Hasnain, Acta Crystallogr. D Biol. Crystallogr. 55, 243 (1999). 74 D. M. Ockwell, M. A. Hough, J. G. Grossmann, S. S. Hasnain, and Q. Hao, Acta Crystallogr. D Biol. Crystallogr. 56, 1002 (2000). 75 J. Frank, J. Zhu, P. Penczek, Y. Li, S. Srivastava, A. Verschoor, M. Radermacher, R. Grassucci, R. K. Lata, and R. K. Agrawal, Nature 376, 441 (1995). 76 N. Ban, B. Freeborn, P. Nissen, P. Penczek, R. A. Grassucci, R. Sweet, J. Frank, P. B. Moore, and T. A. Steitz, Cell 93, 1105 (1998).
602
analysis and software
[24]
obtained from solution scattering,77 which was in good agreement with the EM model, could have been used in a similar way. Crystal Structure Available With the high rate at which crystal structures are solved, which should further increase with the structural genomics initiatives, solution studies will in many cases start with a knowledge of the crystal structure of the protein under investigation or at least of a related one. Similarly, for large, supramolecular assemblies, EM models might be available. This opens new opportunities for the interpretation of SAXS data by providing an independent starting point for structural studies. Comparison between Crystal and Solution Structures. It is generally accepted that at the resolution achieved by SAXS, the crystal and solution structures of a protein are essentially identical, in spite of crystal-packing constraints. By and large, this statement is valid in the case of small, globular, single-domain proteins. This is largely due to the high solvent content of macromolecular crystals, generally between 30 and 60% (v/v), ensuring a proper solvation of the molecules. A reality check is, however, always useful as significant and sometimes even major differences are observed in an increasing number of multidomain proteins. Bacteriophage T4 lysozyme, a 164-residue protein comprising two domains, provides an excellent example. In the absence of ligand, wild-type T4 lysozyme crystallizes in a closed conformation similar to the peptidoglycan-bound form.78 The first experimental evidence for a more open conformation of the unliganded protein in solution than in the crystal was obtained by SAXS.79 More recently a solution NMR study in which the domain orientation was determined using dipolar couplings also concluded that differences exist between the average solution and crystal conformations for both the unliganded and the substrate-bound enzyme.80 Unligated glutamate dehydrogenase from Thermococcus profondus, which consists of six identical (MM ¼ 46 kDa) subunits, each comprising two domains of roughly the same size separated by the active site cleft, has been investigated by cryocrystallography and SAXS.81 Significant 77
D. I. Svergun, N. Burkhardt, J. S. Pedersen, M. H. J. Koch, V. V. Volkov, M. B. Kozin, W. Meerwink, H. B. Stuhrmann, G. Diedrich, and K. H. Nierhaus, J. Mol. Biol. 271, 602 (1997). 78 L. H. Weaver and B. W. Matthews, Protein Eng. 1, 115 (1987). 79 A. A. Timchenko, O. B. Ptitsyn, A. V. Troitsky, and A. I. Denesyuk, FEBS Lett. 88, 109 (1978). 80 N. K. Goto, N. R. Skrynnikov, F. W. Dahlquist, and L. E. Kay, J. Mol. Biol. 308, 745 (2001). 81 M. Nakasako, T. Fujisawa, S. Adachi, T. Kudo, and S. Higuchi, Biochemistry 40, 3069 (2001).
[24]
SAXS studies of conformational changes in solution
603
conformational differences in the orientation of the two domains of the subunits, which most likely result from hinge-bending movements, were found and the SAXS patterns indicate that the hexamer is slightly larger in solution than in the crystal. The authors propose that the enzyme displays spontaneous domain motions in solution that are trapped in the crystal through different crystal contacts, although with reduced magnitude. The structural analysis includes a careful study of the hydration water, correlating in particular the observed variations in the active site cleft between the different subunits with the domain motions and discussing possible coupling between the two sets of changes. The importance of packing forces is further illustrated by a comparative study of multisubunit thiamine phosphate-dependent enzymes from various organisms.65 Among the proteins investigated, only the compact pyruvate decarboxylase from Zymomonas mobilis adopts the same conformation in solution and in the crystal. In contrast, there are significant differences between the calculated and experimental patterns of transketolase, native and pyruvamide-activated pyruvate decarboxylase from Saccharomyces cerevisiae, and pyruvate oxidase from Lactobacillus plantarum. Using rigid body movements of domains in the crystal structure—essentially rotations around a flexible hinge—models yielding much improved agreement with the experimental data were found. Interestingly, a clear correlation was found between the magnitude of the distortions induced by the crystal packing and the interfacial area between subunits. Another striking case is that of the R structure of aspartate transcarbamylase (ATCase) from E. coli, the enzyme catalyzing the first step of the biosynthetic pathway of pyrimidines, the condensation of carbamyl phosphate and aspartate. ATCase is a highly regulated enzyme with positive homotropic cooperativity for the binding of the substrate l-aspartate, heterotropic activation by ATP, and inhibition by CTP. This textbook illustration of allostery comprises two trimers of catalytic chains (MM ¼ 34 kDa) and three dimers of regulatory (r) chains (MM ¼ 17 kDa) forming the 306kDa holoenzyme with quasi-D3 symmetry. Both nucleotides bind to the same site on the regulatory chain, about 6 nm away from the nearest active site. The solution scattering curves computed from the crystal structures of the T82 and R83 states of the enzyme were compared with the experimental X-ray scattering patterns. Experimental and calculated T curves agree, while large deviations reflecting significant differences between the quaternary structures in the crystal and in solution are observed for the R state.84 82
R. C. Stevens, J. E. Gouaux, and W. N. Lipscomb, Biochemistry 29, 7691 (1990). H. Ke, W. N. Lipscomb, Y. Cho, and R. B. Honzatko, J. Mol. Biol. 204, 725 (1988). 84 D. I. Svergun, C. Barberato, M. H. J. Koch, L. Fetler, and P. Vachette, Proteins 27, 110 (1997). 83
604
analysis and software
[24]
The experimental curve of the R state was fitted using rigid body movements of the subunits in the crystal R structure. Taking the latter as a reference, the distance between the catalytic trimers along the 3-fold axis is increased by 0.34 nm (from 1.08 to 1.42 nm with respect to the T structure) and the trimers are rotated by 11 around the same axis, while the regula tory dimers are rotated by 9 around the corresponding 2-fold axis. The crystal structures used for this study displayed no density for the seven N-terminal residues of the r chains. Reinvestigation of the rigid body modeling using subsequent higher-resolution structures85,86 with complete r chains confirmed the initial conclusions with a somewhat smaller increase in the distance between the two trimers (0.28 instead of 0.34 nm).87 As allosteric enzymes appear to have been selected for easy reorganization on ligand binding, involving only low-energy noncovalent interactions, it is not entirely surprising that the crystal-packing forces which also originate from noncovalent bonds between neighboring molecules would distort these subtle architectures. Similar changes, although of smaller amplitude, are likely to be observed with increasing frequency in the case of allosteric proteins, complexes or larger assemblies. Here, the combination of a highresolution crystal study with solution work is an absolute requirement and one can only concur with the authors of the NMR study on T4 lysozyme mentioned above when they emphasize ‘‘the importance of alternative methods to X-ray crystallography for evaluating inter-domain structure,’’80 simply to add that SAXS could well be one of them. Grb2 is an adaptor protein composed of an SH2 domain flanked on either side by an SH3 domain, which is involved in a variety of signal transduction cascades through interactions of its different domains with specific sites on a number of partner proteins.88 In the crystal Grb2 forms a dimer in which the monomer has a compact structure with the two SH3 domains in close contact.89 The suggestion that the molecule was likely to adopt a more open and flexible conformation in solution,89 thereby facilitating interactions with various target proteins, was confirmed by a joint NMR and SAXS study of Grb2 in solution.88 From the NMR study of the whole protein and of its isolated domains, it follows that the two SH3 domains maintain their individual structure within the protein, without detectable interactions. The junction between the SH2 and the C-terminal SH3 85
R. P. Kosman, J. E. Gouaux, and W. N. Lipscomb, Proteins 15, 147 (1993). L. Jin, B. Stec, W. N. Lipscomb, and E. R. Kantrowitz, Proteins 37, 729 (1999). 87 L. Fetler and P. Vachette, J. Mol. Biol. 309, 817 (2001). 88 S. Yuzawa, M. Yokochi, H. Hatanaka, K. Ogura, M. Kataoka, K. Miura, V. Mandiyan, J. Schlessinger, and F. Inagaki, J. Mol. Biol. 306, 527 (2001). 89 S. Maignan, J. P. Guilloteau, N. Fromage, B. Arnoux, J. Becquart, and A. Ducruix, Science 268, 291 (1995). 86
[24]
SAXS studies of conformational changes in solution
605
domains appears to be much more flexible than other regions. The SAXS study indicates that in solution Grb2 is a relatively extended and flexible monomer with significantly larger radius of gyration and maximum dimension than derived from the crystal structure. Starting from the latter, molecular dynamics was used to generate 750 possible structures of Grb2, forming an ensemble considered to represent the ‘‘solution model,’’ which also yielded a distance distribution function p(r) close to the experimental curve. The analysis of the intra- and interdomain distance distributions led to the conclusion that in solution, Grb2 exists in an ensemble of conformations with relatively open structures. The structure of complexes is particularly likely to be affected by crystallization conditions and crystal-packing interactions and it is thus especially important to verify the relevance of symmetric arrangements which these interactions may induce. A case in point is that of the ternary complex between aminoacyl-tRNA, elongation factor Tu, and guanosine triphosphate,90 where the existence of a trimeric assembly, similar to that observed in the asymmetric unit of the crystal structure of an otherwise similar complex91 and thought to be the physiologically relevant form, could be excluded. Singular value decomposition of these SAXS titration curves also indicated that there was no temperature dependence of the complex stoichiometry in contrast with the results of previous RNase A protection experiments. The investigation of the growth of low-pH crystals of BPTI combining diffraction and SAXS in solution provides an even clearer example of such effects.92 Crystallization of bovine pancreatic trypsin inhibitor (BPTI) at acidic pH in the presence of thiocyanate, chloride, and sulfate ions leads to three different polymorphs in P21, P6422, and P6322 space groups. The same decamer with 10 BPTI molecules organized through two perpendicular 2-fold and 5-fold axes forming a well-defined compact object is found in the three polymorphs. In contrast, at basic pH only monomeric crystal forms are observed. The SAXS data recorded during crystallization of BPTI at pH 4.5 in both undersaturated and supersaturated BPTI solutions were analyzed in terms of the formation of discrete oligomers (n ¼ 1 to 10). In addition to the monomer, a dimer, a pentamer, and a decamer were identified within the crystal structure and their scattering patterns were 90
N. Bilgin, M. Ehrenberg, C. Ebel, G. Zaccai, Z. Sayers, M. H. J. Koch, D. I. Svergun, C. Barberato, V. Volkov, P. Nissen, and J. Nyborg, Biochemistry 37, 8163 (1998). 91 P. Nissen, M. Kjeldgaard, S. Thirup, G. Polekhina, L. Reshetnikova, B. F. Clark, and J. Nyborg, Science 270, 1464 (1995). 92 C. Hamiaux, J. Pe´rez, T. Prange´, S. Veesler, M. Rie`s-Kautt, and P. Vachette, J. Mol. Biol. 297, 697 (2000).
606
analysis and software
[24]
computed using CRYSOL (Fig. 5). The experimental curves were then analyzed as linear combinations of these theoretical patterns using a nonlinear curve-fitting procedure. The results, confirmed by gel filtration experiments, unambiguously demonstrate the coexistence of two different BPTI particles in solution: a monomer and a decamer, without any evidence for other intermediates. The fraction of decamers increases with salt concentration (i.e., when reaching and crossing the solubility curve). This suggests that crystallization of BPTI at acidic pH is a two-step process whereby decamers first form in under- and supersaturated solutions followed by the growth of what are best described as ‘‘BPTI decamer’’ crystals. Conformational Changes on Ligand Binding. Although functionally important changes caused by substrate or effector binding can be limited to reorientation of a (few) side chain(s), domain movements are frequently observed. These are best studied in parallel in crystals and in solution. If crystals of both the unliganded and the bound forms are available, a high-resolution description of the structural change can be obtained, with the solution investigation validating or completing this picture. Some caution is required in the interpretation of solution data, especially when no independent measure of the extent of binding (e.g., spectroscopy) is available, as one is more often than not dealing with equilibria. An excellent illustration of this kind of investigation is provided by the SAXS study of the effect of ligand binding on the association properties and conformation of the ligand-binding domain (LBD) of nuclear receptors for retinoic acid, RXR and RAR .93 While the crystal structures of RXR and RAR LBDs are in good agreement with the average conformation in solution, some interesting new features were observed in the latter environment. Measurements on (apo/apo, apo/holo, and holo/holo) RXR /RAR LBD heterodimers, where apo and holo refer to the occupancy of the ligandbinding site, revealed the existence of three conformational states suggesting that the subunits are structurally independent within the heterodimers. Unliganded RXR LBD forms dimers and tetramers in solution, and a structural model of this autorepressed tetramer was proposed, which was confirmed by an independent crystal study.94 In contrast to the monomer observed in the crystal, holo RXR LBD is a dimer in solution. Multimeric proteins are often difficult to crystallize while their individual subunits are more amenable to crystallization and structure 93
P. F. Egea, N. Rochel, C. Birck, P. Vachette, P. A. Timmins, and D. Moras, J. Mol. Biol. 307, 557 (2001). 94 R. T. Gampe, V. G. Montana, M. H. Lambert, G. B. Wisely, M. V. Milburn, and E. H. Xu, Genes Dev. 14, 2229 (2000).
[24]
SAXS studies of conformational changes in solution
607
Fig. 5. Ribbon representation of the monomer, dimer, pentamer, and decamer of BPTI, with their calculated scattering patterns. Courtesy of C. Hamiaux and T. Prange´.
608
analysis and software
[24]
determination. An illustrative case is that of the cAMP-dependent protein kinase (PKA), a tetramer comprising two identical catalytic (C) subunits and two regulatory (R) subunits. The crystal structure of the C subunit has been solved, as well as that of the cAMP-binding site of the R subunit together with the NMR structure of the dimerization domain of the R subunit. Small-angle X-ray and neutron scattering were used to investigate the quaternary structure of the holoenzyme and of a heterodimer RC (R is a truncation mutant of R) with partially deuterated R subunits, which enabled the authors to obtain structural parameters of the C and R subunits within the complex.18 From these data, a low-resolution model was derived for the holoenzyme and the dimer, using the program SASMODEL mentioned earlier. The crystal structures of C and R fit well within the shape of the RC dimer model, with C in a closed conformation similar to that observed for C with a bound pseudosubstrate peptide. The model for PKA holoenzyme has an extended dumbbell shape, which was constrained by taking into account the results of several mutation studies identifying the likely residues important for R–C interactions. Finally, the study suggests that the PKA structure may be flexible with a hinge movement of each lobe with respect to the dimerization domain. This approach combining the available biochemical and mutagenesis data with crystallographic, NMR, and solution scattering studies has been successfully applied to a number of systems such as troponin C–troponin I interactions,95 calmodulin–myosin light chain kinase interactions,96 and the assembly of immunoglobulin-like modules in titin.97 A study of ligand-induced conformational changes involves the R form of allosteric ATCase mentioned above. A new and even more extended form of the enzyme was observed for the ternary complex of the enzyme with PALA (N-phosphonacetyl-l-aspartate), a bisubstrate analog, and the allosteric activator Mg-ATP.87 Here, the increase in the intertrimer distance with respect to the crystallographic R structure amounts to 0.44 nm, that is, 40% larger than above. A proposal is put forward to account for the difference between free ATP, which causes no other difference in the scattering pattern than that directly due to its contribution to scattering, and Mg-ATP. The main features of a plausible mechanism for the effect of Mg-ATP are delineated, based on the crystal structures of the enzyme in the presence of ATP. It is worth mentioning that the information content of the SAXS pattern is sufficient to establish that Mg-ATP specifically 95
G. A. Olah and J. Trewhella, Biochemistry 33, 12800 (1994). J. K. Krueger, G. Zhi, J. T. Stull, and J. Trewhella, Biochemistry 37, 13997 (1998). 97 S. Improta, J. K. Krueger, M. Gautel, R. A. Atkinson, J. F. Lefe`vre, S. Moulton, J. Trewhella, and A. Pastore, J. Mol. Biol. 284, 761 (1998). 96
[24]
SAXS studies of conformational changes in solution
609
alters the R quaternary structure without modifying the T–R equilibrium, which depends only on substrate binding to the active site. Indeed, the SAXS pattern constitutes a specific and sensitive probe of the quaternary structure of the enzyme, which can be used to titrate the T–R structural transition at equilibrium, recording a series of patterns (typically 20 curves) at different substrate concentrations.98,99 The entire data set is then analyzed using singular value decomposition (SVD), a method giving the minimal number of component patterns (i.e., conformations) involved in the transition. An experiment using PALA, from which a few representative curves are shown in Fig. 6, was performed. The three welldefined crossing points shown by arrows are already suggestive of a twostate transition. This is confirmed by the SVD analysis, which indicates that all curves can be approximated within experimental errors by a linear combination of only two eigenvectors. In other words, the transition triggered by PALA only involves the two extreme structures T and R, without any detectable intermediate species. This supports the view of an equilibrium between the two forms that is shifted by PALA binding as proposed by Monod, Wyman, and Changeux in their model of a concerted transition.100 With only two components, the analysis gives, after proper scaling, the fractional concentration of each form, thereby leading to the titration curve of the structural transition, that is, the variation of the fraction of molecules in the R state versus the active site occupancy represented by the thin line in the inset of Fig. 6. Practically all molecules appear to be in the R state when only two-thirds of the active sites are occupied. Similar titration experiments have been performed using succinate, an analog of aspartate, in the presence of saturating amounts of carbamyl phosphate (CP).99 Again, only two quaternary structures are detected. However, the scattering patterns of the unligated enzyme (T state) and of the ATCase–CP complex display small but significant differences. An SVD analysis established that a third quaternary structure is involved. This proves that CP modifies the quaternary structure of ATCase to a T0 state, close to but different from T. A last titration experiment by PALA in the presence of saturating amounts of CP also reveals a two-state transition, but the titration curve is systematically shifted by 10 to 15% toward R (inset of Fig. 6, solid symbols). Therefore, CP changes the quaternary structure of ATCase to T0 , a structure more readily converted to R by PALA.99 This work illustrates the use of SAXS as a probe of the conformation of an enzyme for structural
98
L. Fetler, P. Tauc, G. Herve´, M. F. Moody, and P. Vachette, J. Mol. Biol. 251, 243 (1995). L. Fetler, P. Tauc, and P. Vachette, J. Appl. Crystallogr. 30, 781 (1997). 100 J. Monod, J. Wyman, and J. P. Changeux, J. Mol. Biol. 12, 88 (1965). 99
610
[24]
analysis and software 35,000
1
0.8 30,000
R
0.6 25,000
0.4 20,000 I(q)
0.2
0
15,000
0
0.5 [PALA]Tot/active site
1
10,000
5000
0 0.5
1
1.5 q = (4πsinθ) / λ
2
2.5
3
(nm−1)
Fig 6. X-ray scattering curves of ATCase in the presence of increasing PALA concentrations expressed as moles of PALA per mole of active site [0; 0.20; () 0.40; 0.60; 2]. The arrows point toward the three crossing points. Inset: Titration curves (R versus [PALA]tot): (.60) PALA alone; (2) PALA plus 5 mM CP. The lines simply join the successive data points. The diagonal corresponds to the saturation function Y. From Fetler et al.99 with kind permission from the IUCr.
biochemistry studies, providing a wealth of information on the relationship between structural changes and mechanism. Removal of metal atoms from metalloproteins may cause major structural rearrangements as reported in a study of human ceruloplasmin, a copper-containing serum glycoprotein with multiple functions.101 The crystal structure at 0.31-nm resolution reveals that the 1046-amino acid-long polypeptide chain is folded in six similar domains arranged in three pairs 101
P. Vachette, E. Dainese, V. B. Vasyliev, P. Di Muro, M. Beltramini, D. I. Svergun, V. De Filippis, and B. Salvato, J. Biol. Chem. 277, 40823 (2002).
[24]
SAXS studies of conformational changes in solution
611
forming a triangular array around a pseudo-3-fold axis and joined by flexible linkers, with three mononuclear copper sites in domains 2, 4, and 6 and a trinuclear cluster at the interface between N-terminal domain 1 and C-terminal domain 6 (Fig. 7B).102,103 The apo form of the molecule following copper removal was proposed to be in a ‘‘molten globule’’ state.104 On removal of copper, the SAXS pattern undergoes a dramatic modification (Fig. 7A), with a 40% increase in both the radius of gyration (from 3.25 to 4.52 nm) and the maximum dimension (from 11.0 to 15.5 nm) indicative of a large structural change toward a more open conformation. Based on the crystal structure with its quasisymmetrical distribution of domains and the trinuclear copper site at the N and C interface, it was proposed that this interface no longer exists in the absence of copper. This allows each of the (1,2) and (5,6) pairs of domains, connected to the central (3,4) pair by flexible linkers, to move freely in solution, leading to a more extended conformation of the entire apo molecule, in qualitative agreement with the SAXS observations. Calculated SAXS patterns obtained by rigid body movements on models yield an excellent agreement with the experimental data (Fig. 7A, thick line; and Fig. 7B). As one is dealing with a protein exploring a large conformational space, the specific open model obtained by this procedure represents nothing more than the conformation that best fits the data. This is a clear example in which crystal studies cannot give a direct answer, although the crystal structure of the holo form is necessary for a meaningful interpretation of the scattering data. An example of the power of coupling SAXS to X-ray absorption spectroscopy (XAFS) based on a known crystal structure is provided by a study of the ligand-induced conformational change in transferrin.105 Indeed, previous crystallographic and SAXS studies led to questions concerning the nature of the change: was the apo molecule sufficiently mobile to continuously explore a sufficiently large conformational space to also sample the closed form, or was the open conformational state a well-defined one, with the molecule only occasionally in the closed state? Several mutants were constructed and studied by SAXS. One, with the mutation located close to one of the hinges, was also studied by XAFS to probe possible ligand
102
I. Zaitseva, V. Zaitsev, G. Card, K. Moshkov, B. Bax, A. Ralph, and P. Lindley, J. Biol. Inorg. Chem. 1, 15 (1996). 103 P. F. Lindley, G. Card, I. Zaitseva, V. Zaitsev, B. Reinhammar, E. Selin-Lindgren, and K. Yoshida, J. Biol. Inorg. Chem. 2, 454 (1997). 104 V. De Filippis, V. B. Vassiliev, M. Beltramini, A. Fontana, B. Salvato, and V. S. Gaitskhoki, Biochim. Biophys. Acta 1297, 119 (1996). 105 J. G. Grossmann, J. B. Crawley, R. W. Strange, K. J. Patel, L. M. Murphy, M. Neu, R. W. Evans, and S. S. Hasnain, J. Mol. Biol. 279, 461 (1998).
612
analysis and software
[24]
Fig. 7. (A) X-ray scattering patterns of the holo (thin line) and apo (error bars) forms of human ceruloplasmin. The scattering pattern of the best-fitting model (thick line) is superimposed on the apo experimental data. (B) View along the pseudo-3-fold axis of the crystal structure of the holo protein (left) and of the model of the apo form (right) in ribbon representation. The three copper atoms at the (1–6) interface are shown in CPK at the bottom of the molecule (WebLab ViewerLite37; Molecular Simulations, 1999).
rearrangement around the iron atom. The results indicate that both the iron-loaded form and the metal-free forms of the molecule present a unique conformation. In addition, some of the mutants are in an intermediate conformation, displaying the initial 20 hinge–twist of one domain without the hinge-bending motion, suggesting a two-step process for the closure of the interdomain cleft.
[24]
SAXS studies of conformational changes in solution
613
Large Structures A special case of incompleteness occurs for large structures such as viruses for which low-resolution (100–1 nm) and high-resolution (1.5–0.3 nm) data may be available from cryoelectron microscopy and crystallography, respectively. Merging such data, which have only small, if any, overlap, is greatly facilitated by small-angle diffraction106 and also leads to much improved models where loosely ordered regions (e.g., nucleic acids), which scatter but do not diffract, can be detected. For spherical particles the intensities can also be usefully compared with those of the scattering patterns in solution. Small-angle scattering data also yield better approximations to the contrast transfer function for image reconstruction in electron microscopy107 for icosahedral and spherical particles. A wide-angle X-ray solution scattering pattern was also successfully used to correct for the contrast transfer function in the cryo-EM study of the 70S ribosome from E. coli at a resolution of 1.15 nm.108 Using contrast variation on specifically deuterated ribosomes, a map of the protein–RNA distribution in the 70S ribosome from E. coli at a resolution of 3 nm was constructed,11 which was confirmed by high-resolution crystallographic studies.109,110 This is an area where progress can be expected as an increasing number of large structures is being investigated. Time-Resolved Experiments The availability of high-flux X-ray beams and high rate detectors has revived interest in fast kinetic studies. This is true for crystallography and SAXS alike. The aims are, however, different in both approaches. Crystallography, using the wide bandpass radiation from an undulator in single bunch mode together with femtosecond laser excitation aims at determining the steps of a (hypothetical) defined chemical mechanism (i.e., the sequence of rupture and formation of covalent bonds in the substrate and occasionally between enzyme and substrate) on the femtosecond to picosecond scale.111,112 Alternatively, hypothetical intermediate structures are 106
H. Tsuruta, V. S. Reddy, W. R. Wikoff, and J. E. Johnson, J. Mol. Biol. 284, 1439 (1998). P. A. Thuman-Commike, H. Tsuruta, B. Greene, P. E. Prevelige, J. King, and W. Chiu, Biophys. J. 76, 2249 (1999). 108 I. S. Gabashvili, R. K. Agrawal, C. M. Spahn, R. A. Grassucci, D. I. Svergun, J. Frank, and P. Penczek, Cell 100, 537 (2000). 109 N. Ban, P. Nissen, J. Hansen, P. B. Moore, and T. A. Steitz, Science 289, 905 (2000). 110 M. M. Yusupov, G. Z. Yusupova, A. Baucom, K. Lieberman, T. N. Earnest, J. H. Cate, and H. F. Noller, Science 292, 883 (2001). 111 B. Perman, V. Srajer, Z. Ren, T. Teng, C. Pradervand, T. Ursby, D. Bourgeois, F. Schotte, M. Wulff, R. Kort, K. Hellingwerf, and K. Moffat, Science 279, 1946 (1998). 107
614
analysis and software
[24]
flash-frozen and studied in a more conventional way.113 In contrast, SAXS addresses global, quaternary structure changes and entropy-driven assembly processes, which occur with characteristic times on the order of 1 ms (i.e., essentially diffusive or dissipative processes, as opposed to strictly chemical processes). The process is initiated using fast mixing devices (stopped-flow)114 or other perturbation methods. For instance, the kinetics of the allosteric transition of ATCase has been investigated at low tem perature (5 ) in the presence of glycerol to slow down the reaction, yielding kinetic results supporting the model of a concerted transition.115 Assembly mechanisms can also be studied, as well as the (un)folding of proteins or nucleic acids.116 Some steps in assembly and maturation processes in viruses117,118 or Ca2+-dependent swelling119 are other examples of time-resolved studies. This last study is currently being pursued using both the intense beam of the ESRF to increase the useful resolution of the data and neutrons at the ILL to distinguish the rearrangement of the protein capsid from that of the RNA moiety. The study of the first and fastest events in protein folding or unfolding occurring in the submillisecond range can benefit from the use of pink radiation,120–122 the wide bandpass beam coming from a double multilayer monochromator123 or from the unmonochromatized first harmonic of an undulator, the higher orders being eliminated by mirrors. Obviously, radiation damage is a major concern in these applications. Here too, the 112
K. Moffat, Nat. Struct. Biol. 5(Suppl.), 641 (1998). A. Heine, G. DeSantis, J. G. Luz, M. Mitchell, C. H. Wong, and I. A. Wilson, Science 294, 369 (2001). 114 H. Tsuruta, T. Nagamura, K. Kimura, Y. Igarashi, A. Kajita, Z. X. Wang, K. Wakabayashi, Y. Amemiya, and H. Kihara, Rev. Sci. Instrum. 60, 2356 (1989). 115 H. Tsuruta, P. Vachette, T. Sano, M. F. Moody, Y. Amemiya, K. Wakabayashi, and H. Kihara, Biochemistry 33, 10007 (1994). 116 R. Russell, I. S. Millett, S. Doniach, and D. Herschlag, Nat. Struct. Biol. 7, 367 (2000). 117 C. Berthet-Colominas, M. Cuillel, M. H. J. Koch, P. Vachette, and B. Jacrot, Eur. Biophys. J. 15, 159 (1987). 118 R. Lata, J. F. Conway, N. Cheng, R. L. Duda, R. W. Hendrix, W. R. Wikoff, J. E. Johnson, H. Tsuruta, and A. C. Steven, Cell 100, 253 (2000). 119 J. Pe´rez, S. Defrenne, J. Witz, and P. Vachette, Cell. Mol. Biol. 46, 937 (2000). 120 K. W. Plaxco, I. S. Millett, D. J. Segel, S. Doniach, and D. Baker, Nat. Struct. Biol. 6, 554 (1999). 121 D. J. Segel, A. Bachmann, J. Hofrichter, K. O. Hodgson, S. Doniach, and T. Kiefhaber, J. Mol. Biol. 288, 489 (1999). 122 L. Pollack, M. W. Tate, A. C. Finnefrock, C. Kalidas, S. Trotter, N. C. Darnton, L. Lurio, R. H. Austin, C. A. Batt, S. M. Gruener, and S. G. J. Mochrie, Phys. Rev. Lett. 86, 4962 (2001). 123 H. Tsuruta, S. Brennan, Z. U. Rek, T. C. Irving, W. H. Tompkins, and K. O. Hodgson, J. Appl. Crystallogr. 31, 672 (1998). 113
[24]
SAXS studies of conformational changes in solution
615
availability of the crystal structure of the starting and/or the end point of a structural transition considerably increases the potential and scope of the analysis of time-resolved data. The basic numerical techniques used to detect intermediates (e.g., singular value decomposition) or to model scattering curves124 are essentially the same as those used for equilibrium studies on mixtures. Conclusion and Perspectives
The perception of small-angle X-ray scattering by solutions of biological macromolecules has undergone deep changes as large facilities and increasingly powerful data analysis and modeling software have made it more accessible to a larger community. The value of SAXS and the unique links it provides between high- and low-resolution structure determination methods and also between structural, hydrodynamic, and other scattering methods has been amply demonstrated in many applications. Indeed, a growing number of crystallographers show interest in using SAXS as a complement to crystal studies. A new generation of SAXS instruments using undulator beams has become available at several facilities. The full potential of these instruments, especially for time-resolved experiments, has not yet been exploited, and the modeling methods clearly also have a margin of progress. One can therefore be confident that SAXS will become in the near future part of the methodological toolkit of any structural biologist and will allow those who realize that there is no life at equilibrium to contribute significantly to the understanding of the more complex interactions and structural changes that are characteristic of biological systems. Acknowledgments We thank A. Ducruix and J. Pe´rez for critical reading of the manuscript.
124
S. Koenig, D. I. Svergun, M. H. J. Koch, G. Huebner, and A. Schellenberger, Biochemistry 31, 8726 (1992).
616
analysis and software
[25]
[25] Identifying Importance of Amino Acids for Protein Folding from Crystal Structures By Nikolay V. Dokholyan, Jose M. Borreguero, Sergey V. Buldyrev, Feng Ding, H. Eugene Stanley, and Eugene I. Shakhnovich Introduction
One of the most intriguing questions in biophysics is how protein sequences determine their unique three-dimensional structure. This question, known as the protein-folding problem,1–25 is of great importance because understanding protein-folding mechanisms is a key to successful manipulation of protein structure and, consequently, function. The ability to manipulate protein function is, in turn, crucial for effective drug discovery. 1
C. B. Anifsen, Science 181, 223 (1973). H. Taketomi, Y. Ueda, and N. Go, Int. J. Pept. Protein Res. 7, 445 (1975). 3 N. Go, Annu. Rev. Biophys. Bioeng. 12, 183 (1983). 4 J. D. Bryngelson and P. G. Wolynes, J. Phys. Chem. 93, 6902 (1989). 5 M. Karplus and E. I. Shakhnovich, in ‘‘Protein Folding’’ (T. Creighton, ed.). W. H. Freeman, New York, 1994. 6 O. B. Ptitsyn, Protein Eng. 7, 593 (1994). 7 V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich, Biochemistry 33, 10026 (1994). 8 E. I. Shakhnovich, V. I. Abkevich, and O. Ptitsyn, Nature 379, 96 (1996). 9 D. K. Klimov and D. Thirumalai, Phys. Rev. Lett. 76, 4070 (1996). 10 A. R. Fersht, Curr. Opin. Struct. Biol. 7, 3 (1997). 11 E. I. Shakhnovich, Curr. Opin. Struct. Biol. 7, 29 (1997). 12 J. N. Onuchic, Z. Luthey-Schulten, and P. G. Wolynes, Annu. Rev. Phys. Chem. 48, 545 (1997). 13 E. I. Shakhnovich, Fold. Des. 3, R45 (1998). 14 C. Micheletti, J. R. Banavar, A. Maritan, and F. Seno, Phys. Rev. Lett. 82, 3372 (1998). 15 V. S. Pande, A. Yu Grosberg, D. S. Rokshar, and T. Tanaka, Curr. Opin. Struct. Biol. 8, 68 (1998). 16 V. P. Grantcharova, D. S. Riddle, J. V. Santiago, and D. Baker, Nat. Struct. Biol. 5, 714 (1998). 17 J. C. Martinez, M. T. Pissabarro, and L. Serrano, Nat. Struct. Biol. 5, 721 (1998). 18 H. S. Chan and K. A. Dill, Proteins Struct. Funct. Genet. 30, 2 (1998). 19 A. F. P. de Arau´jo, Proc. Natl. Acad. Sci. USA 96, 12482 (1999). 20 A. R. Dinner and M. Karplus, J. Chem. Phys. 37, 7976 (1999). 21 B. D. Bursulaya and C. L. Brooks, J. Am. Chem. Soc. 121, 9947 (1999). 22 B. No¨lting and K. Andert, Proteins Struct. Funct. Genet. 41, 288 (2000). 23 S. B. Ozkan, I. Bahar, and K. A. Dill, Nat. Struct. Biol. 8, 765 (2001). 24 N. V. Dokholyan, Recent Res. Dev. Stat. Phys. 1, 77 (2001). 25 A. V. Finkelstein and O. B. Ptitsyn, eds., ‘‘Protein Physics: A Course of Lectures.’’ Academic Press, Boston, 2002. 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[25]
identifying protein folding from crystal structures
617
Understanding the mechanisms of protein folding is also crucial for deciphering the imprints of evolution on protein sequence and structural spaces. For example, some positions along the sequence in a set of structurally similar nonhomologous proteins are more conserved in the course of evolution than others.26 Such conservation can be attributed to evolutionary pressure to preserve amino acids that play a crucial role in (1) protein function, (2) stability, and (3) folding kinetics—the ability of proteins to rapidly reach their native state.27,28 Interestingly, function is not conserved among nonhomologous proteins that share the same fold, so we can assume that the evolutionary pressure to preserve functionally important amino acids in such a set of proteins is ‘‘weaker’’ than those that are involved in protein stability and folding kinetics. It has been shown27 that up to 80% of the conservation of amino acids observed in the course of evolution can be explained by pressure to preserve protein stability. Thus, to understand the role of evolutionary pressure to preserve rapid folding kinetics, we need to be able to quantify the importance of amino acids for protein folding kinetics. Due to the difficulties and cost of actual experimental studies, it is important to develop rapid computational tools to identify the folding kinetics of a given protein from its crystal structure. The ultimate goal is to be able to predict the protein-folding kinetics of a given protein from its sequence. However, this goal requires the solution of the protein-folding problem (Section II), that is, understanding how a given amino acid sequence folds into a native protein structure. Since protein crystal structures provide invaluable information about amino acid interactions, it is possible to reduce the problem to identifying the folding kinetics of a protein from its structures. The revolution in protein purification and structure-refining methods29–32 produced a large and constantly growing number of highquality protein crystal structures. The underlying assumption in these models is that all the information necessary to fold a specific protein is encoded in the protein structure and that the crystal structure can be used as the primary source of information of protein-folding mechanisms. This assumption became the foundation for the development of theoretical and 26
L. A. Mirny and E. I. Shakhnovich, J. Mol. Biol. 291, 177 (1999). N. V. Dokholyan and E. I. Shakhnovich, J. Mol. Biol. 312, 289 (2001). 28 N. V. Dokholyan, L. A. Mirny, and E. I. Shakhnovich, Physica A 314, 600 (2002). 29 J. Drenth, ‘‘Principles of Protein X-Ray Crystallography.’’ Springer-Verlag, New York, 1994 30 Macromolecular crystallography, Part A. in C. W. Carter, Jr. and R. M. Sweet, eds., Methods Enzymol. 276, (1997). 31 C. W. Carter, Jr. and R. M. Sweet, eds., Methods Enzymol. 277, (1997). 32 J. Roach and C. W. Carter, Jr., Acta Crystallogr. A 59, 273–280 (2003). 27
618
analysis and software
[25]
computational models of protein-folding mechanisms.4,33–46 Surprisingly, such an approach has already yielded promising and robust results, which are summarized in this chapter. Here we present an overview of computational techniques for reconstructing the folding mechanisms of proteins from their crystal structures. We also describe methods that we have validated on the Src SH3 domain, a 56-amino acid protein, studied in detail in experiments16,22,43,47–59 and molecular dynamics simulations.45,60–63 We describe a new protein model 33
H. S. Chan and K. A. Dill, Annu. Rev. Biochem. 20, 447 (1991). A. M. Gutin and E. I. Shakhnovich, J. Chem. Phys. 98, 8174 (1993). 35 C. J. Camacho and D. Thirumalai, Proc. Natl. Acad. Sci. USA 90, 6369 (1993). 36 J. D. Bryngelson, J. Chem. Phys. 100, 6038 (1994). 37 V. S. Pande, A. Y. Grosberg, and T. Tanaka, Proc. Natl. Acad. Sci. USA 91, 12972 (1994). 38 A. J. Li and V. Daggett, Proc. Natl. Acad. Sci. USA 91, 10430 (1994). 39 H. Li, C. Tang and N. S. Wingreen, Proc. Natl. Acad. Sci. USA 95, 4987 (1998). 40 V. Munoz and W. A. Eaton, Proc. Natl. Acad. Sci. USA 96, 11311 (1999) 41 O. V. Galzitskaya and A. V. Finkelstein, Proc. Natl. Acad. Sci. USA 96, 11299 (1999). 42 E. Alm and D. Baker, Proc. Natl. Acad. Sci. USA 96, 11305 (1999). 43 R. Guerois and L. Serrano, J. Mol. Biol. 304, 967 (2000). 44 H. Nymeyer, N. D. Socci, and J. N. Onuchic, Proc. Natl. Acad. Sci. USA 97, 634 (2000). 45 C. Clementi, H. Nymeyer, and J. N. Onuchic, J. Mol. Biol. 278, 937 (2000). 46 N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley, and E. I. Shakhnovich, J. Mol. Biol. 296, 1183 (2000). 47 A. P. Combs, T. M. Kapoor, S. Feng, and J. K. Chen, J. Am. Chem. Soc. 118, 287 (1996). 48 S. Feng, J. K. Chen, H. Yu, J. A. Simons, and S. L. Schreiber, Science 266, 1241 (1994). 49 S. Feng, C. Kasahara, R. J. Rickles, and S. L. Schreiber, Proc. Natl. Acad. Sci. USA 92, 12408 (1995). 50 H. Yu, M. K. Rosen, T. B. Shin, C. Seidel-Dugan, and J. S. Brugde, Science 258, 1665 (1992). 51 V. P. Grantcharova and D. Baker, Biochemistry 36, 15685 (1997). 52 D. S. Riddle, V. P. Grantcharova, J. V. Santiago, E. Alm, I. Ruczinski, and D. Baker, Nat. Struct. Biol. 6, 1016 (1999). 53 A. R. Viguera, J. C. Martinez, V. V. Filimonov, P. L. Mateo, and L. Serrano, Biochemistry 33, 10925 (1994). 54 A. R. Viguera, L. Serrano, and M. Wilmanns, Nat. Struct. Biol. 10, 874 (1996). 55 J. C. Martinez, A. R. Viguera, R. Berisio, M. Wilmanns, P. L. Mateo, V. V. Filimonov, and L. Serrano, Biochemistry 38, 549 (1999). 56 S. Knapp, P. T. Mattson, P. Christova, K. D. Berndt, A. Karshikoff, M. Vihinen, C. I. Smith, and R. Ladenstein, Proteins Struct. Funct. Genet. 23, 309 (1998). 57 Y. -K. Mok, E. L. Elisseeva, A. R. Davidson, and J. D. Forman-Kay, J. Mol. Biol. 307, 913 (2001). 58 J. G. B. Northey, A. A. Di Nardo and A. R. Davidson, Nat. Struct. Biol. 9, 126 (2002). 59 J. G. B. Northey, K. L. Maxwell, and A. R. Davidson, J. Mol. Biol. 320, 389 (2002). 60 J. Gsponer and A. Catfilsch, J. Mol. Biol. 309, 285 (2001). 61 C. Clementi, P. A. Jennings, and J. N. Onuchic, J. Mol. Biol. 311, 879 (2001). 62 J. M. Borreguero, N. V. Dokholyan, S. Buldyrev, H. E. Stanley, and E. I. Shakhnovich, J. Mol. Biol. 318, 863 (2002). 63 F. Ding, N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley, and E. I. Shakhnovich, Biophys. J. 83, 3525 (2002). 34
[25]
identifying protein folding from crystal structures
619
and show that the thermodynamics of Src SH3 from molecular dynamics simulations is consistent with that observed experimentally. We test the proposed mechanism for protein folding, the nucleation scenario, and identify the transition state ensemble of protein conformations characterized by the maximum of the free energy.7,13,15,46,64 How Do Proteins Fold?
Levinthal ‘‘Paradox’’ In 1968, Levinthal formulated a simple argument that points out the nonrandom character of protein-folding kinetics,65 which we illustrate with the following example. Consider a 100-amino acid protein and let us estimate the time necessary for such a protein to reach its unique native state by a random search. If each amino acid moves in 6 possible directions (e.g. up, down, left, right, forward, and backward), the total number of conformations that a 100-amino acid protein can assume is 6100 1078. It is known that the fastest vibrational mode of a protein is that of its tails and is on the order of magnitude of 1 ps, so the time necessary for a 100-amino acid protein to fold is approximately 1066 s or 1057 years. Many proteins fold in the range of 1 ms–1 s. Thus, there is a specific mechanism due to which proteins avoid most conformations en route to their native state. Nucleation Scenario Two-state proteins are characterized by fast folding and the absence of stable intermediates at physiological temperatures. If we follow the folding process for an ensemble of initially unfolded proteins, both the average potential energy and the entropy of the ensemble decrease smoothly to their native state values. The absence of energetic and topological frustrations defines a ‘‘good folder.’’12,66 Various measures have been proposed to determine whether a protein sequence qualifies as a two-state folder, either relying on kinetic67 or thermodynamic9,68 properties. The free energy landscape of the two-state proteins at physiological temperatures is characterized by two deep minima.7,11,15,41,53,69,70 One 64
A. M. Gutin, V. I. Abkevich, and E. I. Shakhnovich, Fold. Des. 3, 183 (1998). C. Levinthal, J. Chim. Phys. 65, 44 (1968). 66 J. N. Onuchi, H. Nymeyer, A. E. Garcia, J. Chahine, and N. D. Socci, Adv. Protein Chem. 53, 87 (2000). 67 J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes, Proteins Struct. Funct. Genet. 21, 167 (1995). 68 V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich, Folding Design 1, 221 (1996). 65
620
analysis and software
[25]
minimum corresponds to the unique native state with the lowest potential energy and low conformational entropy, while the second minimum corresponds to a set of unfolded conformations with higher values of potential energy and high conformational entropy. At the folding transition temperature TF, these minima have equal depths, and both native and unfolded states coexist with equal probabilities. The two minima are separated by a free energy barrier. The set of conformations that belong to the top of this barrier, having maximal values of free energy, is called the transition state ensemble (TSE). At equilibrium, the probability of observing a conformation with free energy, G, is given by p exp ðG=kB T), where kB is the Boltzmann constant and T is the temperature of the system. Since at TF the free energies of native and unfolded or misfolded ensembles are equal, the probability of existing in each of these states is the same. The probability to find a conformation at the top of the free energy barrier is minimal. Therefore, if we consider any protein conformation at the top of the free energy barrier, such a conformation most likely unfolds or reaches its native state with equal probability ¼ 1/2. So, the TSE is characterized by the probability of the conformations to directly reach the native state without unfolding equal to 1/2.62,63,71 The questions then concern which conformations belong to the top of the free energy barrier, and whether there any specific mechanisms that are responsible for the rapid folding transition. Numerous folding scenarios have been proposed to answer these questions.6,13,42,72–77 The mechanism that we advocate in this chapter is called a nucleation scenario.10,11 According to the nucleation scenario, there is a specific obligatory set of contacts at the transition state ensemble, called a specific nucleus, the formation of which determines the future of a conformation at the transition state ensemble. If the specific nucleus is formed, a protein rapidly folds to its native conformation. If the specific nucleus is disrupted in the transition state, the protein rapidly unfolds. Thus, to verify the nucleation scenario
69
S. E. Jackson, N. Elmasry, and A. R. Fersht, Biochemistry 32, 11270 (1993). J. P. K. Doye and D. J. Wales, J. Chem. Phys. 105, 8428 (1996). 71 R. Du, V. S. Pande, A. Y. Grosberg, T. Tanaka, and E. I. Shakhnovich, J. Chem. Phys. 108, 334 (1998). 72 O. B. Ptitsyn, Dokl. Acad. Nauk. 210, 1213 (1973). 73 P. S. Kim and R. L. Baldwin, Annu. Rev. Biochem. 59, 631 (1994). 74 M. Karplus and D. L. Weaver, Protein Sci. 3, 650 (1994). 75 D. B. Wetlaufer, Proc. Natl. Acad. Sci. USA 70, 691 (1973). 76 D. B. Wetlaufer, Trends Biochem. Sci. 15, 414 (1990). 77 O. B. Ptitsyn, Nat. Struct. Biol. 3, 488 (1996). 70
[25]
identifying protein folding from crystal structures
621
we must determine the nucleus and the TSE of a protein. Next, we describe a protein model that we use in molecular dynamics simulations (Fig. 1). Protein Engineering Experiments
A way to test the importance of amino acids in experiments was proposed by Fersht and co-workers.78,79 The method, called protein engineering or F value analysis, is based on the engineering of a mutant protein with the amino acids under consideration replaced by other ones. The value of the free energy difference between the wild-type and the mutant proteins is measured at the transition state G{, folded state GF, and unfolded states GU (Fig. 3). F values are defined as F¼
G{ GU GF GU
(1)
F values are close to zero for those amino acids whose substitution does not affect the transition states. Thus, at the zeroth approximation, these amino acids are least important for the protein-folding kinetics. F values are close to unity for those amino acids whose substitution affects the transition states to the same extent as the folded states. Thus, these amino acids are most important for protein-folding kinetics. Developments in Determination of Protein-Folding Kinetics
Developments in protein purification and structure-refining methods29 have led to publication of high-resolution protein crystal structures. This set of data boosted theoretical studies of protein folding beyond the general heteropolymer models.4,33–37 Early studies targeting important amino acids for protein dynamics applied the available crystal structure data in two different approaches: structures were used (1) as reference states (decoys) for theoretical predictions,38,80–86 and (2) as a source of dynamic 78
A. Matouschek, J. T. Kelis, Jr., L. Serrano, and A. R. Fersht, Nature 342, 122 (1989). A. Matouschek, J. T. Kelis, Jr., L. Serrano, M. Bycroft, and A. R. Fersht, Nature 346, 440 (1990). 80 E. M. Boczko and C. L. Brooks, Science 269, 393 (1995). 81 V. Daggett, A. J. Li, L. S. Itzhaki, D. E. Otzen, and A. R. Fersht, J. Mol. Biol. 257, 430 (1996). 82 T. Lazaridis and M. Karplus, Science 278, 1928 (1997). 83 F. B. Sheinerman and C. L. Brooks III, Proc. Natl. Acad. Sci. USA 95, 1562 (1998). 84 S. L. Kazmirski and V. Daggett, J. Mol. Biol. 284, 793 (1998). 85 A. G. Ladurner, L. S. Itzhaki, V. Daggett, and A. R. Fersht, Proc. Natl. Acad. Sci. USA 95, 8473 (1998). 86 N. A. Marti-Renom, R. H. Stote, E. Querol, F. X. Aviles, and M. Karplus, J. Mol. Biol. 284, 145 (1998). 79
622
analysis and software
[25]
information.87–91 Studies relied on the developed theoretical framework92 that explained folding of relatively small proteins as a chemical reaction between two sets of species—folded and denaturated protein states, separated by transition states and possibly by a set of metastable intermediates. Transition states control the rate of the folding reaction, and solving for the portions of the protein that provide structural coherence to these transition states became a major effort in determining the kinetically important amino acids. Computational power limitations and the inaccuracies in the interatomic force field93 forced all-atom-folding simulations to be performed under extreme conditions favoring denaturation, typically high temperatures.38,80–85 This approach assumes that protein folding can be described by running the unfolding simulation backward in time, and that folding at high temperatures is comparable to folding at room temperatures. These assumptions are questionable, since folding experimental studies are performed under conditions favoring the native state. Furthermore, the low stability of proteins at physiological conditions—only a few kilocalories per mole,94 indicates that folding of the protein to its native structure is the result of a delicate balance between enthalpic and entropic terms. This balance is distorted at high temperatures, where folding becomes a rare event and the transition state may change drastically.95 In simulations, Daggett et al.38,81,84 unfolded target proteins starting from their crystal structures, and monitored the time evolution of a parameter representing the structural integrity of proteins during simulations. Abrupt changes in the parameter pinpointed denaturation of these proteins, and analysis of the trajectories revealed disrupted native amino acid interactions. The amino acids involved in these key interactions were identified as kinetically important, and the authors found good correlation to experimental F values. The issue of the limited statistical significance of the results,38,81,84 due to a small number of unfolding simulations, was addressed by Lazaridis and 87
C. Wilson and S. Doniach, Proteins 6, 193 (1989). J. Skolnick and A. Kolinski, Science 250, 1121 (1990). 89 K. A. Dill, K. M. Fiebig, and H. S. Chan, Proc. Natl. Acad. Sci. USA 90, 1942 (1993). 90 A. Kolinski and J. Skolnick, Proteins Struct. Funct. Genet. 18, 338 (1994). 91 M. Vieth, A. Kolinski, C. L. Brooks, and J. Skolnick, J. Mol. Biol. 251, 448 (1995). 92 E. I. Shakhnovich and A. M. Gutin, Biophys. Chem. 34, 187 (1989). 93 W. Wang, O. Donini, C. M. Reyes, and P. A. Kollman, Annu. Rev. Biophys. Biophys. Struct. 30, 211 (2001). 94 T. Creighton, ‘‘Proteins: Structures and Molecular Properties,’’ 2nd Ed. W. H. Freeman, New York, 1993. 95 A. V. Finkelstein, Protein Eng. 10, 843 (1997). 88
[25]
identifying protein folding from crystal structures
623
Karplus,82 who performed a larger series of unfolding simulations starting from conformations slightly different from the initial crystal structure. A wealth of simulations allowed those authors to extract the common set of key interactions and to identify the important amino acids with higher accuracy. Other attempts to circumvent the poor statistics rested on the discretization of a representative unfolding simulation, followed by long equilibrium simulations of the protein around each of the discretized steps.80,83 This method assumes that a protein is at equilibrium at every step in the folding process, but given that at high temperatures folding is a rare event, caution must be taken when interpreting the results. All-atom simulations were also used to increase the efficiency of protein-engineering experiments in a self-consistent experimental and computational approach toward determination of the TSE.85 This method is most useful for proteins for which only a small fraction of the residues play a key role. Such a method may also serve as a refining tool of the protein-engineering results. Protein databases96 of crystal structures have been widely used as a source of dynamic information with application to folding simulations. In their pioneer study, Wilson and Doniach87 computed effective pairwise amino acid contact potentials from the frequencies of spatial proximities between pairs of amino acids obtained in the database of structures. The authors used these potentials to reproduce, with modest success, the folding process of a one-atom per residue model of crambin on a square lattice.97 Skolnick and Kolinski88 developed a statistical potential using two-atom representation of apoplastocyanin.98 Folding simulations on a finer lattice than that used in previous studies allowed the authors to fold a model pro˚ with respect to the tein with a root–mean–square deviation (RMSD) of 6 A crystal structure. However, the propensity of the amino acids to adopt a specific crystal structure prevented the authors from generalizing the applicability of the model to more than one protein at a time. Kolinsky and Skolnick90 extended the original model88 with a sophisticated potential energy including a variety of energetic and entropic contributions and a hierarchy of finer lattices. Refolding simulations of three different proteins allowed the authors to describe folding processes with moderate success. Vieth et al.91 used a similar model to study the aggregation kinetics of the GCN4 leucine zipper99 into dimers, trimers, and 96
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, Nucleic Acids Res. 28, 235 (2000). 97 M. M. Teeter, S. M. Roe, and N. H. Heo, J. Mol. Biol. 230, 292 (1993). 98 T. P. Garrett, D. J. Clingeleffer, J. M. Guss, S. J. Rogers, and H. C. Freeman, J. Biol. Chem. 259, 2822 (1984). 99 E. K. O’Shea, J. D. Klemm, P. S. Kim, and T. Alber, Science 254, 539 (1991).
624
analysis and software
[25]
tetramers. The authors identified the amino acids that regulated the aggregation kinetics, in accordance with experiments. A systematic study100 indicated that simple pairwise statistical potentials are of limited use in refolding simulations, and although statistically derived potentials are gaining in prediction power, their rapidly increasing complexity compromised their efficiency when compared with ab initio molecular dynamics simulations. Since all information necessary to fold a particular protein is precisely encoded in the protein structure, the crystal structure can be used as the sole source of information, with no regard to the protein database. This approach was taken by Dill et al.89in their study of the folding mechanisms of crambin and chymotrypsin inhibitor.101 Dill et al. assigned attractive interactions between all pairs of hydrophobic amino acids, neglecting other amino acid interactions. The folding dynamics was implemented through a sequence of folding events in which hydrophobic contacts act as constraints that bring other contacts into spatial proximity. The authors found that 1 in every 4000 simulations ended in the crystal structure and proposed a folding pathway for the two proteins. This technique, although able to find a folding event, cannot reproduce a statistically significant ensemble, since the sequence of folding events is forced in the simulation. Thus, only when the proposed sequence of events coincides with the most probable ones can the results be representative of the folding of the protein. Crystal structure-based approaches to identifying the important amino acids for protein folding have attracted interest.38,39 These methods use the crystal structure as the reference state. Starting from the crystal structure, temperature-induced unfolding38,39 in all-atom molecular dynamics simulations with explicit solvent molecules have been applied to study the transition states. However, the limitation of the computational ability of traditional molecular dynamics algorithms only enables one to sample over several unfolding trajectories from the folded state. Thus, this technique can only capture one or a few transition state conformations instead of a statistically significant ensemble. Moreover, derivation of a folding transition state ensemble from high-temperature unfolding may be problematic in some cases due to possible significant differences between the high-temperature free energy landscape and the free energy landscape of a protein at physiological temperatures.95,102
100
D. Thomas, G. Casari, and C. Sander, Protein Eng. 9, 941 (1996). C. A. McPhalen and M. N. James, Biochemistry 26, 261 (1987). 102 A. R. Dinner and M. Karplus, J. Mol. Biol. 292, 403 (1999). 101
[25]
identifying protein folding from crystal structures
625
Alternative theoretical approaches40–43 have been proposed to predict the transition states in protein folding and have obtained significant correlations with experimental F values for several proteins. However, each of these models involves drastic assumptions. For example, each amino acid can only adopt two states—native or denatured—and the ability to be in the native state was considered to be independent of other residues. Such an assumption holds for one-dimensional systems, but may be inappropriate for three-dimensional proteins, because the native state of a residue depends on its contacts with its neighbors. Combined with an effective dynamic algorithm, simplified protein models with a crystal structure-based interaction potential44,45,103 have been applied to study folding kinetics. The principal difficulty in the kinetics studies is the classification of various protein conformations, that is, the knowledge of the reaction coordinate—a parameter that can uniquely identify the position of a protein conformation on a folding landscape with respect to the native state. The fraction of native contacts Q44,45 has been proposed as an approximation to the reaction coordinate. However, other authors have argued that the reaction coordinate for folding is not well defined,7,71,104 and the principal difficulty in identifying the folding reaction coordinate from crystal structures is in uncovering the relationship between protein-folding thermodynamics and kinetics, that is, how much kinetic information we can obtain about protein folding barriers from equilibrium sampling of folding trajectories. Protein-Folding Kinetics from Discrete Molecular Dynamics Simulations
Protein Model The problem of protein modeling in simulations is as complex as the protein-folding problem itself. Such complexity often makes brute force approaches of all-atom simulations impractical. Lattice models7,35,70,105–110 became popular due to their ability to reproduce a significant amount of 103
C. Micheletti, J. R. Banavar, and A. Maritan, Phys. Rev. Lett. 87, 88102 (2001). D. K. Klimov and D. Thirumalai, Proteins Struct. Funct. Genet. 43, 465 (2001). 105 J. Skolnick, A. Kolinski, C. L. Brooks, A. Godzik, and A. Rey, Curr. Opin. Struct. Biol. 3, 414 (1993). 106 E. I. Shakhnovich, Phys. Rev. Lett. 72, 3907 (1994). 107 M. H. Hao and H. A. Scheraga, J. Phys. Chem. 98, 4940 (1994). 108 A. Sali, E. I. Shakhnovich, and M. Karplus, J. Mol. Biol. 235, 1614 (1994). 109 V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich, J. Mol. Biol. 252, 460 (1995). 110 A. Kolinski, W. Galazka, and J. Skolnick, Proteins 26, 271 (1996). 104
626
analysis and software
[25]
folding events in a reasonable computational time. However, the role of topology (common structural features) in determining the folding nucleus requires study beyond lattice models, which impose unphysical constraints on the protein degrees of freedom. Simplified off-lattice models45,46,111–115 are a compromise between lattice and all-atom models. They can be regarded as the first step into modeling the realistic conformational dynamics of proteins. A simple minimalistic off-lattice protein model is a beads-on-a-string model, representing a chain with maximal flexibility.116 One drawback of a beads-on-a-string model is that its chain flexibility is higher than that observed in real proteins, so, as a result, the protein model folding kinetics is often altered, due to the conformational traps that occur in excessively flexible protein models during folding. Stiffer chains allow more cooperative motions of protein chains, drastically reducing the number of collapsed conformations. It is, thus, crucial to introduce an additional set of chain constraints in order to mimic the flexibility of the proteins. In Ding et al.63 we model a protein by beads representing C and C (Fig. 1A). There are four types of bonds: (1) covalent bonds between Cai and Ci, (2) peptide bonds between Cai and C(i 1), (3) effective bonds between Ci and C(i 1), and (4) effective bonds between Cai and C(i 2). To determine the effective bond length, we calculate the average and the standard deviation of distances between carbon pairs of types 3 and 4 for 103 representative globular proteins obtained from the Protein Data ˚ for type 3 Bank.96 We find that the average distances are 4.7 and 6.2 A and type 4 bonds, respectively. The ratio of the standard deviation to the average for bond types 3 and 4 are 0.036 and 0.101, respectively. The standard deviation of bond type 4 is larger than that of bond type 3 because it is related to the angle of two consecutive peptide bonds. Thus, the bond lengths of type 4 fluctuate more than those of type 3. The effective bonds impose additional constraints on the protein backbone so that our model closely mimics the stiffness of the protein backbone, and can give rise to cooperative folding thermodynamics. In our simulation, the four types of bonds are realized by assigning infinitely high potential well barriers116 (Fig. 1B): 111
A. Irba¨ck and H. Schwarze, J. Phys. A Math. Gen. 28, 2121 (1995). G. F. Berriz, A. M. Gutin, and E. I. Shakhnovich, J. Chem. Phys. 106, 9276 (1997). 113 Z. Guo and C. L. Brooks III, Biopolymers 42, 745 (1997). 114 J. E. Shea, Y. D. Nochomovitz, Z. Guo, and C. L. Brooks III, J. Chem. Phys. 109, 2895 (1998). 115 D. K. Klimov, D. Newfield, and D. Thirumalai, Proc. Natl. Acad. Sci. USA 99, 8019 (2002). 116 N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley, and E. I. Shakhnovich, Folding Design 3, 577 (1998). 112
[25]
identifying protein folding from crystal structures
627
Fig. 1. (A) Schematic diagram of the protein model. Grey spheres represent carbons, black ones represent carbons (for Gly, and carbons are the same). In the present model only the interactions between side chains are counted, so that the interaction exists only between carbons, and the carbon plays only the role of the backbone. (B and C) The potential of interaction between (B) specific residues; (C) constrained residues. a1 is the diameter of the hard sphere and a2 is the diameter of the attractive sphere. [b1, b2] is the interval where residues that are neighbors on the chain can move freely. " is negative for native contacts and positive for nonnative ones.
Vijbond ¼
0; Dij ð1 Þ < jri rj j < Dij ð1 þ Þ þ1; otherwise
(2)
where Dij is the distance between atoms i and j in the native state, ¼ 0.0075 for a bond of type 1, ¼ 0.02 for a bond of type 2, ¼ 0.036 for a bond of type 3, and ¼ 0.101 for a bond of type 4. The covalent and peptide bonds are given a smaller width and the effective bonds are given a larger width to mimic the protein flexibility. Other models tailored for molecular dynamics include the use of continuous potentials for bond and dihedral angles45,114,117 and for distances.118 However, the use of discrete potential of interactions presents a computational simplification over continuous potentials that require calculations every discrete time step. 117 118
H. Nymeyer, A. E.Garcia, and J. N. Onuchic, Proc. Natl. Acad. Sci. USA 95, 5921 (1998). M. Sasai, Proc. Natl. Acad. Sci. USA 92, 8438 (1995).
628
analysis and software
[25]
We use a modified Go model similar to one described in Dokholyan et al.,116 in which interactions are determined by the native structure of proteins. In our model, only C atoms that are not nearest neighbors along the chain interact with each other. The cutoff distance between C ˚ . The Go model has been widely applied atoms is chosen to be 7.5 A to study various aspects of protein-folding thermodynamics and kinetics.24,42,46,62,63,119,120 Despite the drawback of the Go model, associated with the prerequisite knowledge of the native structure, it has important advantages. It is the simplest model that satisfies the principal thermodynamic and kinetic requirements for a protein-like model: (1) the unique and stable native state; (2) a cooperative folding transition resembling a first-order phase transition. Importantly, protein sequences with amino acids represented by only two or three types showed a relatively slow decrease in potential energy at TF until proteins reached their native state. The corresponding folding scenario is a coil-to-globule collapse, followed by a slow search of the native structure through metastable intermediates.113,117,121 Similarly, the addition of nonspecific interactions to the Go model resulted in analogous trapping114; and (3) the Go model is derived from the native topology, which according to protein-engineering experiments16,17,122,123 is determinant in the resulting structure of the transition state. Furthermore, in a study with an all-atom energy function, Paci et al. determined that native interactions account for 85% of the energy of the transition state ensemble of the two-state folder AcP.124 The use of the Go model is based (implicitly or explicitly) on the assumption that topology of the native structure is more important in determining folding mechanism than energetics of actual sequences that fold into it. Apparently, conclusive proof of such an assumption can be obtained either in simulations that do not use the Go model, or in experiments that compare folding pathways of analogs—proteins with nonhomologous sequences that fold into similar conformations. The dominating role of topology in defining folding mechanisms was first found in simulations in 1994 when Abkevich and coauthors7 observed that various nonhomologous sequences designed to fold to the same lattice structure featured the same 119
Y. Zhou and M. Karplus, Nature 401, 400 (1999). N. V. Dokholyan, L. Li, F. Ding, and E. I. Shakhnovich, Proc. Natl. Acad. Sci. USA 99, 8637 (2002). 121 H. S. Chan and K. A. Dill, J. Chem. Phys. 100, 9238 (1994). 122 F. Chiti, N. Taddei, P. M. White, M. Bucciantini, F. Magherini, M. Stefani, and C. M. Dobson, Nat. Struct. Biol. 6, 1005 (1999). 123 J. Clarke, E. Cota, S. B. Fowler, and S. J. Hamill, Structure 7, 1145 (1999). 124 E. Paci, M. Vendruscolo, and M. Karplus, Proteins Struct. Funct. Genet. 47, 379 (2002). 120
[25]
identifying protein folding from crystal structures
629
folding nucleus. This finding was further corroborated in Mirny et al.,125 where evolution-like selection of fast-folding sequences generated many families of sequences (akin to superfamilies in real proteins) that all have the same nucleus positions, stabilized despite the fact that actual amino acid types that delivered such stabilization varied from family to family. Similar behavior was observed in structural and sequence alignment analysis of real proteins,26,125,126 where extra conservation was detected in positions corresponding to a common folding nucleus for proteins representing that fold. Experimentally, a common folding nucleus was found in / plait proteins that have no sequence homology.122,127 Other works provided support for the important role of protein topology in its folding kinetics.45,122,128–130 Discrete Molecular Dynamics Algorithm Due to the computational burden of traditional molecular dynamics,131 simplified simulation methods are needed to study protein folding. Our program employs the discrete molecular dynamics algorithm, which received strong attention due to its rapid performance132,133 in simulating polymer fluids,132 single homopolymers,133,134 proteins,116,119,135 and protein aggregates.136,137 A detailed description of the algorithm can be found in Refs. 138–141 125
L. A. Mirny, V. I. Abkevich, and E. I. Shakhnovich, Proc. Natl. Acad. Sci. USA 95, 4976 (1998). 126 O. B. Ptitsyn and K. -L. H. Ting, J. Mol. Biol. 291, 671 (1999). 127 V. Villegas, J. C. Martı´nez, F. X. Avile´s, and L. Serrano, J. Mol. Biol. 283, 1027 (1998). 128 K. W. Plaxco, K. T. Simons, and D. Baker, J. Mol. Biol. 277, 985 (1998). 129 A. R. Fersht, Proc. Natl. Acad. Sci. USA 97, 1525 (2000). 130 K. W. Plaxco, S. Larson, I. Ruczinski, D. S. Riddle, E. C. Thayer, B. Buchwitz, A. R. Davidson, and D. Baker, J. Mol. Biol. 278, 303 (2000). 131 Y. Duan and P. Kollman, Science 282, 740 (1998). 132 S. W. Smith, C. K. Hall, and B. D. Freeman, J. Comput. Phys. 134, 16 (1997). 133 Y. Zhou, M. Karplus, J. M. Wichert, and C. K. Hall, J. Chem. Phys. 107, 10691 (1997). 134 N. V. Dokholyan, E. Pitard, S. V. Buldyrev, and H. E. Stanley, Phys. Rev. E 65, 030801(R) (2002). 135 Y. Zhou and M. Karplus, J. Mol. Biol. 293, 917 (1999). 136 A. V. Smith and C. K. Hall, J. Mol. Biol. 312, 187 (2001). 137 F. Ding, N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley, and E. I. Shakhnovich, J. Mol. Biol. 324, 851 (2002). 138 B. J. Alder and T. E. Wainwright, J. Chem. Phys. 31, 459 (1959). 139 A. Y. Grosberg and A. R. Khokhlov, ‘‘Giant Molecules.’’ Academic Press, Boston, 1997. 140 M. P. Allen and D. J. Tildesley, ‘‘Computer Simulation of Liquids.’’ Clarendon Press, Oxford, 1987. 141 D. C. Rapaport, ‘‘The Art of Molecular Dynamics Simulation.’’ Cambridge University Press, Cambridge, 1997.
630
analysis and software
[25]
To control the temperature of the protein we introduce 103 particles, which do not interact with the protein or with each other in any way but via elastic collisions, serving as a heat bath. Thus, by changing the kinetic energy of those ‘‘ghost’’ particles we are able to control the temperature of the environment. The ghost particles are hard spheres of the same radii as chain residues and have unit mass. The temperature is measured in units of "/kB, where " is the typical interaction strength between pairs of amino acids and kB is the Boltzmann constant. The time step is equal to the shortest time between two consecutive collisions between any two particles in the system. Folding Thermodynamics To test whether our models faithfully reproduce the experimentally observed16,52,53,142 thermodynamic and kinetic properties of the SH3 domain, Ding et al.63 and Borreguero et al.62 performed the discrete molecular dynamics simulations of the model SH3 domain at various temperatures. At each temperature we calculate the potential energy E, the radius of gyration Rg,143 the root–mean–square deviation from the native state (RMSD),144 and the specific heat Cv(T). The radius of gyration is a measure of a protein size, RMSD measures the similarity between a given conformations and the native state, and the specific heat measures the fluctuations of the potential energy of the protein at a given temperature. At low temperatures, the average potential energy hEi increases ˚ . Near the slowly with temperature, and the RMSD remains below 3 A transition temperature Tf, the quantities E, Rg, and RMSD fluctuate between values characterizing two states, folded and unfolded, yielding a bimodal distribution of the potential energy. Potential energy fluctuations at Tf give rise to a sharp peak in Cv(T) (see, e.g., Fig. 7 of Dokholyan et al.116), which is characteristic of a first-order phase transition for a finite system. Our findings are consistent with the two-state folding thermodynamics, experimentally observed for the C-Src SH3 domain.16,52,53,142 Protein-Folding Kinetics Identifying Folding Nucleus. A method to identify the protein-folding nucleus from equilibrium trajectories was proposed in Dokholyan et al.116 and later used on SH3 domain proteins.62,63 The idea is to study ensembles of conformations that have a specific history and future. For example, 142
S. E. Jackson, Folding Design 3, R81 (1998). M. Doi, ed., ‘‘Introduction to Polymer Physics.’’ Oxford University Press, New York, 1997. 144 W. Kabsch, Acta Crystallogr. A 34, 827 (1978). 143
[25]
identifying protein folding from crystal structures
631
conformations that originate in the unfolded state, reach a putative transition state, and later unfold, must differ from the conformations that originate in the folded state, reach a putative transition region, and later fold. Both sets of conformations, which we denote by UU and FF, are characterized by the same potential energy and similar overall structural characteristics. Nevertheless, there is a crucial kinetic difference between them. According to nucleation scenario, UU conformations lack the folding nucleus. The nucleus is not created at the transition region, which leads to the protein unfolding. FF conformations have the nucleus intact at the transition state, so that the protein does not unfold. Thus, to determine the nucleus, we compare the average frequencies of contacts between amino acids in UU and FF ensembles of conformations. Amino acid contacts that have the largest frequency difference form the folding nucleus. We test this method to identify the nucleus of a computationally designed protein46 and later, in Ding et al.63 to determine the folding nucleus of Src SH3 domain protein. For the SH3 domain we find that the crucial contact that is formed at the top of the free energy barrier is between two loops—the distal hairpin and the divergent turn, namely L24–G54. The observation of this contact is statistically significant: the probability of observing L24–G54 contact in our molecular dynamic simulations by chance is about 0.04, although L24–G54 is the most persistent contact in the FF– UU ensemble. The formation of this contact clips the distal hairpin and the RT-loop together, drastically reducing protein entropy. To additionally test the role of contact L24–G54 in SH3, we ‘‘covalently’’ constrain this contact in our molecular dynamics simulations. If L24–G54 constitutes the folding nucleus, then by constraining it we do not allow the folding nucleus to be disrupted, and, thus, the protein should rarely unfold in equilibrium simulations. After cross-linking L24–G54, we observe that the SH3 domain exists predominantly in the folded state. In fact, the histogram of the potential energy states, being bimodal at TF for unconstrained protein, becomes unimodal; with a maximum corresponding to the energy of the native conformation. Thus, cross-linking of L24–G54 strongly biases conformations to the native state. For a control, we test whether constraining any other contact leads to a similar bias of conformations to the native state. We cross-link T9–S64, the N and C termini of Src SH3. T9–S64 is the longest range contact along the protein chain, and, in the case of a homopolymer, it reduces the entropy of conformational space the most.145 We find that fixation of T9–S64 does not 145
A. Y. Grosberg and A. R. Khokhlov, ‘‘Statistical Physics of Macromolecules.’’ AIP Press, New York, 1994.
632
analysis and software
[25]
Probability
significantly affect the distribution of energy states, indicating that formation of an arbitrary contact is not a sufficient condition to bias the protein conformation toward its native state. Identifying Transition State Ensemble. Next, we identify the transition state ensemble of SH3 protein—the set of all conformations that belong to the top of the free energy barrier. We test whether selected conformations belong to the transition state ensemble by computing its probability to fold, pFOLD.71 To determine pFOLD for a given conformation in molecular dynamics simulations, we randomize the velocities of the particles and simulate the protein for a fixed interval of time, long enough to observe a folding transition in equilibrium simulations at TF. We then determine pFOLD by computing the ratio of number of successful folding events versus total number of trials. As we mention above, transition state ensemble conformations are characterized by pFOLD values close to 1/2. We study three types of conformations: (1) UU, (2) FF, and (3) UF. The latter is a set of conformations that originates in the unfolded state, crosses the putative transition barrier, and reaches the folded state. We choose the putative transition conformations as those having an energy higher than that of the native state but lower than the average energy of unfolded states, so that their potential energy has the lowest probability at TF (Fig. 2). In UU conformations the nucleus is not present, and since there is little chance that it will be created after randomization of velocities, we expect pFOLD to be close to zero. In FF conformations, the nucleus is not present, and since there is little chance that it will be disrupted, we expect pFOLD to be close to unity. In UF conformations the nucleus is present with some probability; thus, if we select UF conformations so that the nucleus is formed with the probability 1/2, we expect pFOLD to be close to 1/2.
F
U
Energy Fig. 2. An illustration of the probability distribution of the potential energies of conformations of two-state proteins at folding transition temperature. The two maxima represent the folded (F) and unfolded (U) conformations, which are separated from each other by low-probability transition states.
[25]
identifying protein folding from crystal structures Transition states
∆G
Mutant Wild type
633
∆GU Unfolded states
∆G
F
Native state
Fig. 3. An illustration of the F value analysis for a two-state protein. An amino acid at a specific position is selected in the wild-type protein and is mutated to a specific target. Such mutations affect the free energies of the unfolded, transition, and native states. The extent to which the transition state is affected with respect to the unfolded and native states is measured by F values, defined in Eq. (1).
In Refs. 62 and 63 we show that, in fact, pFOLD is close to zero for the ensemble of UU conformations. pFOLD is approximately unity for the ensemble of FF conformations. Only for the ensemble of UF conformations do we find that pFOLD is close to 1/2. Thus, the set of UF conformations represents the transition state ensemble. It is important, that, even though we perform thermodynamic simulations, we study the protein-folding kinetics because we select UU, FF, and UF conformations based on their past and future states. It is due to kinetic selection of the UU, FF, and UF conformations that we observe difference in pFOLD values, even though their energetic (potential energy) and structural (RMSD, Rg) characteristics are close to each other. Virtual Screening Method. We use a technique similar to experimental F value analysis to predict the TSE via computer simulations. We assume that the mutation does not give rise to significant variation of the threedimensional structures of folded and transition state ensembles, the same assumption that is made in protein engineering experiments. In our simulations, the free energy shifts due to mutation can be computed separately in the unfolded, transition, and folded state ensembles: Gx ¼ kT ln h exp ðE=kTÞix
(3)
Here x denotes a state ensemble (folded, F; unfolded, U; and transition, {), E is the change in potential energy due to the mutation (details in Ref.45), and the average h. . .ix is taken over all conformations of unfolded, transition, and folded state ensembles. We compute45
634
analysis and software
[25]
Fig. 4. Two-state protein free energy landscapes are characterized by two distinct minima at the folding transition temperature. One minimum corresponds to a misfolded/unfolded set of protein conformations, while the other corresponds to the native conformation. The two minima are separated by the free energy barrier. The set of conformations at the top of the free energy barrier constitutes the transition state ensemble and is characterized by their probability to rapidly fold to the native state, pFOLD 1/2. In pretransition states, the folding nucleus is not formed, and thus the probability to fold of such conformations is close to zero. In posttransition states, the folding nucleus is formed, and thus the probability to fold of such conformations is close to 1. The difference in the folding kinetics of pre- and posttransition conformations is drastic, even though their potential energies, Rg, RMSD, and other structural characteristics are close to each other. This difference is exemplified with preand posttransition conformations of CI2, obtained by all-atom Monte Carlo simulations.120 In pretransition states the nucleus, A16, L49, and 157 (beads), is not formed, while in posttransition states the nucleus is intact.
F¼
ln h exp ðE=kTÞi{ ln h exp ðE=kTÞiU ln h exp ðE=kTÞiF ln h exp ðE=kTÞiU
(4)
The F values in our analysis are determined using the free energy relationship of Eq. (3), which takes into account both energetic and entropic contributions, but assumes that mutations do not change the TSE. Interestingly, if one adopts a simplified definition of F value used in more recent work146 as proportional to the number of contacts a residue makes in the TSE, the correlation coefficient between theoretical and experimental F values is reduced to 0.27 from approximately 0.6 (Section V.D.4). An approximation to the F value, the difference between the average number 146
M. Vendruscolo, E. Paci, C. Dobson, and M. Karplus, Nature 409, 641 (2001).
[25]
identifying protein folding from crystal structures
635
of contacts residues form in the TSE and in unfolded states, F (hNii{ hNii U)/ (hNiiF hNii U), provides a better correlation coefficient between predicted and experimentally observed F values (0.48) than does the approximation of Vendruscolo et al.146 The reason that a thermodynamic definition of the F value yields better agreement with experiments can be inferred from a G plot,63 which shows that GF GU for most of the amino acids is not negligible. Indeed, there are several amino acids that make persistent short-range contacts in the unfolded states. Comparing Simulations with Experiments. In Ding et al.,63 F values are computed using the virtual screening method and the comparison with experimental F values16, 52 for Src SH3 protein was statistically significant— the linear regression coefficient is approximately 0.6. By comparing the number of contacts that an amino acid makes in the TSE with that number in the unfolded state, those amino acids that are most important for the formation of the transition state ensemble are selected: L24, F26, L32, V35, W43, A45, A54, Y55, and 156. In general, the majority of the residues from that list have high experimental F values; remarkably, residue A45, which has the highest number of contacts in the transition state ensemble with respect to the unfolded states, has the highest experimental F value—1.2. Notable exceptions are L24, W43, and G54, which have F values that are either small or negative, as in the case of G54. For residue G54, mutation destabilizes the protein while accelerating folding, strongly suggesting that it participates in the transition state ensemble.147 Additional evidence supporting the important roles of L24 and G54 for the transition state of SH3 comes from the evolutionary observation that these amino acids are conserved in a family of homologous SH3 domain proteins.62,148 Role of Protein Topology
En route to the native state, at the transition states a protein loses its entropy by forming a specific nucleus (see Fig. 4). Entropically and energetically pre- and posttransition states—conformations with pFOLD approximately zero and unity correspondingly—are indistinguishable. In fact, in Dokholyan et al.,120 pre- and posttransition sets of conformations were selected for SH3 and C12 proteins. Both pre- and posttransition states had similar structural and energetic properties. The question then is: ‘‘What distinguishes pre- and posttransition states?’’
147 148
L. S. Itzhaki, D. E. Otzen and A. R. Fersht, J. Mol. Biol. 254, 260 (1995). S. M. Larson and A. R. Davidson, Protein Sci. 9, 2170 (2000).
636
analysis and software
[25]
Fig. 5. Constructing protein graphs from protein conformations. Each node corresponds to an amino acid. We draw an edge between any two nodes of a graph if there exists a contact between amino acids, corresponding to these nodes. The contact between two amino acids is defined by the spatial proximity of carbons (C for Gly) of these amino acids. The contact ˚. distance is taken to be 8.5 A
To answer this question, we hypothesize that120 the actual topological properties of pre- and posttransition conformations are different. To test this hypothesis, we construct a protein graph (Fig. 5), the nodes of which represent amino acids and the edges of which represent pairs of amino acids that are within the contact range from each other. For SH3, we define the maximum distance between C atoms at which a contact exists at ˚ .149 7.5 A A simple measure of topological properties of the graph is the average minimal path along the edges between any two nodes of the graph, L, used in Vendruscolo et al.150 and later used for discriminating pre- and posttransition states of SH3 and CI2 proteins: 149 150
R. L. Jernigan and I. Bahar, Curr. Opin. Struct. Biol. 6, 195 (1996). M. Vendruscolo, N. V. Dokholyan, E. Paci, and M. Karplus, Phys. Rev. E 65, 061910 (2002).
[25]
identifying protein folding from crystal structures
L¼
N X 1 ‘ij NðN 1Þ i>j
637 (5)
where N is the number of amino acids and ‘ij is the minimal path between nodes i and j. L values characterize the ‘‘tightness’’ of the network by computing the average separation of elements from each other. For both SH3 and CI2 proteins, L was observed to be significantly different in pre- and post-transition states, supporting the hypothesis of ref.120 that the protein conformation topology plays an important role in proteinfolding kinetics. Additional evidence of the importance of topology in protein folding was shown in Vendruscolo et al.,150 where using other determinants of protein graph topology, the most important amino acids for the protein-folding kinetics were identified for several proteins: AcP, human procarboxypeptidase A2, tyrosine-protein kinase SRC, -spectrin SH3 domain, CI2, and protein L. Using Monte Carlo simulations of the hydrophobic protein model, Treptow et al.151 also suggested the role of protein topology in folding kinetics. Conclusion
The revolution in protein crystallography has resulted in the identification of a large number of protein structures. The latter became an invaluable source of information on amino acid interactions. We review extensive studies that uncovered the important role of protein topology in folding kinetics. These studies suggested that one can determine protein-folding kinetics to a reasonably detailed level from the knowledge of crystal structure. We describe analytical and computational tools for determining and characterizing protein-folding kinetics from crystal structures. These include a protein model for off-lattice molecular dynamic simulations that faithfully reproduces many aspects of SH3 folding thermodynamics and kinetics. Using molecular dynamics simulations, we verify the nucleation scenario for the SH3 protein family by comparing the fluctuations originating in the native and unfolded states. We find an important role of the L24– G54 contact for the folding kinetics of SH3 proteins. A possible test of the kinetic importance of the L24–G54 contact may come from cross-linking this contact and understanding whether cross-linking stabilizes the native state of the Src SH3. The hallmark of the relationship between protein crystal structures and their folding kinetics was signified by the success of the Go model of amino 151
W. L. Treptow, M. A. A. Barbosa, L. G. Garcia, and A. F. P. de Arau´jo, Proteins 49, 167 (2002).
638
analysis and software
[25]
acid interactions to study protein folding. We describe several studies that are based on the Go model. In one such study, we identify the most evasive protein-folding transition state ensemble for Src SH3 protein, and find that it is consistent with experimental observations. We dissect the transition state ensemble by studying wiring properties of protein graphs. The structural properties of protein graphs are related to protein topology and, thus, may explain the kinetics of the folding process. These studies unveil the expanding possibilities for studying protein-folding kinetics from their crystal structures. Acknowledgments We gratefully acknowledge E. Deeds and A. F. P. de Arau´jo for critical reading of the manuscript. This work is supported by the NSF and Petroleum Research Fund (to H.E.S.), and by the NIH (GM52126 to E.S.). N.V.D. is supported by an NIH NRSA Grant (GM20251–01).
Author Index
Numbers in parentheses are footnote reference numbers and indicate that an author’s work is referred to although the name is not cited in the text.
A Abad-Zapatero, C., 551 Abbasi, A., 594(26), 595 Abendroth, J., 77 Abernethy, N. F., 549(164), 550 Abkevich, V. I., 618, 621, 621(7), 627, 627(7), 630(7), 631 Abola, E. E., 374, 377(8), 417, 476, 549(162), 550 Abrahams, J. P., 29, 30, 33(35), 46, 55, 135, 163, 164, 165(6), 171, 178(6), 184, 187, 603 Abrahams, S. C., 303, 313(2) Adachi, S., 604 Adamiak, D. A., 120 Adams, P. D., 23, 37, 39, 47(5), 171, 177, 244, 246, 273(28), 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Adler, J., 210 Adrian, M., 204 Agard, D. A., 27, 29, 46, 190, 566 Aggeler, R., 603 Agmon, I., 164 Agrawal, R. K., 603, 615 Ailey, B., 465 Alber, T., 625 Albert, M. S., 119 Aldag, I., 603 Alder, B. J., 631 Alexandratos, J., 121, 408 Alexandrov, N. N., 514 Alexandrov, V., 546, 560, 565 Alexov, E., 509 Ali, S. A., 594(26), 595 Al-Karadaghi, S., 60 Allen, F. H., 245 Allen, M. P., 631
639
Alm, E., 620, 622(42), 626(42), 627(42), 630(42), 632(52), 637(52) Almagro, J. C., 418 Almo, S. C., 87, 98(26), 99(26), 211, 220(50), 223, 223(50), 224(50), 246, 465 Alphey, M. S., 77 Altendorf, K., 603 Altman, R. B., 549(164; 165), 550, 557, 559 Altschul, S. F., 380, 467, 544 Alvarez-Rua, C., 207 Alzari, P. M., 327, 338(12) Amada, F., 402 Amann, K. J., 212 Amara, P., 118 Amemiya, Y., 616 Amzel, L. M., 528 Anand, K., 77 Andersen, N. H., 421 Anderson, A. G., 451 Anderson, P. M., 47, 49(35), 76 Andersson, I., 115, 116(87) Andert, K., 618, 620(22) Andreu, J. M., 595 Andrew, P. W., 206(22), 207 Andrews, B. K., 423 Angelov, B., 516 Anifsen, C. B., 618 Antson, A. A., 322 Apweiler, R., 465 Arata, Y., 375, 378(20), 379(20) Arcangioli, B., 595 Arendall, W. B. III, 387, 394, 395(19), 397(19), 417, 448, 452(64), 456(64), 461(64) Argos, P., 390, 557 Arjunan, P., 76, 602 Armon, A., 505 Arnal, I., 207 Arnoux, B., 606
640
author index
Arthur, J. W., 472 Artymiuk, P. J., 557, 558, 558(59) Aruffo, A., 463(3), 464, 472(3) Aslam, M., 601 Aszo´di, A., 473 Atilgan, A. R., 304 Atkinson, R. A., 610 Aurenhammer, F., 517, 519(60) Austin, R. H., 616 Avile´s, F. X., 623, 631 ´ zevedo, E. D., 371 A
B Baase, W. A., 547, 575(5) Bachet, B., 87, 93(34), 97(34), 115(34), 116(34) Bachmann, A., 616 Bacon, D. J., 316, 475 Bada, M., 595 Bader, R. F. W., 456 Badretdinov, A. Y., 479, 481(73), 482(73), 486(73), 490(73) Baggio, R., 61 Bahar, I., 304, 618, 638 Bairoch, A., 465, 511, 549(160; 166), 550 Bajorath, J., 463(3), 464, 472(3) Baker, D., 27, 29, 46, 190, 463(8), 464, 547, 560, 616, 618, 620, 620(16), 622(42), 626(42), 627(42), 630(16; 42), 631, 632(16; 52), 637(16; 52) Baker, E. N., 199 Baker, L.-J., 125, 133(31) Baker, P. J., 77 Baker, T. S., 204, 206, 207, 207(13), 209(13), 216(13), 220(34) Bakowies, D., 517 Balaram, P., 395(22), 396 Balass, M., 130 Baldi, P., 581 Baldwin, J. E., 115, 116(87) Baldwin, R. L., 451, 622 Baldwin, T. O., 401 Bamford, D. H., 595 Ban, N., 163, 164, 165(9), 172, 174(27), 175(27), 180(27), 181(39), 182, 603, 615 Banavar, J. R., 618, 627 Banuelos, S., 403, 405(30) Barberato, C., 594, 605, 607
Barbosa, M. A. A., 639 Barker, W. C., 480 Barrett, C., 469 Barrientos, L. G., 491, 493(77) Bartels, C., 516 Bartels, H., 164 Bartels, K. S., 170 Barth, P., 328 Barton, G. J., 374, 467, 557, 559 Basak, A. K., 207 Bash, P. A., 448 Bashan, A., 164 Bashford, D., 421, 473, 475(63) Bateman, A., 560, 581 Bateman, R. C., Jr., 405, 411(33) Batt, C. A., 616 Battino, R., 85, 94 Baucom, A., 615 Baumann, U., 313, 318(34), 319(34) Baumeister, W., 163, 204, 206 Bax, A., 375, 378(20), 379(20) Bax, B., 613 Baxter, K., 245 Bayly, C., 422, 447(34) Beamer, L. J., 88, 98(40), 101(40), 115, 116(83) Becirevic, A., 594(25), 595 Becquart, J., 606 Behlke, J., 323 Bella, J., 18 Bellott, M., 473, 475(63) Bellucci, F., 332 Beltramini, M., 602, 612, 613 Bennett, W. S., 551 Bennett, W. S., Jr., 551 Benson, D. A., 380(32), 381, 384(32), 465 Benson, S. W., 111 Ben-Tal, N., 505 Berendes, R., 206 Berendsen, H. J., 311, 418, 565 Berendsen, J., 23 Berendzen, J., 22, 23(1), 24(1), 25, 27, 27(1), 28, 37, 39, 40(4), 41, 41(4), 42(18), 43, 45(4), 46, 49(4; 18), 58(4), 160, 162, 171, 244 Bergman, L. D., 544 Bergner, A., 174 Berkeley Structural Genomics Center, 77 Berkowitz, M. L., 516
author index Berman, H. M., 230(4), 231, 372, 376, 377, 378, 382, 383, 465, 497, 511, 548, 599, 625, 628(96) Bernal, J. D., 515, 570 Bernard, A., 125, 132(25) Berndt, K. D., 620 Berneche, S., 547 Bernstein, F. C., 230(3), 231, 372, 497, 599 Beroukhim, R., 206 Berriz, G. F., 628 Berry, E. A., 562 Berthault, P., 118 Berthet-Colominas, C., 616 Bertolasi, V., 332 Bertone, P., 548 Bestor, T. H., 115(95), 116(95), 118 Betancourt, M. R., 473 Betts, L., 196 Beveridge, D. L., 383 Bezborodov, A. M., 199 Bezborodova, S. I., 199 Bhat, T. N., 46, 230(4), 231, 372, 383, 465, 497, 511, 548, 599, 625, 628(96) Bhatnagar, A., 340 Bhinge, A., 517 Bhuiya, A. K., 63 Bijvoet, J. M., 3, 4, 6, 15(2), 20(3) Bilgin, N., 607 Biou, V., 17 Birck, C., 115, 116(84; 85), 608 Birmanns, S., 205, 208(10), 209(10), 213(10) Birney, E., 581 Bjorkman, P. J., 115, 116(86) Blackford, L. S., 371 Blake, C. C. F., 79, 122 Blake, J. D., 472 Blakemore, W., 207 Blanc, E., 411, 412(35) Blass, D., 206(21), 207 Blessing, R. H., 56, 57(56), 58, 58(56), 64(60), 68(61), 71, 399, 447(62), 448, 450(62) Blow, D. M., 3, 6, 9(15), 10, 10(16) Blum, D. L., 115(99), 116(99), 118 Blumenthal, R., 82, 125, 132(23) Blundell, T. E., 123 Blundell, T. L., 26, 45, 120, 199, 306, 325, 463(2; 4), 464, 467, 472, 472(2; 4), 473(22), 479(22), 557, 559 Boczko, E. M., 623, 624(80), 625(80) Bodo, G., 7
641
Boeckmann, B., 549(160), 550 Boguski, M. S., 465, 467 Boisvert, D. C., 163 Bokhoven, C., 3, 15(2), 20(3) Bompard-Gilles, C., 115(93), 116(93), 118 Bonanno, J. B., 246, 465 Bond, C. S., 77 Bondi, A., 92 Borchardt, R. T., 29, 64, 65(77), 68(77) Borge, J., 207 Borghese, C., 119 Borreguero, J. M., 618, 620, 622(62), 630(62), 632(62), 635(62), 637(62) Bourgeois, D., 615 Bourguet, W., 87, 88(36), 93(30; 36), 115(36), 116(36), 118(30) Bourne, P. C., 315, 376, 383 Bourne, P. E., 230(4), 231, 372, 374, 465, 497, 511, 548, 599, 625, 628(96) Boutonnet, N. S., 547 Bower, M., 469 Bowie, J. U., 76, 469, 476 Boylan, S., 593, 610(18) Brady, L., 122 Braig, K., 163 Branchaud, B. P., 562 Branden, C. I., 245, 581 Brannigan, J. A., 316 Braun, T., 204 Bray, J. E., 465 Brejc, K., 130 Brennan, S., 616 Brenner, S. E., 374, 465, 540, 543(77), 548, 578, 578(26) Breslow, E., 122 Brew, K., 463(1), 464, 472(1) Breyer, W. A., 88, 92(43), 112(43) Brice, M. D., 230(3), 231, 372, 497, 599 Bricogne, G., 18, 21, 29, 31(25), 41, 42(18), 46, 49(18), 54, 70, 80, 87, 88(31; 32), 90(31), 93(31), 95, 96(32), 97(31), 98(31), 104(31), 109(31), 112(31), 113(31; 32; 66; 81), 115, 116(81), 120, 122(3), 134, 135(41), 177, 189, 244, 325, 336, 411, 412(35) Broadwater, A., 544 Brock, C. P., 311 Brocklehurst, S. M., 473 Brodersen, D. E., 21, 164, 165(10), 175, 181(32)
642
author index
Bronstein, I. B., 322 Brookes, S., 207 Brooks, B. R., 422, 447(32), 456, 565 Brooks, C. L., 421, 618, 623, 624, 624(80), 625(80; 91), 627 Brooks, C. L. III, 623, 624(83), 625(83), 628, 629(114), 630(113; 114) Broutin, I., 87, 89(29), 91(29), 93(29), 110(29) Brown, F. K., 516 Brown, K. A., 601 Brown, M., 469, 581 Browne, W. J., 463(1), 464, 472(1) Bruccoleri, R. E., 422, 447(32) Brugde, J. S., 620 Bru¨nger, A. T., 11, 17(26), 21, 23, 30, 37, 39, 47, 47(5), 49(35), 50(40), 51, 54, 171, 173, 177, 191, 194(21), 196(21), 200(21), 244, 282(17), 283, 306, 386, 387, 415, 417, 418(1; 2), 421(1; 2), 423, 424, 426(35; 44), 432(2), 433(2), 435(1), 436(2), 441(2), 443, 445(35), 446, 446(35; 53), 462(1; 2) Brunmark, A., 316(41), 320 Brunne, R. M., 445 Brunskill, A., 76 Bryant, S., 374 Bryant, S. H., 455, 514, 528, 548, 558 Bryngelson, J. D., 618, 620, 620(4), 621, 623(4; 36) Brzozowski, A. M., 122 Bucciantini, M., 630, 631(122) Buchanan, S. K., 77 Bucher, P., 511, 549(166), 550 Buchwitz, B., 631 Buck, M., 594, 595(24) Buckley, P. A., 77 Buerger, M. J., 48, 61(37) Bugg, C. E., 398 Buldyrev, S. V., 618, 620, 621(46), 622(62; 63), 628, 628(46; 63), 630(46; 62; 63; 116), 631, 631(116), 632(62; 63; 116), 633(46; 63), 635(62; 63), 637(62; 63) Bunick, G. J., 152 Burden, L. M., 82, 125, 132(27) Burger, A., 206 Burgess, R. R., 177 Bu¨rgi, H.-B., 303, 313(2), 455 Burkhardt, K., 372 Burkhardt, N., 589, 604 Burla, M. C., 39 Burley, S. K., 246, 455, 465
Burling, F. T., 11, 17(26), 54, 423, 426(35), 443, 445(35), 446(35; 53) Burnett, M. N., 314 Burnett, R. M., 207 Bursulaya, B. D., 618 Burzlaff, H., 303, 313(2) Bushnell, D. A., 164, 165(7), 166(7), 177 Butler, S. A., 306, 314(21), 315(21) Butterfoss, G., 457, 458, 459, 460, 461(104; 105) Butterworth, S., 295 Bycroft, M., 623 Byrant, S. H., 514 Bystroff, C., 29, 190, 560 Bystrom, C. E., 562
C Cachau, R. E., 324, 332, 341(28a), 342 Caldentey, J., 595 Caldwell, J. W., 422, 447(34) Callaway, J., 374, 376(5) Camacho, C. J., 620, 623(35), 627(35) Camalli, M., 39 Camble, R., 113(81), 115, 116(81) Cammer, S. A., 511, 512, 517, 517(9), 539 Campbell, E. A., 164 Campos-Olivas, R., 491, 493(77) Capaldi, R. A., 603 Capel, M. S., 28, 181(39), 182, 246, 465, 589 Capitani, G., 115(97), 116(97), 118 Capozzi, F., 143 Card, G., 613 Carlisle, H. C., 3 Caroll, S. F., 115, 116(83) Carrozzini, B., 39 Carter, A. P., 164, 165(10), 175, 181(32) Carter, C. W., 29, 411, 412(35), 512, 517(9) Carter, C. W., Jr., 46, 145, 150, 158, 191, 194(21), 196, 196(21), 200(21), 511, 517, 539, 544, 619 Carugo, K. D., 403, 405(30) Carvin, D., 123 Casari, G., 514, 526(24), 626 Cascarano, G. L., 39 Cascio, D., 88, 98(40), 101(40) Case, D. A., 86, 418, 421, 421(14) Castellano, E. E., 134
author index Castro, B., 87, 93(34), 97(34), 115(34), 116(34) Cate, J. H., 179, 615 Cates, G. D., 119 Catfilsch, A., 620 Causse, H., 602 CCP4, 386 Cech, T. R., 179 Chacko, K. K., 141, 142(3) Chacon, P., 206, 208(11), 209, 209(11), 210(11), 221, 595 Chahine, J., 621 Chakravarty, S., 517 Chambon, P., 87, 88(36), 93(36), 115(36), 116(36) Chan, H. S., 618, 620, 623(33), 624, 626(89), 630 Chance, M. R., 246, 465 Chandrasekhar, J., 418 Chandrasekhar, K., 76 Chang, C., 125, 129(24), 132(24) Chang, C.-S., 62 Chang, G., 40 Chang, J.-J., 204 Chang, W. R., 122 Changeux, J. P., 611 Chapman, M. S., 261 Chappey, C., 380(33), 381 Chase, E. S., 206, 207, 207(13), 209(13), 216(13), 220(34) Chauvin, Y., 581 Chayen, N. E., 96, 115(69) Che, Z., 207 Chen, C. C. H., 77 Chen, C.-J., 80, 115(98), 116(98), 118 Chen, C. Y., 115(96), 116(96), 118 Chen, J. K., 620 Chen, L., 354 Chen, L. Q., 122 Chen, R. O., 549(164; 165), 550 Chen, S., 206(22), 207 Chen, W. Z., 547 Chen, X., 469 Cheng, B., 475 Cheng, N., 616 Cheng, R. H., 207, 220(34) Cheng, X., 115(92; 95), 116(92; 95), 118 Chertov, O., 82, 125, 132(23) Chiabrera, A., 509 Chien, F. T., 115(96), 116(96), 118
643
Chik, J. K., 553 Chinardet, N., 115, 116(84) Chiti, F., 630, 631(122) Chiu, W., 204, 207, 615 Chiverton, A., 245 Cho, Y., 605 Choi, J., 371 Chomilier, J., 516 Chothia, C., 374, 464, 465, 469, 515, 516, 518(39), 524, 540, 543(77), 548, 551, 551(51), 552, 552(29; 34), 556, 556(42), 557(51), 570, 574(127), 577, 578, 578(26) Christiansen, L., 122 Christopher, J. A., 256, 257(40), 259(40) Christova, P., 620 Cianci, M., 96, 115(69) Cieplak, P., 422, 447(34) Claessens, M., 473 Clarage, J. B., 167, 423, 446 Clark, B. F., 607 Clarke, J., 630 Cleary, A., 371 Clementi, C., 620, 626(45), 627(45), 628(45), 629(45), 631(45), 635(45) Clemons, W. M., 28 Clemons, W. M., Jr., 164, 165(10), 175, 181(32) Clever, H. L., 94 Cline, J. E., 89 Clingeleffer, D. J., 625 Clippe, A., 125, 132(25) Clore, G. M., 39, 47(5), 177, 244, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Clowney, L., 378 Cochran, W., 60 Cogdell, R. J., 316 Cohen, A., 88, 97(42), 112(42) Cohen, F. E., 390, 391(11), 459, 469, 472, 558, 595 Cohen, G. H., 557, 558 Cohen, J., 207 Colau, D., 125, 129(26), 132(26) Coleman, W. G., Jr., 77 Collaborative Computational Project Number 4, 39, 242 Collaborative Crystallographic Project No. 4, 305, 306(13), 310(13), 311(13), 314(13) Collier, M. L., 544
644
author index
Colloc’h, N., 87, 93(30; 34), 97(34), 115(34), 116(34), 118(30) Colls, J., 113(81), 115, 116(81) Colman, P. M., 79 Combs, A. P., 620 Conn, H. L., Jr., 85 Connolly, M. L., 86, 388, 419, 503 Conway, J. F., 616 Cook, W. J., 398, 403, 405(32) Cooper, J. B., 306 Cornell, W. D., 422, 447(34) Coskun, U., 603 Cosme, J., 115(90), 116(90), 118 Costabel, M., 327, 338(12) Coster, D., 3(7), 4 Cota, E., 630 Cowan, S. W., 207, 278, 390, 391(8), 411(8), 412(8) Cowtan, K. D., 29, 30, 31, 33(32), 35(37), 134, 135(42), 148, 190, 245, 282, 282(15; 16), 283, 359, 362(14) Cowtan, S. W., 244 Craievich, A., 134 Craig, R., 211, 220(50), 223(50), 224(50) Cramer, C., 422 Cramer, F., 89 Cramer, P., 164, 165(7), 166(7) Crawley, J. B., 613 Creagh, D. C., 143 Creighton, T., 624 Crick, F. H. C., 6, 9(15), 10(16), 325 Crippen, G. M., 514, 526(25) Crofts, A. R., 562 Cross, P. C., 565 Crowfoot, D. M., 3 Cuff, M. E., 602 Cui, Q., 449 Cuillel, M., 616 Cullens, S. C., 84 Cullis, A. F., 9 Cusack, S., 77 Cusanovich, M. A., 67 Cutfield, J. F., 199 Cutfield, S. M., 199 Cutsem, E. V., 473 Czerwinski, E. W., 41, 154
D Daggett, V., 547, 620, 623, 623(38), 624(38; 81; 84; 85), 625(85), 626(38) Dahlquist, F. W., 603, 604 Dailey, T. A., 115(98), 116(98), 118 Dainese, E., 602, 612 Dalby, A. R., 315 Dall’Antonia, F., 326(11), 327, 328(11), 339(11) Dann, C. E., 125, 133(32) Danz, H., 170 Darden, T. A., 343 Darnton, N. C., 616 Darst, S. A., 164, 209 Daugherty, M., 115(94), 116(94), 118 Dauter, M., 21, 80, 95, 120, 121, 122, 122(3; 7) Dauter, N., 29 Dauter, Z., 21, 29, 67, 79, 80, 82, 95, 120, 121, 122, 122(3; 7), 123(8), 124(8), 125, 130, 132(8), 133(28; 30), 153, 295, 325, 408 David, P. R., 177 Davidson, A. R., 620, 631, 637 Davidson, V. L., 354 Davies, D. R., 557 Davis, I. W., 394, 395(19), 397(19), 448, 452(64), 456(64), 461(64) Davis, M. E., 561 Dawant, B. M., 210(48), 211 Deacon, A. M., 77 de Arau´jo, A. F. P., 618, 639 Debaerdemaeker, T., 63 de Bakker, P. I. W., 394, 395(19), 397(19), 448, 452(64), 456(64), 461(64) Debon, F. L., 85 Debye, P., 601 Decius, J. C., 565 Declercq, J.-P., 61, 125, 132(25) De Crombrugghe, M., 575 De Filippis, V., 612, 613 de Forcrand, R., 88 Defrenne, S., 616 De Graaff, R. A. G., 171 de Groot, B. L., 565 Deisenhofer, J., 77, 115(89), 116(89), 118, 164 de la Fortelle, E., 18, 21, 41, 42(18), 49(18), 70, 80, 87, 88(31; 32), 90(31), 93(31), 95, 96(32), 97(31), 98(31), 104(31), 109(31), 112(31), 113(31; 32; 66; 81), 115, 116(81), 120, 122(3), 134, 135(41), 177, 244, 336
author index DeLano, W. L., 39, 47(5), 177, 244, 386, 405 Delarue, M., 544, 595 Delaunay, B. N., 516 Delgoda, R., 77 De Maeyer, M., 390, 391(10) Demange, P., 206 Demarcelle, M., 115(93), 116(93), 118 De Maria, L., 547 Dementieva, I., 60, 123, 124(19), 328 Demeny, T., 383 Demirel, M. C., 304 Demmel, J., 371 Denesyuk, A. I., 604 de Oliveira, R. T., 134 Derewenda, Z. S., 122, 125, 133(30) DeRosier, D. J., 206, 209(14), 210(14), 211, 211(14), 217(14), 220(14; 50), 221, 221(14), 222, 222(14), 223(50), 224(50) De Santis, G., 616 Deshayes, K. D., 125, 132(22) Desmet, J., 390, 391(10) de Sousa, S. L. M., 119 Desvaux, H., 118 DeTitta, G. T., 56, 63, 63(53), 65(53), 66(53), 244 Devedjiev, Y., 125, 133(30) Devereux, J., 558 Devine, C. S., 340 de Vos, A. M., 125, 132(22) Dewan, J. C., 105, 109(74) Dewey, T. G., 472 deWind, N., 77 DeWitte, R. S., 514 Dhillon, I., 371 Diamond, R., 261, 304, 324(4), 596 Dias, D. P., 206 Diaz, J. F., 595 Dickerson, R. E., 26, 40, 45(16), 305, 311(11) Dickinson, R., 119 Diedrich, G., 604 Dill, K. A., 455, 514, 547, 575, 618, 620, 623(33), 624, 626(89), 630 Diller, D. J., 245 di Marco, S., 70 Di Muro, P., 602, 612 Di Nardo, A. A., 620 Ding, F., 618, 620, 622(63), 628(63), 630, 630(63), 631, 632(63), 633(63), 635(63), 637(63; 120) Dinner, A. R., 618, 626
645
Dintzis, H. M., 7 Dixon, M. M., 547, 575(5) Djinovic-Carugo, K., 88, 106(45), 113 Do, R. K. G., 467, 475(24), 478(24), 479(24) Dobrott, R. D., 52 Dobson, C. M., 438, 630, 631(122; 146), 636 Dodd, F. E., 603 Dodge, C., 374 Dodson, E., 21, 122 Dodson, E. J., 23, 37, 39, 72(8), 171, 199, 242, 246, 273(26), 295, 306, 308, 316, 332, 350, 400, 408(38) Dodson, G. G., 122, 199, 322, 551, 552(34) Doi, M., 632 Dokholyan, A. V., 618, 620, 622(63), 628(63), 630(24; 63), 631, 632(63), 633(63), 635(63), 637(63) Dokholyan, N. V., 618, 619, 620, 621(46), 622(62), 628, 628(46), 630, 630(46; 62; 116), 631(116), 632(62; 116), 633(46), 635(62), 637(62; 120), 638, 639(150) Domany, E., 514 Domingo, E., 207 Dominy, B. N., 421 Donaldson, L. W., 603 Dong, A., 115(95), 116(95), 118 Dongarra, J., 371 Doniach, S., 595, 616, 624, 625(87) Donini, O., 624 Doolittle, R., 558 Dorocke, J. A., 125, 133(31) Doublie, S., 150, 168 Doudna, J. A., 179 Downing, K. H., 207 Doye, J. P. K., 622, 622(70), 627(70) Drenth, J., 120, 186(41), 187, 619, 623(29) Drickamer, K., 179 Driehuys, B., 119 Driessen, H., 305, 311(12) Driessen, H. P. C., 306, 314(21), 315(21) Du, R., 622, 627(71), 634(71) Duan, Y., 631 Dubochet, J., 204 Ducros, V. M.-A., 316 Ducruix, A., 606 Duda, R. L., 616 Duda, R. O., 247 Dumoutier, L., 125, 129(26), 132(26) Dunaway-Mariano, D., 77
646
author index
Dunbrack, R. L., 390, 391(11), 456, 459, 459(103), 472 Dunbrack, R. L., Jr., 469, 473, 475(63) Duncan, B. S., 565 Dunietz, B. D., 575 Dunitz, J. D., 303, 311, 313(2), 455 Dunn, B. M., 125, 133(28) Duporque, D., 79 Durbin, R., 560, 581 Durell, S. R., 304 Durley, R., 354 Duyckaerts, C., 574 Dyda, F., 602
E Eady, N. A. J., 601 Ealick, S. E., 77 Earnest, T. N., 21, 615 Eaton, W. A., 620, 626(40), 627(40) Ebel, C., 607 Echols, N., 577 Eddy, S. R., 469, 560, 581 Edelman, M., 516 Edelsbrunner, H., 517 Edgell, M. A., 517, 539 Edgell, M. H., 544 Edmundson, A. B., 399 Efimov, V. P., 94, 115(57), 116(57) Egea, P. F., 608 Egelman, E. H., 206(23), 207 Egerton, M., 113(81), 115, 116(81) Egile, C., 212 Ehrenberg, M., 607 Eicken, 262 Eigenbrot, C., 432 Eisenberg, D., 25, 27(11), 40, 41, 43, 43(21), 45(22), 54(21), 115, 116(83), 469, 476, 528 Eisenhaber, F., 390 Eker, F., 451 El Hajji, M., 87, 93(34), 97(34), 115(34), 116(34) Elisseeva, E. L., 620 Ellis, P., 88, 97(42), 112(42) Elmasry, N., 622 Elsner, J., 448, 449(70) Elstner, M., 448, 448(71), 449, 449(70), 450, 450(71; 73), 452, 453, 453(84)
Engel, A., 204 Engel, J., 94, 115(57), 116(57) Engelman, D. M., 589 Engh, R. A., 377(25), 378, 391(25), 397 Enyeart, J. J., 94, 95(63) Enzlin, J. H., 77 Epp, O., 164 Epstein, J. A., 548, 549(24) Eriani, G., 544 Erman, B., 304 Ermekbaeva, L. A., 199 Esnouf, R. M., 232 Esposito, L., 325, 328(3) Estermann, M. A., 52 Eswar, N., 465 Etchebest, C., 390, 391(7) Evans, G., 60, 328, 329 Evans, P. R., 70 Evans, R. W., 613 Evanseck, J. D., 473, 475(63) Everitt, P., 88, 106(45) Evrard, C., 125, 132(25) Ewyng, G. J., 86
F Faber, H. R., 562 Fabre, C., 115, 116(85), 602 Fairlamb, A. H., 77 Falicov, A., 558 Falquet, L., 511 Fan, H.-F., 72 Fanuel, L., 115(93), 116(93), 118 Farrenkopf, B., 602 Faust, L., 221 Faye, I., 115, 116(86) Featherstone, R. M., 84, 85, 86, 87(15), 90(5) Fedorov, A. A., 223 Fedorov, B. A., 597 Feese, M. D., 562 Fefe`vre, J. F., 610 Feigin, L. A., 587, 593, 602 Feigon, J., 374 Felciano, R., 549(165), 550 Felsenstein, J., 470, 483(42) Feng, S., 620 Feng, Z., 230(4), 231, 372, 377, 379, 382, 382(30), 383, 465, 497, 511, 548, 558, 599, 625, 628(96)
647
author index Ferguson, D. M., 422, 447(34) Ferretti, V., 332 Ferrin, T. E., 435 Fersht, A. R., 534, 618, 622, 622(10), 623, 624(81; 85), 625(85), 631, 637 Fesinmeyer, R. M., 421 Fetler, L., 312(99), 605, 606, 610(87), 611 Feyfant, E., 479, 481(73), 482(73), 486(73), 490(73) Fidelis, K., 475 Fiebig, K. M., 624, 626(89) Field, M. J., 118, 448, 473, 475(63), 566 Filimonov, V. V., 620, 621(53), 632(53) Finch, J. T., 17, 166 Findsen, L., 445(55), 446 Finet, S., 588, 602(4) Finkelstein, A. V., 618, 620, 621(41), 624, 626(41; 95), 627(41) Finnefrock, A. C., 616 Finney, J. L., 515, 516, 570, 574 Fischer, G., 591(60), 602 Fischer, S., 473, 475(63) Fiser, A., 463, 463(6; 7), 464, 467, 467(6), 471, 472(6), 473, 475(24), 478(24), 479, 479(24), 480, 481(73), 482(73), 486(73), 490(73; 74), 491, 493(6; 7; 73) Fisher, A. J., 207, 401 Fita, I., 207 Fitzgerald, P. M. D., 376 Fitzpatrick, J. M., 208, 210(48), 211 Flaherty, K. M., 11, 17(26), 54, 443, 446(53) Flannery, B. P., 205 Fleming, P. J., 574 Fleming, T. M., 315 Fletcher, R. J., 115(91), 116(91), 118 Fletterick, R. J., 29, 190, 206, 209 Flo¨ckner, H., 469 Flores, T. P., 200, 559 Foadi, J., 39, 72(8) Fontana, A., 613 Fontecilla-Camps, J. C., 106, 118 Forman-Kay, J. D., 620 Forsythe, E. L., 122 Fortier, S., 69, 245 Fotiadis, D., 204 Fourme, R., 83, 87, 88(31; 32), 89(29), 90(31), 91(29), 93(29–31), 96(32), 97(31), 98(31), 99(28), 101(28), 104(31), 106(28), 109(31), 110(29), 112(31), 113(31; 32), 115(35), 116(35), 118(30), 179
Fowler, S. B., 630 Fox, T., 422, 447(34) Franceschi, F., 164 Frank, J., 172, 174(27), 175(27), 180(27), 213, 603, 615 Franks, N. P., 119 Frauenheim, T., 448, 448(71), 449, 449(70), 450, 450(71; 73) Frederick, C. A., 118 Freeborn, B., 172, 174(27), 175(27), 180(27), 589, 603 Freeman, B. D., 631 Freeman, H. C., 625 Freer, A. A., 316 French, G. S., 70 Frere, J. M., 115(93), 116(93), 118 Frey, M., 118 Fridkin, M., 130 Friedel, G., 3(8), 4 Friesner, R. A., 565, 575 Frisch, M. J., 455 Frolow, F., 170 Fromage, N., 606 Fromme, P., 115, 116(82) Fu, J., 177 Fu, P., 517 Fuchs, S., 130 Fujinaga, M., 52, 65(42) Fujisawa, T., 595, 604 Fujiyoshi, Y., 204 Fukunaga, K., 230 Fukuyama, K., 402 Fuller, S. D., 207 Furey, W., 70, 76, 602 Furey, W. F., 122
G Gaasterland, T., 246, 465 Gabashvili, I. S., 615 Gaitskoki, V. S., 613 Gala´n, J. E., 125, 129(33), 133(33) Galazka, W., 514, 627 Gallagher, S. C., 594(27), 595 Gallagher, W., 432 Gallo, S. M., 39, 56(9), 349 Galzitskaya, O. V., 620, 621(41), 626(41), 627(41) Gampe, R. T., 608
648
author index
Gan, H. H., 517 Gangloff, J., 544 Gao, J., 448, 473, 475(63) Gao, M., 547, 575(2) Gao, Y., 602 Garavelli, J. S., 480 Garcia, A. E., 595, 621, 629, 630(117) Garcia, L. G., 639 Garman, E. F., 105, 168, 325 Garratt, R. C., 134 Garrett, T. P., 625 Gassmann, J., 189 Gastinel, L. N., 115, 116(86) Gautel, M., 610 Gelbin, A., 378, 379, 382(30), 383 Gellatly, B. J., 516, 574 Genick, U. K., 555 Gerloff, D. L., 469 Germain, G., 61 Gernert, K. M., 544 Gerstein, M., 211, 467, 516, 518(39), 528, 546, 547, 550(7), 551, 551(52), 552, 552(29), 553(7), 555, 555(7), 556, 557, 557(72), 558, 559, 559(72), 560, 565, 570, 570(40; 41), 573, 574, 574(127; 128), 577, 577(45; 51) Ghosh, M., 207 Giacovazzo, C., 39, 55 Gibbons, C., 603 Gibrat, J. F., 558 Gibson, T. J., 506, 581 Gilbert, R. J., 206(22), 207 Gilli, G., 332 Gilliland, G., 230(4), 231, 372, 383, 465, 497, 511, 548, 599, 625, 628(96) Gilliland, G. L., 399, 432 Gilmore, C. J., 29, 46 Gilson, M. K., 547 Gish, W., 380, 467 Glasgow, J., 245 Glatter, O., 587, 593 Gloor, S. M., 115, 116(88) Gluehmann, M., 164 Gnatt, A. L., 177 Go, N., 284, 304, 324(5; 6), 517, 618 Go, R. T., 119 Godefroy, G., 574 Godzik, A., 469, 472, 514, 558, 627 Goetz, D., 310, 320(24) Golden, B. L., 179
Goldie, D. J., 562 Goldie, K. N., 211, 215(55), 216(55) Goldsmith, S. C., 211, 220(50), 223, 223(50), 224(50) Goldstein, A., 46, 200 Golubev, A. M., 125, 129(34), 131, 133(29; 34) Gonzales, R. C., 246(29), 247 Gonzalez, A., 327, 338(12) Goodfellow, J. M., 13, 16(30) Gooding, A. R., 179 Goodwill, K. E., 76 Go¨rbitz, C. H., 456 Gordon, E. J., 80 Gosse-Kunstleve, R. W., 23, 171 Goto, N. K., 604 Gouaux, J. E., 605, 606 Gould, I. R., 422, 447(34) Gould, R. O., 62 Gramaccioli, C. M., 303, 313(2) Grant, X., 419 Grantcharova, V. P., 618, 620, 620(16), 630(16), 632(16; 52), 637(16; 52) Grassucci, R. A., 172, 174(27), 175(27), 180(27), 603, 615 Graur, D., 505 Gray, R., 208 Gray, W. R., 399 Graziano, V., 17 Greene, B., 615 Greer, J., 252, 472 Gribskov, M., 469, 558 Griebenow, K., 451 Griffith, J. P., 551 Grimes, J. M., 207 Grimm, R., 204 Grindley, H. M., 558 Grishin, N. V., 115(94), 116(94), 118 Grisworld, I. J., 88, 92(43), 112(43) Gronenborn, A. M., 491, 493(77) Gronenmayer, H., 87, 88(36), 93(36), 115(36), 116(36) Gros, P., 177, 244, 386, 415, 418(2), 421(2), 423, 424(38; 40), 426(2; 40), 432(2), 433(2), 436(2), 441(2), 462(2) Grosberg, A. Y., 618, 620, 621(15), 622, 623(37), 627(71), 631, 633, 634(71) Gross, E. G., 84 Gross, H., 206, 207(20) Gross, P., 39, 47(5)
author index Grosse-Kunstleve, R. W., 37, 39, 47, 47(5), 49(35), 177, 244, 246, 273(28), 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Grossmann, J. G., 594(26), 595, 603, 613 Growhurst, G. S., 315 Grueber, G., 594(25), 595, 603 Gruener, S. M., 616 Gru¨tter, M. G., 70, 115(97), 116(97), 118 Gsponer, J., 620 Guddat, L. W., 399 Guenther, B., 479 Guerois, R., 620, 626(43), 627(43) Guilloteau, J. P., 606 Guinier, A., 592 Gunasekaran, K., 395(22), 396 Guo, H., 473, 475(63) Guo, Z., 628, 629(114), 630(113; 114) Guss, J. M., 115(91), 116(91), 118, 130, 625 Gustchina, A., 125, 133(28) Gutin, A. M., 618, 620, 621, 621(7), 623(34), 624, 627, 627(7), 628, 630(7)
H Ha, S., 473, 475(63) Haas, D. J., 170 Hacker, J., 602 Haft, D. H., 480 Hagen, S. J., 421 Hainfeld, J. F., 211 Hajdu, J., 115, 116(87) Halfon, Y., 170 Haliloglu, T., 304 Hall, A. C., 119 Hall, C. K., 631 Hamiaux, C., 607 Hamill, S. J., 630 Hamm, P., 451, 453(80) Hammarling, S., 371 Haneef, M. I. J., 305, 311(12) Hanein, D., 204, 206, 209(14; 15), 210(14; 15), 211, 211(14), 212, 217(14), 220(14; 50), 221, 221(14), 222, 222(14), 223(50), 224(50) Hansen, J., 164, 165(9), 181(39), 182, 615 Hanson, M. A., 163 Hao, M. H., 627 Hao, Q., 603 Happer, W., 119
649
Harata, K., 393 Harel, D., 473 Harel, M., 115(91), 116(91), 118, 130 Harms, J., 164 Harp, J. M., 152 Harpaz, Y., 516, 552 Harris, G. W., 305, 306, 308(15), 309(14; 15), 311(12), 314(21), 315(21) Harris, R. A., 119, 125, 133(31) Harrison, P. M., 577 Harrison, R. W., 191 Harrison, S. C., 163, 171 Hart, M., 167 Hart, P. E., 247 Harter, T. M., 340 Hartsch, T., 164, 165(10) Hasnain, S. S., 594(26), 595, 603, 613 Hatanaka, H., 606 Hatchikian, C., 118 Haugk, M., 448, 449(70) Hauptman, H. A., 37, 56, 56(1), 57, 58, 62, 62(70), 63, 63(53), 64(60), 65(53), 66(53), 69, 75, 244 Hauptman, J., 156 Hausrath, A. C., 603 Haussler, D., 469, 560, 581 Havel, T. F., 473 Hawkes, D. J., 210(46; 47), 211 Hawkins, G., 422 Hawley, R. C., 421 Hawthornethwaite-Lawless, A. M., 316 Hayward, S., 211, 311, 565 Hazelwood, L., 212 Hazout, S., 390, 391(7) He, J. J., 403, 405(31) Hegde, R., 163 Hegyi, H., 577 Heightman, T. D., 115, 116(88) Heine, A., 616 Heinemann, U., 323, 324, 334(2) Hellingwerf, K., 615 Helliwell, J. R., 56, 96, 115(69), 167 Hemler, P., 210 Hendrickson, T., 421 Hendrickson, W. A., 12, 13(27), 15, 16, 16(27; 28), 18(34), 21(27), 38, 49(3), 56(3), 74, 79, 121, 122, 122(6), 123, 138, 141(1), 179, 189, 191, 244, 325, 417, 432(4), 602 Hendrix, R. W., 616 Hengelein, F. M., 89
650
author index
Henikoff, J., 511 Henikoff, S., 511 Hennecke, H., 115(97), 116(97), 118 Henri, L., 115(89), 116(89), 118 Henry, G., 371 Hensley, L., 544 Heo, N. H., 450, 625 Hermans, J., 118, 414, 418, 420, 449, 450(73), 451, 452, 453, 453(84), 457, 458, 459, 460, 461(104; 105) Herschlag, D., 616 Hershfield, M. S., 29, 64, 65(77), 68(77) Herve´, G., 611 Herzberg, O., 77, 395(21), 396 Heuser, J. E., 212 Hewat, E. A., 206(21), 207 Higgins, D. G., 506, 581 Highsmith, S., 547 Higuchi, S., 604 Hilbers, C. W., 375, 378(20), 379(20) Hilge, M., 115, 116(88), 171 Hilgenfeld, R., 80, 113, 602 Hill, D. L., 210(46; 47), 211 Hill, R. C., 463(1), 464, 472(1) Himm, J. F., 94, 95(63) Hinsen, K., 566 Hinton, G. E., 254 Hirokawa, G., 60 Hirokawa, N., 209 Ho, Y.-S. J., 82, 125, 132(27) Hobohm, U., 258, 522, 528, 528(64), 531(70) Hodgkin, D. C., 551, 552(34) Hodgkin, D. M., 199 Hodgson, K. O., 13, 16(30), 54, 616 Hoenger, A., 206, 207(20), 211, 215(55), 216(55) Hofmann, K., 511, 549(166), 550 Hofmann, T., 306 Hofricher, J., 616 Hogue, C. W., 374, 548 Hol, W. G. J., 166, 245, 423, 424(38) Holbrook, S. R., 305, 311(10; 11) Holden, H. M., 67, 206, 207(16) Hollinger, F. P., 421 Holm, L., 245, 374, 465, 512, 549(161), 550, 558 Holmes, K. C., 206, 207(16) Holmgren, A., 323 Holton, T. R., 256, 257(40), 259(40) Homo, J.-C., 204
Honda, K., 455 Honig, B., 419, 420, 420(15), 494, 501, 506(6), 509 Honig, H., 494 Honzatko, R. B., 605 Hooft, R. W. W., 295, 332, 374, 377(7; 8), 417, 476, 523 Hoover, D. M., 82, 125, 132(23) Hope, H., 170 Hoppe, W., 189 Ho¨rer, S., 313, 318(34), 319(34) Horiguchi, T., 206(23), 207 Horton, J. R., 74, 189 Horwich, A. L., 163 Hough, M. A., 603 Hovmoller, S., 396 Howard, A. J., 77 Howard, E., 326, 328, 330(10), 339(10) Howell, P. L., 29, 58, 64, 65(77), 68(61; 77), 71 Howlin, B., 305, 306, 308(15), 309(14; 15), 311(12), 314(21), 315(21) Hoye, E., 593, 610(18) Hsieh, J.-C., 125, 133(32) Hsieh, S.-H., 378, 379, 382(30), 383 Hsu, W. H., 115(96), 116(96), 118 Hu, H., 452, 453, 453(84) Hu, M., 411, 412(35) Huang, C. C., 435 Huang, E., 552, 600 Huang, Q., 451 Hubbard, R. E., 199, 255, 282(18), 283, 289(18) Hubbard, T., 374, 469, 540, 543(77), 548, 578(26) Hubbard, T. J. P., 465, 578 Hubbel, J. H., 103, 104(72) Huber, A. H., 35, 36(38) Huber, R., 163, 164, 173, 174, 206, 377(25), 378, 391(25), 397, 551 Huebner, G., 602, 617 Huge-Jensen, B., 122 Hughey, R., 469 Hung, L.-W., 246, 273(28) Hunkapiller, T., 581 Hunt, L. T., 480 Hunter, W. N., 77, 120 Hurley, J. H., 82, 125, 132(27) Hutchinson, E. G., 374, 394, 524
651
author index
I Igarshi, Y., 616 Illing, G., 324, 334(2) Impey, R. W., 418 Improta, S., 610 Inagaki, F., 606 Ioerger, T. R., 244, 246, 250, 253(35), 256, 257(40), 259(40), 273(28) Irba¨ck, A., 628 Irving, T. C., 616 Irwin, J., 113(81), 115, 116(81) Isaacs, N. W., 199, 316 Islam, S. A., 123 Israelachvili, J., 90, 91(53), 92(53) Isralewitz, B., 547, 575(2) Istvan, E. S., 77 Isupov, M. N., 306, 315, 315(20) Itzhaki, L. S., 623, 624(81; 85), 625(85), 637 IUPAC-IUB Commission on Biochemical Nomenclature, 379 Iurcu, G., 547 Iwaoka, M., 456
J Jack, A., 163, 171 Jackson, J. B., 77 Jackson, M. R., 316(41), 320 Jackson, S. E., 622, 632 Jacrot, B., 589, 616 Jain, S. C., 378, 383 Jakana, J., 207 Jalkanen, K., 450 James, M. N., 626 Janell, D., 164 Janin, J., 524, 551, 575 Jansen, R., 546, 577 Jansonius, J. N., 79 Jap, B., 163 Jaroszewski, L., 472 Jarvis, L. E., 435 Jeffrey, L. C., 403, 405(32) Jelsch, C., 399, 447(62), 448, 450(62) Jennings, A. J., 465, 472(11) Jennings, P. A., 620 Jensen, G. J., 177 Jensen, L. H., 13, 16(30) Jernigan, R. L., 304, 514, 526(18), 638 Jessen, S., 602
Jez, J. M., 332 Jiang, J.-S., 39, 47(5), 177, 244, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 446, 462(2) Jia-Xing, Y., 400, 408(38) Jimenez, J. L., 206(22), 207 Jin, L., 606 Joachimiak, A., 47, 49(35), 60, 76, 123, 124(19), 163, 324, 328 John, G., 247 John Hart, P., 88, 98(40), 101(40) Johnson, C. K., 314 Johnson, E. F., 115(90), 116(90), 118 Johnson, J. E., 204, 207, 615, 616 Johnson, L. N., 26, 45, 120, 325 Johnson, M. S., 463(4), 464, 472(4) Johnson, T., 577 Jones, D. T., 374, 469, 501, 506(5), 514, 549(163), 550 Jones, M. L., 374 Jones, S., 374, 501, 506(5), 548 Jones, T. A., 207, 239, 244, 245, 278, 390, 391(8), 394, 411(8), 412(8) Jones, T. H., 473 Jones, T. L. Z., 125, 133(30) Jordan, F., 602 Jorgensen, W. L., 418 Joris, B., 115(93), 116(93), 118 Joseph, D., 561 Joseph-McCarthy, D., 473, 475(63) Joyce, J. A., 119 Jullien, R., 516 Junemann, R., 589 Jungnickel, G., 448, 449(70) Junker, J., 546 Jurnak, F., 166
K Kabsch, W., 71, 374, 632 Kack, H., 332 Kahn, R., 87, 88(32), 96(32), 113(32), 179 Kajander, T., 77 Kaji, A., 60 Kajita, A., 616 Kalidas, C., 616 Kallenbach, N. R., 451 Kamiya, N., 137 Kammerer, R. A., 94, 115(57), 116(57) Kane, D. J., 603
652
author index
Kans, J. A., 548, 549(24) Kantrowitz, E. R., 606 Kapoor, T. M., 620 Kaptein, R., 295, 332, 375, 377, 378(20), 379(20), 382(24) Kar, T., 456 Karle, J., 13(31), 14, 16(31), 17(31), 38, 49(2), 56(2), 57, 62, 63 Karlsson, A., 322 Karplus, M., 191, 194(21), 196(21), 200(21), 422, 447(32), 448, 449, 456, 459, 459(103), 469, 471, 473, 475(63), 477(43), 479(43), 516, 561, 565, 566, 575, 618, 622, 623, 624(82), 625(82), 626, 627, 630, 631, 631(119), 636, 637(146), 638, 639(150) Karplus, P. A., 396 Karsch-Mizrachi, I., 380(32), 381, 384(32) Karshikoff, A., 620 Kasahara, C., 620 Kasher, R., 130 Kataeva, I. A., 115(99), 116(99), 118 Kataoka, M., 606 Katchalski-Katzir, E., 130 Kawamoto, M., 137 Kaxiras, E., 448(71), 449, 450(71; 73) Kay, L. E., 604 Kazmirski, S. L., 623, 624(84) Ke, H. M., 196, 605 Keep, N., 322 Kelis, J. T., Jr., 623 Keller, W., 150 Kendell, M. G., 367 Kendrew, J. C., 7, 26, 40, 45(16), 85, 99(11) Kennan, R. P., 94, 95(63) Kennard, O., 230(3), 231, 372, 497, 599 Kern, D., 553, 565(47) Keskin, O., 304 Kessler, R. M., 210(48), 211 Khalak, H. G., 39, 56(9), 349 Khan, G., 305, 311(12) Khokhlov, A. R., 631, 633 Kidera, A., 304, 324(5; 6) Kiefhaber, T., 616 Kihara, D., 473 Kihara, H., 616 Kikkawa, M., 209 Kim, A., 77 Kim, P. S., 622, 625 Kim, S., 305, 311(10; 11) Kim, S.-H., 25, 40
Kimura, K., 616 King, A., 207 King, J., 615 Kirshenbaum, K., 547 Kirste, R. G., 589 Kisker, C., 88, 94(59), 98(40), 101(40), 115, 115(59), 116(59) Kitao, A., 565 Kjeldgaard, M., 21, 207, 245, 278, 390, 391(8), 411(8), 412(8), 589, 607 Klein, D. J., 410 Klein, M. L., 418 Klemm, J. D., 625 Kleywegt, G. J., 245, 394, 601 Klimov, D. K., 618, 621(9), 627, 628 Klug, A., 166, 175 Klukas, O., 115, 116(82) Knablein, J., 174 Knapp, S., 620 Knapp-Mohammady, M., 450 Knol, K. S., 3(7), 4 Knoops, B., 125, 132(25) Kobayashi, H., 96, 97(70), 517 Koch, M. H. J., 586, 589, 591(60; 65), 594, 594(25), 595, 595(24), 596, 598, 599(42), 602, 603, 604, 605, 607, 616, 617 Kocher, J. P., 514 Koenig, B., 602 Koenig, S., 591(60; 65), 602, 617 Koetzle, T. F., 230(3), 231, 372, 497, 599 Kohavi, J., 247 Kohli, E., 207 Kohtz, D. S., 547, 575(4) Kolinski, A., 469, 473, 514, 624, 625(88; 91), 626(90), 627 Kollman, P. A., 86, 422, 447(34), 624, 631 Konarev, P. V., 600 Konnert, J. H., 191 Kornberg, R. D., 164, 165(7), 166(7), 177 Korolev, S., 123, 124(19) Kort, R., 615 Kosman, R. P., 606 Kossiakoff, A. A., 432 Kostyukova, A., 595 Kozielski, F., 207 Kozin, M. B., 594, 597(47; 49), 600, 603, 604 Krahn, J. M., 343 Kratky, C., 88, 95(39), 106(37), 108(37), 109(37), 112(37; 39), 170 Kratky, O., 587
author index Kraulis, P. J., 315 Krauß, N., 115, 116(82) Kraut, J., 7, 141 Krebs, W. G., 211, 546, 547, 550(7), 553(7), 555(7), 565, 577 Kresge, N., 88, 97(42), 112(42) Kretsinger, R. H., 86 Kreychman, J., 555, 577(45) Krimm, S., 456 Krishnamoorthy, B., 514 Krogh, A., 469, 560, 581 Krueger, J. K., 594(27), 595, 610 Krukowski, A. E., 27, 46 Krumpolc, M., 589 Kryger, G., 115(91), 116(91), 118 Kubota, T., 402 Kuchnir, L., 473, 475(63) Kuczera, K., 473, 475(63) Kudo, T., 604 Ku¨hlbrandt, W., 204 Kulp, D., 560 Kumar, A., 13 Kundrot, C. E., 179 Kunishima, N., 402 Kuntz, I. D., 547 Kuntz, I. D., Jr., 85(21), 86, 118(21) Kuprin, S., 598, 599(42) Kuriyan, J., 191, 194(21), 196(21), 200(21), 310, 479 Kussell, E., 574 Kuszewski, J., 39, 47(5), 177, 244, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Kuznetsov, S. R., 125, 133(30) ˚ ., 87, 88(32), 96(32), 113(32) Kvick, A Kyogoku, Y., 382
L LaBean, T. H., 388, 409(4), 417 Ladenstein, R., 620 Ladner, J. E., 11, 377, 382(23) Ladurner, A. G., 623, 624(85), 625(85) Lahr, S. J., 544 Lambert, M. H., 608 Lamers, M. H., 77 Lamour, V., 328
653
Lamzin, V. S., 22, 29(2; 3), 75, 134, 135(43), 137(43), 143, 229, 232, 235, 245, 276, 295, 325, 327, 338(12), 344, 399, 447(62), 448, 450(62) Landon, C., 118 Langer, J. A., 589 Langridge, R., 435 Larson, S. M., 631, 637 Lash, A. E., 380(33), 381 Laskowski, R. A., 295, 332, 374, 377, 382(23; 24), 387, 394(2), 476, 486(70) Lasters, I., 390, 391(10), 473 Lata, R. K., 603, 616 Lathrop, R., 245 Lattman, E. E., 12, 16(28), 597 Lau, F. T. K., 473, 475(63) Lavery, R., 382, 390, 391(7) Lawrence, C. E., 455, 514 Lawrence, J., 84 Lazaridis, T., 623, 624(82), 625(82) Leahy, D. J., 125, 133(32) Lebedev, A., 308, 350 Leclerc, F., 516 Leclerc, J., 84 Lecomte, C., 399, 447(62), 448, 450(62) Ledley, R. S., 480 Lee, B., 518 Lee, C., 403 Lee, H. J., 115, 116(87) Lee, J., 80 Lee, P. L., 17 Lee, T.-S., 448 Lee, W. M., 207, 220(34) LeFebvre, B. C., 517, 539 Leherte, L., 245 Lehman, W., 211, 220(50), 223(50), 224(50) Lehmann, C., 122 Lehmbeck, J., 400, 408(38) Leippe, D., 207, 220(34), 380(33), 381 LeMaster, D. M., 74, 189 Lemker, T., 603 Leonard, G. A., 80, 120, 316 Lepault, J., 204, 207 Lesk, A. M., 464, 551, 552, 552(29; 34), 556, 556(42), 557(51) Leslie, A. G., 29, 46, 55, 135, 164, 165(6), 178(6), 187, 603 Levinthal, C., 621 Levitt, D. G., 245
654
author index
Levitt, M., 448, 467, 473, 513, 524, 528, 547, 552, 557(72), 558, 559(72), 565, 573, 600 Levy, R. M., 565, 566 Lewis, M., 40 Lewis, R. J., 316 Lewis, T., 115(91), 116(91), 118 Lhermite, G., 87, 93(34), 97(34), 115(34), 116(34) Li, A. J., 570, 620, 623, 623(38), 624(38; 81), 626(38) Li, H., 620, 626(39) Li, L., 630, 637(120) Li, M., 82, 125, 130, 133(28) Li, R., 212 Li, Y., 603 Liang, J., 517 Lichtarge, O., 469 Li de la Sierra, I., 87, 115(35), 116(35) Lieb, W. R., 119 Liebecq, C., 379 Lieberman, K., 615 Liepinsh, E., 445 Likic, V. A., 574 Liljas, A., 60 Lim, K., 122 Lim, L. W., 354 Lim, W. A., 552 Lin, D., 246, 465 Lindberg, U., 553 Lindley, P., 613 Lindqvist, Y., 174, 323, 332 Lipman, D. J., 380, 380(32), 381, 384(32), 465, 544 Lippard, S. J., 118, 175 Lipscomb, W. N., 52, 605, 606 Lipsitz, R. S., 456 Lipson, D., 553, 565(47) Littlechild, J. A., 315 Liu, H., 207, 220(34), 247, 449, 450(73) Liu, J., 125, 132(22), 196 Liu, S., 340 Liu, Z.-J., 80 Ljungdahl, L. G., 115(99), 116(99), 118 Lloyd, M. D., 115, 116(87) Lobkowski, J., 125, 132(23) Loll, P. J., 122 London, F., 91 Longhi, S., 87, 93(30), 118(30) Loomis, W. F., 84 Lorenz, M., 206, 207(16)
Louis, J. M., 491, 493(77) Lounnas, V., 431, 445(45; 55–57), 446 Lovell, S. C., 239, 388, 390, 391(3), 392(12), 393, 393(12), 394, 395(19), 397(19), 398(12), 401(3), 403(12), 405, 406(12), 409(4), 411(33), 412(12), 417, 448, 452, 452(63; 64), 456(63; 64), 461(63; 64) Lowe, J., 163, 174 Lowenhaupt, K., 323 Lowey, S., 206, 209(14), 210(14), 211(14), 217(14), 220(14), 221(14), 222(14) Lu, H., 469 Lubini, P., 143 Lubkowski, J., 82, 121, 408 Luchinat, C., 143 Luger, K., 164, 165(5) Lukanidin, E. M., 322 Lunin, V. Y., 191 Lurio, L., 616 Luthey-Schulten, Z., 618, 621(12) Lu¨thy, R., 469, 476 Lutter, R., 46, 164, 165(6), 178(6), 603 Luty, B. A., 561 Luz, J. G., 616 Luzzati, V., 597
M Ma, J., 565 MacArthur, M. W., 295, 332, 387, 394, 394(2), 453 Machius, M., 115(89), 116(89), 118 MacIntyre, W. J., 119 Maciunas, R. J., 210(48), 211 MacKerell, A. D., Jr., 473, 475(63) Madden, T. L., 380(33), 381, 544 Madej, T., 558 Mader, A. W., 164, 165(5) Madhusudhan, M. S., 465, 473 Madura, J. D., 418, 561 Maeda, Y., 595 Maestas, S., 86 Magdoff, B. S., 325 Magherini, F., 630, 631(122) Maignan, S., 606 Main, P., 29, 30, 33(32), 46, 56, 189, 190, 190(5), 191(5), 197(5), 199(5), 200(5), 201, 282(15), 283 Maiorov, V. N., 514, 526(25)
author index Mair, G. A., 79, 122 Maitland, N. J., 322 Makowski, I., 170 Malashkevitch, V. N., 94, 115(57), 116(57) Malfois, M., 594, 595, 595(24) Mallender, W. D., 115(91), 116(91), 118 Mancia, F., 113(81), 115, 116(81) Mandelcorn, L., 89 Mandelkow, E., 206, 207(20), 211, 215(55), 216(55) Mandiyan, V., 606 Manjasetty, B. A., 77 Mann, G., 118 Manning, N., 549(162), 550 March, C. J., 473 Maritan, A., 618, 627 Mark, A. E., 438 Markley, J. L., 375, 378(20), 379(20), 382 Marlovits, T. C., 206(21), 207 Marques, O., 566 Martin, D. W., 209 Martin, G., 150 Martin, J. A., 399 Martinez, J. C., 618, 620, 621(53), 630(17), 631, 632(53) Martı´-Renom, M. A., 463(6), 464, 465, 467(6), 472(6), 473, 479, 481(73), 482(73), 486(73), 490(73), 493(6), 623 Martz, E., 562 Marx, A., 206, 207(20) Marzec, C. R., 480 Mateo, P. L., 620, 621(53), 632(53) Mateu, M. G., 207 Mathews, F. S., 354 Mathieu, M., 207 Matouschek, A., 623 Matssumoto, T., 96, 97(70) Matsubara, H., 402 Matsudaira, P., 211, 220(50), 222, 223, 223(50), 224(50) Matta, C. F., 456 Mattaj, I. W., 77 Matthews, B. W., 11, 16(22), 17(22), 41, 79, 88, 92(43), 98(44), 112(43), 154, 190, 191, 350, 395(20), 396, 547, 557, 574, 575(5), 603, 604 Mattos, C., 473, 475(63) Mattson, P. T., 620 Mattsson, N., 316(41), 320 Maurer, C., 208, 210(48), 211
655
Max, N., 86 May, J. L., 28, 175, 181(32) Mazza, C., 77 Mazzarella, L., 325, 328(3) McArthur, A. G., 471 McArthur, M. W., 332, 377, 382(23; 24), 476, 486(70) McAuley, K. E., 400, 408(38) McAulley, W. J., 143 McCammon, J. A., 208, 547, 561, 575 McConkey, B. J., 516 McCoy, A. J., 246, 273(28) McCutcheon, J. P., 28, 175, 181(32) McDermott, G., 316 McDowall, A. W., 204 McGough, A., 204 McIntosh, L. P., 603 McPhalen, C. A., 626 McRee, D. E., 115(90), 116(90), 118, 278, 355, 362(13), 390, 411(14; 15) McSweeney, S., 80 Medvedev, N. N., 516 Meerwink, W., 604 Melo, F., 463(6; 7), 464, 465, 467(6), 472(6), 479, 481(73), 482(73), 486(73), 490(73), 493(6; 7) Menge, U., 122 Merckel, M., 77 Merkel, G., 121, 408 Merritt, E. A., 316 Merz, K. M. J., 422, 447(34) Messerschmidt, A., 174 Mesyanzhinov, V. V., 94, 115(58), 116(58) Mewes, H. W., 480 Meyer, E. E., 372 Meyer, E. E., Jr., 599 Meyer, E. F., 497 Meyer, E. F., Jr., 230(3), 231 Meyer, T. E., 67 Mian, I. S., 469, 581 Michel, H., 164 Micheletti, C., 618, 627 Michie, A. D., 374, 501, 506(5), 548 Michnick, S., 473, 475(63) Michon, A. M., 211, 220(50), 223(50), 224(50) Micossi, E., 120 Mikami, M., 455 Miki, K., 137, 164 Milburn, M. V., 608 Miller, D. W., 566
656
author index
Miller, K. I., 602 Miller, K. W., 118 Miller, R., 37, 39, 56, 56(1; 9; 10), 62, 63, 63(53), 64(10), 65(53), 66(53), 70, 134, 135(40), 232, 244, 349 Miller, W., 380, 544 Millett, I. S., 616 Milligan, R. A., 206, 207, 207(16), 208, 219, 221 Minakhin, L., 164 Minor, W., 123, 124(19), 330 Mirkovic, N., 465 Mirny, L. A., 514, 619, 631, 631(26) Mishikawa, R., 3(6), 4 Mistry, A., 113(81), 115, 116(81) Mitchell, E., 557, 558(59) Mitchell, M., 616 Mitchell, T., 247 Mitchison, G., 560, 581 Mitschler, A., 328 Mitsuoka, K., 204 Mittl, P. R. E., 70 Miura, K., 606 Miyatake, H., 137 Miyazawa, S., 514, 526(18) Mizuguchi, K., 284, 472 Mochrie, S. G. J., 616 Moffat, K., 615, 616 Mok, Y.-K., 620 Monod, J., 611 Montana, V. G., 608 Montet, Y., 118 Montgomery, M. G., 603 Moody, M. F., 611, 616 Mooibroek, H., 313, 318(34), 319(34) Moore, P. B., 164, 165(9), 172, 174(27), 175(27), 180(27), 181(39), 182, 410, 603, 615 Mooser, A., 125, 129(24), 132(24) Moran, F., 595 Moras, D., 87, 88(36), 93(36), 115(36), 116(36), 328, 544, 608 Morgan-Warren, R. J., 164, 165(10), 175, 181(32) Moriarty, N. W., 246, 273(28) Mornon, J. P., 87, 93(34), 97(34), 115(34), 116(34), 516 Moroz, O. V., 322 Morris, A. L., 394
Morris, R. J., 22, 29(3), 75, 134, 135(43), 137(43), 229, 232, 235, 245, 276, 327, 338(12), 344 Morrison, H. G., 471 Moshkov, K., 613 Moss, D. S., 305, 306, 308(15), 309(14; 15), 311(12), 314(21), 315(21), 377, 382(23), 387, 394(2), 476, 486(70) Mosser, A. G., 207, 220(34) Motoda, H., 247 Mouawad, L., 566 Moult, J., 395(21), 396, 475 Moulton, S., 610 Mourey, L., 115, 116(84; 85), 602 Muehlbaecher, C. A., 84, 85, 90(5) Mueller, V., 603 Muirhead, H., 9 Mukherjee, A. K., 56 Mu¨ller, H. R., 88, 89(48) Muller, J., 206, 207(20), 211, 215(55), 216(55) Mu¨ller, M., 471, 480, 490(74) Mun˜oz, V., 453, 620, 626(40), 627(40) Munson, P., 514 Murphy, L. M., 613 Murshudov, G. N., 242, 246, 273(26), 303, 306, 308, 315(20), 316, 322, 332, 350 Murshudow, C., 295 Murzin, A. G., 374, 465, 540, 543(77), 548, 578, 578(26) Myers, E. W., 380
N Naberukhin, Y. I., 516 Nachman, J., 432 Nadarajah, A., 122 Naday, I., 328, 329 Nagamura, T., 616 Nagar, B., 58, 64(60) Nagem, R. A. P., 79, 95, 120, 121, 123(8), 124(8), 125, 129(26; 34), 132(8; 26), 133(34) Nakasako, M., 604 Nanzer, A. P., 446 Napel, S., 210 Nathans, J., 125, 133(32) Navaza, J., 52, 207, 240 Nave, C., 105, 109(74) Nayeem, A., 475
author index Needleman, S. B., 484 Neidigh, J. W., 421 Nelson, W. J., 35, 36(38) Neu, M., 613 Neuefeind, T., 174 Neustroev, K. N., 131 Newfield, D., 628 Newman, J., 28, 207 Ngo, T., 473, 475(63) Nguyen, D. T., 473, 475(63) Ni, Y. S., 77 Nicholls, A., 494, 509 Nicholson, H., 547, 575(5) Nicolas, A., 130 Nieh, Y. P., 200 Nierhaus, K. H., 589, 604, 615(11) Nieuviarts, R., 84 Nigles, M., 244 Niinikoski, T. O., 589 Nilges, M., 39, 47(5), 177, 386 Nilges, N., 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Ninio, J., 597 Nishikawa, S., 3(6), 4 Nissen, P., 164, 165(9), 172, 174(27), 175(27), 180(27), 181(39), 182, 603, 607, 615 Nobbs, C. L., 85 Noble, M. E., 77 Nochomovitz, Y. D., 628, 629(114), 630(114) Nogales, E., 207 Noller, H. F., 615 No¨lting, B., 618, 620(22) Nonaka, T., 96, 97(70) Nordman, C. E., 61 Norskov, L., 122 North, A. C. T., 9, 10, 79, 122, 463(1), 464, 472(1) Northey, J. G. B., 620 Nunes, A. C., 86 Nussinov, R., 514, 570 Nyas, M. N., 553 Nyborg, J., 21, 607 Nymeyer, H., 620, 621, 626(44; 45), 627(44; 45), 628(45), 629, 629(45), 630(117), 631(45), 635(45)
O O’Brien, P., 562 Ockwell, D. M., 603
657
Oda, K., 125, 133(28) O’Donnell, M., 479 Ogata, C. M., 16, 18(34), 121, 122(6), 125, 129(26), 132(26), 244, 325 O’Gorman, L., 206, 209(12) Ogura, K., 606 O’Halloran, T. V., 175 Ohkawa, H., 374, 548, 549(24) Ohlson, T., 396 Ohno, M., 77 Oka, T., 555 Okada, Y., 209 Okaya, Y., 6 Olafson, B. D., 422, 447(32) Olah, G. A., 610 Olczak, A., 96, 115(69) Oldfield, T. J., 239, 240(16), 245, 255, 261, 274, 276, 279(7), 282(18), 283, 284, 284(6), 285(7), 289(18), 295, 297(5), 332 Olins, P. O., 340 Oliva, G., 134 Olson, A. J., 565 Olson, C. A., 451 Olson, N. H., 206, 207, 207(13), 209(13), 216(13), 220(34) Olson, W. K., 378, 383 Onrust, R., 479 Onuchic, J. N., 618, 620, 621, 621(12), 626(44; 45), 627(44; 45), 628(45), 629, 629(45), 630(117), 631(45), 635(45) Opalka, N., 209 Oppenheim, J. J., 82, 125, 132(23) Orcutt, B. C., 480 Ord, J. K., 367 Orengo, C. A., 200, 374, 465, 501, 506(5), 548, 549(163), 550, 557, 559 Oschkinat, H., 324, 334(2) O’Shea, E. K., 625 Ostell, J., 374, 380(32), 381, 384(32), 465 Ostergaard, P. R., 400, 408(38) Osterman, A. L., 115(94), 116(94), 118 Otting, G., 445 Otwinowski, Z., 11, 47, 49(35), 76, 123, 124(19), 163, 330 Otzen, D. E., 623, 624(81), 637 Ouellette, B. F. F., 465 Ouyang, G., 206, 209(14), 210(14), 211(14), 217(14), 220(14), 221(14), 222(14) Overington, J. P., 467, 479, 479(23), 481(73), 482(73), 486(73), 490(73)
658
author index
Oxford Cryosystems, 106 Oyama, H., 125, 133(28) Ozkan, S. B., 618
P Pabit, S. A., 421 Paci, E., 630, 636, 637(146), 638, 639(150) Padlan, E. A., 557 Padro´n, G., 87, 115(35), 116(35) Pa¨hler, A., 16 Palnitkar, M., 77 Pande, V. S., 618, 620, 621(15), 622, 623(37), 627(71), 634(71) Panjikar, S., 108 Pannu, N. S., 39, 47(5), 177, 244, 351, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Pantos, E., 595 Papadimitriou, C. H., 236 Papiz, M. Z., 303, 310, 311(27), 314(27), 316, 317(27), 318(27) Parisini, E., 143 Park, B., 577 Park, J., 469, 560 Park, S.-Y., 137 Parker, M., 206(22), 207 Parrish, C. R., 207 Pastore, A., 610 Patel, K. J., 613 Pattanayak, J., 456 Pauptit, R. A., 113(81), 115, 116(81) Pavelcik, F., 52 Pavlov, M. Y., 597 Pearl, F. M., 465 Pearson, W. R., 467 Peat, T. S., 28 Pe´delacq, J. D., 115, 116(85), 602 Pedersen, J. S., 604 Peerdeman, A. F., 6 Peerdeman, J. F., 4 Penczek, P., 172, 174(27), 175(27), 180(27), 603, 615 Penning, T. M., 332 Pepinsky, R., 6 Perahia, D., 566 Perera, L., 516 Pe´rez, J., 607, 616 Perham, R. N., 473
Perkins, S. J., 601 Perles, L. A., 134 Perman, B., 615 Pernot, L., 87, 93(30), 115(35), 116(35), 118(30) Perrakis, A., 22, 29(2; 3), 47, 49(35), 75, 76, 77, 115, 116(87), 134, 135(43), 137(43), 229, 232, 235, 245, 276, 344 Perutz, M. F., 9, 189, 557 Peters, J. W., 88, 98(40), 101(40) Peterson, P. A., 316(41), 320 Petitpas, I., 207 Petoukhov, M. V., 591(65), 596, 600, 601, 602 Petrash, J. M., 340 Petrey, D., 420, 494 Petsko, G. A., 85(21), 86, 118(21), 455, 561 Pettifer, R. F., 329 Pettigrew, D. W., 562 Pettitt, B. M., 423, 431, 445(45; 55–58), 446 Pfeiffer, F., 480 Pfleger, K., 247 Philips, G. N., 88, 106(41) Phillips, D. C., 79, 122, 167, 463(1), 464, 472(1), 558 Phillips, G. N. J., 445(56; 58), 446 Phillips, J. C., 13, 16(30), 54 Pichon-Lesme, V., 399 Pichon-Pesme, V., 447(62), 448, 450(62) Pickersgill, R. W., 306, 308(15), 309(15) Pielak, G. J., 544 Pieper, U., 465 Pietrokovski, S., 511 Piontek, K., 115, 116(88) Pissabarro, M. T., 618, 630(17) Pitard, E., 631 Plaxco, K. W., 616, 631 Plu¨ckthun, A., 125, 129(24), 132(24) Plurad, J. C., 544 Poch, O., 544 Podell, E., 179 Podjarny, A. D., 46, 172, 324, 328 Pohl, E., 245 Pol, E., 210 Polaka, N., 223 Polekhina, G., 607 Polidori, G., 39 Polikarpov, I., 79, 95, 120, 121, 123(8), 124(8), 125, 129(26; 34), 132(8; 26), 133(29; 34), 134 Pollack, G. L., 94, 95(63)
659
author index Pollack, L., 616 Pollard, T. D., 212 Polyakov, A., 209 Polyakov, K. M., 199 Ponder, J. W., 390, 391(6), 436(51), 437, 456(51) Pontius, J., 295, 574, 575(136) Poon, C. D., 451 Poon, P., 115, 116(86) Porezag, D., 448, 449(70) Porod, G., 590 Postma, J. P. M., 418 Poterszman, A., 328 Pothier, P., 207 Potter, S. A., 70 Potterton, L., 471, 477(43), 479(43) Powell, M. J. D., 54 Powlowski, J., 77 Pradervand, C., 615 Prange´, T., 83, 87, 88(31; 32), 89(29), 90(31), 91(29), 93(29–31; 34), 96(32), 97(31; 34), 98(31), 99(28), 101(28), 104(31), 106(28), 109(31), 110(29), 112(31), 113(31; 32), 115(34; 35; 93), 116(34; 35; 93), 118, 118(30), 607 Prasad, B. V., 207 Prendergast, F. G., 574 Presley, B. K., 388, 405, 409(4), 411(33) Presley, P. K., 417 Press, W. H., 205 Prevelige, P. E., 615 Price, A. C., 77 Priestle, J. P., 70 Prilusky, J., 549(162), 550 Prince, S. M., 316 Prins, J. A., 3(7), 4 Prisant, M. G., 394, 395(19), 397(19), 448, 452(64), 456(64), 461(64) Prodhom, B., 473, 475(63) Protein Data Bank, 372, 373(3), 374 Pryor, A. W., 303 Ptitsyn, O. B., 604, 618, 622, 622(6), 631 Puri, A., 82, 125, 132(23) Pusey, M. L., 122
Q Qian, J., 548 Qian, W., 456 Qiu, D., 421
Qiu, L., 421 Querol, E., 623 Quillin, M. L., 88, 92(43), 98(44), 112(43), 574 Quiocho, F. A., 278, 403, 405(31), 553
R Rabitz, H., 343 Radermacher, M., 603 Radzicka, A., 529 Raftery, J., 96, 115(69) Rahfeld, J., 602 Rajashankar, K. R., 80, 82, 95, 121, 122(7), 125, 132(23) Rajogopalan, K. V., 94(59), 115, 115(59), 116(59) Ralph, A., 613 Ramachandran, G. N., 6 Ramakrishnan, C., 395(22), 396 Ramakrishnan, V., 17, 28, 164, 165(10), 175, 181(32), 589 Raman, S., 6 Ramaswamy, S., 115, 116(87) Randal, M., 432 Rapaport, D. C., 631 Rapp, B. A., 380(32; 33), 381, 384(32), 465 Rappleye, J., 70 Rattner, A., 125, 133(32) Ravichandran, V., 383 Ravikumar, K., 123 Rayment, I., 67, 206, 207(16), 401 Read, R. J., 29, 39, 47(5), 52, 65(42), 70, 156, 177, 244, 246, 273, 273(28), 351, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Reddy, V. S., 207, 615 Redinbo, M. R., 245 Rees, D. C., 88, 94(59), 98(40), 101(40), 106(41), 115, 115(59), 116(59) Reese, M. G., 560 Refaat, L. S., 6, 198, 200 Reiher, W. E. III, 473, 475(63) Reinhammar, B., 613 Rek, Z. U., 616 Remaut, H., 115(93), 116(93), 118 Remigy, H., 204 Remington, S. J., 551, 557, 562 Ren, Z., 615 Renauld, J.-C., 125, 129(26), 132(26) Rendell, L., 247
660
author index
Reo, N. V., 118 Reshetnikova, L., 607 Retailleau, P., 411, 412(35) Rey, A., 627 Rey, F. A., 207 Reyes, C. M., 624 Reynolds, C. D., 199 Rhodes, D., 166 Riboldi-Tunnicliffe, A., 602 Rice, D. W., 77, 557, 558, 558(59) Rice, L. M., 21, 39, 47(5), 177, 244, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Rice, W. J., 209 Rich, A., 323 Richard, S., 598, 599(42) Richards, F. M., 390, 391(6), 515, 517, 518, 552, 570, 570(40), 574, 574(125) Richards, R. M., 436(51), 437, 456(51) Richardson, D. C., 239, 245, 387, 388, 389, 390, 391(3; 36; 37), 392(12; 36; 37), 393, 393(12), 394, 395(19; 36; 37), 397(19), 398(12; 36; 37), 400(36; 37), 401(3; 36; 37), 402(36; 37), 403(12), 405, 406(12), 409(4), 410(36; 37), 411(33), 412(12), 417, 448, 452, 452(63; 64), 456(63; 64), 461(63; 64), 524, 544 Richardson, J. S., 239, 245, 387, 388, 389, 390, 391(3; 36; 37), 392(12; 36; 37), 393, 393(12), 394, 395(19; 36; 37), 397(19), 398(12; 36; 37), 399, 400(36; 37), 401(3; 36; 37), 402(36; 37), 403(12), 406(12), 409(4), 410(36; 37), 412(12), 417, 448, 452, 452(63; 64), 456(63; 64), 461(63; 64), 544 Richelle, J., 295, 381, 382(34), 574, 575(136) Richmond, R. K., 164, 165(5) Richmond, T. J., 115(97), 116(97), 118, 164, 165(5), 166, 175 Richter, C., 164, 209 Rick, S. W., 332 Rickles, R. J., 620 Riddle, D. S., 618, 620, 620(16), 630(16), 631, 632(16; 52), 637(16; 52) Rie`s-Kautt, M., 607 Rieubland, M., 589 Rijllart, A., 589 Rini, J. M., 58, 64(60) Rivers, P. S., 558 Rizkallah, P. J., 96, 115(69)
Roach, J., 137, 411, 412(35), 619 Robbins, A. H., 87, 98(26), 99(26) Roberts, A. L. U., 30, 282(17), 283 Rocchia, W., 509 Rochel, N., 608 Rock, C. O., 77 Rodgers, J. R., 230(3), 231, 372, 497, 599 Roe, S. M., 450, 625 Roessner, C. A., 262 Rogers, S. J., 625 Rogniaux, H., 328 Roitberg, A. E., 421 Rojas, A., 125, 129(34), 133(34) Rojas, O., 565 Rokshar, D. S., 618, 621(15) Romo, T., 423 Rooman, M. J., 514, 547 Rose, G. D., 451 Rose, J. P., 80, 115(98; 99), 116(98; 99), 118, 122 Roseman, A. M., 209 Rosemann, M. G., 209, 217(43), 219(43) Rosen, M. K., 620 Rosenbaum, 328 Rosenberry, T. L., 115(91), 116(91), 118 Rosenbrock, G., 113(81), 115, 116(81) Rosenfield, R. E., 311 Rosenzweig, A. C., 118 Rossjohn, J., 206(22), 207 Rossmann, M. G., 9, 10, 13, 16(19a), 18, 94, 115(58), 116(58), 164, 170, 207, 551, 557 Rossmann, R., 115(97), 116(97), 118 Rost, B., 473 Rost, L. E., 221 Roth, M., 88, 95(39), 112(39) Rotkiewicz, P., 473 Rouge, P., 115, 116(85), 602 Rould, M. A., 145, 154 Roux, B., 547 Roux, M., 473, 475(63) Roversi, P., 411, 412(35) Roy, P., 207 Ruczinski, I., 620, 631, 632(52), 637(52) Rudolph, M. G., 310, 313(28), 316(41), 320, 320(28) Rueckert, R. R., 207, 220(34) Ruff, M., 87, 88(36), 93(36), 115(36), 116(36) Rullman, J. A. C., 295, 332, 377, 382(24) Rummel, G., 88, 95(39), 112(39) Rupp, B., 163
author index Rushton, B., 166 Russell, R. B., 557, 559, 616 Rychlewski, L., 472 Rypniewki, W., 115, 116(88)
S Saam, B., 119 Sablin, E. P., 206, 209 Sacchettini, J. C., 244, 246, 250, 253(35), 256, 257(40), 259(40), 262, 273(28) Sachs, J. R., 209 Sack, J. S., 435 Sack, S., 206, 207(20), 278 Sadoc, J. F., 516 Saenger, W., 115, 116(82) Safer, D., 219 Saha, G. B., 119 Saibil, H. R., 204, 206(22), 207 Saito, Y., 6 Sakabe, N., 199 Sˇali, A., 246, 306, 463, 463(5–7), 464, 465, 467, 471, 472, 472(5; 6), 473, 473(22), 475(24), 477(43; 47), 478(24; 44), 479, 479(22–24; 43; 44), 480, 481(73), 482(73), 486(73), 490(73; 74), 491, 493, 493(6; 7; 77), 557, 559, 627 Saloman, E. B., 103, 104(72) Saludjian, P., 87, 115(35), 116(35) Salvato, B., 602, 612, 613 Samama, J. P., 115, 116(84; 85), 602 Sammon, M., 206, 209(12) Samulski, E. T., 451 Sa´nchez, R., 463(5–7), 464, 465, 471, 472, 472(5; 6), 477(43; 47), 478(44), 479, 479(43; 44), 481(73), 482(73), 486(73), 490(73), 493, 493(6; 7) Sandalova, T., 323, 332 Sander, C., 245, 258, 295, 332, 374, 377(7; 8), 417, 465, 476, 512, 522, 523, 528(64), 549(161), 550, 557, 558, 565, 626 Sandy, J., 77 Sanejouand, Y. H., 566 Sanishvili, R., 60, 123, 124(19), 328 Sano, T., 616 Santiago, J. V., 618, 620, 620(16), 630(16), 632(16; 52), 637(16; 52) Saraste, M., 403, 405(30) Sargent, D. F., 115(97), 116(97), 118, 164, 165(5)
661
Sarma, V. R., 79, 122 Sasai, M., 629 Sassoon, J., 313, 318(34), 319(34) Satow, Y., 557 Sauder, J. M., 472 Sauer, O., 88, 95(38; 39), 106(37), 108(37), 109(37), 112(37; 39), 115, 116(88), 118(38) Sax, M., 76, 122, 602 Sayers, Z., 598, 599(42), 607 Sayre, D., 189 Schaefer, M., 516 Schaefer, S., 422 Schaegger, H., 603 Scha¨ffer, A., 544 Schaffer, M. L., 125, 132(22) Scharf, M., 258, 528, 531(70) Scharpf, O., 589 Scheek, R. M., 424 Scheiner, S., 456 Schellenberger, A., 602, 617 Schenk, H., 62(69), 63 Scheraga, H. A., 419, 475, 513, 627 Scherpereel, P., 84 Scheuring, S., 204 Schick, B., 166 Schiffer, C., 414, 423, 424(40), 425(39; 41), 426(40; 41), 431(41), 432(41), 435, 441(41), 442 Schiltz, M., 83, 87, 88(31–33), 89(29), 90(31), 91(29), 92(33), 93(29–31; 33; 34), 96(32), 97(31; 34), 98(31), 99(28), 101(28), 104(31), 106(28), 109(31), 110(29; 33), 112(31; 33), 113(31; 32), 115(34; 35), 116(34; 35), 118(30) Schindelin, H., 88, 94(59), 98(40), 101(40), 115, 115(59), 116(59) Schindler, D. G., 589 Schirmer, T., 88, 95(39), 112(39) Schlenkrich, B., 473, 475(63) Schlessinger, J., 606 Schlichting, I., 555 Schlick, T., 517, 575, 576(143) Schluenzen, F., 164 Schmeing, T. M., 410 Schmid, M. F., 204 Schmidt, A., 88, 106(37), 108(37), 109(37), 112(37), 327, 338(12) Schmidt, B., 591(60), 602 Schmidt, R., 559
662
author index
Schmidt, T. J., 206, 207(13), 209(13), 216(13) Schmitt, M., 589 Schneider, B., 378, 383 Schneider, D. K., 589 Schneider, F., 174 Schneider, G., 174, 323, 332 Schneider, K., 383 Schneider, R., 258, 374, 528, 531(70) Schneider, T. R., 23, 37, 77, 105, 171, 311, 324, 325, 336, 349, 350, 359(2), 367(2) Schoenborn, B. P., 79, 85, 86, 87(15), 99(11), 589 Schofield, C. J., 115, 116(87) Schomaker, V., 304, 306, 310(22), 314(3), 315(3) Schoone, J. C., 3, 15(2), 20(3) Schoot-Uiterkamp, A. J. M., 118 Schotte, F., 615 Schrauber, H., 390 Schreiber, S. L., 620 Schrempf, H., 594(25), 595 Schubert, W. D., 115, 116(82) Schubot, F. D., 115(98; 99), 116(98; 99), 118 Schuler, G. D., 380(33), 381, 467, 548, 549(24) Schulten, K., 547, 575(2) Schulthess, T., 94, 115(57), 116(57) Schultz, P., 204 Schultze, P., 374 Schulz, H. H., 303, 313(2) Schulz, W. G., 324, 344(1) Schutt, C. E., 343, 553 Schuttler, J., 119 Schwartz, T., 323 Schwarze, H., 628 Schweitzer-Stenner, R., 451 Schweizer, W. B., 311 Schwilden, H., 119 Scofield, J. H., 103, 104(72) Scott, A. I., 262 Sedgewick, R., 237 Segel, D. J., 616 Segelke, B., 163 Segref, A., 77 Seidel-Dugan, C., 620 Seifert, G., 448, 448(71), 449, 449(70), 450(71) Selin-Lindgren, E., 613 Selmer, M., 60 Semenyuk, A. V., 593
Seno, F., 618 Serrano, L., 453, 618, 620, 621(53), 623, 626(43), 627(43), 630(17), 631, 632(53) Seshu, R., 247 Sessions, R., 524 Settle, W., 84 Seul, M., 206, 209(12) Severinov, K., 164 Shah, A. K., 115(99), 116(99), 118 Shakhnovich, E. I., 514, 547, 574, 618, 619, 620, 621, 621(7; 11; 46), 622, 622(11; 13; 62; 63), 623(34), 624, 627, 627(71), 628, 628(46; 63), 630, 630(7; 46; 62; 63; 116), 631, 631(13; 26; 116), 632(62; 63; 116), 633(46; 63), 634(71), 635(62; 63), 637(7; 62; 63; 120) Shan, L., 399 Sharma, D., 125, 133(32) Sharma, Y., 456 Sharp, K. A., 419, 420(15), 494 Shea, J. E., 628, 629(114), 630(114) Sheinerman, F. B., 623, 624(83), 625(83) Sheldrick, G. M., 23, 37, 39, 48, 49, 50(39), 56(1; 11; 39), 58(39), 59(39), 62, 67, 69(85), 70, 77, 80, 120, 122(3), 143, 171, 232, 336, 349, 350, 359(2), 362(4), 367(2) Shen, T., 547 Shen, W., 223 Shenkin, P. S., 421 Shepard, W., 87, 88(31; 32), 90(31), 93(31), 96(32), 97(31), 98(31), 104(31), 109(31), 112(31), 113(31; 32) Sheriff, S., 138, 141(1) Sherman, M. B., 204 Sherrill, C. D., 455 Shewchuk, L., 547, 575(5) Shi, J., 472 Shi, Z., 451 Shigesada, K., 206(23), 207 Shimada, J., 574 Shimanouchi, T., 372, 497, 599 Shin, T. B., 620 Shindyalov, I. N., 230(4), 231, 372, 374, 465, 497, 511, 548, 599, 625, 628(96) Shiono, M., 6 Shipman, L. W., 262 Shlyapnikov, S. V., 199 Shmueli, U., 57, 303, 313(2) Shneider, M. M., 94, 115(58), 116(58) Shomanouchi, T., 230(3), 231
author index Short, S. A., 150, 158, 196 Shue, G., 547, 575(4) Sibanda, B. L., 463(2), 464, 472, 472(2) Sibbiah, S., 600 Sicker, T., 80, 113 Siddiqui, A., 374 Sieck, T., 603 Sieker, L. C., 13, 16(30), 67 Sigler, P. B., 163, 553, 565 Sillers, I. Y., 589 Silman, I., 115(91), 116(91), 118 Sim, E., 77 Sim, G. A., 21, 70 Simmerling, C., 421 Simmonson, T., 244 Simons, J. A., 620 Simons, K. T., 631 Simonson, T., 39, 47(5), 177, 386, 415, 418(2), 421(2), 422, 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Simpson, P. G., 52 Sinclair, J. C., 77 Singh, R., 512, 514, 515(5; 6), 517(5; 6) Singh, U. C., 86 Sinnokrot, M. O., 455 Sippl, M. J., 455, 469, 471, 475, 476(45), 478(45), 486(45), 514, 526(24), 558, 600 Sitkoff, D., 419, 420(15) Sixma, T. K., 22, 29(2), 77, 130, 245 Sjo¨lander, K., 469, 581 Sjo¨lin, L., 399 Skalka, A. M., 121, 408 Skelton, N. J., 125, 132(22) Skibshoj, I., 322 Skiniotis, G., 211, 215(55), 216(55) Sklenar, H., 382 Skolnick, J., 469, 473, 514, 558, 624, 625(88; 91), 626(90), 627 Skovoroda, T. P., 191 Skrynnikov, N. R., 604 Smit, A. B., 130 Smith, A. V., 631 Smith, C. I., 620 Smith, G. D., 29, 56, 57(56), 58, 58(56), 64, 64(60), 65(77), 68, 68(61; 77), 70, 71 Smith, J. C., 473, 475(63) Smith, J. E., 221 Smith, J. L., 16, 122 Smith, L. J., 438 Smith, P. E., 451
663
Smith, S. W., 631 Smith, T. F., 262, 500, 506(4) Smith, T. J., 206, 207, 207(13), 209(13), 216(13), 220(34) Sneath, P., 215 Snow, M. E., 473 Sobolev, V., 516 Socci, N. D., 620, 621, 626(44), 627(44) Sogin, M. L., 471 Sokal, R., 215 Sokolova, A., 595 Soltis, M., 88, 97(42), 98(40), 101(40), 112(42) Soltis, S. M., 88, 106(41) Sonnhammer, E. L., 560, 581 Sosa, H., 206 Sowdhamini, R., 463(4), 464, 472(4) Soyer, A., 516 Spagna, R., 39 Spahn, C. M., 615 Speir, J. A., 316(41), 320 Spellmeyer, D. C., 422, 447(34) Springer, C. S., Jr., 119 Srajer, V., 615 Sreerama, N., 451 Sridhar, V., 115(90), 116(90), 118 Sridharan, S., 509 Srinivasan, A. R., 383 Srinivasan, N., 463(4), 464, 472(4) Srinivasan, R., 141, 142(3), 378 Srinivasan, S., 473 Srinivasarao, G. Y., 480 Srivastava, S. K., 340, 603 Stahlberg, H., 204 Stanley, E., 63 Stanley, H. E., 618, 620, 621(46), 622(62; 63), 628, 628(46; 63), 630(46; 62; 63; 116), 631, 631(116), 632(62; 63; 116), 633(46; 63), 635(62; 63), 637(62; 63) Stanley, K., 371 States, D. J., 422, 447(32) Stebbins, C. E., 125, 129(33), 133(33) Stec, B., 306, 310(17), 606 Steeg, E., 245 Stefani, M., 630, 631(122) Steiglitz, K., 236 Steinrauf, L. K., 122 Steipe, B., 174 Steitz, T. A., 164, 165(9), 172, 174(27), 175(27), 180(27), 181(39), 182, 410, 551, 603, 615
664
author index
Stengle, D. P., 118 Stengle, T. R., 118 Stenkamp, R., 463(3), 464, 472(3) Stern, P. S., 475, 565 Sternberg, M. J. E., 123, 463(2), 464, 465, 472, 472(2; 11), 558 Steven, A. C., 204, 616 Stevens, R. C., 163, 605 Stewart, J. M., 156 Stewart, P. L., 207 Still, W. C., 421 Stock, D., 163 Stoeva, S., 594(26), 595 Stoilova-McPhie, S., 212 Stokes, D. L., 209 Stoop, J., 313, 318(34), 319(34) Stork, D. G., 247 Stote, J., 473, 475(63) Stote, R. H., 623 Stout, C. D., 355, 362(13) Stowell, M. H. B., 88, 98(40), 101(40), 106(41) Strandberg, B. E., 26, 40, 45(16) Strange, R. W., 613 Strassheim, M. L., 207 Straub, R., 473, 475(63) Strelkov, S. V., 94, 115(58), 116(58) Stricht, D. V., 125, 132(25) Strockbine, B., 421 Struck, M. M., 166 Stuart, A., 367, 463(6), 464, 467(6), 472(6), 493(6) Stuart, D. I., 207 Studholme, C., 210(46; 47), 211 Studier, F. W., 465 Studier, W., 246 Stuhrmann, B. B., 589 Stuhrmann, H. B., 589, 594, 604 Stull, J. T., 610 Stura, E. A., 355, 362(13) Su, X. D., 115, 116(86) Subbiah, S., 403, 552 Subramaniam, S., 445(55), 446, 517 Sudarsanam, S., 473 Sudhakar, P. V., 517 Suhai, S., 448, 448(71), 449, 449(70), 450, 450(71) Sullivan, M. L., 403, 405(32) Sumanawaeera, T., 210 Susnow, R., 343
Sussman, J. L., 115(91), 116(91), 118, 130, 473, 549(162), 550, 551 Svensson, L. A., 399 Svensson, O., 87, 88(32), 96(32), 113(32) Svergun, D. I., 586, 587, 589, 591(60; 65), 593, 594, 594(25), 595, 595(24), 596, 597(47; 49), 598, 599, 599(42), 600, 601, 602, 603, 604, 605, 607, 612, 615, 615(11), 617 Swaminathan, S., 70, 246, 422, 447(32), 465, 602 Swanson, S. M., 252 Sweeney, H. L., 221 Sweet, R. M., 17, 172, 174(27), 175(27), 180(27), 411, 412(35), 528, 603, 619 Swindells, M. B., 374, 453, 501, 506(5), 548 Sykes, B. D., 375, 378(20), 379(20) Sykes, J. A., 562
T Taddei, N., 630, 631(122) Takeda, K., 137 Takemoto, S., 456 Taketomi, H., 618 Tammana, H., 560 Tanabe, K., 455 Tanaka, S., 513 Tanaka, T., 618, 620, 621(15), 622, 623(37), 627(71), 634(71) Tang, C., 620, 626(39) Tang, Y. Z., 547 Tao, Y., 94, 115(58), 116(58) Tardieu, A., 588, 602(4) Tasumi, M., 230(3), 231, 372, 497, 599 Tate, C., 198, 200 Tate, M. W., 616 Tatusova, T. A., 380(33), 381 Tauc, P., 312(99), 611 Tavernier, B., 84 Taylor, H. C., 388, 409(4), 417 Taylor, I., 113(81), 115, 116(81) Taylor, R., 516, 518(39), 570, 574(127) Taylor, W. R., 200, 469, 473, 514, 557, 559 Teeter, M. M., 12, 13(27), 16(27), 21(27), 79, 306, 310(17), 399, 447(62), 448, 450, 450(62), 625 Teichmann, S., 577 Tempczyk, A., 421 Ten Eyck, L. F., 11, 191, 347, 350, 359, 362(14)
665
author index Teng, T., 107(77), 108, 615 Teplyakov, A., 40 ter Kuile, B., 480, 490(74) Terwilliger, T. C., 17, 22, 23, 23(1), 24(1), 25, 27, 27(1; 11), 28, 30(4; 5), 31(4; 5), 32(4; 5), 37, 39, 40, 40(4), 41, 41(4), 42(18), 43, 43(21), 45(4; 22), 46, 49(4; 18), 54(21), 58(4), 160, 162, 171, 244, 246, 273(27; 28), 282 Tetaud, E., 77 Teukolsky, S. A., 205 Teyton, L., 316(41), 320 Thanki, N., 383 Thanos, C. D., 76 Thayer, E. C., 631 Thim, L., 122 Thirumalai, D., 618, 620, 621(9), 623(35), 627, 627(35), 628 Thirup, S., 239, 245, 473, 607 Thoden, J. B., 401 Thomas, A., 566 Thomas, B. D., 544 Thomas, D., 626 Thomas, M. J., 562 Thomas, P. D., 455, 575 Thompson, A., 115, 116(87) Thompson, A. B., 562 Thompson, J. D., 506, 581 Thompson, N. E., 177 Thompson, T. B., 401 Thormahlen, M., 206, 207(20) Thornton, J. M., 200, 295, 332, 374, 377, 382(23; 24), 387, 394, 394(2), 453, 463(2), 464, 465, 469, 472, 472(2), 476, 486(70), 501, 506(5), 514, 524, 548, 549(163), 550, 558 Thorsson, V., 560 Thuman, P., 56, 63, 63(53), 65(53), 66(53), 244 Thuman-Commike, P. A., 615 Tickle, I. J., 206(22), 207 Tildesley, D. J., 631 Tilton, R. F., 87, 98(26), 99(26) Tilton, R. F., Jr., 85(21), 86, 99, 105, 109(74), 118(21) Timchenko, A. A., 604 Timm, D. E., 125, 133(31), 152 Timmins, P. A., 608 Ting, K.-L. H., 631 Tirion, M. M., 304
Tjandra, N., 456 Tobias, C. A., 84 Tocilj, A., 164 Todd, A. E., 465 Tolley, S., 122 Tomoda, S., 456 Tompkins, W. H., 616 Tooze, J., 581 Torda, A. E., 424, 446, 469 Transue, T. R., 343 Treptow, W. L., 639 Trewhella, J., 593, 594(27), 595, 610, 610(18) Troitsky, A. V., 604 Tronrud, D. E., 191, 261, 350 Tropsha, A., 511, 512, 514, 515(5; 6), 516, 517, 517(5–7; 9), 522(7), 539 Trotter, S., 616 Trueblood, K. N., 303, 304, 306, 310(22), 311, 313(2), 314(3), 315(3) Truhlar, D., 422 Trybus, K. M., 206, 209(14), 210(14), 211(14), 217(14), 220(14), 221(14), 222(14) Tsai, J., 516, 518(39), 546, 547, 552, 570, 570(41), 573, 574, 574(127; 128) Tsugita, A., 480 Tsui, V., 418, 421(14) Tsuruta, H., 615, 616 Tsuzuki, S., 455 Tucker, A. D., 113(81), 115, 116(81) Tucker, P., 88, 106(45), 108 Tuffery, P., 390, 391(7) Tuma, R., 595 Tunnicliffe, A., 113(81), 115, 116(81) Turk, D., 245, 278 Turkenburg, J. P., 122, 316 Turkenburg, M. G. W., 23, 37, 171 Turner, G. M., 562 Turner, M. A., 29, 64, 65(77), 68(77) Turpin, F. H., 84
U Uchida, K., 125, 133(28) Uchimaru, T., 455 Ueda, Y., 618 Ulrich, E. L., 382 Ultsch, M., 125, 132(22) Umland, T., 602 Unger, R., 473
666
author index
Unwin, N., 206 Ursby, T., 615 Urzhumtsev, A. G., 172, 191 Uso´n, I., 23, 37, 56(1), 70, 171
V Vachette, P., 312(99), 586, 605, 606, 607, 608, 610(87), 611, 612, 616 Vagin, A. A., 40, 242, 246, 273(26), 306, 308, 350 Vaguine, A., 295, 381, 382(34) Vaisman, I. I., 511, 512, 515(5; 6), 516, 517, 517(5–7), 522(7) Vajdos, F. F., 125, 132(22) Vale, R. D., 206 Valeev, E. F., 455 Va˚legard, K., 115, 116(87) Vallet, B., 84 Vanaman, T. C., 463(1), 464, 472(1) van Beeumen, J., 115(93), 116(93), 118 Van Belle, D., 547 van Bommel, J. A., 4 Van den Elsen, P., 210 van der Plas, J. L., 171 van der Spoel, D., 565 Van Dorsselaer, A., 328 van Gunsteren, W. F., 418, 423, 424, 424(38; 40), 425(41), 426(40; 41), 431(41), 432(41), 435, 438, 441(41), 442, 445, 446, 517 van Holde, K. E., 602 van Scheltinga, A. C., 115, 116(87) van Vlijmen, H., 471, 477(43), 479(43) Varadarajan, R., 517 Vassiliev, V. B., 613 Vasyliev, V. B., 612 Vaughn, D. E., 115, 116(86) Veerapandian, B., 306 Veesler, S., 607 Vellieux, F. M. D., 29 Vendruscolo, M., 630, 636, 637(146), 638, 639(150) Verdaguer, N., 207 Verma, C. S., 316 Verne`de, X., 106, 118 Vernoslova, E., 52 Verschelde, J. L., 562 Verschoor, A., 603 Vetterling, W. T., 205
Vierstra, R. D., 403, 405(32) Vieth, M., 624, 625(91) Vigil, D., 595 Viguera, A. R., 620, 621(53), 632(53) Vihinen, M., 620 Vijayan, M., 199 Vijay-Kumar, S., 398 Villegas, V., 631 Villeret, V., 115(93), 116(93), 118 Vitagliano, L., 325, 328(3) Vitali, J., 87, 98(26), 99(26) Vivare`s, D., 588, 602(4) Voelter, W., 594(26), 595 Vogel, H. J., 565 Voges, D., 206 Volbeda, A., 118 Volkman, B. F., 553, 565(47) Volkmann, N., 204, 206, 209(14; 15), 210, 210(14; 15), 211, 211(14), 212, 215(55), 216(55), 217(14), 220(14; 50), 221(14), 222(14), 223(50), 224(50) Volkov, V. V., 591(60), 594, 597(47), 600, 602, 604, 607 Voloshin, V. P., 516 von Bohlen, K., 170 von Delft, F., 57, 77 von Heijne, G., 560 von Mises, R., 211 Vonrhein, C., 21, 164, 165(10), 411, 412(35) Von Stackelberg, M., 88, 89(48) Vorobjev, Y. N., 418, 419, 420 Voronoi, G. F., 515, 570 Voss, N., 552, 570(41) Vovelle, F., 118 Vrielink, A., 77 Vriend, G., 295, 332, 374, 377(7; 8), 417, 476, 523, 557 Vyas, N. K., 553 Vysotski, E. S., 80
W Wade, R. C., 207, 561 Wadzack, J., 589 Wagner, F., 422 Wainwright, T. E., 631 Wakabayashi, K., 616 Waldner, J. C., 544 Waldo, G. S., 28 Wales, D. J., 622, 622(70), 627(70)
author index Walker, D., 371 Walker, J. E., 46, 164, 165(6), 178(6), 603 Wallace, A. C., 374 Wallace, B. A., 123 Walsh, C. T., 196 Walsh, D. A., 593, 610(18) Walsh, M. A., 47, 49(35), 60, 76, 295, 328 Walther, D., 595 Walz, J., 204 Wang, B. C., 29, 30(20), 31(20), 46, 64(27), 79(27), 80, 115(98; 99), 116(98; 99), 118, 122, 189 Wang, C. A., 594(27), 595 Wang, C. X., 547 Wang, G., 207 Wang, M. Y., 210(48), 211 Wang, W., 624 Wang, W. C., 115(96), 116(96), 118 Wang, Z. X., 616 Warren, G. L., 39, 47(5), 177, 244, 386, 415, 418(2), 421(2), 426(2), 432(2), 433(2), 436(2), 441(2), 462(2) Warshel, A., 448, 513 Watanabe, M., 473, 475(63) Watenpaugh, K. D., 13, 16(30), 376 Waterman, M. S., 262, 500, 506(4) Watson, D. F., 519, 531(62) Watson, H. C., 85, 99(11) Weaver, D. L., 622 Weaver, L. H., 604 Webster, P., 170 Weeks, C. M., 23, 37, 39, 56, 56(1; 9; 10), 58, 62, 63, 63(53), 64(10), 65(53), 66(53), 68(61), 69, 70, 75, 134, 135(40), 171, 232, 244, 349 Weiner, S. J., 86 Weis, W. I., 11, 17(26), 35, 36(38), 54, 179, 310, 443, 446(53) Weise, C. F., 451 Weiss, M. S., 80, 113, 602 Weisshaar, J. C., 451 Weissig, H., 230(4), 231, 372, 383, 465, 497, 511, 548, 599, 625, 628(96) Welch, M., 115, 116(84) Wemmer, D. E., 553, 565(47) Wendt, T. G., 211, 215(55), 216(55) Wesenberg, G., 67 West, J., 208, 210(48), 211 Westbrook, E. M., 328, 329
667
Westbrook, J., 230(4), 231, 372, 376, 377, 378, 379, 382, 382(30), 383, 465, 497, 511, 548, 599, 625, 628(96) Weston, S. A., 113(81), 115, 116(81) Wetlaufer, D. B., 622 Whaley, R. C., 371 Wheeler, D. L., 380(32; 33), 381, 384(32), 465 Wherland, S., 473 Whitby, F. G., 88, 98(40), 101(40) White, P. M., 630, 631(122) White, S. A., 77 White, S. W., 77 Whittaker, M., 206, 207, 207(16), 219, 221 Whittington, D. A., 118 Wichert, J. M., 631 Wiegand, G., 551 Wiener, M. C., 88, 98(40), 101(40), 106(41) Wierenga, R. K., 551 Wigneshweraraj, S. R., 594, 595(24) Wikoff, W. R., 207, 615, 616 Wilcock, J., 94 Wilcock, R. J., 85 Wilhelm, E., 94 Wilhem, E., 85 Wilkens, S., 603 Wilkinson, A. J., 316 Willett, P., 557, 558, 558(59) Williams, G. J., 230(3), 231, 372, 497, 599 Williams, K. A., 204 Williams, P. A., 115(90), 116(90), 118 Williamson, K. L., 118 Willis, B. T. M., 303 Willumeit, R., 589 Wilmanns, M., 620 Wilson, A. J. C., 57 Wilson, C., 29, 190, 555, 565, 577(45), 624, 625(87) Wilson, E. B., 565 Wilson, I. A., 316(41), 320, 558, 616 Wilson, K. S., 22, 29(2), 39, 56, 67, 70, 72(8), 229, 232, 235, 245, 295, 308, 322, 325, 332, 350, 400, 408(38) Wilson, M. A., 306 Wilson-Kubalek, E. M., 206, 221 Wimberly, B. T., 28, 164, 165(10), 175, 181(32) Wingreen, N. S., 620, 626(39) Winn, M. D., 303, 306, 310, 315(20) Winter, D. C., 212 Winterhalter, K., 115, 116(88)
668
author index
Winterwerp, H. H. K., 77 Wiorkiewicz-Kuczera, J., 473, 475(63) Wisely, G. B., 608 Wishnia, A., 119 Witt, H. T., 115, 116(82) Wittmann, H. G., 170 Witz, J., 616 Wlodawer, A., 13, 16(30), 82, 121, 125, 129(24), 130, 132(24), 133(28), 399, 408, 432 Wlodek, S. T., 547 Wodak, S. J., 295, 381, 382(34), 473, 514, 547, 551, 574, 575, 575(136) Wolfenden, R., 150, 158, 196, 529 Wolynes, P. G., 618, 620(4), 621, 621(12), 623(4) Wong, C. H., 616 Woods, R. C., 246(29), 247 Woodward, C., 432, 517 Woody, R. W., 451 Woolfson, D., 524 Woolfson, M. M., 6, 39, 61, 63, 72, 72(8), 198, 200 Wootton, J. C., 467 Word, J. M., 239, 388, 390, 391(3), 392(12), 393, 393(12), 394, 395(19), 397(19), 398(12), 401(3), 403(12), 405, 406(12), 409(4), 411(33), 412(12), 417, 448, 452, 452(64), 456(63; 64), 461(63; 64) Woutersen, S., 451, 453(80) Wriggers, W., 205, 206, 208, 208(10; 11), 209, 209(10; 11), 210(11), 213(10), 221, 547 Wright, P. E., 375, 378(20), 379(20) Wu, G., 471, 480, 490(74) Wulff, M., 615 Wunsch, C. D., 484 Wu¨thrich, K., 375, 378(20), 379(20), 445 Wyckoff, H. W., 7 Wyman, J., 611
X Xia, X., 448 Xiang, S., 29, 46, 150, 158, 191, 194(21), 196, 196(21), 200(21) Xu, E. H., 608 Xu, H., 70, 75 Xu, Z., 553, 565
Y Yajima, H., 209 Yamakura, T., 119 Yamato, T., 517 Yang, A.-S., 501, 506(6) Yang, D., 122, 262 Yang, F., 121, 408 Yang, W., 448, 449, 450(73) Yaniv, M., 597 Yao, J.-X., 39, 72, 72(7; 8) Yeh, J. I., 166 Yeh, L. S. L., 480 Yin, D., 473, 475(63) Yin, Y. H., 411, 412(35) Yoder, J. A., 115(95), 116(95), 118 Yohn, C. B., 206, 207(16) Yokochi, M., 606 Yonath, A., 164, 170 Yoshida, K., 613 Young, A. C. M., 105, 109(74) Young, H. S., 209 Young, M., 547 Yu, H., 620 Yu, X., 206(23), 207 Yuan, C.-S., 29, 64, 65(77), 68(77) Yuan, F., 471, 477(43), 479(43) Yuda, S., 96, 97(70) Yue, K., 514 Yusupov, M. M., 615 Yusupova, G. Z., 615 Yuzawa, S., 606
Z Zaccai, G., 589, 598, 599(42), 607 Zagalsky, P. F., 80, 96, 115(69) Zaidi, Z. H., 594(26), 595 Zaitsev, V., 613 Zaitseva, I., 613 Zalis, M. E., 388, 409(4), 417 Zarivach, R., 164 Zhang, D., 602 Zhang, G., 164, 209 Zhang, H., 77, 115(94), 116(94), 118 Zhang, J., 544 Zhang, K. Y. J., 29, 46, 188, 189, 190, 190(5), 191(5), 197(5), 199(5), 200, 200(5), 201, 282
author index Zhang, X., 115(92; 95), 116(92; 95), 118 Zhang, Y.-M., 77 Zhang, Z., 544 Zhao, J., 593, 610(18) Zheng, C.-D., 39, 72(8) Zheng, W., 512, 515(5), 517, 517(5; 7), 522(7) Zhi, G., 610 Zhong, L., 323 Zhou, K., 179 Zhou, L., 115(92; 95), 116(92; 95), 118 Zhou, R., 306, 310(17)
669
Zhou, T., 115(94), 116(94), 118, 396 Zhou, Y., 630, 631, 631(119) Zhu, J., 603 Zimmer, R. M., 514 Zimmermann, W., 115, 116(88) Zou, J.-Y., 207, 244, 278, 390, 391(8), 411(8), 412(8) Zwick, M., 46 Zwickl, P., 163 Zydowsky, L. D., 196
Subject Index
A Acyl protein thioesterase I, rapid halide soak in structure determination, 127, 135–136 Aldose reductase, subatomic high-resolution multiple-wavelength anomalous diffraction phasing comparison of refined 0.66 angstrom map with MAD map, 337 comparison of refined phases with MAD phases, 337 comparison with medium-resolution refinement, 341–343 crystallization, 328 data collection, 328–330 MAD phasing to 0.9 angstroms, 336–337 protonation, 330–331 refinement of model against selenomethionine derivative data, 336 short contacts, 331 solvent structure, 334–336 stereochemistry, 332 validation of refined model against MAD map, 337–340 water structure validation, 340–341 Amicyanin, SHELXL program refinement of structure, 354–355, 359–360, 362 2-Aminoethylphosphonate transaminase, automated structure solution, 77–78 Anisotropic displacement parameters, see Translation, rotation, screw-rotation parameterization ARP, see Automated Refinement Procedure Asparagine, all-atom contact validation of protein structures, 402–403 Aspartate transcarbamylase, small-angle X-ray scattering studies ligand-induced conformational change, 610–611 R state, 605–606 time-resolved studies, 616 Automated Refinement Procedure ARP/wARP software
671
applications, 239–240 availability, 242 example, 240–242 interface, 240 overview, 232–234, 242–244 protein backbone modeling, 235–238 sequence docking and side-chain fitting, 238–239 pattern recognition theory, 229–231 problems in automation, 231–232 protein modeling from free atoms, 233–234
B Benzoate dioxygenase reductase, translation, rotation, screw-rotation refinement, 322 Bijvoet difference Fourier synthesis approximate electron density, 141–143 example, 143–145 Friedel’s law, 138–141 overview, 137–138 historical perspective and principles, 7–11 multiple isomorphous replacement with anomalous scattering, 18 Bovine pancreatic trypsin inhibitor small-angle X-ray scattering, 607–608 time-averaging refinement crystallographic data, 422 flexibility probing overall protein, 435–436 specific regions, 436–438, 440–441 refinement, 432–433 success of refinement, 433, 435 water structure, 433, 441 BPTI, see Bovine pancreatic trypsin inhibitor
C C, CAPRA pattern recognition algorithm, 250–255 CAPRA, see TEXTAL System
672
subject index
C deviations, all-atom contact validation of structures, 397–398, 409–410, 413 CCP4, see Collaborative Computational Project Number 4 Ceruplasmin, small-angle X-ray scattering of ligand-induced conformational change, 612–613 CNS, see Crystallography and NMR System Collaborative Computational Project Number 4 applications, 74–81 availability, 81, 83 data preparation, 70–72 enantiomorph determination, 73–74 heavy-atom searching and phasing, 72–73 overview, 39, 70 scoring trial structures, 73 site validation, 73–74 substructure refinement, 73–74 Comparative protein structural modeling, see Homology-based protein structural modeling Computational geometry, see Simplicial Neighborhood Analysis of Protein Packing Crambin, quantum mechanics modeling of solution structure, 450–451 Crystallography and NMR System applications, 74–81 availability, 81, 83 data preparation FA structure factors, 49 Patterson map combination, 49 sigma cutoffs and outlier elimination, 47–49 enantiomorph determination, 54–55 heavy-atom searching direct-space method, 52 peak search and position check, 53 principles, 49–52 reciprocal-space method, 52 overview, 39, 47 scoring trial structures, 53–54 site validation, 54 substructure refinement, 54 Cyanovirin, homology-based protein structural modeling with MODELLER, 491–493 CzrA, TEXTAL System in automated model building, 262–269
D Database of Macromolecular Movements, see Protein flexibility Delaunay tessellation, see Simplicial Neighborhood Analysis of Protein Packing Density modification density histogram matching, 190–191 multidimensional histograms components and analysis, 194–197 constraint for density modification double-histogram matching, 197–200 two-dimensional histogram matching, 200–201, 203 derivatives, 193–194 geometric shape and stereochemical information, 191–192 insensitivity to molecular conformation, 196–197 n-dimensional histogram, 192–193 overview, 188–190
E Electron microscopy atomic model docking into images classification of fitting methods, 205–206 density correlation-based fitting, 209–210 fimbrin, 222–223, 225 fitting requirements, 205 manual fitting, 206–208 myosin, 220–222 product-moment correlation coefficient as fitting criterion accuracy of fitting, 211–212 confidence intervals, 212–214 overview, 210–211 solution set interpretation and analysis, 214–216 solution set robustness, 216–219 validation, 219–220 vector quantization-based fitting, 208–209 resolution, 204 small-angle X-ray scattering complementation, 615 EM, see Electron microscopy
673
subject index
F Ferredoxin, SHELXL program refinement of structure inconsistency detection, 364–367 overview, 355, 362 restraints, 362–364 Fimbrin, atomic model docking into electron microscopy images, 222–223, 225 Friedel’s law, Bijvoet-difference Fourier synthesis, 138–141 Full matrix optimization, see also SHELXL program overview of optimization problem, 347–348 principles of optimization first-order methods, 349 second-order methods, 349–354 Taylor series expansion, 348–349 underdetermined systems, 354 zeroth-order methods, 349 programs, 350 prospects, 368, 370–371
G -Galactosidase, rapid halide soak in structure determination, 126, 131, 134 beta;-Galactosidase, rapid halide soak in structure determination, 126, 134–135 Generalized Born model, see Implicit solvent models Glutamate dehydrogenase, small-angle X-ray scattering, 604–605 Glutamine, all-atom contact validation of protein structures, 402–403 G proteins, packing motif identification using Simplicial Neighborhood Analysis of Protein Packing, 542–544 GRASP2 program alignment of structures, 499–501 design and capabilities, 494, 510–511 eletrostatic energy calculation and visualization, 508–510 file input/output, 497 graphical user interface, 494–486 objects atomic subsets, 496 electrostatic potential maps, 496
manipulation, 497–498 molecular models, 497 surfaces, 496 views, 496 sequence–structure–function relationship analysis, 506–508 subsets, 498–499 surface property comparison between different molecules, 501–505 Grb2, small-angle X-ray scattering, 606–607
H Halide rapid soaks advantages, 136–137 anomalous scatterer sites acyl protein thioesterase I, 127 -galactosidase, 126 trypsin inhibitor, 127 automated substructure determination, 81–82 bromine, 122, 124 chlorine, 122 classic heavy atom reagent comparison, 123–124 cryosoaking conditions, 124–125, 128–130 examples of solved structures acyl protein thioesterase I, 135–136 -galactosidase, 126, 131, 134 -galactosidase, 134–135 table, 132–133 heavy alkali metal comparison, 123, 136–137 iodine, 122, 124 principles, 120–121 Heavy-atom soaking, see Halide rapid soaks Hidden Markov model, structure comparison, 559–560 Histidine, all-atom contact validation of protein structures, 400–401 Homology-based protein structural modeling applications, 464 MODELLER program availability, 479 capabilities, 480 cyanovirin modeling, 491–493 lactate dehydrogenase modeling active site modeling with multiple templates, 486–490
674
subject index
Homology-based protein structural modeling (cont.) model building, 485–486 related structure searching, 480–481 sequence alignment with template, 484–485 template selection, 481–484 loop modeling, 475–476 rationale, 464–465 steps iterating alignment, modeling, and model evaluation, 479 model building, 472–476 model evaluation, 476, 478–479 overview, 465–466 target sequence alignment with structures, 471–472 template searches, 467, 469–470 template selection, 470–471 Web servers and resources, 465, 467–469
I Implicit solvent models continuum dielectric models generalized Born model, 418, 420–422 polarization free energy calculation, 419–420 overview, 415–418, 462 Isoleucine, all-atom contact validation of protein structures, 406 Isomorphous difference Fourier methods advantages, 163 difference Fourier map computation deviant observations, identification and deletion, 155–156 pairing, 152–153 scaling, 154–155 difference visualization between similar crystal structures, 145–150 interpretation of difference density maps, 157–159 isomorphism maintenance, 151–152 refinement of isomorphous structures, 159–160, 162 resolution range, 157 sensitivity, 150, 163 software, 150
Isomorphous replacement Bijvoet’s contributions, 3–7 fixed-wavelength analysis, 10–13
K Karle’s analysis, historical perspective and principles, 13–17 Ketopantoate hydroxymethyltransferase, automated structure solution, 75, 77 Krypton heavy atom in protein structure determination cryocrystallography gas binding kinetics, 106 overview, 105–106 pressurization cells, 106–108 room temperature experiment comparison, 108–109 examples, 115–118 N-myristoyltransferase, 113, 115 porcine pancreatic elastase, 112–113 pressurization of crystals, 83–84, 99–102, 110–112 room temperature experiments, 99–105, 108–109 molecular forces in protein interactions, 89–94 physical properties, 89–90 protein complexes, 88–89 solubilities, 94–95
L Lactate dehydrogenase, homology-based protein structural modeling with MODELLER active site modeling with multiple templates, 486–490 model building, 485–486 related structure searching, 480–481 sequence alignment with template, 484–485 template selection, 481–484 Lactone, SHELXL program refinement of structure, 354–357, 359 LDH, see Lactate dehydrogenase Leucine, all-atom contact validation of protein structures, 406
subject index Ligand binding, crystal structures all-atom contact validation of protein structures, 407–409 model building with X-BUILD, 292 placement with X-LIGAND, 296–297 Protein Data Bank structure nomenclature, 380 Light-harvesting complex II, translation, rotation, screw-rotation refinement, 316–318 LOOKUP, see TEXTAL System Lysozyme, small-angle X-ray scattering of T4 enzyme, 604
M MAD, see Multiple-wavelength anomalous diffraction Major histocompatibility complex class I peptide, translation, rotation, screwrotation refinement of complexes, 320 Mannitol dehydrogenase, translation, rotation, screw-rotation refinement, 318–320 Matthews algorithm, fixed-wavelength analysis, 11–12 Metal rapid soaks, see Halide rapid soaks Methionine, all-atom contact validation of protein structures, 406 Mevalonate kinase, TEXTAL System in automated model building, 262–263, 265–267 MIR, see Multiple isomorphous replacement MIRAS, see Multiple isomorphous replacement with anomalous scattering MODELLER program availability, 479 capabilities, 480 cyanovirin modeling, 491–493 lactate dehydrogenase modeling active site modeling with multiple templates, 486–490 model building, 485–486 related structure searching, 480–481 sequence alignment with template, 484–485 template selection, 481–484 loop modeling, 475–476
675
MOLPROBILITY, Web-based all-atom contact validation of structures, 4, 411–412 Multidimensional histogram, see Density modification Multiple isomorphous replacement automated solution, see Crystallography and NMR System; RESOLVE program; SOLVE program error sources, 325 historical perspective, 16–17 measurements used for substructure determination, 38 solution approaches, 16–17 Multiple isomorphous replacement with anomalous scattering automated solution, see Crystallography and NMR System; SHELDX program; SnB program measurements used for substructure determination, 38 Multiple-wavelength anomalous diffraction accuracy and error sources, 325–326 automated solution, see Collaborative Computational Project Number 4; Crystallography and NMR System; RESOLVE program; SOLVE program historical perspective, 14, 1 9–18 measurements used for substructure determination, 38 single-wavelength anomalous diffraction comparison, 21–22 subatomic high-resolution phasing aldose reductase comparison of refined 0.66 angstrom map with MAD map, 337 comparison of refined phases with MAD phases, 337 comparison with medium-resolution refinement, 341–343 crystallization, 328 data collection, 328–330 MAD phasing to 0.9 angstroms, 336–337 protonation, 330–331
676
subject index
Multiple-wavelength anomalous diffraction (cont.) refinement of model against selenomethionine derivative data, 336 short contacts, 331 solvent structure, 334–336 stereochemistry, 332 validation of refined model against MAD map, 337–340 water structure validation, 340–341 data collection, 327–329 overview, 326–327 prospects, 343–344 Myosin, atomic model docking into electron microscopy images, 220–222 N-Myristoyltransferase, krypton as heavy atom in protein structure determination, 113, 115
N NMR, see Nuclear magnetic resonance Nuclear magnetic resonance size limitations in macromolecular structure determination, 204 xenon studies of macromolecular structure, 118–119
P Patterson function automated solution of heavy-atom substructures, 37–39 historical perspective, 6 PDB structures, see Protein Data Bank structures PKA, see Protein kinase A PROBE program all-atom contact validation of structures, 388, 399, 411 availability, 410–411 Protein Data Bank structure validation, 381 Protein Data Bank structures historical perspective and applications, 372, 374 newsletters, 372–373 nomenclature standardization, 375–376
Research Collaboratory for Structural Bioinformatics management, 375 validation legacy files, 383–384 nomenclature errors, 385 overview of process, 376–377 primary processing, 382–383 prospects, 386–387 revalidation, 386 sequence representation errors, 384–385 specific structure checks atom nomenclature, 378–380 checks against experimental data with SFCHECK, 381–382 close contacts, 379–380 covalent bond distances and angles, 377–378 ligand nomenclature, 380 sequence comparison, 380 stereochemical validation, 378 Validation Server, 382 water, 381 Protein flexibility classification of motions, 547 Database of Macromolecular Movements motion attributes, 550–551 normal mode analysis, 562, 565–566, 568–570 number of protein motions, 553, 555 overview, 548–549 packing classification, 552–553 size classification, 551 Web access, 547 energy minimization, 575–576 flexible linker identification from genome sequences, 577–578, 581, 583 molecular dynamics, 575–576 normal mode analysis, 575–576 prospects for computational analysis, 584–586 structure comparison techniques for databases hidden Markov model, 559–560 interpolation between structures adiabatic mapping interpolation, 561 Morph server, 562 multiple structural alignment, 559 pairwise structural alignment, 557–559 sieve fit superposition and screw axis orientation, 556–557
677
subject index Voronoi polyhedra for packing quantification, 570, 573–575 Protein folding computation from sequence historical perspective, 623–625 overview, 619–621 structure databases, 625–627 transition state ensemble, 625 evolutionary conservation, 619 kinetics from discrete molecular dynamics simulations algorithm, 631–632 experimental validation, 637 folding nucleus identification, 632–634 prospects, 639–640 protein model, 627–631 thermodynamics of folding, 632 transition state ensemble identification, 634–635 virtual screening, 635–637 Levinthal paradox, 621 nucleation scenario, 621–623 protein engineering analysis, 623 small-angle X-ray scattering time-resolved studies, 616–617 topology factors in modeling, 637–639 Protein kinase A, small-angle X-ray scattering of ligand-induced conformational change, 610 Protein packing, see Simplicial Neighborhood Analysis of Protein Packing
Q QM/MM model, see Quantum mechanics modeling Quanta program all-atom model generation, 288–209 automated model re-building overview, 274–275 bones editing for map mask, 282 generation in X-AUTOFIT, 279, 281 parameters, 281–282 data input, 278–279 logging, 298–299 recovery, 299–300 design, 276–277
external program running, 300–301 hydrogen fitting, 296 interface, 277–278 ligand placement with X-LIGAND, 296–297 map mask generation, 282–283 map tracing with X-POWERFIT and CA BUILD, 284–286 memory usage, 300 modal versus amodal actions, 279 model building with X-BUILD alternate conformations, 292–293 ligands, 292 non-bonding restraints, 292 nucleic acids, 291 refinement to map, 290–291 residue types, 291 validation, 293–295 non-bond marks, 302 palettes and tools, 277, 302 parameter file, 300 resolution scaling, 301 seondary structure identification with X-POWERFIT, 284 sequence assignment, 286–288 solvent placement with X-SOLVENT, 295–297 symmetry handling, 301 text annotation, 298 training, 302–303 water modeling, 295–296 Quantum mechanics modeling high-level quantum mechanics modeling confidence in database and false rotamers, 461–462 mean conformation versus minimum energy conformation, 456–458 principles, 453–456 rotamer distribution, 458–459, 461 torsion angle distibution, 458 overview, 416, 447–448, 463 QM/MM model crambin solution structure, 450–451 dipeptide solution structure, 451–453 features, 448–450
R Ramachandran plot, all-atom contact validation of structures, 394
678
subject index
Rapid soaks, see Halide rapid soaks REDUCE program all-atom contact validation of structures, 388, 399, 403, 407, 411 availability, 410–411 REFMAC program, see Translation, rotation, screw-rotation parameterization RESOLVE program general computational approaches for automated structure solution, 23–24 overview, 22–23 prospects, 36–37 statistical density modification cycle of solvent flattening, 32–33 mathematics, 31–32 overview, 29–31 pattern matching example, 35–36 validation, 33–35 Retinoid receptors, small-angle X-ray scattering of ligand-induced conformational change, 608 RhoE, translation, rotation, screw-rotation refinement, 322 Ribonucleic acid, all-atom contact validation of structures, 409–410 RNA, see Ribonucleic acid Rotamer, see Side-chain rotamer libraries
S S100A12, translation, rotation, screwrotation refinement, 322 SAD, see Single-wavelength anomalous diffraction SAXS, see Small-angle X-ray scattering Serine proteases, packing motif identification using Simplicial Neighborhood Analysis of Protein Packing, 541–542 SHARP, see Statistical heavy atom refinement and phasing SHELDX program applications, 74–81 availability, 81, 83 data preparation normalization, 57–58 resolution cutoffs, 59 sigma cutoffs and outlier elimination, 58–59 enantiomorph determination, 69
halide soaks in substructure determination, 81–82 heavy-atom searching and phasing multisolution methods and trial structures, 61–62 principles, 59–61 real-space constraints, 63–65 reciprocal-space phase refinement or expansion, 62–63 overview, 39, 55–57 scoring trial structures correlation coefficient, 65–66 crystallographic R, 65 minimal function, 65 Patterson figure of merit, 65 site validation comparison of trials, 67–68 crossword tables, 67 substructure refinement, 69–70 weak anomalous signals in substructure determination, 78–80 SHELXL program full matrix optimization amicyanin, 354–355, 359–360, 362 ferredoxin inconsistency detection, 364–367 overview, 355, 362 restraints, 362–364 lactone, 354–357, 359 vector computers, 371 Side-chain rotamer libraries, all-atom contact validation of structures, 390–394, 406–407 Siderophore-binding protein, translation, rotation, screw-rotation refinement, 320–321 Simplicial Neighborhood Analysis of Protein Packing computational geometry in protein structure analysis Delaunay tessellation application to protein structure, 520–521 higher-order residue contact geometry Delaunay tessellation, 516–520 four-body contacts, 515 three-body contacts, 514 Voronoi diagram, 517–518 Voronoi tessellation, 515–517
subject index reduced representation of protein structure and residue contacts, 513–514 elementary packing motif identification using Delaunay tessellation, 521–523 examples of packing motif identification G proteins, 542–544 nuclear receptor ligand-binding domains, 540–541 overview, 539–540 serine proteases, 541–542 four-body statistical scoring function for fold recognition based on Delaunay tessellation contact mapping, 526, 528–529 hydrophobic core identification and visualization, 531–534, 536 remote fold recognition, 536–538 score correlation with experimentally measured hydrophobicities, 529, 544 virtual mutagenesis, 538–539 modules, 512–513 overview, 511–513 prospects, 544, 546 quadruplet residue contact identification using Delaunay tessellation, 521–523 sequence–structure space of tertiary contacts, 523–524 tertiary packing patterns and secondary structure interfaces, 524–526 Web server, 512, 546 Single isomorphous replacement, measurements used for substructure determination, 38 Single isomorphous replacement with anomalous scattering, measurements used for substructure determination, 38 Single-wavelength anomalous diffraction automated solution, see Collaborative Computational Project Number 4; RESOLVE program; SHELDX program; SnB program; SOLVE program historical perspective, 13, 20 measurements used for substructure determination, 38 multiple-wavelength anomalous diffraction comparison, 21–22 prospects, 20–22
679
SIR, see Single isomorphous replacement SIRAS, see Single isomorphous replacement with anomalous scattering Small-angle X-ray scattering applications ab initio shape modeling, 602–603 comparison of crystal and solution structures, 604–608 conformational changes on ligand binding, 608, 610–614 crystallization condition optimization, 601–602 large structure data merging, 615 prospects, 617 search model in molecular replacement, 603–604 time-resolved studies, 615–617 distance distribution function and invariants, 591–593 modeling tools ab initio methods, 593–596 addition of missing loops and domains, 600–601 automatic constrained fitting, 601 computation of scattering patterns from atomic models, 597–599 rigid body refinement, 599–600 radiation sources and damage, 587 solution scattering theory, 587–591 X-ray absorption spectroscopy coupling, 613–614 SNAPP, see Simplicial Neighborhood Analysis of Protein Packing SnB program applications, 74–81 availability, 81, 83 data preparation normalization, 57–58 resolution cutoffs, 59 sigma cutoffs and outlier elimination, 58–59 enantiomorph determination, 69 halide soaks in substructure determination, 81–82 heavy-atom searching and phasing multisolution methods and trial structures, 61–62 principles, 59–61 real-space constraints, 63–65
680
subject index
SnB program (cont.) reciprocal-space phase refinement or expansion, 62–63 overview, 39, 55–57 scoring trial structures correlation coefficient, 65–66 crystallographic R, 65 minimal function, 65 Patterson figure of merit, 65 site validation comparison of trials, 67–68 crossword tables, 67 substructure refinement, 69–70 weak anomalous signals in substructure determination, 78–80 SOLVE program applications, 74–81 availability, 81, 83 data preparation, 41–42 decision-making in automated structure solution, 25–26 electron density map output, 28 figure of merit of phasing, 45–46 general computational approaches for automated structure solution, 23–24 heavy-atom searching and phasing, 42–44 input, 27–28 multiple isomorphous replacement structure solution, 24–25 multiple-wavelength anomalous diffraction structure solution, 24–25 nonrandomness of electron density, 46–47 overview, 22–23, 39–40 Patterson agreement, 45 prospects, 36–37 scoring, 44–47 scoring criteria for heavy atom partial structures, 26–27 single-wavelength anomalous diffraction structure solution, 24–25 site validation, 45 weak anomalous signals in substructure determination, 78–80 Statistical heavy atom refinement and phasing, principles, 18–19
T Taylor series expansion, structure optimization, 348–349
TEXTAL System accuracy of protein models, 269–270, 273 CAPRA C-Alpha Pattern Recognition Algorithm, 250–255 examples CAPRA results, 264–269 CzrA, 262–269 LOOKUP results, 267–269 mevalonate kinase, 262–263, 265–267 tracer results, 263–264 LOOKUP core pattern matching, 250, 255 feature extraction to represent density patterns, 255–257 feature matching in region database searching, 257–260 overview, 245–246 pattern recognition, 246–250 postprocessing steps, 260–262 prospects, 273 Thioredoxin reductase, translation, rotation, screw-rotation refinement, 323 Threonine, all-atom contact validation of protein structures, 403, 405–406 Time-averaging refinement bovine pancreatic trypsin inhibitor crystallographic data, 422 flexibility probing overall protein, 435–436 specific regions, 436–438, 440–441 refinement, 432–433 success of refinement, 433, 435 water structure, 433, 441 interpretation of results, 430–431 overview, 416, 422–424, 462 prospects, 443, 445–447 steps, 425–427, 430 water channels within crystals, 442–443 TLSANL program, see Translation, rotation, screw-rotation parameterization TLS parameterization, see Translation, rotation, screw-rotation parameterization Translation, rotation, screw-rotation parameterization anisotropic displacement parameter refinement overview, 303–305 parameter features, 306–308 programs, 305–306 prospects, 323–324
subject index refinement examples benzoate dioxygenase reductase, 322 light-harvesting complex II, 316–318 major histocompatibility complex class I peptide complexes, 320 mannitol dehydrogenase, 318–320 rhoE, 322 S100A12, 322 siderophore-binding protein, 320–321 table, 317 thioredoxin reductase, 323 REFMAC program input, 311–312 interpretation of results, 312–314 overview, 308–310 rigid group selection, 310–311 rigid body motion, 304–305 TLSANL analysis, 314–315
V Valine, all-atom contact validation of protein structures, 406 Voronoi polyhedra, packing quantification, 570, 573–575 Voronoi tessellation overview, 515–517 packing region identification with Voronoi diagram, 517–518
W Water structure, see also Implicit solvent models all-atom contact validation of protein structures, 407 channels within crystals, 442–443 Quanta program modeling, 295–296 solvent placement with X-SOLVENT, 295–297 subatomic high-resolution multiplewavelength anomalous diffraction phasing validation in aldose reductase, 334–336, 340–341 time-averaging refinement in bovine pancreatic trypsin inhibitor, 433, 441
X X-AUTOFIT, see Quanta program X-BUILD, see Quanta program
681
Xenon anesthesia, 84–85, 119 heavy atom in protein structure determination advantages, 97–98 cryocrystallography gas binding kinetics, 106 overview, 105–106 pressurization cells, 106–108 room temperature experiment comparison, 108–109 examples, 115–118 historical perspective, 85–88 N-myristoyltransferase, 113, 115 porcine pancreatic elastase, 112–113 pressurization of crystals, 83–85, 99–102, 110–112 room temperature experiments, 99–105, 108–109 X-ray scattering properties, 95–97 molecular forces in protein interactions, 89–94 nuclear magnetic resonance studies, 118–119 physical properties, 89–90 protein complexes, 84–85, 88–89 solubilities, 94–95 X-LIGAND, see Quanta program X-POWERFIT, see Quanta program X-ray crystallography atomic model docking into electron microscopy images, see Electron microscopy automated structure solution, see specific programs large asymmetric assemblies crystallization challenges, 164, 166 data collection challenges, 166–168, 170 examples, 165 heavy atom clusters for phasing, 174–175, 177–178 high-resolution phasing, 178–179, 181–182 overview, 163–164 phasing challenges, 171–174 prospects, 188 solvent-flipping phase refinement, 182–188 X-SOLVENT, see Quanta program