* Department of Chemistry &
Institute of Molecular Biophysics, Florida State University, Tallahassee,
FL 32306-3015, USA.
¥
Department of Molecular Biology, The Scripps Research Institute, 10550
N. Torrey Pines Rd., MB-13, La Jolla, CA 92037, USA.
‡ Department of Biological
Sciences, University of Warwick, Coventry, CV4 7AL, UK.
§ WP44-B122,
Merck & Co. Inc., West Point, PA 19486, USA.
¶ Purdue University,
Department of Biological Sciences, West Lafayette, IN 47907, USA
% Center for Macromolecular
Crystallography, 260 BHS THT 79, University of Alabama, UAB Station, Birmingham,
AL 35294, USA
A generous definition of "ab initio phasing" can include molecular replacement, MAD methods etc. (see elsewhere, this volume). Virus structures are among the few large structures solved with phases calculated from single monochromatic data sets without an atomic model. Here, "ab initio" will be used only when there is no prior source of phases. "Ab initio" has been used to describe the calculation of all phases from scratch, or the calculation of high resolution phases from low resolution phases determined by other methods [1]. Here, to distinguish the two, the latter will be termed "phase extension" which is discussed more fully in the accompanying article [2].
Initial phasing is somewhat unique, but phase improvement and extension by NCS is a specific form of density modification. Similar iterative techniques are used as shown in Figure 1 of the accompanying article [3]. For viruses, the equivalence of reciprocal- and real-space procedures is especially instructive. Phases can be considered calculated from a weighted vector sum of neighboring structure factors (see accompanying article [3] and references therein).
The forebear of high resolution studies was the 22.5 Å phase determination of SBMV [12]. Between 60 and 35 Å, spherically averaged structure amplitudes agreed well with those of a solid sphere with diameter of 281 Å, the nearest neighbor distance for packed spheres in the crystal. Centric phases were calculated from the model, refined using NCS and extended to 22.5 Å where the agreement broke down between experimental structure amplitudes and those calculated by back transformation of the map. The limitation to 22.5 Å was blamed on coincidence of the 532 icosahedral symmetry and special points of 23 crystallographic symmetry, and the resulting failure to break the centric nature of phases. In retrospect, with only 10-fold non-crystallographic symmetry, SBMV would be expected to be one of the more difficult phase determinations.
A more sophisticated, non-centric starting model was built by decorating a sphere with cylinders, consistent with electron microscopic images [13]. Maps calculated by phase extension to the 22.5 Å resolution limit were consistent with later high resolution structure determination [14].
With viruses, NCS was used first to improve inaccurate phases obtained by isomorphous replacement. Then it was used to improve molecular replacement phases calculated from structures that were first closely related, but later decreasingly so, as confidence improved in NCS phase refinement. In what was then an extreme case, the structure of the bacteriophage MS2 was determined starting with 13 Å phases calculated from a completely unrelated model, SBMV. Extension to 3.4 Å led to an uninterpretable map, but the phases were good enough to determine the binding sites of heavy metals. This overcame the formidable problem of determining > 90 heavy atom sites per crystallographic asymmetric unit. Isomorphous replacement phases were extended successfully to 3.3 Å resolution. This initial phase determination suggested that it might be possible to start at very low resolution, that a very crude initial model might suffice, and gave valuable insights into the phasing process (see later).
The initial phasing model for CPV was similar to that used for SBMV, except that a spherical shell allowed the nucleic acid and protein to have different uniform densities. CPV showed the critical importance of precise phasing models. Phases for a model can be calculated by back-transformation of an electron density map. However, an analytical calculation of structure factors is useful for parameter refinement [5]:
where Ro, Ri
are the outer and inner shell radii, rNA
is the density of the nucleic acid (relative to the protein), j = Ö
-1, h is the reflection index, and Sn is the position of the nth
virus in the unit cell. By comparing calculated and observed structure
amplitudes it is possible to determine Ro, Ri, rNA
and Sn by systematic search or least-squares refinement. Refinement is
facilitated by analytical partial derivatives [5].
How are the phases calculated? The crystal is the convolution of point scatterers (exponential term) and spherical shells (G term). In most cases, N is small (1 or 2 per asymmetric unit) and the exponential generates only centric terms. G, the analytical expression for the Fourier transform of a spherical shell, also has only centric terms. Its effect on the phase of the point scatterer term is either to change it by 180° (G < 0) or to leave it unchanged (G > 0).
Figure 1: Ambiguities of phase determination. (a) The Fourier transform of a solid sphere with radius of ~ 125 Å. In determining parameters for a phasing model, its agreement with experimental structure amplitudes is maximized. Data are normally only available at resolutions where non-spherical components of the virus gradually dominate those of a perfect sphere. (b) G-functions of "a" in thick line and with radius changed by 4% in thin line. Within the 35 to 20 Å resolution range, ~½ the calculated phases are incorrect (G > 0 instead of G < 0 et vice versa). (c) With 8% error in radius, most phases are the Babinet opposites. Providing enough phases are mutually consistent, extension might yield a Babinet solution which can be corrected before an atomic model is built. Thus, this error is not as serious as in (b). (d) Panels a-c show phased (signed) G-functions, but it is only magnitudes that are observable. Within the available narrow resolution window, it is possible to nearly superimpose the G-functions of discretely different models. Different peaks superimpose at 22 Å, showing that complete low resolution data could resolve the ambiguity. With this single parameter solid sphere, the superimposition in (d) is not good. When additional degrees of freedom are added by fitting inner radius, density levels and position, a nearly exact superimposition can be obtained over a finite resolution range.
Figure 1 illustrates how discretely different models may have similar fit to the diffraction data. Multiple optima are obvious in systematic searches [5]. The wrong choice can lead to an incorrect low resolution physical model. However, with extension to ~ 3 Å, Babinet inversion can be detected and corrected prior to building an atomic model. More damaging to phasing prospects are parameter sets between the optima. Simulations showed that these generate mixtures of correct phases and Babinet opposites from which convergence is not possible [19]. The tests showed that spherical shell radii need to be within 3% of one of the models corresponding to a correct or Babinet phase solution. Systematic fine search and least-squares refinement can meet these stringent criteria [5], but crystal packing calculations and electron microscopy might not. Lack of required precision is one of several possible explanations for the initial difficulties of several phase determinations (see below).
De novo phase determination for CPV was not completely successful due to failure to refine the point symmetry location as the resolution increased. This was only realized with analysis of partial isomorphous derivative data [9]. As with MS2, the heavy atom sites had negative peaks, indicating that the extended phases were Babinet-inverted. The 532 symmetry location was refined to maximize the heavy atom peaks. Extension was restarted with isomorphous replacement phases. Retrospective analysis showed that the ab initio phasing failed due to a ~ 2 Å error in its position. The 2 Å precision achieved at low resolution was not sufficient for extension beyond 9 Å. Average phase error of 41° quickly dropped to < 10° upon correcting the position, demonstrating that with greater experience, complete ab initio determination would have been possible [9].
Phasing of fX174 was initiated prior to completion of cpv. Spherically symmetric models with unrefined radii from packing calculations failed, as did early attempts with phases derived from electron microscopic (em) images [20]. Success came with an unrelated atomic model fit into the em envelope. Retrospective analysis showed that phases could have been extended from the em image had the point symmetry location been accurate. Why did the atomic model work, but not the em image directly? Perhaps the higher resolution structure factors from an atomic model led to a more precise virus position. Experience with cpv also suggests that phasing could fail if, due to experimental error, the size of the em image was not within the 3% required precision. It might be necessary to refine against the x-ray data, the em magnification or the contrast transfer function [21] that can change the size of a near-spherical image.
Comments are restricted to differences with methods used for proteins [3].
Envelope: The symmetry of a virus normally belongs to a closed point group. The envelope needs only to distinguish different assemblies and not individual proteins. Relatively crude envelopes often suffice at the start, even polyhedra defined by bisecting normal planes to vectors between neighboring viral centers. Spheres, or a combination of spheres and planes are often used [9, 24], but always set to err on the side of generous envelopes. When surface features are prominent, structure determinations have benefited from more detailed envelopes [20]. As phase recombination following 30+-fold NCS prevails over minor masking errors, envelopes can be improved during refinement by comparing the electron density at specific points with those of points to which it might be related by NCS [25]. Automatic solvent definition [26] can also be used. At very low resolution, the scattering of viral nucleic acid is significant, and it is better to flatten this to an average value that is independent from that chosen for the solvent [5]. At high resolution, scattering from the disordered nucleic acid can be considered to be the same as from the external solvent.
Filling missing observations: Even with data collected from hundreds of crystals, virus data sets rarely approach completion. It has been argued that with appropriate processing of accurate data and phases, only 1/60th of a data set for an assembly with 60-fold NCS is required to generate a map equivalent in quality to that of a complete data set without NCS [4, 27]. Thus the Herculean effort of completing a virus data set is rarely undertaken. The potential consequences of missing data are dire. A reflection does not contribute equally to all points of a map related by NCS. Thus a map calculated with partial data should not be expected to have the exact NCS that will be constrained. In reciprocal space, this is equivalent to calculating phases from neighboring structure factors, some of which have been incorrectly reset to zero. This shows a potential future benefit of reciprocal space implementation, in which the missing reflections could be ignored. In real-space implementations, it is now routine to complete the data by calculation of missing observations by back-transformation of the current map. There has not been a systematic study of how weighting of the filled reflections relative to observed reflections might be optimized to improve convergence.
Related are the effects of resolution truncation. Series termination errors, greatest near the viral surface, need not be icosahedrally symmetric. Simulations showed benefit in leaving a surface margin unaveraged [18], but series termination is usually less problematic with real data.
Extension: Especially without prior phases, resolution can only be extended in small steps (see accompanying article [3]). Progress is usually followed with graphs of the correlation coefficient between observed and back-transformed structure magnitudes plotted against resolution. A gradual decrease might indicate imprecise point symmetry, or poor envelope definition. Sudden changes might indicate the presence of multiple solutions (see below) and/or that more gradual extension might be required.
Programs: As few virus structures are solved, some prefer programs such as RAVE [28] that have been tested extensively in many protein structure determinations. These are not well optimized for problems as large as viruses. Increases of speed have been achieved with programs exploiting highly parallel computer architecture [29], or by loading the entire map onto cheap physical memory that has recently become available (EB & MSC, unpublished).
Histograms comparing final and intermediate
phases have been revealing (Figure 2). Analysis of MS2 [30] showed that
the initial (unsuccessful) refinement converged towards the Babinet solution,
but some phases were close to the enantiomer or enantiomeric Babinet solution.
Application of 532 symmetry is consistent with any of these solutions,
but their co-existence was unexpected, because of their incompatible electron
densities (that result in uninterpretable maps). The multiple solutions
were thought linked to prior observations of correlation coefficient (above)
oscillating with resolution, indicating variable quality of phase determination.
Experience of several structure determinations suggested that reciprocal
space was being partitioned with regions converging upon different solutions.
As each phase depends only on neighboring structure factors, phases for
each region could become consistent with each other and the NCS, but inconsistent
with other regions of reciprocal space. Overall statistics might indicate
little, because it would only be on the boundaries of regions where two
solutions might "fight" each other and structure factors might be inconsistent
with each other.
The sensitivity of ab initio phasing to starting model can now be rationalized on the basis that for success, the vast majority of phases must belong to one consistent set.
As with MS2, competing solutions appeared limiting in the uninterpretable initial maps of NwV [23]. By contrast, the final ab initio-derived phases of fX174 and CPV showed only modest random error about a single Babinet solution [9, 20]. However, intermediate phases of fX174 showed multiple solutions, and also repeated switching of large numbers of phases between the correct, Babinet and enantiomeric solutions during phase refinement [20]. This might have been due to slight (~ 1Å) adjustments of the symmetry location, but is not known for certain. It should be emphasized that although the ab initio determinations have been more exhaustively analyzed, phases for some other viruses have been difficult to refine [31-33]. Perhaps similar problems occur with phases originally derived from isomorphous molecular replacement.
With the ability to compare to accurate final phases, had we chanced upon a general limitation of symmetry averaging? Did competing internally consistent regional phase sets limit phase refinement of proteins? The structure determination of human rhinovirus 50 (HRV50) (Blanc et al., in progress) presented an opportunity for an intermediate test of this hypothesis. Phases were determined not ab initio, but by molecular replacement from the closely related 3.5 Å HRV1A structure [34]. They were refined and extended to 2.4 Å resolution using 15-fold symmetry averaging. However, the map was disappointing, looking more like 4 Å resolution than 2 Å. Following model-building and real-space refinement [35], phases were calculated from the atomic model and re-extended from 4 Å resolution. Although refinement continues, the resulting improved map enabled the building of a structure with Rfree = 25% using all data to 2.0 Å. Unexpectedly, comparison of the two phase sets showed no evidence of multiple phase sets -- just random phase error about a single solution. Furthermore, searches through reciprocal space revealed that there was no significant correlation in the phase error of neighboring reflections.
The conclusion is anecdotal, because only one of the detailed analyses did not involve ab initio determination. There is a preliminary suggestion that low resolution ab initio phase determinations are more susceptible to competing solutions than higher resolution phase determinations. At both high and low resolution, different phase solutions may be consistent with the symmetry averaging. It may be that at high resolution, the effect of a poor model is near-random phase, whereas at low resolution, it can lead to large regions of reciprocal space with phase solutions that are internally consistent, consistent with the NCS, but compete with different solutions in other regions. It appears that competing solutions can persist through much phase refinement.