Combining Protein Evolution and Secondary Structure

Combining Protein Evolution and Secondary Structure Jeflrey L. Thorne, * Nick Goldman,t and David T. Jones-/-S2 *Program in Statistical Genetics, Statistics Department, North Carolina State University; TDivision of Mathematical Biology, National Institute for Medical Research, London; and SBiomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College, London An evolutionary model that combines protein secondary structure and amino acid replacement is introduced. It allows likelihood analysis of aligned protein sequences and does not require the underlying secondary (or tertiary) structures of these sequences to be known. One component of the model describes the organization of secondary structure along a protein sequence and another specifies the evolutionary process for each category of secondary structure. A database of proteins with known secondary structures is used to estimate model parameters representing these two components. Phylogeny, the third component of the model, can be estimated from the data set of interest. As an example, we employ our model to analyze a set of sucrose synthase sequences. For the evolution of sucrose synthase, a parametric bootstrap approach indicates that our model is statistically preferable to one that ignores secondary structure. Introduction It is widely recognized that evolutionary divergence of protein structures occurs much less rapidly than divergence of protein sequences (e.g., Chothia and Lesk 1986; Flores et al. 1993). This indicates that selective constraints may act to preserve protein structure. The nature of these constraints is poorly understood and they have received relatively little direct attention in the areas of population genetics and phylogeny reconstruction. We believe this lack of attention should be rectified. The process of molecular evolution will not be well understood until the constraints that affect it have been characterized. There has been less work on modelling amino acid replacement than on modelling nucleotide substitution. This disparity may be attributable to the extra complexity of modelling the replacement process. Replacements tend to occur between chemically similar amino acid types and a replacement model should reflect this. There are many ways to categorize amino acids by chemical properties (e.g., hydrophobicity, charge, relative size of side chain), and physicochemical distances between amino acid types have been suggested (Grantham 1974; Taylor and Jones 1993), but these categorizations or physicochemical distances may not directly reflect the differences among amino acid types that are acted upon by evolution. Cognizant of the difficulty of creating a realistic model for protein evolution based solely on physicochemical principles, Dayhoff and coworkers (Dayhoff, Eck, and Eck 1972; Dayhoff, Schwartz, and Orcutt 1978) developed an empirical approach. To construct Present address: Department of Genetics, University of Cambridge. 2 Present address: Department of Biological Sciences, University of Warwick. Key words: hidden Markov model, maximum likelihood, molecular evolution, phylogeny, protein structure. Address for correspondence and reprints: Jeffrey L. Thorne, Program in Statistical Genetics, Statistics Department, Box 8203, North Carolina State University, Raleigh, North Carolina 27695-8203. E- mail: thorne@stat.ncsu.edu. Mol. Biol. Evol. 13(5):666-673. 1996 0 1996 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 their empirical amino acid transition matrix, sets of easily aligned (i.e., closely related) sequences were collected. Only closely related sequences were considered because, when evolutionary distance is sufficiently small, the possibility of multiple replacements can be ignored. The observed replacement patterns were used to construct a probabilistic replacement model. When the Dayhoff model was originally proposed, relatively few protein sequences were known. More recent studies (Gonnet, Cohen, and Benner 1992; Jones, Taylor, and Thornton 1992) followed the spirit of the Dayhoff approach but were able to tabulate a larger number of observed replacements. Because the Dayhoff model is empirical, it reflects the fact that different amino acid types are replaced at different rates and the fact that amino acids are usually replaced by chemically similar amino acids. A problem with the Dayhoff approach is that it effectively models the replacement process at the average site in the average protein. There may be no such thing as an average site in an average protein. The physical environment of a protein site may greatly influence the replacement process at the site. Therefore, there may be variation among sites in the replacement process. Important features of the physical environment might include the secondary structure and whether the site is on the surface or in the interior of a protein. There have been tabulations of the observed number of amino acid replacements for each of several different categories of physical environment (e.g., Overington et al. 1990; Liithy, McLachlan, and Eisenberg 1991; Topham et al. 1993; Wako and Blundell 1994), but the potential of these tabulations to disentangle phylogenetic correlations and constraints due to physical environment has not been exploited previously. In this study, we introduce a probabilistic model that relates protein secondary structure to protein evolution. Our empirical approach is similar to the Dayhoff procedure but we model the replacement process for each of several categories of physical environment. We classify the physical environment of a site according to the secondary structure at the site. Three categories of secondary structure (a-helix, B-sheet, and loop) are be- 666

Combining Protein Evolution and Structure 667 ing employed. The term loop is being used loosely type i. If the amount of evolution separating two sehere to indicate that a site is neither in an o-helix nor quences is small, the possibility that multiple substituin a P-sheet. In future studies, we plan to consider more tions at a site have occurred since two sequences have detailed classification schemes. diverged is negligible. In this case, p$(t> can be approximated, when i # j, by the product of t and the rate at The Model The Known Structure Data Set We utilize a data set maintained by one of us (D.T.J.) that contains representative sequences of 207 protein families. Protein families are only included in the data set if the tertiary structure of at least one member of the family has been experimentally determined. We will refer to this data set as the known structure data set. Most of the protein families in the known structure data set are represented by sequences in addition to the one with known tertiary structure. The members of each protein family are aligned. Fortunately, alignment is not difficult for these sequences because the sequences included in the data set are relatively similar to one another. To convert information in the known structure data set from three-dimensional structure to secondary structure categories, we rely on the DSSP computer program (Kabsch and Sander 1983). This program can determine the secondary structure of a protein from the atomic coordinates that specify its tertiary structure. We proceed with the assumption that the category of underlying secondary structure is identical for residues of different sequences that are in the same column of the alignment. This assumption is reasonable because, as noted earlier, structural divergence occurs much less rapidly than divergence at the sequence level. With this assumption, the category of underlying secondary structure for a column in the alignment can be determined in most cases from the sequence with known tertiary structure. For an occasional column of the alignment the underlying secondary structure cannot be determined in this way because, due to an insertion or deletion event, the sequence of known tertiarv structure does not have which amino acid type i is replaced by amino acid type j for category k of secondary structure. This rate will be referred to as CY$ and the array containing ol$ for all combinations of amino acids and secondary structure categories will be referred to as CX. Also, if t is small, pi(t) is well approximated by 1 - c cx$. j#i Consider a comparison between two protein sequences and assume N,,, is the number of alignment positions where neither sequence has a gap. This study follows the example of Jones, Taylor, and Thornton (1992) in that pairwise comparisons are made only if the two sequences possess O.SSN, or more alignment positions with identical residues and at least one of the two proteins is the closest homolog in the data set to the other. The incentive for adopting the somewhat stringent sequence identity threshold is to make approximations for pi(t) and pi(t) accurate. The relationship between the threshold choice and resulting replacement model is discussed later in this paper. Assume C is the number of sequence comparisons with similarity exceeding our threshold. Let ni be twice the number of alignment positions of category k in the C sequence comparisons that contained residues of type i in both sequences. Let r$ be the number of alignment positions of category k where one of the two residues is type i and the other is type j (i # j). Notice that n$ k = nji. Our Markov model of replacement is constructed to be reversible. This means that, if + is the equilibrium probability of residue type i for category k, a residue in the column. Insertions and deletions tend to Reversibility is ensured if occur in regions of secondary structure that are loops +kcx$ = @c$ V i, j. (2) (Benner and Gerloff 1991; Thornton et al. 1991). Therefore, we have chosen to classify these occasional col- Denoting the expected proportion of alignment positions umns in an alignment as belonging to the loop category. in category k by vk and the amount of evolution separating the sequences in the mth comparison by t,,,, the The Replacement Process for Each Category of reversibility property implies that for i # j, Secondary Structure We model the replacement process for each category of secondary structure, whereas Dayhoff and coworkers (Dayhoff, Eck, and Eck 1972; Dayhoff, Schwartz, and Orcutt 1978; also see Kishino, Miyata, (3) and Hasegawa 1990) model the process for an average site. We assume that, given the secondary structure Also, at a site, the amino acid replacement process at the site is Markovian and independent of the replacement pro- 1 cess at other sites. This assumption is not ideal but is E c n$ = 2vk+f 5 N,,,. (4) superior to completely ignoring secondary structure. i m=l Let p;(t) be the probability, for secondary structure If we approximate E[n$ by n$ and E[Z&] by Z&, both k, of a site containing residue type j after an amount of of which can be observed from the known structure evolution t given that the site initially contained residue data set, then for i # j, (1)

668 Thome et al. Table 1 Equilibrium Amino Acid Frequencies and Mutabilities for Each Category of Secondary Structure as Estimated from the Known Structure Data Set FREQUENCY MUTABILITY Loop P-sheet a-helix Loop P-sheet a-helix G... 0.120 0.052 0.043 A... 0.070 0.061 0.133 v... 0.053 0.105 0.075 L... 0.066 0.098 0.111 I M :: 0.03 0.0161 0.083 0.022 0.050 0.021 F 0.032 0.059 0.035 P ::: 0.069 0.025 0.020 S... 0.088 0.063 0.044 T c::: 0.064 0.015 0.079 0.034 0.046 0.018 g::: 0.048 0.036 0.018 0.037 0.036 0.039 Y... 0.027 0.062 0.034 w.. 0.013 0.022 0.015 D... 0.07 1 0.032 0.045 E H::: 0.05 0.0261 0.036 0.028 0.084 0.023 K... 0.061 0.043 0.075 R... 0.043 0.042 0.054 34 42 106 100 100 100 84 105 102 56 77 55 100 127 131 99 131 131 54 78 59 45 59 87 99 115 212 87 87 150 49 39 70 113 185 147 91 80 127 65 79 55 39 41 35 70 106 139 84 107 100 79 106 100 67 91 71 77 97 83 The above equation shows we can estimate the relative rate of replacement via the product of cx$ and a constant that does not depend on i, j, or k. In fact, this constant is unimportant for our purposes. To model the replacement process, we need only estimate the relative rates of replacement from i to j for each pair of amino acids i # j. These relative rates can thus be estimated from the n$. For our model, the replacement process at a site depends on the underlying secondary structure and the cx$ parameters that are estimated from the known structure data set. The ol$ estimates determine the estimates of rj$ that are shown in table 1. The table indicates that our approach of considering secondary structure is promising because, as expected, we find that equilibrium frequencies of amino acids are not independent of secondary structure. The CX$ estimates themselves are not shown but are available upon request from the authors. Table 1 also lists inferred mutabilities of amino acids for each category of secondary structure. The mutability of amino acid i for secondary structure category k reveals the rate, relative to other amino acids in the same category, at which amino acid i is replaced. The mutability of amino acid i in category k would be c cx; j#i except that (following Dayhoff) we have chosen to standardize mutabilities within each category so that the mutability of alanine (A) is 100. Mutabilities of amino acids within each category are measured relative to the mutability of alanine. Dayhoff and coworkers measure amounts of evolution in terms of the expected number of replacement events per 100 sites since sequence divergence. We follow their example but emphasize that the units are expetted number of replacement events per 100 randomly selected sites. One category of secondary structure may experience replacements at a higher rate than another. Imagine, for example, a randomly selected site is an 01- helix with probability 0.3, a P-sheet with probability 0.5, and a loop with probability 0.2. Furthermore, imagine that the amount of evolution since sequence divergence is an expected 20 replacements per 100 o-helix sites, an expected 10 replacements per 100 P-sheet sites, and an expected 30 replacements per 100 loop sites. Then, this amount of evolution would be an expected 17 replacements per 100 average sites because (0.3)(20) + (OS)(lO) + (0.2)(30) = 17. With reasoning similar to equation (5), it can be shown that This equation reveals that we can estimate the rate of change for category k up to a constant. Applying this to the known structure data set, we find that, if the replacement rate of a loop site is scaled to be 1.0, then the replacement rate of an o-helix site is 1.027 and the rate of a P-sheet site is 0.775. It is interesting that the o-helix rate estimate is greater than that for loops, but the biological significance is unclear. It may be the case that helices do evolve at greater rates than loops. Alternatively, this may be an artifact of our defining the loop category as consisting of all sites that are neither o-helix nor p- sheet. Sites fitting this definition may actually be some mixture of those that are slowly evolving and those that are quickly evolving. We intend to investigate this issue in the future. Organization Sequence of Secondary Structure Along a The secondary structure at a site in a protein is not independent of secondary structure at surrounding sites; the types of secondary structure at adjacent sites are positively correlated. Secondary structure is a regional property of proteins. For this reason, we believe the organization of secondary structure along a sequence might be well described by a hidden Markov model. Churchill (1989) was the first to apply hidden Markov models to a biological sequence. He employed hidden Markov models to describe the organization of ATrich and GC-rich regions in DNA. Hidden Markov models have been used to describe the organization of secondary structure along a sequence (Asai, Hayamizu, and Handa 1993; White, Stultz, and Smith 1994), but these models were employed to analyze single protein sequences and phylogenetic correlations among sequences were therefore not an issue. The first use of hidden Markov models in sequence evolution is by Felsenstein and Churchill (1996) and has been implemented in versions 3.5 and later of the PHYLIP software package (see Fel-

Combining Protein Evolution and Structure 669 Table 2 Likelihood Calculations Estimated Probabilities of the Secondary Structure Category at a Position Given the Category at the Previous Position FROM CATEGORY a-helix To CATEGORY P-sheet Loop a-helix.... 0.9085 0.0005 0.09 10 (13,290) (8) (1,331) P-sheet.... 0.005 1 0.8113 0.1836 (50) (8,008) (1,812) Loop...... 0.0619 0.0862 0.8519 (1,341) (1,867) (18,450) NOTE.--The observed number of transitions in the known structure data set between each pair of categories is in parentheses below the estimated probabilities of the transition. senstein 1989). The Felsenstein-Churchill framework was developed to allow regional heterogeneity of nucleotide substitution rates but we modify it to permit regional heterogeneity of protein secondary structure. With our hidden Markov model, we can analyze data sets consisting of aligned protein sequences with unknown structures. The hidden states of our model correspond to the underlying secondary structure. The secondary structure cannot be directly observed but it may be inferred through the pattern of amino acid replacements. Each category of secondary structure is associated with a distinct pattern of replacements and the task is to match the pattern of replacements with secondary structure. Although the secondary structures of the sequences are unknown, we assume the organization of secondary structure along the sequences can be described by a stationary first-order Markov chain. This assumption implies that, given the secondary structure at site i, the secondary structure at site i+ 1 is independent of the secondary structures of sites 1,..., i- 1. This assumption is biologically implausible. Secondary structure at a site is likely to be influenced not only by neighboring sites in the sequence but also by sites that are nearby at the level of tertiary structure. The model we adopt is not perfect but it is a starting point; other models of replacement completely ignore structure. Sequences in the known structure data set can be used to estimate pc,m,ci, the probability that site i has underlying secondary structure category ci given that site i- 1 has secondary structure category Ci_ i. The matrix values of pc,-,, for each pair of secondary structure categories will be referred to as p. Imagine that Ci represents an o-helix and Ci_l represents a P-sheet. A simple way to estimate the probability that an a-helix position will follow a P-sheet position is to count the num- ber of times for sequences of known tertiary structure that an o-helix position follows a P-sheet position and divide this number by the number of times that an 01- helix or P-sheet or loop position follows a P-sheet position. Estimates obtained this way from the known structure data set are in table 2. Stationary probabilities of o-helix, P-sheet, and loop positions (i.e., the Vk values) derived from the transition probability estimates are respectively 0.325, 0.212, and 0.463. We can use our model and a likelihood approach to reconstruct phylogenies from aligned protein sequences with unknown structures. The aligned data set will be assumed to have length N and will be represented by S. The first i columns of the aligned data set will be represented by Si, and the ith column itself by Si. To reconstruct a phylogeny, we want to estimate T, the topology and branch lengths of the tree. We estimate the rates of replacement (i.e., cx) and the parameters for organization of secondary structure along the sequence (i.e., p) from the known structure data set. For the sake of simplicity, we will not introduce notation to distinguish between the true values of (Y and p and their estimates. Our goal is to find the T that maximizes Pr(SIT, o, p). The first step is to calculate Pr(SIT, cx, p) for some T of interest. This can be done with an iterative algorithm. At each step, the algorithm considers Ci, a passible secondary structure category for site i, and computes Pr(Si, CilT, a, p). For i > 1, the algorithm uses the fact that Pr(Si, Ci I T, 01, PI = c W&-l, ci-l 1 T, a, p) cl-1 *Pr(ci( Si-1, Ci-1, T, 01, p> -Pr(si I Si- 1, ci- 1, Ci, T, a, p>* (7) The assumptions of our model allow the above to be simplified: To calculate Pr(Si, ci I T, 01, PI = C P XSi-1, Ci-1 IT, a, P)Pc,_lcl cl-1 *Pr(si I Ci, T, (x). Pr(s i, cl I T, 01, p), Pr(si, cl 1 T, 01, p> = Pr(sl, cl 1 T, a, p)wq IT, a, p> To estimate T, we use numerical optimization algorithms (e.g., Swofford et al., 1996) to find f, the value of T that maximizes Pr(SIT, CY, p). We can also use our model to reconstruct evolutionary trees in cases where a protein structure has been exnerimentallv determined. In these cases. the amino (8) = Pr(si Ici, T, o)yr,,. (9) The pruning algorithm of Felsenstein (1981) can be applied to calculate Pr(silci, T, cx). The pruning algorithm requires determination of the replacement transition probabilities p:(t) for each combination of amino acids and branch lengths. These transition probabilities can be calculated from CY and t (e.g., see Goldman and Yang 1994). The iterative algorithm allows calculation of the likelihood because Pr(S 1 T, 01, p) = c Pr($,r, CNI T, a, P>. CN (10)

670 Thorne et al. Table 3 Log-likelihoods of the Sucrose Synthase Data Set for Different Unrooted Tree Topologies and Replacement Models MODEL (Ath, Pau), (SW, Vfa) UNROOTED TOPOLOGY (Ath, Stu), (Pau, Vfa) (Ath, Fva), (Pau, Stu) DAY.... -4,473.48-4,401.75-4,479.54 JTT..... -4,422.Ol -4,348.94-4,426.12 THEM... -4,417.19-4,345.g 1-4,420,76 us...... - 4,405.93-4,339.35-4,409.12 NOTE.-A list of the sources of the four sucrose synthase protein sequences along with abbreviations, common names, and Swiss-Prot accession numbers is: Arabidopsis thaliana (Ath, mouse-ear cress, Q00197), Phaseolus aureus (Pau, mung bean, Q01390). Solanum tuberosum (Stu, potato, P10691), and Vicia faba (Vfa, broad bean, P3 1926). acid replacement process for each site is determined by Arabidopsis thaliana the known secondary structure category at the site. Be- FIG. 1.-The maximum-likelihood tree topology and branch cause the secondary structure is known, the likelihood lengths for the sucrose synthase gene under the US model. calculations do not involve p. Instead, the likelihood is the product of likelihoods at individual sites, again calculated with the pruning algorithm of Felsenstein protein secondary structure is ignored is referred to as (198 1). Reconstruction of phylogenies from sequences THEM ( Typical Homogeneous Evolutionary Model ). with known structures involves less uncertainty and is Table 3 contains the maximum log-likelihood value therefore expected to be more accurate than reconstruc- for each combination of the three unrooted topologies tion of phylogenies from sequences with unknown struc- and the four replacement models. The US model contures. siders secondary structure and yields higher likelihoods than other models. The maximum-likelihood topology Example and Parametric Bootstrap and branch length estimates obtained by this model are To illustrate our approach, we consider four sucrose shown in figure 1. The cumulative CPU time required synthase protein sequences with unknown secondary by a Sparcstation 10 Model 30 to estimate branch structure. Sequences were aligned with the assistance of lengths of the three unrooted sucrose synthase topolothe Treealign software (Hein 1990). Regions of uncer- gies with the US model was 403.2 seconds. In general, tain alignment may bias a phylogeny reconstruction pro- the US model requires approximately the amount of cedure toward certain tree topologies. In columns where computation required by THEM multiplied by the numthe alignment was deemed relatively uncertain, we treat- ber of secondary structure categories. ed all residues-or all but one residue-as residues of Interestingly, the likelihoods of THEM for the suunknown amino acid type. This treatment of uncertain crose synthase-data are higher than those for the JTT or regions in the alignment should reduce the potential for DAY models (table 3). One possible explanation is that alignment uncertainty to bias phylogeny reconstruction. THEM is based on replacement counts from the known Gap positions in the alignment were also treated as un- structure data set. Protein families with known tertiary known residues. The resulting alignment contained 809 structure may not be a random sample of protein famicolumns and a total of 19 residues of unknown amino lies (e.g., see Orengo, Jones, and Thornton 1994). For acid ty Pee example, families of membrane- bound proteins are like- The four aligned sucrose synthase genes were an- ly to be u nderrepresented in the known structure data alyzed by the replacement model described here and by set. This underrepresentation may be less severe in data three amino acid replacement models that ignore sec- sets used to construct the JTT or DAY models. Thereondary structure. We will refer to the model described fore, it may be the case that THEM is superior to the here as the US ( Uses Structure ) model. The remain- JTT and DAY models when some protein families are ing three amino acid replacement models that we con- studied but is inferior when others are studied. Clearly, sidered ignore secondary structure but otherwise employ the potentially biased composition of the known strucapproaches similar to that described above to estimate ture data set could also affect the appropriateness of relative replacement rates from counts of observed re- the US model in certain cases. placements. The models referred to as DAY and JTT are Although the US model produces higher likelirespectively derived from the replacement counts of hoods than THEM for the sucrose synthase data set, Dayhoff and collaborators (Dayhoff, Schwartz, and Or- additional investigation is required before rejecting cutt 1978) and the replacement counts of Jones and col- THEM in favor of the US model. The replacement prolaborators (Jones, Taylor, and Thornton 1992). The mod- cess and secondary structure organization parameters of el obtained from the known structure data set when the US model were estimated independently of the su-

Combining Protein Evolution and Structure 671 al ence FIG. 2.-A histogram of simulated log-likelihood differences. The observed difference of 6.46 for the sucrose synthase data set is also indicated. 1 0 crose synthase data; these parameters are not free to vary. Therefore, THEM is not a special case of the US model. Fortunately, the parametric bootstrap approach to model comparison described by Goldman (1993) does not require that models be nested. To employ the parametric bootstrap, we simulated 100 data sets according to our null hypothesis of THEM. We simulated under the topology and branch lengths that were estimated with THEM from the actual sucrose synthase data set. Each simulated data set had the same alignment length and the same locations of residues of unknown type as the actual data set. For each simulated data set, we subtracted the maximum log-likelihood of THEM from the maximum log-likelihood of the US model. If we calculate the mean and sample standard deviation of the 100 simulated differences, we find that the actual difference is 3.6 sample standard deviations higher than the mean of the simulated differences. A histogram of these 100 log-likelihood differences is shown in figure 2. The fact that each of the 100 simulated log-likelihood differences is less than the observed log-likelihood difference 6.46 indicates that we can reject THEM in favor of the US model. In other words, consideration of secondary structure appears to be warranted for the sucrose synthase data. Discussion Phylogeny Reconstruction Because the US model is more biologically realistic than previously proposed models of protein evolution, it is reasonable to expect that it will yield better estimates of the phylogeny than other models (e.g., see Yang, Goldman, and Friday 1994). The statistical comparison of THEM and the US model for the sucrose synthase data adds support to this expectation but should not be viewed as a confirmation. Further study is necessary to verify that the US model is preferable to other models when the objective is phylogeny inference. Bias and Variance of Replacement Rate Estimators In this study, our empirically derived replacement model considers only sequence pairs from the known structure data set with 85% or more identical residues. If categories of secondary structure evolve at sufficiently different rates, this could potentially yield biased estimates of CY$. Imagine two pairs of sequences, one rich in loop sites and the other rich in sheet sites, each separated by the amount of evolution t. If loop sites evolve more quickly than sheet sites, then the pair of sequences rich in loop sites may be less than 85% identical and the pair rich in sheet sites may be more than 85% identical. The pair rich in sheet sites but not the pair rich in loop sites would be considered in this study. This scenario could have effects that include loop sites being underrepresented. Reliance upon pairwise comparisons in this study may increase the variance of cx$ estimators. The same replacement event could be counted in several different pairwise comparisons. Imagine a group of four protein sequences and the unrooted topology that relates them. There are six possible pairwise comparisons between the four sequences. If all six comparisons were made and each comparison exceeded the 85% identity threshold, a replacement event that took place on the interior branch of the topology relating the four sequences would be counted in four of the six pairwise comparisons. In contrast, a replacement event that took place on any other branch of the topology would be counted in three of the six pairwise comparisons. To lessen this unbalanced counting of replacement events, the only pairwise comparisons that we make are those in which at least one of the sequences is being compared with its closest homolog in the data set. The unbalanced counting of replacement events that remains would inflate the variance of our estimators of 0~; but would not bias them. In the future, we plan to investigate estimators that avoid the potential bias and inflated variance described here. We also hope to determine the variances of these estimators. Probabilistic versus Biological Assumptions It is well known among molecular evolutionists that likelihood approaches are based on explicit probabilistic models but the fact that explicit probabilistic assumptions are not always easily translated to explicit biological meanings seems less appreciated. A likeli-

672 Thome et al. hood approach has less value for illuminating the process of sequence divergence if the parameters in the evolutionary model are not biologically interpretable. The probabilistic models we present here are biologically motivated. By examining models with explicit biological assumptions, we avoid the difficult task of assigning biological meaning to the parameters of a model after the model has been statistically selected. We are encouraged by the results of the parametric bootstrap test because they support the idea that secondary structure influences evolution, but we emphasize that the value of our model stems primarily from its biological basis and not from its favorable statistical performance. Our perspective is well summarized by Hunter (1982, p. 5 1): All models are wrong; some are more useful than others. Future Directions Modifications to our model can be envisioned and some have already been discussed above. Many of these modifications may improve the statistical performance of the model and some may be biologically meaningful. For example, heterogeneity of replacement rates among sites within a secondary structure category could be permitted by adopting approaches already developed for the purposes of modelling nucleotide substitution (see Yang 1993, 1995; Yang, Goldman, and Friday 1994; Felsenstein and Churchill, 1996). Another modification would be to increase the number of categories in our hidden Markov model. For example, sites on the surface of a protein could be distinguished from interior sites and the additional category turn could be included. Another research direction would involve combining DNA substitution models with our model of amino acid replacement. Recent work (Goldman and Yang 1994; Muse and Gaut 1994) indicates that this may be feasible. The current version of our model is not tailored to any specific family of proteins but it may be possible to design models appropriate for specific families or even for particular sites within a specific family. Some protein families are particularly rich in o-helix positions and others in B-sheet positions (Levitt and Chothia 1976). Our hidden Markov model for organization of secondary structure along a sequence allows for chance variation in frequencies of secondary structure categories among protein families but the actual variation among protein families is probably larger than our model predicts. Perhaps this additional variation can be considered in future investigations. One concern that arises when such models are contemplated is whether sufficient information exists to estimate model parameters accurate- ly* It has not escaped our notice that this approach also has value in prediction of protein secondary structure. We will discuss this application in a forthcoming study. We hope our approach and its subsequent modifications will shed light on protein evolution and structure. Acknowledgments We thank Hirohisa Kishino, Joseph Felsenstein, and two anonymous reviewers for their constructive comments. We also thank Francis Giesbrecht, Elizabeth Thompson, and Simon Tavare for their assistance with the quotation by W. G. Hunter. The research was supported by NIH grant ROl GM 49654, principal investigator J. L. Thorne. Programs written in Fortran that implement the techniques described here are available from N. Goldman (N.Goldman@gen.cam.ac.uk) and programs written in C are available from J. Thorne (thorne@stat.ncsu.edu). LITERATURE CITED ASAI, K., S. HAYAMIZU, and K. HANDA. 1993. Prediction of protein secondary structure by the hidden Markov model. CABIOS 9:141-146. BENNER, S. A., and D. GERLOFF. 199 1. Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure: a prediction of the catalytic domain of protein kinases. Adv. Enzyme Regul. 31: 12 l-l 8 1. CHOTHIA, C., and A. M. LESK. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J 5:823-826. CHURCHILL, G. A. 1989. Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51:79-94. DAYHOFF, M. O., R. V. ECK, and C. M. ECK. 1972. A model of evolutionary change in proteins. Pp. 89-99 in M. 0. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5. National Biomedical Research Foundation, Washington, D.C. DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT. 1978. A model of evolutionary change in proteins. Pp. 345-352 in M. 0. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5, suppl. 3. National Biomedical Research Foundation, Washington, D.C. FELSENSTEIN, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376. FELSENSTEIN, J. 1989. Phylip-phylogeny inference package (version 3.2). Cladistics 5: 164-166. FELSENSTEIN, J., and G. A. CHURCHILL. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104. FLORES, T. P, C. A. ORENGO, D. S. Moss, and J. M. THORN- TON. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2: 181 l-l 826. GOLDMAN, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182-198. GOLDMAN, N., and Z. YANG. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736. GONNET, G. H., M. A. COHEN, and S. A. BENNER. 1992. Exhaustive matching of the entire protein sequence database. Science 256: 1443-1445. GRANTHAM, R. 1974. Amino acid difference formula to help explain protein evolution. Science 185:862-864. HEIN, J. 1990. A unified approach to alignment and phylogenies. Pp. 626-645 in R. E DOOLITTLE, ed. Methods in enzymology. Vol. 183. Academic Press, San Diego, Calif. HUNTER, W. G. 1982. Graduate programs in statistics. Pp. 35-69 in J. S. RUSTAGI and D. A. WOLFE, eds. Teaching of statistics and statistical consulting. Academic Press, New York, N.Y. JONES, D. T., W. R. TAYLOR, and J. M. THORNTON. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275-282.

Combining Protein Evolution and Structure 673 KABSCH, W., and C. SANDER. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577-2637. KISHINO, H., T. MIYATA, and M. HASEGAWA. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31:151-160. LEVITT, M., and C. CHOTHIA. 1976. Structural patterns in globular proteins. Nature 261:552-558. LOTHY, R., A. D. MCLACHLAN, and D. EISENBERG. 1991. Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 10:229-239. MUSE, S. V., and B. S. GAUT. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with applications to the chloroplast genome. Mol. Biol. Evol. 11:715-724. ORENGO, C. A., D. T. JONES, and J. M. THORNTON. 1994. Protein superfamilies and domain superfolds. Nature 372:63 l- 634. OVERINGTON, J., M. S. JOHNSON, A. SALI, and T. L. BLUNDELL. 1990. Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Sot. Lond. B 241:132-145. SWOFFORD, D. L., G. J. OLSEN, I? J. WADDELL, and D. M. HILLIS. 1996. Phylogenetic inference. Pp. 407-5 14 In D. M. HILLIS, C. MORITZ, and B. K. MABLE, eds. Molecular systematics, 2nd edition. Sinauer Associates, Sunderland, Mass. TAYLOR, W. R., and D. T. JONES. 1993. Deriving an amino acid distance matrix. J. Theor. Biol. 164:65-83. THORNTON, J. M., T. I? FLORES, D. T. JONES, and M. B. SWIN- DELLS. 199 1. Prediction of progress at last. Nature 354: 105-106. TOPHAM, C. M., A. MCLEOD, E EISENMENGER, J. I? OVERING- TON, M. S. JOHNSON, and T. L. BLUNDELL. 1993. Fragment ranking in modelling of protein structure: conformationally constrained substitution tables. J. Mol. Biol. 229: 194-220. WAKO, H., and T. L. BLUNDELL. 1994. Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes. J. Mol. Biol. 238:682-692. WHITE, J. V., C. M. STULTZ, and T. E SMITH. 1994. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. 119:35-75. YANG, Z. 1993. Maximum likelihood phylogenetic estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10: 1396-1401. YANG, Z. 1995. A space-time process model for the evolution of DNA sequences. Genetics 139:993-1005. YANG, Z., N. GOLDMAN, and A. FRIDAY. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316-324. PAUL M. SHARP, reviewing Accepted January 30, 1996 editor