Combining Protein Evolution and Secondary Structure

Size: px
Start display at page:

Download "Combining Protein Evolution and Secondary Structure"

Transcription

1 Combining Protein Evolution and Secondary Structure Jeflrey L. Thorne, * Nick Goldman,t and David T. Jones-/-S2 *Program in Statistical Genetics, Statistics Department, North Carolina State University; TDivision of Mathematical Biology, National Institute for Medical Research, London; and SBiomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College, London An evolutionary model that combines protein secondary structure and amino acid replacement is introduced. It allows likelihood analysis of aligned protein sequences and does not require the underlying secondary (or tertiary) structures of these sequences to be known. One component of the model describes the organization of secondary structure along a protein sequence and another specifies the evolutionary process for each category of secondary structure. A database of proteins with known secondary structures is used to estimate model parameters representing these two components. Phylogeny, the third component of the model, can be estimated from the data set of interest. As an example, we employ our model to analyze a set of sucrose synthase sequences. For the evolution of sucrose synthase, a parametric bootstrap approach indicates that our model is statistically preferable to one that ignores secondary structure. Introduction It is widely recognized that evolutionary divergence of protein structures occurs much less rapidly than divergence of protein sequences (e.g., Chothia and Lesk 1986; Flores et al. 1993). This indicates that selective constraints may act to preserve protein structure. The nature of these constraints is poorly understood and they have received relatively little direct attention in the areas of population genetics and phylogeny reconstruction. We believe this lack of attention should be rectified. The process of molecular evolution will not be well understood until the constraints that affect it have been characterized. There has been less work on modelling amino acid replacement than on modelling nucleotide substitution. This disparity may be attributable to the extra complexity of modelling the replacement process. Replacements tend to occur between chemically similar amino acid types and a replacement model should reflect this. There are many ways to categorize amino acids by chemical properties (e.g., hydrophobicity, charge, relative size of side chain), and physicochemical distances between amino acid types have been suggested (Grantham 1974; Taylor and Jones 1993), but these categorizations or physicochemical distances may not directly reflect the differences among amino acid types that are acted upon by evolution. Cognizant of the difficulty of creating a realistic model for protein evolution based solely on physicochemical principles, Dayhoff and coworkers (Dayhoff, Eck, and Eck 1972; Dayhoff, Schwartz, and Orcutt 1978) developed an empirical approach. To construct Present address: Department of Genetics, University of Cambridge. 2 Present address: Department of Biological Sciences, University of Warwick. Key words: hidden Markov model, maximum likelihood, molecular evolution, phylogeny, protein structure. Address for correspondence and reprints: Jeffrey L. Thorne, Program in Statistical Genetics, Statistics Department, Box 8203, North Carolina State University, Raleigh, North Carolina E- mail: thorne@stat.ncsu.edu. Mol. Biol. Evol. 13(5): by the Society for Molecular Biology and Evolution. ISSN: their empirical amino acid transition matrix, sets of easily aligned (i.e., closely related) sequences were collected. Only closely related sequences were considered because, when evolutionary distance is sufficiently small, the possibility of multiple replacements can be ignored. The observed replacement patterns were used to construct a probabilistic replacement model. When the Dayhoff model was originally proposed, relatively few protein sequences were known. More recent studies (Gonnet, Cohen, and Benner 1992; Jones, Taylor, and Thornton 1992) followed the spirit of the Dayhoff approach but were able to tabulate a larger number of observed replacements. Because the Dayhoff model is empirical, it reflects the fact that different amino acid types are replaced at different rates and the fact that amino acids are usually replaced by chemically similar amino acids. A problem with the Dayhoff approach is that it effectively models the replacement process at the average site in the average protein. There may be no such thing as an average site in an average protein. The physical environment of a protein site may greatly influence the replacement process at the site. Therefore, there may be variation among sites in the replacement process. Important features of the physical environment might include the secondary structure and whether the site is on the surface or in the interior of a protein. There have been tabulations of the observed number of amino acid replacements for each of several different categories of physical environment (e.g., Overington et al. 1990; Liithy, McLachlan, and Eisenberg 1991; Topham et al. 1993; Wako and Blundell 1994), but the potential of these tabulations to disentangle phylogenetic correlations and constraints due to physical environment has not been exploited previously. In this study, we introduce a probabilistic model that relates protein secondary structure to protein evolution. Our empirical approach is similar to the Dayhoff procedure but we model the replacement process for each of several categories of physical environment. We classify the physical environment of a site according to the secondary structure at the site. Three categories of secondary structure (a-helix, B-sheet, and loop) are be- 666

2 Combining Protein Evolution and Structure 667 ing employed. The term loop is being used loosely type i. If the amount of evolution separating two sehere to indicate that a site is neither in an o-helix nor quences is small, the possibility that multiple substituin a P-sheet. In future studies, we plan to consider more tions at a site have occurred since two sequences have detailed classification schemes. diverged is negligible. In this case, p$(t> can be approximated, when i # j, by the product of t and the rate at The Model The Known Structure Data Set We utilize a data set maintained by one of us (D.T.J.) that contains representative sequences of 207 protein families. Protein families are only included in the data set if the tertiary structure of at least one member of the family has been experimentally determined. We will refer to this data set as the known structure data set. Most of the protein families in the known structure data set are represented by sequences in addition to the one with known tertiary structure. The members of each protein family are aligned. Fortunately, alignment is not difficult for these sequences because the sequences included in the data set are relatively similar to one another. To convert information in the known structure data set from three-dimensional structure to secondary structure categories, we rely on the DSSP computer program (Kabsch and Sander 1983). This program can determine the secondary structure of a protein from the atomic coordinates that specify its tertiary structure. We proceed with the assumption that the category of underlying secondary structure is identical for residues of different sequences that are in the same column of the alignment. This assumption is reasonable because, as noted earlier, structural divergence occurs much less rapidly than divergence at the sequence level. With this assumption, the category of underlying secondary structure for a column in the alignment can be determined in most cases from the sequence with known tertiary structure. For an occasional column of the alignment the underlying secondary structure cannot be determined in this way because, due to an insertion or deletion event, the sequence of known tertiarv structure does not have which amino acid type i is replaced by amino acid type j for category k of secondary structure. This rate will be referred to as CY$ and the array containing ol$ for all combinations of amino acids and secondary structure categories will be referred to as CX. Also, if t is small, pi(t) is well approximated by 1 - c cx$. j#i Consider a comparison between two protein sequences and assume N,,, is the number of alignment positions where neither sequence has a gap. This study follows the example of Jones, Taylor, and Thornton (1992) in that pairwise comparisons are made only if the two sequences possess O.SSN, or more alignment positions with identical residues and at least one of the two proteins is the closest homolog in the data set to the other. The incentive for adopting the somewhat stringent sequence identity threshold is to make approximations for pi(t) and pi(t) accurate. The relationship between the threshold choice and resulting replacement model is discussed later in this paper. Assume C is the number of sequence comparisons with similarity exceeding our threshold. Let ni be twice the number of alignment positions of category k in the C sequence comparisons that contained residues of type i in both sequences. Let r$ be the number of alignment positions of category k where one of the two residues is type i and the other is type j (i # j). Notice that n$ k = nji. Our Markov model of replacement is constructed to be reversible. This means that, if + is the equilibrium probability of residue type i for category k, a residue in the column. Insertions and deletions tend to Reversibility is ensured if occur in regions of secondary structure that are loops +kcx$ V i, j. (2) (Benner and Gerloff 1991; Thornton et al. 1991). Therefore, we have chosen to classify these occasional col- Denoting the expected proportion of alignment positions umns in an alignment as belonging to the loop category. in category k by vk and the amount of evolution separating the sequences in the mth comparison by t,,,, the The Replacement Process for Each Category of reversibility property implies that for i # j, Secondary Structure We model the replacement process for each category of secondary structure, whereas Dayhoff and coworkers (Dayhoff, Eck, and Eck 1972; Dayhoff, Schwartz, and Orcutt 1978; also see Kishino, Miyata, (3) and Hasegawa 1990) model the process for an average site. We assume that, given the secondary structure Also, at a site, the amino acid replacement process at the site is Markovian and independent of the replacement pro- 1 cess at other sites. This assumption is not ideal but is E c n$ = 2vk+f 5 N,,,. (4) superior to completely ignoring secondary structure. i m=l Let p;(t) be the probability, for secondary structure If we approximate E[n$ by n$ and E[Z&] by Z&, both k, of a site containing residue type j after an amount of of which can be observed from the known structure evolution t given that the site initially contained residue data set, then for i # j, (1)

3 668 Thome et al. Table 1 Equilibrium Amino Acid Frequencies and Mutabilities for Each Category of Secondary Structure as Estimated from the Known Structure Data Set FREQUENCY MUTABILITY Loop P-sheet a-helix Loop P-sheet a-helix G A v L I M :: F P ::: S T c::: g::: Y w D E H::: K R The above equation shows we can estimate the relative rate of replacement via the product of cx$ and a constant that does not depend on i, j, or k. In fact, this constant is unimportant for our purposes. To model the replacement process, we need only estimate the relative rates of replacement from i to j for each pair of amino acids i # j. These relative rates can thus be estimated from the n$. For our model, the replacement process at a site depends on the underlying secondary structure and the cx$ parameters that are estimated from the known structure data set. The ol$ estimates determine the estimates of rj$ that are shown in table 1. The table indicates that our approach of considering secondary structure is promising because, as expected, we find that equilibrium frequencies of amino acids are not independent of secondary structure. The CX$ estimates themselves are not shown but are available upon request from the authors. Table 1 also lists inferred mutabilities of amino acids for each category of secondary structure. The mutability of amino acid i for secondary structure category k reveals the rate, relative to other amino acids in the same category, at which amino acid i is replaced. The mutability of amino acid i in category k would be c cx; j#i except that (following Dayhoff) we have chosen to standardize mutabilities within each category so that the mutability of alanine (A) is 100. Mutabilities of amino acids within each category are measured relative to the mutability of alanine. Dayhoff and coworkers measure amounts of evolution in terms of the expected number of replacement events per 100 sites since sequence divergence. We follow their example but emphasize that the units are expetted number of replacement events per 100 randomly selected sites. One category of secondary structure may experience replacements at a higher rate than another. Imagine, for example, a randomly selected site is an 01- helix with probability 0.3, a P-sheet with probability 0.5, and a loop with probability 0.2. Furthermore, imagine that the amount of evolution since sequence divergence is an expected 20 replacements per 100 o-helix sites, an expected 10 replacements per 100 P-sheet sites, and an expected 30 replacements per 100 loop sites. Then, this amount of evolution would be an expected 17 replacements per 100 average sites because (0.3)(20) + (OS)(lO) + (0.2)(30) = 17. With reasoning similar to equation (5), it can be shown that This equation reveals that we can estimate the rate of change for category k up to a constant. Applying this to the known structure data set, we find that, if the replacement rate of a loop site is scaled to be 1.0, then the replacement rate of an o-helix site is and the rate of a P-sheet site is It is interesting that the o-helix rate estimate is greater than that for loops, but the biological significance is unclear. It may be the case that helices do evolve at greater rates than loops. Alternatively, this may be an artifact of our defining the loop category as consisting of all sites that are neither o-helix nor p- sheet. Sites fitting this definition may actually be some mixture of those that are slowly evolving and those that are quickly evolving. We intend to investigate this issue in the future. Organization Sequence of Secondary Structure Along a The secondary structure at a site in a protein is not independent of secondary structure at surrounding sites; the types of secondary structure at adjacent sites are positively correlated. Secondary structure is a regional property of proteins. For this reason, we believe the organization of secondary structure along a sequence might be well described by a hidden Markov model. Churchill (1989) was the first to apply hidden Markov models to a biological sequence. He employed hidden Markov models to describe the organization of ATrich and GC-rich regions in DNA. Hidden Markov models have been used to describe the organization of secondary structure along a sequence (Asai, Hayamizu, and Handa 1993; White, Stultz, and Smith 1994), but these models were employed to analyze single protein sequences and phylogenetic correlations among sequences were therefore not an issue. The first use of hidden Markov models in sequence evolution is by Felsenstein and Churchill (1996) and has been implemented in versions 3.5 and later of the PHYLIP software package (see Fel-

4 Combining Protein Evolution and Structure 669 Table 2 Likelihood Calculations Estimated Probabilities of the Secondary Structure Category at a Position Given the Category at the Previous Position FROM CATEGORY a-helix To CATEGORY P-sheet Loop a-helix (13,290) (8) (1,331) P-sheet (50) (8,008) (1,812) Loop (1,341) (1,867) (18,450) NOTE.--The observed number of transitions in the known structure data set between each pair of categories is in parentheses below the estimated probabilities of the transition. senstein 1989). The Felsenstein-Churchill framework was developed to allow regional heterogeneity of nucleotide substitution rates but we modify it to permit regional heterogeneity of protein secondary structure. With our hidden Markov model, we can analyze data sets consisting of aligned protein sequences with unknown structures. The hidden states of our model correspond to the underlying secondary structure. The secondary structure cannot be directly observed but it may be inferred through the pattern of amino acid replacements. Each category of secondary structure is associated with a distinct pattern of replacements and the task is to match the pattern of replacements with secondary structure. Although the secondary structures of the sequences are unknown, we assume the organization of secondary structure along the sequences can be described by a stationary first-order Markov chain. This assumption implies that, given the secondary structure at site i, the secondary structure at site i+ 1 is independent of the secondary structures of sites 1,..., i- 1. This assumption is biologically implausible. Secondary structure at a site is likely to be influenced not only by neighboring sites in the sequence but also by sites that are nearby at the level of tertiary structure. The model we adopt is not perfect but it is a starting point; other models of replacement completely ignore structure. Sequences in the known structure data set can be used to estimate pc,m,ci, the probability that site i has underlying secondary structure category ci given that site i- 1 has secondary structure category Ci_ i. The matrix values of pc,-,, for each pair of secondary structure categories will be referred to as p. Imagine that Ci represents an o-helix and Ci_l represents a P-sheet. A simple way to estimate the probability that an a-helix position will follow a P-sheet position is to count the num- ber of times for sequences of known tertiary structure that an o-helix position follows a P-sheet position and divide this number by the number of times that an 01- helix or P-sheet or loop position follows a P-sheet position. Estimates obtained this way from the known structure data set are in table 2. Stationary probabilities of o-helix, P-sheet, and loop positions (i.e., the Vk values) derived from the transition probability estimates are respectively 0.325, 0.212, and We can use our model and a likelihood approach to reconstruct phylogenies from aligned protein sequences with unknown structures. The aligned data set will be assumed to have length N and will be represented by S. The first i columns of the aligned data set will be represented by Si, and the ith column itself by Si. To reconstruct a phylogeny, we want to estimate T, the topology and branch lengths of the tree. We estimate the rates of replacement (i.e., cx) and the parameters for organization of secondary structure along the sequence (i.e., p) from the known structure data set. For the sake of simplicity, we will not introduce notation to distinguish between the true values of (Y and p and their estimates. Our goal is to find the T that maximizes Pr(SIT, o, p). The first step is to calculate Pr(SIT, cx, p) for some T of interest. This can be done with an iterative algorithm. At each step, the algorithm considers Ci, a passible secondary structure category for site i, and computes Pr(Si, CilT, a, p). For i > 1, the algorithm uses the fact that Pr(Si, Ci I T, 01, PI = c W&-l, ci-l 1 T, a, p) cl-1 *Pr(ci( Si-1, Ci-1, T, 01, p> -Pr(si I Si- 1, ci- 1, Ci, T, a, p>* (7) The assumptions of our model allow the above to be simplified: To calculate Pr(Si, ci I T, 01, PI = C P XSi-1, Ci-1 IT, a, P)Pc,_lcl cl-1 *Pr(si I Ci, T, (x). Pr(s i, cl I T, 01, p), Pr(si, cl 1 T, 01, p> = Pr(sl, cl 1 T, a, p)wq IT, a, p> To estimate T, we use numerical optimization algorithms (e.g., Swofford et al., 1996) to find f, the value of T that maximizes Pr(SIT, CY, p). We can also use our model to reconstruct evolutionary trees in cases where a protein structure has been exnerimentallv determined. In these cases. the amino (8) = Pr(si Ici, T, o)yr,,. (9) The pruning algorithm of Felsenstein (1981) can be applied to calculate Pr(silci, T, cx). The pruning algorithm requires determination of the replacement transition probabilities p:(t) for each combination of amino acids and branch lengths. These transition probabilities can be calculated from CY and t (e.g., see Goldman and Yang 1994). The iterative algorithm allows calculation of the likelihood because Pr(S 1 T, 01, p) = c Pr($,r, CNI T, a, P>. CN (10)

5 670 Thorne et al. Table 3 Log-likelihoods of the Sucrose Synthase Data Set for Different Unrooted Tree Topologies and Replacement Models MODEL (Ath, Pau), (SW, Vfa) UNROOTED TOPOLOGY (Ath, Stu), (Pau, Vfa) (Ath, Fva), (Pau, Stu) DAY , , , JTT ,422.Ol -4, , THEM... -4, ,345.g 1-4,420,76 us , , , NOTE.-A list of the sources of the four sucrose synthase protein sequences along with abbreviations, common names, and Swiss-Prot accession numbers is: Arabidopsis thaliana (Ath, mouse-ear cress, Q00197), Phaseolus aureus (Pau, mung bean, Q01390). Solanum tuberosum (Stu, potato, P10691), and Vicia faba (Vfa, broad bean, P3 1926). acid replacement process for each site is determined by Arabidopsis thaliana the known secondary structure category at the site. Be- FIG. 1.-The maximum-likelihood tree topology and branch cause the secondary structure is known, the likelihood lengths for the sucrose synthase gene under the US model. calculations do not involve p. Instead, the likelihood is the product of likelihoods at individual sites, again calculated with the pruning algorithm of Felsenstein protein secondary structure is ignored is referred to as (198 1). Reconstruction of phylogenies from sequences THEM ( Typical Homogeneous Evolutionary Model ). with known structures involves less uncertainty and is Table 3 contains the maximum log-likelihood value therefore expected to be more accurate than reconstruc- for each combination of the three unrooted topologies tion of phylogenies from sequences with unknown struc- and the four replacement models. The US model contures. siders secondary structure and yields higher likelihoods than other models. The maximum-likelihood topology Example and Parametric Bootstrap and branch length estimates obtained by this model are To illustrate our approach, we consider four sucrose shown in figure 1. The cumulative CPU time required synthase protein sequences with unknown secondary by a Sparcstation 10 Model 30 to estimate branch structure. Sequences were aligned with the assistance of lengths of the three unrooted sucrose synthase topolothe Treealign software (Hein 1990). Regions of uncer- gies with the US model was seconds. In general, tain alignment may bias a phylogeny reconstruction pro- the US model requires approximately the amount of cedure toward certain tree topologies. In columns where computation required by THEM multiplied by the numthe alignment was deemed relatively uncertain, we treat- ber of secondary structure categories. ed all residues-or all but one residue-as residues of Interestingly, the likelihoods of THEM for the suunknown amino acid type. This treatment of uncertain crose synthase-data are higher than those for the JTT or regions in the alignment should reduce the potential for DAY models (table 3). One possible explanation is that alignment uncertainty to bias phylogeny reconstruction. THEM is based on replacement counts from the known Gap positions in the alignment were also treated as un- structure data set. Protein families with known tertiary known residues. The resulting alignment contained 809 structure may not be a random sample of protein famicolumns and a total of 19 residues of unknown amino lies (e.g., see Orengo, Jones, and Thornton 1994). For acid ty Pee example, families of membrane- bound proteins are like- The four aligned sucrose synthase genes were an- ly to be u nderrepresented in the known structure data alyzed by the replacement model described here and by set. This underrepresentation may be less severe in data three amino acid replacement models that ignore sec- sets used to construct the JTT or DAY models. Thereondary structure. We will refer to the model described fore, it may be the case that THEM is superior to the here as the US ( Uses Structure ) model. The remain- JTT and DAY models when some protein families are ing three amino acid replacement models that we con- studied but is inferior when others are studied. Clearly, sidered ignore secondary structure but otherwise employ the potentially biased composition of the known strucapproaches similar to that described above to estimate ture data set could also affect the appropriateness of relative replacement rates from counts of observed re- the US model in certain cases. placements. The models referred to as DAY and JTT are Although the US model produces higher likelirespectively derived from the replacement counts of hoods than THEM for the sucrose synthase data set, Dayhoff and collaborators (Dayhoff, Schwartz, and Or- additional investigation is required before rejecting cutt 1978) and the replacement counts of Jones and col- THEM in favor of the US model. The replacement prolaborators (Jones, Taylor, and Thornton 1992). The mod- cess and secondary structure organization parameters of el obtained from the known structure data set when the US model were estimated independently of the su-

6 Combining Protein Evolution and Structure 671 al ence FIG. 2.-A histogram of simulated log-likelihood differences. The observed difference of 6.46 for the sucrose synthase data set is also indicated. 1 0 crose synthase data; these parameters are not free to vary. Therefore, THEM is not a special case of the US model. Fortunately, the parametric bootstrap approach to model comparison described by Goldman (1993) does not require that models be nested. To employ the parametric bootstrap, we simulated 100 data sets according to our null hypothesis of THEM. We simulated under the topology and branch lengths that were estimated with THEM from the actual sucrose synthase data set. Each simulated data set had the same alignment length and the same locations of residues of unknown type as the actual data set. For each simulated data set, we subtracted the maximum log-likelihood of THEM from the maximum log-likelihood of the US model. If we calculate the mean and sample standard deviation of the 100 simulated differences, we find that the actual difference is 3.6 sample standard deviations higher than the mean of the simulated differences. A histogram of these 100 log-likelihood differences is shown in figure 2. The fact that each of the 100 simulated log-likelihood differences is less than the observed log-likelihood difference 6.46 indicates that we can reject THEM in favor of the US model. In other words, consideration of secondary structure appears to be warranted for the sucrose synthase data. Discussion Phylogeny Reconstruction Because the US model is more biologically realistic than previously proposed models of protein evolution, it is reasonable to expect that it will yield better estimates of the phylogeny than other models (e.g., see Yang, Goldman, and Friday 1994). The statistical comparison of THEM and the US model for the sucrose synthase data adds support to this expectation but should not be viewed as a confirmation. Further study is necessary to verify that the US model is preferable to other models when the objective is phylogeny inference. Bias and Variance of Replacement Rate Estimators In this study, our empirically derived replacement model considers only sequence pairs from the known structure data set with 85% or more identical residues. If categories of secondary structure evolve at sufficiently different rates, this could potentially yield biased estimates of CY$. Imagine two pairs of sequences, one rich in loop sites and the other rich in sheet sites, each separated by the amount of evolution t. If loop sites evolve more quickly than sheet sites, then the pair of sequences rich in loop sites may be less than 85% identical and the pair rich in sheet sites may be more than 85% identical. The pair rich in sheet sites but not the pair rich in loop sites would be considered in this study. This scenario could have effects that include loop sites being underrepresented. Reliance upon pairwise comparisons in this study may increase the variance of cx$ estimators. The same replacement event could be counted in several different pairwise comparisons. Imagine a group of four protein sequences and the unrooted topology that relates them. There are six possible pairwise comparisons between the four sequences. If all six comparisons were made and each comparison exceeded the 85% identity threshold, a replacement event that took place on the interior branch of the topology relating the four sequences would be counted in four of the six pairwise comparisons. In contrast, a replacement event that took place on any other branch of the topology would be counted in three of the six pairwise comparisons. To lessen this unbalanced counting of replacement events, the only pairwise comparisons that we make are those in which at least one of the sequences is being compared with its closest homolog in the data set. The unbalanced counting of replacement events that remains would inflate the variance of our estimators of 0~; but would not bias them. In the future, we plan to investigate estimators that avoid the potential bias and inflated variance described here. We also hope to determine the variances of these estimators. Probabilistic versus Biological Assumptions It is well known among molecular evolutionists that likelihood approaches are based on explicit probabilistic models but the fact that explicit probabilistic assumptions are not always easily translated to explicit biological meanings seems less appreciated. A likeli-

7 672 Thome et al. hood approach has less value for illuminating the process of sequence divergence if the parameters in the evolutionary model are not biologically interpretable. The probabilistic models we present here are biologically motivated. By examining models with explicit biological assumptions, we avoid the difficult task of assigning biological meaning to the parameters of a model after the model has been statistically selected. We are encouraged by the results of the parametric bootstrap test because they support the idea that secondary structure influences evolution, but we emphasize that the value of our model stems primarily from its biological basis and not from its favorable statistical performance. Our perspective is well summarized by Hunter (1982, p. 5 1): All models are wrong; some are more useful than others. Future Directions Modifications to our model can be envisioned and some have already been discussed above. Many of these modifications may improve the statistical performance of the model and some may be biologically meaningful. For example, heterogeneity of replacement rates among sites within a secondary structure category could be permitted by adopting approaches already developed for the purposes of modelling nucleotide substitution (see Yang 1993, 1995; Yang, Goldman, and Friday 1994; Felsenstein and Churchill, 1996). Another modification would be to increase the number of categories in our hidden Markov model. For example, sites on the surface of a protein could be distinguished from interior sites and the additional category turn could be included. Another research direction would involve combining DNA substitution models with our model of amino acid replacement. Recent work (Goldman and Yang 1994; Muse and Gaut 1994) indicates that this may be feasible. The current version of our model is not tailored to any specific family of proteins but it may be possible to design models appropriate for specific families or even for particular sites within a specific family. Some protein families are particularly rich in o-helix positions and others in B-sheet positions (Levitt and Chothia 1976). Our hidden Markov model for organization of secondary structure along a sequence allows for chance variation in frequencies of secondary structure categories among protein families but the actual variation among protein families is probably larger than our model predicts. Perhaps this additional variation can be considered in future investigations. One concern that arises when such models are contemplated is whether sufficient information exists to estimate model parameters accurate- ly* It has not escaped our notice that this approach also has value in prediction of protein secondary structure. We will discuss this application in a forthcoming study. We hope our approach and its subsequent modifications will shed light on protein evolution and structure. Acknowledgments We thank Hirohisa Kishino, Joseph Felsenstein, and two anonymous reviewers for their constructive comments. We also thank Francis Giesbrecht, Elizabeth Thompson, and Simon Tavare for their assistance with the quotation by W. G. Hunter. The research was supported by NIH grant ROl GM 49654, principal investigator J. L. Thorne. Programs written in Fortran that implement the techniques described here are available from N. Goldman (N.Goldman@gen.cam.ac.uk) and programs written in C are available from J. Thorne (thorne@stat.ncsu.edu). LITERATURE CITED ASAI, K., S. HAYAMIZU, and K. HANDA Prediction of protein secondary structure by the hidden Markov model. CABIOS 9: BENNER, S. A., and D. GERLOFF Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure: a prediction of the catalytic domain of protein kinases. Adv. Enzyme Regul. 31: 12 l-l 8 1. CHOTHIA, C., and A. M. LESK The relation between the divergence of sequence and structure in proteins. EMBO J 5: CHURCHILL, G. A Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51: DAYHOFF, M. O., R. V. ECK, and C. M. ECK A model of evolutionary change in proteins. Pp in M. 0. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5. National Biomedical Research Foundation, Washington, D.C. DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT A model of evolutionary change in proteins. Pp in M. 0. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5, suppl. 3. National Biomedical Research Foundation, Washington, D.C. FELSENSTEIN, J Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17: FELSENSTEIN, J Phylip-phylogeny inference package (version 3.2). Cladistics 5: FELSENSTEIN, J., and G. A. CHURCHILL A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: FLORES, T. P, C. A. ORENGO, D. S. Moss, and J. M. THORN- TON Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2: 181 l-l 826. GOLDMAN, N Statistical tests of models of DNA substitution. J. Mol. Evol. 36: GOLDMAN, N., and Z. YANG A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11: GONNET, G. H., M. A. COHEN, and S. A. BENNER Exhaustive matching of the entire protein sequence database. Science 256: GRANTHAM, R Amino acid difference formula to help explain protein evolution. Science 185: HEIN, J A unified approach to alignment and phylogenies. Pp in R. E DOOLITTLE, ed. Methods in enzymology. Vol Academic Press, San Diego, Calif. HUNTER, W. G Graduate programs in statistics. Pp in J. S. RUSTAGI and D. A. WOLFE, eds. Teaching of statistics and statistical consulting. Academic Press, New York, N.Y. JONES, D. T., W. R. TAYLOR, and J. M. THORNTON The rapid generation of mutation data matrices from protein sequences. CABIOS 8:

8 Combining Protein Evolution and Structure 673 KABSCH, W., and C. SANDER Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: KISHINO, H., T. MIYATA, and M. HASEGAWA Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31: LEVITT, M., and C. CHOTHIA Structural patterns in globular proteins. Nature 261: LOTHY, R., A. D. MCLACHLAN, and D. EISENBERG Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 10: MUSE, S. V., and B. S. GAUT A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with applications to the chloroplast genome. Mol. Biol. Evol. 11: ORENGO, C. A., D. T. JONES, and J. M. THORNTON Protein superfamilies and domain superfolds. Nature 372:63 l OVERINGTON, J., M. S. JOHNSON, A. SALI, and T. L. BLUNDELL Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Sot. Lond. B 241: SWOFFORD, D. L., G. J. OLSEN, I? J. WADDELL, and D. M. HILLIS Phylogenetic inference. Pp In D. M. HILLIS, C. MORITZ, and B. K. MABLE, eds. Molecular systematics, 2nd edition. Sinauer Associates, Sunderland, Mass. TAYLOR, W. R., and D. T. JONES Deriving an amino acid distance matrix. J. Theor. Biol. 164: THORNTON, J. M., T. I? FLORES, D. T. JONES, and M. B. SWIN- DELLS Prediction of progress at last. Nature 354: TOPHAM, C. M., A. MCLEOD, E EISENMENGER, J. I? OVERING- TON, M. S. JOHNSON, and T. L. BLUNDELL Fragment ranking in modelling of protein structure: conformationally constrained substitution tables. J. Mol. Biol. 229: WAKO, H., and T. L. BLUNDELL Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes. J. Mol. Biol. 238: WHITE, J. V., C. M. STULTZ, and T. E SMITH Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. 119: YANG, Z Maximum likelihood phylogenetic estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10: YANG, Z A space-time process model for the evolution of DNA sequences. Genetics 139: YANG, Z., N. GOLDMAN, and A. FRIDAY Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11: PAUL M. SHARP, reviewing Accepted January 30, 1996 editor

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG Department of Biology (Galton Laboratory), University College London, 4 Stephenson

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics Inferring Phylogenies from Protein Sequences by Parsimony, Distance, and Likelihood Methods Joseph Felsenstein Department of Genetics University of Washington Box 357360 Seattle, Washington 98195-7360

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation 1 dist est: Estimation of Rates-Across-Sites Distributions in Phylogenetic Subsititution Models Version 1.0 Edward Susko Department of Mathematics and Statistics, Dalhousie University Introduction The

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL MATTHEW W. DIMMIC*, DAVID P. MINDELL RICHARD A. GOLDSTEIN* * Biophysics Research Division Department of Biology and

More information

Solving the protein sequence metric problem

Solving the protein sequence metric problem Solving the protein sequence metric problem William R. Atchley*, Jieping Zhao, Andrew D. Fernandes, and Tanja Drüke *Department of Genetics, Bioinformatics Research Center, Graduate Program in Biomathematics,

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES Protein Structure W. M. Grogan, Ph.D. OBJECTIVES 1. Describe the structure and characteristic properties of typical proteins. 2. List and describe the four levels of structure found in proteins. 3. Relate

More information

Estimating Divergence Dates from Molecular Sequences

Estimating Divergence Dates from Molecular Sequences Estimating Divergence Dates from Molecular Sequences Andrew Rambaut and Lindell Bromham Department of Zoology, University of Oxford The ability to date the time of divergence between lineages using molecular

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Substitution matrices

Substitution matrices Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Different Versions of the Dayhoff Rate Matrix

Different Versions of the Dayhoff Rate Matrix Different Versions of the Dayhoff Rate Matrix CAROLIN KOSIOL and NICK GOLDMAN* EMBL-European Bioinformatics Institute, Hinxton, CB10 1SD, U.K. *Corresponding author: Nick Goldman EMBL-European Bioinformatics

More information

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A J Mol Evol (2000) 51:423 432 DOI: 10.1007/s002390010105 Springer-Verlag New York Inc. 2000 Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution Maria Anisimova, Joseph P. Bielawski, and Ziheng Yang Department of Biology, Galton Laboratory, University College

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society 1 of 5 1/30/00 8:08 PM Protein Science (1997), 6: 246-248. Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society FOR THE RECORD LPFC: An Internet library of protein family

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

Inferring Molecular Phylogeny

Inferring Molecular Phylogeny Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Proceedings of the SMBE Tri-National Young Investigators Workshop 2005

Proceedings of the SMBE Tri-National Young Investigators Workshop 2005 Proceedings of the SMBE Tri-National Young Investigators Workshop 25 Control of the False Discovery Rate Applied to the Detection of Positively Selected Amino Acid Sites Stéphane Guindon,* Mik Black,*à

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Phylogenetic Inference and Parsimony Analysis

Phylogenetic Inference and Parsimony Analysis Phylogeny and Parsimony 23 2 Phylogenetic Inference and Parsimony Analysis Llewellyn D. Densmore III 1. Introduction Application of phylogenetic inference methods to comparative endocrinology studies has

More information

Contrasts for a within-species comparative method

Contrasts for a within-species comparative method Contrasts for a within-species comparative method Joseph Felsenstein, Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, USA email address: joe@genetics.washington.edu

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Estimating the Rate of Evolution of the Rate of Molecular Evolution

Estimating the Rate of Evolution of the Rate of Molecular Evolution Estimating the Rate of Evolution of the Rate of Molecular Evolution Jeffrey L. Thorne,* Hirohisa Kishino, and Ian S. Painter* *Program in Statistical Genetics, Statistics Department, North Carolina State

More information

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities Computational Statistics & Data Analysis 40 (2002) 285 291 www.elsevier.com/locate/csda Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities T.

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid Lecture 11 Increasing Model Complexity I. Introduction. At this point, we ve increased the complexity of models of substitution considerably, but we re still left with the assumption that rates are uniform

More information

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION AND CALIBRATION Calculation of turn and beta intrinsic propensities. A statistical analysis of a protein structure

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Probabilistic modeling and molecular phylogeny

Probabilistic modeling and molecular phylogeny Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU) What is a model? Mathematical

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure Bioch/BIMS 503 Lecture 2 Structure and Function of Proteins August 28, 2008 Robert Nakamoto rkn3c@virginia.edu 2-0279 Secondary Structure Φ Ψ angles determine protein structure Φ Ψ angles are restricted

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Supplementary Figure 1 Histogram of the marginal probabilities of the ancestral sequence reconstruction without gaps (insertions and deletions).

Supplementary Figure 1 Histogram of the marginal probabilities of the ancestral sequence reconstruction without gaps (insertions and deletions). Supplementary Figure 1 Histogram of the marginal probabilities of the ancestral sequence reconstruction without gaps (insertions and deletions). Supplementary Figure 2 Marginal probabilities of the ancestral

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

Modeling Noise in Genetic Sequences

Modeling Noise in Genetic Sequences Modeling Noise in Genetic Sequences M. Radavičius 1 and T. Rekašius 2 1 Institute of Mathematics and Informatics, Vilnius, Lithuania 2 Vilnius Gediminas Technical University, Vilnius, Lithuania 1. Introduction:

More information

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach Author manuscript, published in "Journal of Computational Intelligence in Bioinformatics 2, 2 (2009) 131-146" Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach Omar GACI and Stefan

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information