BIOINFORMATICS ORIGINAL PAPER

Size: px
Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Transcription

1 BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 1 25, pages doi:1.193/bioinformatics/bti347 Phylogenetics REVCOM: a robust Bayesian method for evolutionary rate estimation Andrew J. Bordner and Ruben Abagyan Molsoft LLC, 3366 North Torrey Pines Court, Suite 3, La Jolla, CA 9237, USA Received on January 25, 25; accepted on February 19, 25 Advance Access publication March 4, 25 ABSTRACT Motivation: Evolutionary conservation estimated from a multiple sequence alignment is a powerful indicator of the functional significance of a residue and helps to predict active sites, ligand binding sites, and protein interaction interfaces. Many algorithms that calculate conservation work well, provided an accurate and balanced alignment is used. However, such a strong dependence on the alignment makes the results highly variable. We attempted to improve the conservation prediction algorithm by making it more robust and less sensitive to (1) local alignment errors, (2) overrepresentation of sequences in some branches and (3) occasional presence of unrelated sequences. Results: A novel method is presented for robust constrained Bayesian estimation of evolutionary rates that avoids overfitting independent rates and satisfies the above requirements. The method is evaluated and compared with an entropy-based conservation measure on a set of 1494 protein interfaces. We demonstrated that 62% of the analyzed protein interfaces are more conserved than the remaining surface at the 5% significance level. A consistent method to incorporate alignment reliability is proposed and demonstrated to reduce arbitrary variation of calculated rates upon inclusion of distantly related or unrelated sequences into the alignment. Contact: bordner@ornl.gov Supplementary information: The protein protein interface dataset, multiple sequence alignments and corresponding phylogenetic trees are available at bordner/revcom/ 1 INTRODUCTION The evolutionary conservation varies among amino acid sites owing to differing degrees of functional constraints on them (Fitch and Margoliash, 1967; Uzzell and Corbin, 1971; Holmquist et al., 1983). Sites that are important for the protein s tertiary structure and folding, enzymatic activity, ligand binding or interaction with other proteins are generally more conserved. The greater conservation of residues in binding sites compared with other surface residues has been exploited by some methods that predict small ligand or protein protein interfaces by mapping residue conservation values to the query protein surface (Landgraf et al., 21; Lichtarge and Sowa, 22; Pupko et al., 22). The identification of conserved residues may be useful for identifying functionally important residues even in the absence of structural information. Binding site prediction results To whom correspondence should be addressed at Computer Science and Mathematics Division, Oak Ridge National Laboratory, PO Box 28, MS6173, Oak Ridge, TN 37831, USA have applications in structure-based drug design as well as protein functional assignment and suggest important residues for mutation analysis studies. Models that account for the evolutionary relationship of the sequences through a phylogenetic tree should be less sensitive to the choice of sequences than methods based on residue frequencies in a corresponding multiple sequence alignment column. In particular, a large number of closely related sequences may give erroneously high residue conservation. Evolutionary tracing is one such method that utilizes a phylogenetic tree to identify residues that are identically conserved in a subtree (Lichtarge et al., 1996). The maximum tree depth at which a residue remains unchanged is used to rank the degree of conservation. This analysis was later modified to incorporate a quantitative model of residue substitutions (Landgraf et al., 1999). Another algorithm, ConSurf, used a maximum parsimony tree to calculate a site conservation score as the number of substitutions weighted by their physicochemical distance (Armon et al., 21). A study by Pupko et al. (22) describes a method for a maximumlikelihood estimation of independent evolutionary rates assuming a homogeneous Markov model of residue substitution. The maximumlikelihood method (Neyman, 1971; Felsenstein, 1981) is not subject to errors present in the maximum parsimony principle, including systematic overestimation of conservation due to neglect of backward and parallel substitutions and no dependence on branch lengths. It also has the advantage that a fast recursive algorithm may be used to calculate likelihoods for a given tree (Felsenstein, 1981). However, a straightforward estimation of independent site rates suffers from overfitting since there are as many parameters as sites (Felsenstein, 21). One solution to this problem is to assume that the rates come from a probability distribution, usually assumed to be a gamma distribution. Yang (1993, 1994) used a discrete approximation to the gamma distribution to incorporate rate variation in phylogeny determination using the maximum-likelihood method. Although there is no evolutionary process that supports it, a gamma rate distribution is often used since it is defined on the correct interval (positive rates). It also leads to a negative binomial distribution for total substitution frequencies in a simple Poisson model of site substitution probabilities. An early paper by Uzzell and Corbin (1971) supported a gamma rate distribution by showing that the negative binomial distribution fits sequence data better than a Poisson distribution. On account of this evidence and mathematical convenience we also use the gamma distribution, but as a Bayesian prior rates distribution. A lack of robustness or excessive sensitivity of a conservation calculation to the input multiple sequence alignment is a general The Author 25. Published by Oxford University Press. All rights reserved. For Permissions, please journals.permissions@oupjournals.org 2315

2 A.J.Bordner and R.Abagyan problem for the existing methods. This sensitivity may be owing to overrepresentation of a particular subfamily or alignment errors. While distantly related sequences have the potential to provide strong evidence of residue conservation they also generally introduce more noise. Three sources of errors at large evolutionary distances may affect the site rate calculation: (1) local alignment errors due to ambiguities in divergent sequence segments, (2) alignment to a nonhomologous sequence mistakenly picked out by a database search and (3) uncertainties in the residue substitution matrix owing to alignment errors or extrapolation to large distances. All these errors are expected to become more possible at larger distances and are neglected in most conservation prediction methods. This is important, particularly for automatically generated alignments which may include non-homologous sequences or alignment errors for distantly related sequences. While iterative alignment methods, such as PSI- BLAST, have high sensitivity in detecting remote homologs (Park et al., 1998) they are in particular susceptible to inclusion of nonhomologous sequences owing to overly permissive parameters or profile drift (Sjölander, 24). We introduce the Robust EVolutionary COnservation Measure (REVCOM) which consistently incorporates the alignment reliability into the likelihood calculation in order to render the method more robust to the inclusion of distantly related sequences. The resulting likelihood for each site may be interpreted as a sum of likelihoods for all possible trees with a subset of sequences removed, weighted by the probability that the excluded sequences are unreliable and the included ones are reliable. A large non-redundant dataset of 1494 protein protein interfaces with available X-ray crystal structures was also compiled in order to test the conservation algorithm. Since they are larger in general than ligand binding interfaces the variance in conservation statistics for protein protein interfaces is expected to be lower. We compared the REVCOM method with a simple entropy-based method that does not use evolutionary trees. Both the ability of the methods to detect the higher conservation in protein protein interfaces and their stability with the inclusion of distantly related sequences in the alignment were evaluated. 2 SYSTEMS AND METHODS 2.1 REVCOM method overview First, pairwise evolutionary distances and a phylogenetic tree are estimated from a multiple sequence alignment without assuming rate heterogeneity. Next, the site rate distribution is inferred using a maximum-likelihood estimate that accounts for alignment reliability. Finally, the individual site rates are calculated using this distribution as a Bayesian prior distribution. 2.2 Residue substitution model Amino acid substitutions at sites are assumed to follow independent timehomogeneous Markov processes with the Jones Taylor Thornton (JTT) matrix used to calculate substitution probabilities (Jones et al., 1992). The substitution matrix for a given evolutionary distance was extrapolated from the matrix given in this reference for a distance corresponding to one percent accepted point mutation (one PAM). 2.3 Multiple sequence alignments Multiple sequence alignments were generated automatically for each protein sequence in the dataset. First, the BLAST program (Altschul et al., 199) with an E-value cutoff of.1 was used to collect similar sequences from the NCBI nr database. This relatively permissive cutoff was chosen such that distantly related homologous sequences that make a disproportionately large contribution to the evolutionary rates estimation are included. Sequences with >9% identity to another sequence in the set were then iteratively removed. The ClustalW program (Thompson et al., 1994) with default alignment parameters (Gonnet 25 scoring matrix with gap opening = 1 and gap extension =.1) was used to align the remaining sequences. 2.4 Phylogenetic trees Phylogenetic trees were then generated for each alignment using the neighborjoining algorithm (Saitou and Nei, 1987) as implemented in the Quicktree (Howe et al., 22) program. Thus the trees for every protein in a complex were determined independently. A weighted least squares fit using pairwise PAM distances between sequences was then used to recalculate the branch lengths. The PAM distance t was calculated by inverting the expression for the expected fraction of identical residues q(t) q(t) = (M t ) ii p i, (1) where M t = exp(tq) is the JTT matrix for distance t, defined by the matrix exponential of the instantaneous JTT rate matrix Q and p i is the occurrence probability of amino acid type i. This simple estimate for the PAM distance was used along with linear interpolation in order to reduce the computational time. The variance σt 2 in the distance was approximated from the binomial variance using the delta method ( ) dq(t) 2 σt 2 q(1 q). (2) dt n The weighted least squares fit then minimizes the sum of squares of weighted residuals N(N 1)/2 (d i 2N 2 j=1 C ij b j ) 2 SS = σ 2, (3) i where N is the number of sequences, d i is the pairwise PAM distance, b j is the branch length and C is a matrix with elements C ij = 1 when branch j is in the unique path joining sequences in the pair i and otherwise (Bulmer, 1991). Defining a diagonal variance matrix V ii σi 2 the estimates of the branch lengths are b = (C T V 1 C) 1 C T V 1 d. 2.5 Simple residue conservation model We compared our rate prediction model with a simple entropy-based model in which the site conservation is calculated for each alignment column using the Shannon entropy S S = f i log f i, (4) where f i are the residue occurrence frequencies for the column (Valdar, 22; Mihalek et al., 24). Gaps are not included in the calculation of the residue frequencies. 2.6 Protein protein interface datasets A dataset of protein intermolecular interfaces in complexes was compiled from biological unit information in the Protein Data Bank (PDB) (Berman et al., 2) archive using the ICM scripting language (Molsoft, LLC, 24). Only the first biological unit was included and mmcif format files were utilized since they contain biological unit information for all entries. Only X-ray structures of proteins with 2 residues were considered and pairs of interacting proteins were clustered at 3% sequence identity. Complexes whose PDB information conflicted with their Swiss-Prot (Apweiler et al., 24) subunit annotations were corrected or removed after consulting the literature. Alignments of the protein sequences in each cluster to a representative sequence were then used to compare interface residues. Two interfaces on a protein were considered distinct if their residue sets, referred to the representative sequence, overlapped <2%. The highest quality structure, with the least missing coordinates and the highest resolution, for each unique protein protein interface was then included in the dataset. Interfaces including immune system proteins that are highly polymorphic proteins 2316

3 REVCOM: evolutionary rate estimation or undergo somatic mutation, namely MHC, T-cell receptors and antibodies, were excluded. Finally, only interfaces containing at least 1 residues were included in the set because of the large variance of the residue-based statistics as well as difficulties in validating the interface prediction for smaller interfaces. The statistical analyses were performed for each distinct protein with its interface residues determined from the structure of the complex. The resulting set had 1494 protein protein interfaces (1143 unordered pairs), of which 518 were in homodimers, 114 were in heterodimers and the remaining 862 were in multimers. 2.7 Statistical tests The Mann Whitney Wilcoxon rank sum test was used to compare the evolutionary rates for the interface and non-interface regions. This non-parametric test was used since the underlying distribution of the posterior rates is unknown and is not a normal distribution. 3 ALGORITHM 3.1 Evolutionary site rates estimation The REVCOM method calculates evolutionary rates for sites in a reference protein sequence using the following steps: (1) Perform a BLAST search of the sequence database using the reference sequence as a query. (2) Remove redundant sequences (>9% identity). (3) Construct a multiple alignment of all sequences. (4) Calculate all pairwise PAM distances between sequences from fractional identities using Equation (1). (5) Determine the phylogenetic tree from pairwise distances using neighbor-joining algorithm. (6) Recalculate branch lengths using weighted least squares method. (7) Find the maximum-likelihood estimate of the α parameter in the prior gamma distribution. (8) Calculate site rates by averaging over the posterior rates distribution. P -values from the BLAST search in the first step are used as estimates of alignment reliability in the final two steps. 3.2 Maximum-likelihood estimate of the gamma distribution parameter The likelihood that is maximized for the α parameter estimate is L(α) = m p( x m r)f(α, r)dr, (5) where f(α, r) is the gamma distribution with mean 1 f(α, r) = αα r α 1 e αr Ɣ(α) and p( x m r)is the likelihood for site m, given the rate r. x m is a vector of amino acid types in column m of the multiple sequence alignment. Non-standard amino acid symbols, e.g. X and gaps are excluded from the calculation by removing the corresponding branches in the tree for that site. (6) 3.3 Accounting for alignment reliability One new feature in the REVCOM method is the use of an alignment reliability probability in order to prevent distantly related or nonhomologous sequences from inordinately affecting the likelihood in Equation (5) and consequently the predicted rates. The probabilities of each sequence to be misaligned with a reference sequence, for which the rates are to be calculated, are used. The misalignment probabilities for different sequences will be assumed to be independent. This allows them to be easily incorporated into the maximum-likelihood calculation. Misalignment probabilities from progressive alignment methods such as ClustalW are generally correlated since early alignment errors are retained and early accurate profiles yield subsequent alignments of greater accuracy. Misalignment probabilities in adjacent columns are also expected to be correlated, both, because errors are more likely to occur in segments corresponding to solvent-exposed loops than in segments corresponding to core helices or strands, and because errors in, e.g. gap placement, will adversely affect the alignment of the nearby columns. Both correlations yield more extreme misalignment probabilities, i.e. low probabilities are lower and high probabilities are higher; however, these deviations from independence are not expected to cause large errors in our method. BLAST P -values, 1 exp( E i ), with E i, the E-value from the database search with the reference sequence of the protein onto which the rates will be projected, will be used for a rough global estimate of the misalignment probabilities. This underestimates the misalignment probabilities since it only explicitly accounts for error (2) mentioned above; however, this is a serious global alignment error whose statistical distribution is known (Altshul and Gish, 1996). Our algorithm can be used with more accurate local, or column dependent, misalignment probabilities that account for other sources of alignment error since the probabilities are defined for each individual site. This is an area of future investigation since we are unaware of any existing method to calculate such probabilities. Previously developed methods for local alignment reliability measures (Vingron and Argos, 199; Chao et al., 1993; Mevissen and Vingron, 1996; Abagyan and Batalov, 1997; Schlosshauer and Ohlsson, 22; Löytynoja and Milinkovitch, 23) may provide useful starting points. The site likelihood p( x m r) is calculated recursively using a modification of the method of Felsenstein (1981). p( x m r) = p i L i (), (7) where p i is the residue probability, as in Equation 1; and L i (I) is the conditional likelihood for a subtree based at node I with I = as the root node. The following recursion relations are then used to calculate L i () (Fig. 1): L i (I) = j=1 [q M p j + (1 q M )(M rbm ) ij ]L j (J ) k=1 [q N p k + (1 q N )(M rbn ) ik ]L k (K) L i (I) = δ j,si, I leaf nodes. (8) In the above equation, b M is the length and q M the misalignment probability of branch M and S I is the residue type in the sequence 2317

4 A.J.Bordner and R.Abagyan A Bayesian estimate of the site rate, ˆr, is then calculated as the average of this posterior distribution ˆr = p(r x m )r dr. (11) Fig. 1. Tree branch and vertex labels for Equation (8). Using the average posterior distribution as an estimate corresponds to minimizing a square-error loss function (Hogg and Craig, 1978). We note that our method uses an empirical Bayes approach since the parameter α is first estimated in the prior rates distribution. This approach has been used Yang and co-workers to calculate DNA evolutionary rates using maximum-likelihood estimates of additional model parameters, including nucleotide substitution rates and the tree topologies (Yang and Wang, 1995), as well as to calculate nonsynonymous/synonymous DNA substitution rate ratios (Yang et al., 2). Fig. 2. Schematic representation of the four probabilities for each pruned tree contributing to the total likelihood, calculated using Equation (8), for the case of one reference sequence R and two other sequences M and N. A dot under the sequence name indicates that it is treated as independent for calculating the joint probability of all sequences. A detailed derivation of this result is given in the Supplementary data. associated with leaf node I at the site. The value of q M for a branch connected to a leaf node is equal to the misalignment probability of the corresponding sequence and is equal to zero for all other branches. The total site likelihood is then the sum of the likelihoods for the given site residues, phylogenetic trees with all combinations of the non-reference sequences removed, weighted by the probability that the missing sequences are misaligned and the included sequences are correctly aligned. For example, the four terms that contribute to the likelihood sum and their respective weights are shown in Figure 2 for the case of one reference sequence R and two other sequences M and N. A detailed derivation for this example is given in the Supplementary Information. 3.4 Bayesian rates estimate The site rate distribution with the estimated parameter ˆα may then be used as a prior distribution to calculate the posterior rate distribution using Bayes formula with p(r x m ) = p( x m r)f(ˆα, r) p( x m ) p( x m ) = (9) p( x m r)f(ˆα, r)dr. (1) 4 IMPLEMENTATION Several numerical approximations were used to speed up the calculation. First, PAM distances were estimated using linear interpolation based on distances that were precalculated using Equation (1) for discrete fractional identity values. A conjugate gradient iterative method was used to solve the linear system of equations resulting from the least squares fit of PAM distances. Also, components of the JTT substitution matrix, which are continuous functions of the evolutionary PAM distance, were approximated by Chebyshev polynomials for fast calculation (Pupko and Graur, 22). Finally, the integral in Equation (5) for the total likelihood may be efficiently calculated using Gaussian quadrature based on Laguerre polynomials (Press et al., 1992), as discussed in the study by Felsenstein (21). 5 DISCUSSION 5.1 Comparison with entropy based conservation measure The site rates calculated using the REVCOM method were compared with the simple entropy conservation measure by examining their ability to detect statistically significant differences in evolutionary conservation between protein protein interface residues and non-interface surface residues. Statistical tests with an alternative hypothesis of higher residue conservation in the interface were performed at three different significance levels, 5%, 1% and.1% using both conservation measures. All interfaces in the dataset were analyzed. The number of proteins with significantly higher conservation in their interaction interfaces as well as the median P -values are given in Table 1 for each method. It is apparent that the REV- COM method discriminates conservation differences better than the entropy measure since it detects significant conservation differences in more interfaces at all significance levels. Overall 62% of the interfaces have significantly higher residue conservation than the non-interface surface residues at the 5% level. 5.2 Robustness to inclusion of distantly related sequences Site rate predictions may be compared with predictions using alignments containing additional distantly related or non-homologous sequences in order to verify that including misalignment probabilities renders the method more robust. First, the NCBI nr database was searched with two different BLAST E-value cutoffs,.1 and 1., 2318

5 REVCOM: evolutionary rate estimation Table 1. Number of protein protein interfaces in the set of 1494 that have greater evolutionary conservation as compared with non-interface surface residues at the given significance levels Conservation model P -value cutoff Median P -value Bayesian rates Entropy Number of residues higher rate higher conservation REVCOM REVCOM without alignment reliability Entropy Method Difference in conservation measure Fig. 3. Comparison of the robustness of three conservation calculation methods. Histograms of conservation differences upon multiple alignment expansions are shown. A perfectly robust conservation measure would have all differences typically as zero. Since the area under each curve is equal, a higher peak at zero difference indicates higher robustness. using sequences for each of the 1494 proteins in the dataset as queries. Next, rates were calculated for the 82 proteins for which additional sequences were found at the higher E-value cutoff. The average absolute deviation in rates was only.23 for the REVCOM method as compared with.33 for the same method without misalignment probabilities. A histogram of the differences in the rates as well as in the entropy measures with an expansion of the multiple alignment is shown in Figure 3. The greater robustness from accounting for alignment reliability is apparent from the histogram of the REVCOM method, which is more strongly peaked about zero rate difference. The histogram also shows that expansion of the multiple alignment causes a large systematic increase in the entropy conservation measure whereas the differences for the REVCOM method are more symmetrical about zero. Next, we illustrate how the misalignment probabilities render the method more robust to the inclusion of non-homologous sequences by a specific example. Site rates are calculated for the SH2 domain using a reliable alignment and an alignment with additional non-homologous sequences and compared. The Pfam seed alignment (Bateman et al., 22) with 58 sequences is used for the reliable alignment and 23 ( 4%) random non-homologous sequences are then added and aligned to a profile of the original sequences using ClustalW (Thompson et al., 1994) for the test alignment. The median absolute site rate differences between the reliable Fig. 4. Surface residue site rates for E.coli malate dehydrogenase homodimer (PDB entry 1EMD (Hall and Banaszak, 1993)) calculated using the REVCOM method. The surface of one protein is colored with blue, white and red corresponding to low, intermediate and high rates, respectively. The most conserved residues are clustered near the dimer interface, NAD cofactor (shown in stick representation) and substrate (not visible beneath the surface to the left of the cofactor). The other protein in the dimer is shown in a ribbon representation with helices colored red and strands colored green. and test alignments is.33 without including alignment reliability and only.65 when it is included. This is because of the decoupling of the non-homologous sequences in the site likelihood calculation when misalignment probabilities are accounted for. 5.3 Conservation in protein protein interfaces It is important to compare conservation of interface residues with that of other surface residues rather than with all protein residues. This is because the lower conservation of surface residues compared with buried residues yields a greater contrast. In fact, one study compared the evolutionary conservation of protein protein interface residues with the entire protein sequence and found little difference (Grishin and Phillips, 1994). However, other studies, which compared interface residues with surface residues, found significant differences in conservation (Valdar and Thornton, 21; Caffrey et al., 24). Our results corroborate this observation but with a much larger dataset than those used in these earlier papers. The site rates may be displayed on the protein surface to identify protein protein interaction interfaces and active sites. An example of Escherichia coli malate dehydrogenase is shown in Figure 4. This homodimeric enzyme uses NAD as a cofactor to reversibly oxidize malate to oxaloacetate. It is an essential component of the citric acid cycle with orthologs in both prokaryotes and eukaryotes. The figure shows that residues in a surface region extending from the dimer interface to the cofactor and substrate binding sites have lower rates, i.e. higher conservation, in general than the remaining surface residues. The lower evolutionary rates for the interface is also evident from the low P -value of for the statistic comparing interface and non-interface residue rates. Although the majority of protein protein interfaces in our dataset are more conserved than the remainder of the surface, the paper of Caffrey et al. (24) concluded that conservation alone is insufficient to predict the interfaces when the surface is divided into small patches. Extending their analysis to a larger dataset or using different conservation measures, such as the one presented here, may yield better results. However, including physicochemical, geometric or residue distribution properties, such as those considered 2319

6 A.J.Bordner and R.Abagyan by Jones and Thornton (1997) in their patch analysis of protein protein interfaces, can only improve the prediction performance. Thus an accurate protein protein interface prediction method should not only account for residue conservation but also include other discriminating properties. 5.4 Future directions There are several possible extensions of the REVCOM method presented here. A local or column-dependent alignment reliability measure (Vingron and Argos, 199; Chao et al., 1993; Mevissen and Vingron, 1996; Abagyan and Batalov, 1997; Schlosshauer and Ohlsson, 22; Löytynoja and Milinkovitch, 23) may be used for calculating the misalignment probabilities instead of the global measure. However, the most appropriate measure, not described in these references, would be the probability of a local misalignment error between a given sequence and the reference sequence in the context of the multiple sequence alignment. Since misalignment probabilities are required by the method described in this paper a local alignment reliability index must be converted into a probability, e.g. by calibrating it against a structural alignment database (Mevissen and Vingron, 1996). It is also interesting to investigate the use of different prior rates distributions. The distribution should not have many parameters because estimation of the gamma distribution parameter is currently the most computationally intensive step in the rates calculation. Finally, since there is evidence that alignment gaps are less frequent in protein protein interfaces, at least for permanent dimers (Caffrey et al., 24), accounting for gaps in the stochastic evolutionary process, rather than treating them as missing data, may result in more accurate site rates. It would be informative to divide the interface dataset into permanent and transient protein protein interactions to investigate the differences in interface residue conservation. Caffrey et al. (24) found that central interface residues were more conserved than peripheral interface residues in permanent interfaces but not transient ones, but their dataset was considerably smaller than the one used here. The difficulty is in automatically assigning binding affinities to a large number of protein interactions. Another issue that has not been thoroughly investigated is whether residues that contribute the most to the binding free energy are more conserved than other interface residues. Previous studies, using alanine-scanning mutagenesis results, have shown that a small number of interface residues contributes to the majority of the binding energy (Clackson and Wells, 1995; Bogan and Thorn, 1998). Hu et al. (2) identified polar residues that are conserved in a structural alignment of protein protein interfaces and speculated, based on the agreement with alanine scanning data, that these may be hot spot residues that strongly influence binding. 6 CONCLUSION We have presented a method, REVCOM, to calculate residue conservation that accounts for the evolutionary relationships of the sequences and avoids overfitting by using a Bayesian framework. This method also accounts for alignment errors by summing the likelihoods of truncated phylogenetic trees which are weighted by the probability that the missing sequences were unreliable and the included ones were reliable. The resulting conservation measure was shown to discriminate conservation differences for protein protein interfaces better than a column entropy measure and to prevent distantly related sequences from overly affecting the rates. The evolutionary rates provided by the REVCOM method are useful for identifying surface residues in enzyme active sites and small ligand or protein binding sites. ACKNOWLEDGEMENTS We thank J. Thorne for pointing out an error in an earlier version of this paper. This work was funded by a grant from the Department of Energy (No. DE-FG3-1ER83282). SUPPLEMENTARY DATA Supplementary data for this paper are available on Bioinformatics online. REFERENCES Abagyan,R. and Batalov,S. (1997) Do aligned sequences share the same fold? J. Mol. Biol., 273, Altshul,S. and Gish,W. (1996) Local alignment statistics. Methods Enzymol., 266, Altschul,S. et al. (199) Basic local alignment search tool. J. Mol. Biol., 215, Apweiler,R. et al. (24) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 32, D115 D119. Armon,A. et al. (21) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J. Mol. Biol., 37, Bateman,A. et al. (22) The Pfam protein families database. Nucleic Acids Res., 3, Berman,H. et al. (2) The Protein Data Bank. Nucleic Acids Res., 28, Bogan,A. and Thorn,K. (1998) Anatomy of hot spots in protein interfaces. J. Mol. Biol., 28, 1 9. Bulmer,M. (1991) Use of the method of generalized least-squares in reconstructing phylogenies from sequence data. Mol. Biol. Evol., 8, Caffrey,D. et al. (24) Are protein protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci., 13, Chao,K. et al. (1993) Locating well-conserved regions within a pairwise alignment. Comput. Appl. Biosci., 9, Clackson,T. and Wells,J. (1995) A hot spot of binding energy in a hormone receptor interface. Science, 267, Felsenstein,J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17, Felsenstein,J. (21) Taking variation of evolutionary rates between sites into account in inferring phylogenies. J. Mol. Evol., 53, Fitch,W. and Margoliash,E. (1967) A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem. Genet., 1, Grishin,N. and Phillips,M. (1994) The subunit interfaces of oligomeric enzymes are conserved to a similar extent to the overall protein sequences. Protein Sci., 3, Hall,M. and Banaszak,L. (1993) Crystal structure of a ternary complex of Escherichia coli malate dehydrogenase citrate and NAD at 1.9 Å resolution. J. Mol. Biol., 232, Hogg,R. and Craig,A. (1978) Bayesian estimation. Introduction to Mathematical Statistics, 5th edn. Prentice-Hall, Englewood Cliffs, NJ, pp Holmquist,R. et al. (1983) The spatial distribution of fixed mutations within genes coding for proteins. J. Mol. Evol., 19, Howe,K. et al. (22) QuickTree: building huge neighbour-joining trees of protein sequences. Bioinformatics, 18, Hu,Z. et al. (2) Conservation of polar residues as hot spots at protein interfaces. Proteins, 39, Jones,D. et al. (1992) The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci., 8, Jones,S. and Thornton,J. (1997) Analysis of protein protein interaction sites using surface patches. J. Mol. Biol., 272, Landgraf,R. et al. (1999) Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng., 12, Landgraf,R. et al. (21) Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol., 37,

7 REVCOM: evolutionary rate estimation Lichtarge,O. and Sowa,M. (22) Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol., 12, Lichtarge,O. et al. (1996) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol., 257, Löytynoja,A. and Milinkovitch,M. (23) A hidden Markov model for progressive multiple alignment. Bioinformatics, 19, Mevissen,H. and Vingron,M. (1996) Quantifying the local reliability of a sequence alignment. Protein Eng., 9, Mihalek,I. et al. (24) A family of evolution entropy hybrid methods for ranking protein residues by importance. J. Mol. Biol., 336, Molsoft, LLC (24) ICM Software Manual. Version 3.. Neyman,J. (1971) Molecular studies of evolution: a source of novel statistical problems. In Gupta,S. and Yackel,J. (eds), Statistical Decision Theory and Related Topics, Academic Press, NY, pp Park,J. et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, Press,W., Teukolsky,S., Vetterling,W. and Flannery,B. (1992) Numerical Recipes in C. Cambridge University Press. Pupko,T. and Graur,D. (22) Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities. Comput. Stat. Data An., 4, Pupko,T. et al. (22) Rate 4site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics, 18, S71 S77. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, Schlosshauer,M. and Ohlsson,M. (22) A novel approach to local reliability of sequence alignments. Bioinformatics, 18, Sjölander,K. (24) Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics, 2, Thompson,J. et al. (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, Uzzell,T. and Corbin,K. (1971) Fitting discrete probability distributions to evolutionary events. Science, 172, Valdar,W. (22) Scoring residue conservation. Proteins, 48, Valdar,W. and Thornton,J. (21) Protein protein interfaces: analysis of amino acid conservation in homodimers. Proteins, 42, Vingron,M. and Argos,P. (199) Determination of reliable regions in protein sequence alignments. Protein Eng., 3, Yang,Z. (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol., 1, Yang,Z. (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol., 39, Yang,Z. and Wang,T. (1995) Mixed model analysis of dna sequence evolution. Biometrics, 51, Yang,Z. et al. (2) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155,

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Statistical Analysis and Prediction of Protein Protein Interfaces

Statistical Analysis and Prediction of Protein Protein Interfaces PROTEINS: Structure, Function, and Bioinformatics 60:353 366 (2005) Statistical Analysis and Prediction of Protein Protein Interfaces Andrew J. Bordner* and Ruben Abagyan Molsoft LLC, San Diego, California

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

2 Spial. Chapter 1. Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6. Pathway level. Atomic level. Cellular level. Proteome level.

2 Spial. Chapter 1. Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6. Pathway level. Atomic level. Cellular level. Proteome level. 2 Spial Chapter Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Spial Quorum sensing Chemogenomics Descriptor relationships Introduction Conclusions and perspectives Atomic level Pathway level Proteome

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes David DeCaprio, Ying Li, Hung Nguyen (sequenced Ascomycetes genomes courtesy of the Broad Institute) Phylogenomics Combining whole genome

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Subfamily HMMS in Functional Genomics D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Pacific Symposium on Biocomputing 10:322-333(2005) SUBFAMILY HMMS IN FUNCTIONAL GENOMICS DUNCAN

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

Detection of Protein Binding Sites II

Detection of Protein Binding Sites II Detection of Protein Binding Sites II Goal: Given a protein structure, predict where a ligand might bind Thomas Funkhouser Princeton University CS597A, Fall 2007 1hld Geometric, chemical, evolutionary

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Supplementary Figure 1. Aligned sequences of yeast IDH1 (top) and IDH2 (bottom) with isocitrate

Supplementary Figure 1. Aligned sequences of yeast IDH1 (top) and IDH2 (bottom) with isocitrate SUPPLEMENTARY FIGURE LEGENDS Supplementary Figure 1. Aligned sequences of yeast IDH1 (top) and IDH2 (bottom) with isocitrate dehydrogenase from Escherichia coli [ICD, pdb 1PB1, Mesecar, A. D., and Koshland,

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities Computational Statistics & Data Analysis 40 (2002) 285 291 www.elsevier.com/locate/csda Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities T.

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Similarity searching summary (2)

Similarity searching summary (2) Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Do Aligned Sequences Share the Same Fold?

Do Aligned Sequences Share the Same Fold? J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Information content of sets of biological sequences revisited

Information content of sets of biological sequences revisited Information content of sets of biological sequences revisited Alessandra Carbone and Stefan Engelen Génomique Analytique, Université Pierre et Marie Curie, INSERM UMRS511, 91, Bd de l Hôpital, 75013 Paris,

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB) Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

2 Genome evolution: gene fusion versus gene fission

2 Genome evolution: gene fusion versus gene fission 2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics Inferring Phylogenies from Protein Sequences by Parsimony, Distance, and Likelihood Methods Joseph Felsenstein Department of Genetics University of Washington Box 357360 Seattle, Washington 98195-7360

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information