Correlations Among Amino Acid Sites in bhlh Protein Domains: An Information Theoretic Analysis

Size: px
Start display at page:

Download "Correlations Among Amino Acid Sites in bhlh Protein Domains: An Information Theoretic Analysis"

Transcription

1 Correlations Among Amino Acid Sites in bhlh Protein Domains: An Information Theoretic Analysis William R. Atchley,* 1 Kurt R. Wollenberg,* Walter M. Fitch, Werner Terhalle, and Andreas W. Dress *Department of Genetics, North Carolina State University; Department of Ecology and Evolutionary Biology, University of California at Irvine; and Fakultät für Mathematik, Universität Bielefeld, Bielefeld, Germany An information theoretic approach is used to examine the magnitude and origin of associations among amino acid sites in the basic helix-loop-helix (bhlh) family of transcription factors. Entropy and mutual information values are used to summarize the variability and covariability of amino acids comprising the bhlh domain for 242 sequences. When these quantitative measures are integrated with crystal structure data and summarized using helical wheels, they provide important insights into the evolution of three-dimensional structure in these proteins. We show that amino acid sites in the bhlh domain known to pack against each other have very low entropy values, indicating little residue diversity at these contact sites. Noncontact sites, on the other hand, exhibit significantly larger entropy values, as well as statistically significant levels of mutual information or association among sites. High levels of mutual information indicate significant amounts of intercorrelation among amino acid residues at these various sites. Using computer simulations based on a parametric bootstrap procedure, we are able to partition the observed covariation among various amino acid sites into that arising from phylogenetic (common ancestry) and stochastic causes and those resulting from structural and functional constraints. These results show that a significant amount of the observed covariation among amino acid sites is due to structural/functional constraints, over and above the covariation arising from phylogenetic constraints. These quantitative analyses provide a highly integrated evolutionary picture of the multidimensional dynamics of sequence diversity and protein structure. Introduction Analyses integrating protein structure and evolution can proceed in different ways. One popular experimental approach, for example, is to carry out site-directed mutagenesis where specific amino acids within a protein are altered, and then to examine the effects of these changes on protein characteristics. Changes in amino acid attributes can then be correlated with changes in protein characteristics (e.g., Koshi and Goldstein 1997 and references therein). Another approach is to model large families of naturally occurring proteins or protein domains and determine how nature has changed their characteristics over billions of years of evolutionary diversification. By examining patterns of sequence diversity, one can explore how naturally occurring sequence variability and amino acid properties (e.g., hydrophobicity, volume, and charge) are important in maintaining protein structure. The latter approach permits analyses of protein structure and function over extensive geological timescales where evolutionary processes have experimented in nature with amino acid changes with regard to protein stability, foldability, and functionality. Experimental, as well as quantitative, analyses of proteins (including multiple-alignment procedures) often proceed by modeling frequencies of residues at individual amino acid sites. For computational expediency, these analyses assume that amino acid sites are inde- Present address: Fakultät für Mathematik, Universität Bielefeld, Bielefeld, Germany. Key words: mutual information, protein evolution, entropy, parametric bootstrap, helix-loop-helix. Address for correspondence and reprints: William R. Atchley, Fakultät für Mathematik, Universität Bielefeld, Postfach , D Bielefeld, Germany. Mol. Biol. Evol. 17(1): by the Society for Molecular Biology and Evolution. ISSN: pendent, i.e., the presence of a residue at one site is assumed to be independent of residues at other sites (Swofford et al. 1996). However, it is well known that this assumption is naïve, since the activities and properties of proteins are the result of interactions among their constitutive amino acids. Interactions among amino acid sites include salt bridges between charged residues, hydrogen bonds between electron acceptors and donors, size constraints reflecting structural interactions between large and small side chains, electrostatic interactions, hydrophobic effects, Van der Waal s forces, and similar phenomena. Detecting structural interactions and statistical covariance or associations among separate amino acid sites is fundamental for understanding protein structure and evolution. Consequently, it is important to determine the magnitude and direction of residue covariability, its origin, and its structural and functional significance. Because associations among separate amino acid sites may arise from several different sources, partitioning these associations into their component sources is fundamental to understanding protein structure, function, and evolution. The observed covariation in residue composition between amino acid sites i and j (C ij ) arises from several separate underlying causes, which can be expressed by a linear model of the form: C ij C phylogeny C structure C function C interactions C stochastic. Indeed, the primary null hypothesis to be evaluated for such a model is that any component covariance is equal to zero and, as a consequence, makes no significant contribution to the observed association between amino acid sites i and j. Let us consider what these various sources of variation entail. 164

2 Correlation Among Amino Acid Sites 165 One obvious source of covariation among residues at different sites is common evolutionary history (C phylogeny ). Felsenstein (1985) discussed this problem with regard to evolution of complex polygenic traits among species. He pointed out quite elegantly that species are part of a hierarchically structured phylogeny and therefore cannot be regarded for statistical purposes as being drawn independently from the same distribution. Felsenstein s (1985) argument holds as well for associations among amino acid sites in related proteins. For example, an ancient gene may have undergone early duplications followed by sequence diversification through mutation, natural selection, and genetic drift, which may act differentially in separate evolutionary lineages. The result will be collections of related proteins, e.g., families of bhlh proteins like MyoD and Myc. These families contain a number of functionally and structurally similar proteins that have arisen from a common ancestral protein followed by evolutionary diversification and hierarchical branching (Atchley, Fitch, and Bronner-Fraser 1994; Atchley and Fitch 1995). Within the individual members of such protein families, we would expect to find associations among residues at various amino acid sites that have persisted from the early duplication events. Additionally, covariation among sites can arise for structural or functional reasons, i.e., C structure and C function. In this instance, associations among amino acids arise independently of common ancestry and reflect a bias in amino acid replacements in order to satisfy structural demands. The folded nature of a native functioning protein requires that only certain amino acid replacements can occur at particular sites and still maintain the structural integrity of the folded protein. Furthermore, there are constraints on amino acid replacements that arise for functional reasons, such as amino acid bias at recognition sites related to DNA binding in transcriptional regulators. These functional changes may arise when selection operates to optimize adaptation and subsequently generate protein diversification. Clearly, the main effects in the linear model (i.e., structure, function, and phylogeny) are confounded and therefore are not statistically independent. It is well known that structural and functional changes arise through evolutionary processes. Consequently, inclusion of a covariance term, C interactions, in the model is necessary to account for such higher-order statistical nonindependence. Finally, covariation among sites may occur that cannot be explained by the main effects in the model and their statistical interaction. This component, designated here C stochastic, refers to the lack of fit of the data to the model and is analogous to the unexplained sum of squares in analysis of variance or regression. For the sake of simplicity, this stochastic effect can be assumed to represent background covariability. While it is obvious that covariability among sites has a multidimensional basis, partitioning the observed covariability among sites into appropriate underlying components is not a simple matter. Rather, it is a process fraught with many statistical and computational difficulties. Not the least of these difficulties is that biological sequences are represented by symbols that have no natural ordering or underlying metric (Atchley, Terhalle, and Dress 1999). Consequently, conventional statistical analyses typically used to partition variability and covariability are difficult to apply with sequence data. Herein, we use an entropy (information theoretic) approach coupled with simulation-based parametric bootstrap procedures to examine the magnitude and origin of associations among amino acid sites in the highly conserved basic helix-loop-helix (bhlh) domain. The bhlh domain is a DNA-binding and dimerization domain of approximately amino acids found in a large and diverse family of transcription factors (Murre et al. 1994). A number of these proteins have been the focus of detailed structural and functional analyses. Furthermore, the bhlh domain has been the subject of several recent evolutionary analyses (Atchley and Fitch 1997; Atchley, Terhalle, and Dress 1999; Morgenstern and Atchley 1999). The present paper explores a number of questions about amino acid associations and protein structure. First, we ascertain the magnitude of association or covariation among residues between amino acid sites within the highly conserved bhlh domain. Second, we carry out computer simulations to elucidate the underlying origins of the observed associations among amino acid sites. We inquire if the observed covariability arises simply from stochastic events or if it is due to evolutionary history or structural and functional constraints. Third, we integrate measures of variability and covariability derived from information theory with structural data from published crystal studies on the bhlh domain. In doing so, we explore the relationships between primary sequence diversity and protein structure/function. Fourth, we examine the evolution of the -helical structure of the bhlh domain among a diverse collection of proteins. Methods and Materials Data The analyses, and conclusions inherent to these analyses and discussions, are based on 242 bhlh domain sequences reported in Atchley and Fitch (1997) and Atchley, Terhalle, and Dress (1999). Multiple-sequence alignment was initially carried out using CLUS- TAL W (Thompson, Higgins, and Gibson 1994), and the alignment was then improved by eye. The various phylogenetic analyses, definitions of various clades and evolutionary lineages, descriptions of the protein families, and the like discussed herein are reported in Atchley and Fitch (1997). Structure of bhlh Proteins Crystal structure studies have been carried out on the bhlh domains of six proteins, i.e., Max, E47, MyoD, USF, PHO4, and SREBP (Ferre-D Amare et al. 1993, 1994; Ellenberger et al. 1994; Ma et al. 1994; Brownlie et al. 1997; Shimizu et al. 1997; Parraga et al. 1998). The Max protein, which is the dimerization part-

3 166 Atchley et al. FIG. 1. Illustration showing the protein homodimer DNA interaction for Max. This figure (modified from Ferre-D Amare et al. 1993) shows two -helices separated by a loop. One -helix comprises the basic DNA-binding region and helix 1, while the second component involves helix 1 and the leucine zipper. The basic region is shown interacting with DNA in this figure. Of particular relevance to these discussions is the interaction of the two helical components proximal to the loop. ner of the protooncogene Myc, has been examined in considerable crystallographic detail and shown to have an amphipathic -helical structure (fig. 1) in which the protein has opposing hydrophobic and hydrophilic faces. The crystal structure of the Max homodimer shows it to be a parallel, left-handed, four-helix bundle with a hydrophobic core (Ferre-D Amare et al. 1993). The conserved hydrophobic amino acids from helix 1 (H1) and helix 2 (H2) in Max are buried in the interior of the four-helix bundle, where they pack together and exhibit strong van der Waals interactions that stabilize the structure of the homodimer. The conserved, hydrophobic amino acids in H1 and H2 appear to be required for a stable protein DNA complex formation in Max. The crystal structure of Max appears to be generally similar to that of other bhlh proteins, such as E47 (Ellenberger et al. 1994), MyoD (Ma et al. 1994), USF (Ferre-D Amare et al. 1994), PHO4 (Shimizu et al. 1997, and SREBP (Parraga et al. 1998). Indeed, Parraga et al. (1998) point out the remarkable similarity in structure between SREBP and Max and that the hydrophobic cores are virtually identical in these two distantly related bhlh proteins. The protein Max will be used as the structural model for our discussions, and it will be assumed that the 242 bhlh proteins involved in these analyses have the

4 Correlation Among Amino Acid Sites 167 same general structural features as Max. This extrapolation is based on the studies of Ferre-D Amare et al. (1993, 1994), Ma et al. (1994), Ellenberger et al. (1994), Shimizu et al. (1997), and Parraga et al. (1998). Helical Wheel Projections Helical wheel projections are used in these analyses to provide insight into residue interrelationships with a protein structure. Helical wheels graphically display the disposition of amino acid side chains about an assumed -helix. The projection is along the central axis of the helix, from the N-terminus to the C-terminus, and it is a useful device for displaying the symmetry (or asymmetry) of hydrophobic/hydrophilic side chains. The helical wheel assumes a periodicity of 3.6 residues per helical turn. Thirty-three exemplar sequences are used to explore the phylogenetic aspects of the helical wheel projections. Generally speaking, these 33 sequences are well-studied proteins that reflect the evolutionary diversity of the bhlh domain. The evolutionary relationships among the various clades and lineages for these sequences are represented by a neighbor-joining tree. Sequences are arranged phylogenetically and shown in a helical wheel configuration for helices 1 and 2. Secondary Structure Prediction In several instances, the secondary structure of a particular bhlh-domain-containing protein is examined using the Protein Sequence Analysis (PSA) Server from Boston University. The computer model analyzes amino acid sequences and calculates the probability of secondary structures and folding classes within regions of a sequence. The underlying theory for these predictions is described in White, Stultz, and Smith (1994), and the URL of the server is Variability and Covariability in Protein Sequences As noted earlier, statistical analyses of biological sequences present difficulties because these sequences are represented by symbols that have no natural ordering or underlying metric (Atchley, Terhalle, and Dress 1999). Consequently, conventional statistical estimates of variability and covariability are difficult to apply. Recently, several authors have suggested the use of the concepts of entropy and mutual information (Korber et al. 1993; Clarke 1995; Herzel and Gross 1995; Schneider 1996; Roman-Roldan, Bernaola-Gavan, and Oliver 1996; Atchley, Terhalle, and Dress, 1999). Entropy (E) is a measure of uncertainty derived from thermodynamics and statistical physics which has considerable utility for studies of protein structure. Assume X is a discrete random variable (the amino acid sites) for which we are uncertain which of its 20 values (x 1, x 2,...,x 20 ) (amino acid residues) will occur at site X, but we do know their expected frequencies, p i,..., p n. These expected frequencies can be used to calculate how much information E(X) is present at site X. In this context, information is a measure of the uncertainty about which residue will occur at a specified site. The Boltzmann-Shannon entropy E(X) is defined (Applebaum 1996) by n j 2 j j 1 E(X) : p log (p ), (1) where n 20, p j is the probability of an amino acid being of the jth kind, and p j log 2 p j : 0ifp j 0. E 0 when all elements are in the same category (the same amino acid residue at a particular site). E increases with both the number of categories (residues at a site) and their equiprobability. Entropy of a uniform distribution whose range has size n is E n log 2 (n). (2) Thus, the minimum entropy or uncertainty value will be zero when only a single residue occurs at a particular site in all included proteins. The expected maximum entropy will be 4.32 when all 20 residues are present in equal frequencies at a given site. The relative information content of Y contained in X is termed the mutual information, or MI(X, Y), where n m p jk jk j 1 k 1 j k MI(X, Y) p log 2. (3) pq Note that MI(X, Y) MI(Y, X), and if X and Y are independent, then MI(X, Y) 0, corresponding to the fact that no information is obtained regarding Y by finding out about X. In biological sequences, MI describes the extent of correlation or association between residues at amino acid sites X and Y that might arise from evolutionary, functional, or structural constraints. More algebraic details are provided in Atchley, Terhalle, and Dress (1999). Statistical Inference about Mutual Information Values An important question in biological sequence analyses is whether one can distinguish signals due to various biological sources (phylogeny, structure, and function) from any background noise (stochastic variation) inherent in a set of sequences. This is analogous, in quantitative genetics, to partitioning phenotypic variability into genetic components (including additive, dominance, and epistatic variance components) and environmental components. In these analyses, we use a parametric bootstrap approach (Efron and Tibshirani 1993; Goldman 1993; Huelsenbeck, Hillis, and Jones 1996) to generate a distribution of MI values reflecting only covariation involving stochastic and phylogenetic constraints. Additional details of this method are presented in Wollenberg and Atchley (2000). The parameters used in the parametric bootstrap simulations were the phylogenetic tree generated from the aligned protein sequences and a residue substitution matrix. Because the tree was derived from the data and the substitution matrix was not (it was chosen to reflect general amino acid substitution probabilities), the data sets generated in the parametric bootstrap simulations contained only stochastic and phylogenetic associations between sites. A neighbor-joining tree (Saitou and Nei 1987) was computed for the 237 sequences using p-distances. The residue change matrix used was that used in the computer program PAML, version 1.3 (Yang 1997). This

5 168 Atchley et al. matrix was generated using the algorithm of Jones, Taylor, and Thornton (1992) and is hereinafter referred to as the JTT matrix. This matrix does not consider gaps as characters for the generation of replicate data sets. Therefore, for MI values calculated on the empirical data to be comparable with MI values calculated on the parametrically generated data sets, only ungapped sites could be used for this statistical analysis. (All sites were used for generating the phylogeny.) For this reason, the original 242 sequences of Atchley and Fitch (1997) were reduced to 237 to decrease the number of sites in the alignment having gaps. This resulted in 32 sites without gaps for analysis. Like any numerical simulation of a physical process, the results depend on the assumptions of the underlying models for their validity. As in any phylogenetic analysis, results depend on the confidence one has that the tree is a realistic description of the history of the subjects being analyzed. The parametric bootstrap also depends on the tree as the source of information about the level and distribution of sequence variation. The residue substitution matrix used will control the changes that occur between sequences in the simulation. Biases in this matrix can affect the potential associations measured in the resulting simulated sequences. However, a matrix having no biases (i.e., a matrix of uniform substitution probabilities) would ignore the biology of the substitution process. Alternatively, one could use a substitution matrix derived from the empirical data, such as that calculated by the RIND program (Bruno 1996). However, a matrix of this type would reflect biases due to phylogeny, structure, and function that are inherent in the empirical data being analyzed (Wollenberg and Atchley 2000). For these reasons, we used a general protein substitution matrix derived using the JTT algorithm. The statistical significance of the MI values was determined by comparing the frequency distributions of MI values for the 237 bhlh sequences and the results for the parametric bootstrap analyses. Any MI value above a specific threshold was considered to contain significant associations over and above those due to stochastic or phylogenetic constraints. This threshold MI value was that value in the frequency distribution of parametric bootstrap MI values greater than a specified percentage (i.e., 99%, 99.9%) of parametric bootstrap MI values. Thus, this procedure does not test whether a given MI value is different from random; rather, it tests whether an MI value reflects a significant association due to structural and functional constraints over and above covariation arising from evolutionary history and stochastic events. Results Entropy Values Figure 2 provides a histogram of the entropy values (E) for the individual amino acid sites in the bhlh domain for 242 proteins. The basic region extends from site 1 to site 13, helix 1 extends from site 14 to site 28, and helix 2 extends from site 50 to site 64. The basic FIG. 2. Histogram of entropy values for individual amino acid sites in the bhlh domain. and helix 1 regions are modeled as a continuous helix (Ferre-D Amare et al. 1993) separated from the second helix by a variable-length loop. The loop is not considered in these discussions. Actual numerical values of E range from 0.15 (98% L residues) at site 23 to the maximum observed value of 3.45 for highly variable sites 21 and 62. As noted earlier, the maximum possible entropy value will be 4.32 when all 20 residues are present at a site in equal frequencies. Crystal studies of the protein Max by Ferre- D Amare et al. (1993) have shown that residues at various sites pack against each other. These sites are described herein as contact sites, using the numerical site identification described by Atchley and Fitch (1997) and Atchley, Terhalle, and Dress (1999). Thus, K1 refers to a K residue at site 1. With regard to the contact sites, I16 packs against F20, which contacts R50, I53, and L54. Site L23 abuts I53 and A57. V27 and P28 pack against Y60, and I61 packs against its symmetry mate (the site in the dimerization partner) I61 and Y60, while M64 interacts with its homodimer symmetry mate M64. These packed residues, together with the relevant van der Waals forces, stabilize the structure of the homodimer and help conserve the hydrophobic core. The analyses described here assume the crystal structure of Max, E47, MyoD, USF, PHO4, and SREBP, extrapolated to the other known bhlh proteins. Based on the multiple alignment of the bhlh domain sequences used in Atchley and Fitch (1997) and Atchley, Terhalle, and Dress (1999), the most prevalent residues at these contact or packing sites are 16 (I, L, V), 20 (F, I, L), 23 (L), 27 (I, L, V), and 28 (L) in helix 1, and 50 (K), 53 (I, T, V), 57 (A), 60 (Y), and 64 (L) in helix 2. Thus, the five relevant contact sites in helix 1 are highly hydrophobic while the five sites in helix 2 (except for site 50, which initiates the second helix) are predominantly hydrophobic. The packing interrelationships among sites in helix 1 and helix 2 are shown graphically in figure 3. Additionally, structural studies on SREBP (Parraga et al. 1998) also indicate interactions between site 12 in the basic region with site 17 in helix 1 and site 50 in helix 2 in the other monomer of the homodimer. All known core contact positions given by Ferre- D Amare et al. (1993) and Parraga et al. (1998) have

6 Correlation Among Amino Acid Sites 169 FIG. 3. Interactions (packing) between helix 1 and helix 2 sites in bhlh proteins. The dotted lines between numbered boxes refer to those sites known to pack together. The entropy values for each site are given in italics. Sites that are symmetry mates with the dimerization partner are underlined. entropy values which range from 0.15 to 2.27 (table 1). All of the remaining amino acid positions except two (described below) have entropy values ranging from 2.46 to The distributions of E values for contact versus noncontact values are nonoverlapping, and their medians are statistically highly significantly different by a Mann-Whitney test (Sokal and Rohlf 1995) (table 1). One might conclude that contact residues can be identified by their entropy values, i.e., assigning contact positions to those with E values below a particular threshold value (approximately 2.3 in the case of these bhlh domains). While this works for the bhlh domain, it is unknown whether it can be generalized to other proteins. If the various amino acid residues are classified into functional groups (e.g., D, E acidic; K, R, H basic) as described in Atchley, Terhalle, and Dress (1999), the contact sites have entropy values ranging from 0.03 to All values are 1.0 except those of site 60 (1.31) and site 28 (1.64). For sites not in contact, the entropy values for functional groups range from 0.58 to The noncontact site with the smallest entropy value (site 51, with E 0.58) has a variety of residues, but they are all aliphatic/hydrophobic, which accounts for this seemingly anomalously low value. A. R. Ferre-D Amare generously provided information from his crystal studies about the structural interactions of site 51. The side chains of the alanine residue in Max and the leucine residue in E47 at site 51 interact with those of site 16 to facilitate stabilization of the protein dimers hydrophobic core. Furthermore, in SREBP, the serine at site 51 makes a water-mediated hydrogen bond with a phosphate oxygen anchoring the dimer to DNA. Other bhlh proteins have histidine residues at site 51, and this side chain could make DNA contacts as well. Our results suggest that two positions not originally considered contact sites by Ferre-D Amare et al. (1993) may be such. Sites 17 and 24 have E values of 1.18 and 1.93, respectively (0.93 and 0.94 with residues classified into functional groups), which are well within the range Table 1 Entropy Values for Amino Acid Sites in Helix 1 and Helix 2 that Are in Contact or Not in Contact Sites in Contact Sites Not in Contact Site Entropy E F Site Entropy E F * * Mean Mean SD SD NOTE. The Entropy columns give the entropy values for individual amino acids at each site. E F is the entropy value when the amino acids at each site are converted to their functional groups (sensu Atchley, Terhalle, and Dress 1999). Sites 17 and 24 are putative contact sites based on their entropy values (see text for more details). A Mann-Whitney test of the null hypothesis that these two sets of sites have the same median entropy value is rejected at P for both E and E F. of the E values for contact sites (table 1). Further examination of the bhlh structural data (A. R. Ferre D Amare, personal communication) suggests that site 17 functions in a water-mediated DNA-protein contact. This interrelationship has apparently resulted in a high level of conservation of hydrophilic residues at this site (74% asparagine, 24% lysine or arginine). This interaction is clearly demonstrated in the sterol-regulatoryelement-binding proteins (SREBPs) by Parraga et al. (1998). Unfortunately, the situation with site 24 is more difficult to explain. Side chains in this position form part of an extension of the hydrophobic core of the bhlh dimer by burying under the flap formed by the loop component. The conservation of longer side-chain residues at site 24 may stem from their hydrophobic methylenes being partially buried and the hydrophilic tips being exposed (A. R. Ferre-D Amare, personal communication). Additional high-resolution studies of the bhlh domain may provide a biological explanation for these quantitative observations on site 24. Variability in Buried and Exposed Sites The results reported here are in line with many previous observations that internal (buried) residues are less variable than external or exposed residues, and less variation means lower entropy values (e.g., Goldman, Thorne, and Jones 1998). Indeed, Atchley, Terhalle, and Dress (1999) previously tested this hypothesis and found that the entropy values of the buried sites are significantly smaller than those of the exposed sites. Mutual Information Tables 2 and 3 provide MI values describing the level of association among amino acid sites within the

7 170 Atchley et al. Table 2 Mutual Information (MI) Values 1.0 Within and Between Components for Basic, Helix 1, and Helix 2 Sites Basic region Helix 1 Helix NOTE. Only values 1.0, which reflect the top 5% of all MI values for the bhlh domain, are included. Basic region sites are 1 13, helix 1 sites are 14 28, and helix 2 sites are Sites from the loop region are not included. bhlh domain. Values 1.0 reported in table 2 constitute the top 5% of MI values for all bhlh domain sites as reported by Atchley, Terhalle, and Dress (1999). When the sites with MI 1.0 are arranged in a network, specific patterns of association are made apparent (fig. 4). Within this network are subnetworks consisting of sets of sites for which each site has connections to all other sites in the subnetwork. These completely connected subnetworks correspond to the cliques previously defined in Atchley, Terhalle, and Dress (1999). A maximum clique corresponds to the largest maximally connected subnetwork. The two maximum cliques for the data from table 2 are presented in figure 4A and B. Mutual Information Between the Basic Region and Helices 1 and 2 Table 2 describes the pattern in mutual information values between the basic DNA-binding region (sites 1 13) and helices 1 and 2. Within the basic component itself, MI values 1.0 are noted between sites (3, 4), (3, 7), (4, 5) (4, 7), (4, 8), and (7, 11). Sites 3, 4, 7, and 8 in the basic region show high levels of association with specific sites in the two heli- Table 3 Entropy (E) and Mutual Information (MI) Values for Sites in Helix 1 (H1) (Sites 14 28) and Helix 2 (H2) Sites (50 64) H2 SITES 64 (E 1.21) 63 (E 3.36) 62 (E 3.45) 61 (E 1.46) 60 (E 1.25) 59 (E 3.16) 58 (E 2.49) 57 (E 1.23) 56 (E 3.08) 55 (E 2.84) 54 (E 0.20) 53 (E 1.24) 52 (E 3.00) 51 (E 2.46) 50 (E 0.52) E H * * NOTE. Sites known to interact in the protein Max are underlined and in bold italic type. Two sites predicted from these analyses to interact are indicated with asterisks. MI values 0.56 reflect sites with high probabilities (P 99%) of having associations due to structural and functional constraints. Values 1.0 represent the top 5% of all bhlh MI values and are given in bold type and underlined.

8 Correlation Among Amino Acid Sites 171 From Table 3, it is clear that sites with low sequence diversity (small entropy values) also show little covariation with other amino acid sites. This is to be expected, since residues at paired sites do not covary if the individual sites themselves had little residue variation. Thus, the four primary sites in each helix previously shown to pack together (sites 16, 20, 23, and 27 in helix 1, and sites 50, 53, 57, and 60 in helix 2) exhibit very little residue variability (low entropy values), and there is very little covariability among the contact sites. The latter finding is reflected by the fact there are no MI values among these eight contact residues higher than 0.22 and, as will be seen below, none show significant covariation due to structural and functional constraints. Several sites within the helices with higher residue diversity exhibit considerable mutual information with other variable sites. Thus, site 23 (E 3.48) in helix 1 exhibits the highest observed MI values (MI 1.19) with sites 52, 55, and 56 in helix 2. In helix 2, site 52 likewise shows high MI values with sites 19, 21, 25, and 26 in helix 1. Note, however, that while site 18 exhibits considerable residue diversity (E 3.26), it does not necessarily exhibit high MI values with other sites. This demonstrates that high entropy does not necessarily produce high values of mutual information even in highly conserved protein domains. FIG. 4. Network diagrams of the data from table 2 (pairs of sites with MI 1.0). Maximum cliques are indicated by the heavy black lines. A, Maximum cliques containing site 7 from the basic region. B, Maximum clique containing site 52 from helix 2. ces, primarily sites 14 and 21 in helix 1 and sites 52, 56, and 62 in helix 2. The helical wheel assumes a periodicity of 3.6 residues per helical turn, so site 14 is at the start of helix 1, and site 21 is two complete turns into the helix. In rank order, the MI values for the top 13 paired sites are as follows: 3 and 21 (1.20), 7 and 21 (1.19), 3 and 14 (1.18), 3 and 4 (1.17), 3 and 56 (1.15), 3 and 62 (1.13), 7 and 15 (1.12), 8 and 19 (1.12), 4 and 7 (1.11), 7 and 11 (1.10) 4 and 21 (1.10), 4 and 52 (1.10), and 11 and 21 (1.10). Mutual Information Within and Between Helix 1 and Helix 2 Table 3 provides the E and MI values for interacting sites in helices 1 and 2. Packed or contact sites are underlined and in italics in table 3, and entropy values are provided for each site. These values provide a description of residue diversity at each site, ranging from E 0.15 at site 23 (which is 98% leucine in this large database) to E 3.48 at site 21. The maximum possible value for E is The remainder of the table provides pairwise MI values for these amino acid sites. Sites in this subset with MI values 1.0 are shown in bold type and underlined. As a point of reference, the largest observed MI value for the entire bhlh domain was 1.26 between sites 14 and 21. Origins of Significant Mutual Information There are several possible explanations for significant levels of association among many sites within the bhlh domain (other than simply chance associations). These include associations arising from evolutionary constraints, correlated mutations, functional associations, and structural constraints. Simulation and the Partition of Observed Associations A simulation was carried out using parametric bootstrap procedures to partition the observed covariation among amino acid sites into that due to evolutionary history (phylogeny) and stochastic events on the one hand, and that due to structural and functional constraints on the other. The distributions of MI values for the parametric bootstrap data and the empirical data were significantly different at P This suggests that there are significant associations among many amino acid sites in the bhlh domain over and above those due to stochastic and phylogenetic effects. The distribution of MI values from 1,000 parametric bootstrap replicates, calculated using the neighborjoining tree and the JTT substitution matrix, was compared with the distribution of MI values for 237 bhlh proteins (fig. 5). This comparison permits calculation of threshold values for distinguishing between structural/ functional and phylogenetic/chance associations. In these analyses, sites having MI values above a given threshold value have a specific probability of covariation due to structural and functional constraints, rather than due to phylogenetic constraint or chance. The specific MI value used as this threshold has an associated prob-

9 172 Atchley et al. FIG. 5. Inverse cumulative frequency distributions of MI values for the alignment of 237 bhlh protein sequences and 1,000 parametric bootstrap replicates using the JTT substitution matrix. Three probability thresholds (P 0.01, P 0.001, and P ), their associated MI values, and the the probability associated with MI are indicated. These thresholds were calculated from the JTT bootstrap distribution. Because MI is a pairwise measure, each replicate consisted of x(x 1)/2 values, where x is the number of sites in the alignment. For the alignment of 237 bhlh sequences, there were 32 sites without gaps, resulting in 496 MI values per replicate. ability based on the number of values in the parametric bootstrap distribution that are greater than the threshold. Based on the frequency distribution shown in figure 5, any pair of nongapped sites in the bhlh alignment that have MI values will have a probability 0.01 that the covariation among sites is due to phylogeny or chance associations. With a threshold value of 0.749, the probability of this level of covariation being due to phylogeny or chance is reduced by an order of magnitude to P In table 2, sites are designated that have MI values 1.0 and it is noted that they reflect the top 5% of all MI values for the entire bhlh domain. These values 1.0 have a probability of P (from the parametric bootstrap simulations) that the associations are due to phylogeny or chance associations. Among the contact sites, there is no significant covariation from structural and functional constraints. However, there is considerable significant covariation among noncontact sites that stems from structural and functional constraints (fig. 6). The latter is evident in those MI values larger than 0.56, the 1% critical value from the parametric bootstrap simulations. Correlations in Hydropathy From a structural and functional perspective, correlated amino acid replacements may have occurred among sites to maintain the same hydropathic relationships, similar electrostatic charges, size constraints, etc. It is difficult to examine statistical associations among sites with regard to electrostatic charge, since only five amino acids (K, R, H, D, and E) are charged and 15 residues have no charge. Since paired sites must show variability to exhibit correlation, attempts to compute correlation coefficients with invariant data (charge of zero in both variables) is undefined, as can be seen by equation (3). Consequently, we are unable to examine associations due to charge and have focused instead on the possibility of significant associations due to hydropathy and size constraints. For 33 exemplar sequences, residues in helix 1 and helix 2 were coded as 1 if they were hydrophobic (A, C, G, I, L, M, F, P, and V), 1 if they were hydrophilic (R, N, D, E, Q, H, and K), and 0 if they were S, T, Y, or W. Then, pairwise product-moment correlation coefficients (r) were computed among all sites in the two helices to determine if significant associations existed among sites for hydropathy states. Several pairs of sites showed high correlations, including sites 20 and 23 (r 0.89), 17 and 61 (r 0.81), 50 and 61 (r 0.75), and 27 and 61 (r 0.71). (Any product-moment correlation 0.45 in this analysis is significant at P 0.01) As can be seen in the data given in figure 7, the high positive correlations relate to strong association for hydrophobic residues at sites 20 and 23, as well as at sites 27 and 61 (fig. 8). The negative values refer to inverse relationships between hydrophobic and hydrophilic residues. In addition to these four pairs of sites, two pairs of sites (sites 23 and 27 and sites 25 and 50) had correlations 0.6. Correlations in the Sizes of Amino Acid Residues For these same 33 bhlh sequences, product-moment correlation coefficients were computed for the sizes (volume) of residues at pairs of sites for helices 1 and 2. Seven pairwise correlations occurred that were statistically significant. These included sites 23 and 50 (r 0.66), 50 and 60 (r 0.64), 53 and 54 (r 0.57), 22 and 26 (r 0.57), 23 and 60 (r 0.52), 54 and 60 (r 0.51), and 56 and 61 (r 0.50). However, some

10 Correlation Among Amino Acid Sites 173 FIG. 6. Representation of high mutual information values among a subset of amino acid sites in bhlh proteins. The solid boxes are the conserved sites known to pack together (see fig. 1) from crystal structure studies. The lines between sites in helix 1 and helix 2 reflect the 12 sites with the highest mutual information ( 0.9). values must be viewed with caution because of computational difficulties associated with high levels of conservation (low entropy) and different residues with equivalent volumes. In spite of these caveats, there are several interesting findings here involving associations based on size. The sites with the largest correlation coefficient (sites 20 and 53) are contact sites, and these observations indicate that there is a like association; i.e., large residues are paired with large residues. Thus, when F occurs at site 20, I, L, Y, or T occurs at site 53. Similarly, an L at site 20 pairs with an L, T, or V at site 53. The largest negative value occurs between two adjacent sites (sites 53 and 54). It might be expected that significant inverse relationships would exist for the volume of adjacent residues. However, the difference between the entropy values for sites 53 (E 1.24) and 54 (E 0.20) stresses the need for caution in interpreting this correlation, since variable site 53 is paired with a largely unvaried site 54. The correlation coefficient of 0.57 between sites 22 and 26 probably represents a meaningful structural/functional association, because there is considerable residue variability at both of these sites. Both sites occur away from the contact sites and, consequently, exhibit more variability. Helical Wheels and Sequence Diversity The helical wheel shown in figure 7 displays the helical distribution of residues including both the basic DNA-binding region and helix 1, while figure 8 provides a helical wheel for amino acid sites 50 64, which constitute helix 2. These figures summarize information for the 392 bhlh domain sequences in the database. The most prevalent residues at each site are shown in the figure. Amino acid cliques (sensu Atchley, Terhalle, and Dress 1999) shown in these figures are defined for each helix. Cliques are groups of amino acid positions FIG. 7. A helical wheel drawing of the basic DNA-binding region and helix 1. Highly conserved sites are shaded, and those sites involved in the predictive motif of Atchley, Terhalle, and Dress (1999) are denoted with an X. The five sites involved in the highest-ranked multisite clique are enclosed with a dotted line. all of which are more highly associated with each other than any are with a nonmember of the clique. Maximal cliques are those not contained in larger cliques. Finally, those sites involved in determining the predictive motif described in Atchley, Terhalle, and Dress (1999) are marked by an X. This predictive motif is a collection of 19 highly conserved sites whose amino acid compositions accurately discriminate bhlh-domain-containing proteins into groups A D according to the evolutionary classification proposed by Atchley and Fitch (1997). In Figure 7, sites are denoted that have entropy values 2.0, values between 2.0 and 2.4, and values 2.4. The most prevalent amino acid residues at each site are noted with the appropriate symbols where possible. Furthermore, five sites (sites 3, 4, 7, 14, and 21) that constitute the highest-ranked multisite clique in helix 1 are denoted. FIG. 8. Helical wheel of helix 2 sites. Labeling conventions follow figure 7.

11 174 Atchley et al. FIG. 9. Neighbor-joining tree for 33 exemplar proteins and the distribution of residues for helix 1 and helix 2. Within a helix, residues are grouped by their positions in the helical wheel (figs. 6 and 7). In the tree, lineages representing Groups A D of Atchley and Fitch (1997) are denoted. In the DNA-binding region, there are five strongly conserved sites with E 1.7 (sites 1, 2, 9, 10, and 12) (fig. 7), and four of these are highly basic in that they have K or R residues in great preponderance. The exception is site 9, which has a glutamic acid in 93% of all sequences. The glutamic acid at site 9 contacts the C in the E-box (CANNTG), and its presence indicates that DNA binding occurs. Those bhlh proteins lacking an E at site 9 do not bind DNA (groups C and D, sensu Atchley and Fitch 1997). The remaining highly variable sites in the basic region (5, 6, and 11) do not appear to show a systematic pattern of functional group amino acids. The remainder of the sites shown in figure 7 (sites 14 28) constitute helix 1 in the bhlh domain. There are a number of highly conserved sites constituting one face on the helical wheel. These include sites 16, 17, 20, 23, 24, 27, and 28, and they comprise a conspicuous distribution. Sites 16, 20, 23, and 27 are hydrophobic sites with a high preponderance of I, L, V, and F amino acid residues. According to Klingler and Brutlag (1994) and others, there is a hydrophobic periodicity that characterizes many helices, i.e., the relative positions of amino acids in an amphipathic helix may influence their interresidue correlation structure. Thus, an amino acid at position i may show a preference for similar types of amino acids at sites i 3 and i 4, which are on the same side of the helix. Analogously, an amino acid at position i may show a preference for dissimilar amino acid types at positions i 2 and i 5. More explicitly, if a hydrophobic residue occurs at site i, there is a greater expectation of seeing a hydrophobic residue at sites i 3 and i 4. Thus, the more coincident two residues are on one side of an helix, the more likely they are to be of the same hydropathy. Conversely, the closer to a 180 separation two residues are, the more likely they are to be of opposite hydropathies (Klingler and Brutlag 1994). Sites 16, 20, 23, and 27 are three to four sites apart, in accordance with the i 3 and i 4 pattern of hydrophobic periodicity described by Klingler and Brutlag (1994). Thus, this set of four sites provides a highly hydrophobic face for the helix. Sites 17 and 24, on the other hand, are conserved sites with hydrophilic residues

12 Correlation Among Amino Acid Sites 175 (K, R, N) which provide the hydrophilic face indicative of amphipathic helices. Finally, site 28 has a high frequency of P residues (63%), indicative of the last site of an helix. At site 32, 35% of the sequences have P residues. Some proteins, like MyoD, continue helix 1 another turn, and the protein is turned out of the helix into the loop one turn later than in other bhlh proteins. Figure 8 shows the helical wheel for helix 2. The helix starts with site 50, which is a highly conserved K residue (93% of all sequences). Hydrophobic periodicity is easily seen relative to sites 57 and 64, which are predominantly hydrophobic residues, as are the i 3 and i 4 sites. However, the i 2 and i 5 sites to sites 57 and 60 are not necessarily hydrophobic and are rather diverse in their amino acid compositions. This is also the case for sites 54 and 61. On average, helix 2 appears to be more hydrophobic than helix 1. The predictive motif proposed by Atchley, Terhalle, and Dress (1999) to discriminate bhlh proteins employs sites that fall on these highly conserved faces of the two helices. Phylogenetic Aspects of the Helical Configuration To explore the phylogenetic aspects of an helix configuration, we combined a neighbor-joining tree of 33 bhlh domains with the distribution of their amino acids along a helical wheel (fig. 9). The proteins chosen for these analyses were simply some typical representatives of the various clades and evolutionary lineages as reported by Atchley and Fitch (1997). This tree delimits those proteins that belong to groups A D. Group A and B bhlh domain proteins are most prevalent in the literature and databases, followed by those of group C. Group A proteins bind to a CAGCTG E-box configuration, while group B proteins bind to the CACGTG E-box configuration. Group C has a more complex DNA-binding behavior, and group D does not bind DNA. Atchley, Terhalle, and Dress (1999) derived a 19- element predictive motif that accurately identifies bhlh-domain-containing proteins. Groups A and B show the smallest deviation from this motif, while groups C and D deviate considerably, mostly in the basic DNA-binding region, as would be expected. Proteins that fit this predictive motif most closely include MyoD and Achaete-Scute (no mismatches) and LYL1, d- HAND, and SREBP (one mismatch). The most divergent proteins are INO4 and CENP-B, with 10 mismatches, while INO2 has 8 mismatches. (EMC also has 10 mismatches, but this is to be expected since it is a group D protein which does not have a typical basic DNAbinding component.) Let us consider the structures of those proteins that deviate most from the predictive motif. CENP-B was originally described as a bhlh protein (Sullivan and Glass 1991; Sugimoto, Muro, and Himeno 1992). However, Atchley and Fitch (1997) suggested that it deviated considerably in its primary sequence from more typical bhlh proteins. In addition to considerable deviation from the predictive motif for bhlh domains (Atchley, Terhalle, and Dress 1999), the first residue in helix 1 is a proline. Proline is an amino acid generally associated with breaking of helices, and, indeed, the last residue in helix 1 and the first one in the loop regions are prolines. Analysis of the secondary structure of the CENP- B bhlh domain by the PSA algorithm described in Materials and Methods indicates that the stretch of residues considered to be the helix 1 region have low probabilities of being helices. The probability values range from 0.34 for the first residue (proline) to 0.56 (an isoleucine midway in helix 1). The means and standard deviations for the probabilities of being helices for the residues in the basic, H1, L, and H2 regions are 0.67 ( 0.16), 0.37 ( 0.15), 0.40 ( 0.15), and 0.55 ( 0.15), respectively. An analysis of variance of the basic and H1 regions to test the null hypothesis that the basic and helix 1 regions have an equal probability of being helices is rejected at P Recently, it was suggested that CENP-B is not a helix-loop-helix protein but, rather, should be classified as helix-turn-helix (Iwahara et al. 1998). Our analyses here suggest that CENP-B probably should not be classified as an HLH protein. Another protein that deviates considerably from the predictive model is INO4, which is believed to positively regulate the coordinate expression of phospholipid structural genes in yeast (Nikoloff and Henry 1994). The fit of INO4 to the predictive motif in helix 2 is particularly bad (five mismatches). However, in spite of the sequence differences, the PSA analyses indicate that INO4 fits the helix model rather well. Our analyses suggest that a more detailed analysis of the structure of INO4 might be fruitful. The extent of conservation at particular sites and pairs of sites can now be more clearly seen, together with deviations from these patterns of conservation. For example, all of the proteins except four have proline residues at site 28. The exceptions include the proteins ADD1, INO2, MyoD, and E12. Furthermore, only a single sequence (PHO4) has a residue other than L or I at site 54. In PHO4, the residue at this site is an E. Discussion A clear understanding of interactions among amino acid sites is fundamental to producing a comprehensive model for the structure, function, and evolution of proteins. Because of the limited number of secondary and tertiary structures that proteins can assume (Sternberg 1996), it can be argued that evolution at the amino acid sequence level is constrained by higher-order structure. This phylogenetic inertia (sensu Felsenstein 1985), which determines the rate and direction of evolutionary change at the sequence level, is reflected in the variability and covariability among primary sequence elements. Covariance structure and phylogenetic inertia are important concepts in Darwinian evolution that have not been adequately explored at the molecular level. The analyses reported here deal with the magnitude and origins of covariability among amino acid sites and their relevance to protein structure and evolution.

Solving the protein sequence metric problem

Solving the protein sequence metric problem Solving the protein sequence metric problem William R. Atchley*, Jieping Zhao, Andrew D. Fernandes, and Tanja Drüke *Department of Genetics, Bioinformatics Research Center, Graduate Program in Biomathematics,

More information

ABSTRACT. ZHAO, JIEPING. Multivariate statistical analysis of protein variation. (Under the direction of Dr. William R. Atchley)

ABSTRACT. ZHAO, JIEPING. Multivariate statistical analysis of protein variation. (Under the direction of Dr. William R. Atchley) ABSTRACT ZHAO, JIEPING. Multivariate statistical analysis of protein variation. (Under the direction of Dr. William R. Atchley) The purpose of this research is to study the protein sequence metric problem

More information

Spectral Analysis of Sequence Variability in Basic-Helix-loop-helix (bhlh) Protein Domains

Spectral Analysis of Sequence Variability in Basic-Helix-loop-helix (bhlh) Protein Domains ORIGINAL RESEARCH Spectral Analysis of Sequence Variability in Basic-Helix-loop-helix (bhlh) Protein Domains Zhi Wang 1 and William R. Atchley 1,2 1 Graduate Program In Biomathematics And Bioinformatics,

More information

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES Protein Structure W. M. Grogan, Ph.D. OBJECTIVES 1. Describe the structure and characteristic properties of typical proteins. 2. List and describe the four levels of structure found in proteins. 3. Relate

More information

From Amino Acids to Proteins - in 4 Easy Steps

From Amino Acids to Proteins - in 4 Easy Steps From Amino Acids to Proteins - in 4 Easy Steps Although protein structure appears to be overwhelmingly complex, you can provide your students with a basic understanding of how proteins fold by focusing

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

A natural classification of the basic helix loop helix class of transcription factors

A natural classification of the basic helix loop helix class of transcription factors Proc. Natl. Acad. Sci. USA Vol. 94, pp. 5172 5176, May 1997 Evolution A natural classification of the basic helix loop helix class of transcription factors WILLIAM R. ATCHLEY* AND WALTER M. FITCH *Center

More information

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION AND CALIBRATION Calculation of turn and beta intrinsic propensities. A statistical analysis of a protein structure

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

ABSTRACT. Wang, Zhi. Spectral Analysis of Protein Sequences. (Under the direction of Dr.

ABSTRACT. Wang, Zhi. Spectral Analysis of Protein Sequences. (Under the direction of Dr. ABSTRACT Wang, Zhi. Spectral Analysis of Protein Sequences. (Under the direction of Dr. William R. Atchley and Dr. Charles E. Smith.) The purpose of this research is to elucidate how to apply spectral

More information

CHAPTER 29 HW: AMINO ACIDS + PROTEINS

CHAPTER 29 HW: AMINO ACIDS + PROTEINS CAPTER 29 W: AMI ACIDS + PRTEIS For all problems, consult the table of 20 Amino Acids provided in lecture if an amino acid structure is needed; these will be given on exams. Use natural amino acids (L)

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Basics of protein structure

Basics of protein structure Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

The protein folding problem consists of two parts:

The protein folding problem consists of two parts: Energetics and kinetics of protein folding The protein folding problem consists of two parts: 1)Creating a stable, well-defined structure that is significantly more stable than all other possible structures.

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Viewing and Analyzing Proteins, Ligands and their Complexes 2 2 Viewing and Analyzing Proteins, Ligands and their Complexes 2 Overview Viewing the accessible surface Analyzing the properties of proteins containing thousands of atoms is best accomplished by representing

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Dana Alsulaibi. Jaleel G.Sweis. Mamoon Ahram

Dana Alsulaibi. Jaleel G.Sweis. Mamoon Ahram 15 Dana Alsulaibi Jaleel G.Sweis Mamoon Ahram Revision of last lectures: Proteins have four levels of structures. Primary,secondary, tertiary and quaternary. Primary structure is the order of amino acids

More information

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix) Computat onal Biology Lecture 21 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

Supporting Information

Supporting Information Supporting Information I. INFERRING THE ENERGY FUNCTION We downloaded a multiple sequence alignment (MSA) for the HIV-1 clade B Protease protein from the Los Alamos National Laboratory HIV database (http://www.hiv.lanl.gov).

More information

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty.

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty. Lecture 9: Protein Architecture V: Evolution, Function & Classification Margaret A. Daugherty Fall 2004 Amino acid use *Proteins don t use aa s equally; eg, most proteins not repeating units. Caveat: collagen

More information

Computational modeling of G-Protein Coupled Receptors (GPCRs) has recently become

Computational modeling of G-Protein Coupled Receptors (GPCRs) has recently become Homology Modeling and Docking of Melatonin Receptors Andrew Kohlway, UMBC Jeffry D. Madura, Duquesne University 6/18/04 INTRODUCTION Computational modeling of G-Protein Coupled Receptors (GPCRs) has recently

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

2 Genome evolution: gene fusion versus gene fission

2 Genome evolution: gene fusion versus gene fission 2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,

More information

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration: PTEI STUTUE ydrolysis of proteins with aqueous acid or base yields a mixture of free amino acids. Each type of protein yields a characteristic mixture of the ~ 20 amino acids. AMI AIDS Zwitterion (dipolar

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Outline. Classification of Living Things

Outline. Classification of Living Things Outline Classification of Living Things Chapter 20 Mader: Biology 8th Ed. Taxonomy Binomial System Species Identification Classification Categories Phylogenetic Trees Tracing Phylogeny Cladistic Systematics

More information

Bio nformatics. Lecture 23. Saad Mneimneh

Bio nformatics. Lecture 23. Saad Mneimneh Bio nformatics Lecture 23 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1:

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1: Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1: Biochemistry: An Evolving Science Tips on note taking... Remember copies of my lectures are available on my webpage If you forget to print them

More information

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years. Structure Determination and Sequence Analysis The vast majority of the experimentally determined three-dimensional protein structures have been solved by one of two methods: X-ray diffraction and Nuclear

More information

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE Examples of Protein Modeling Protein Modeling Visualization Examination of an experimental structure to gain insight about a research question Dynamics To examine the dynamics of protein structures To

More information

Introduction to" Protein Structure

Introduction to Protein Structure Introduction to" Protein Structure Function, evolution & experimental methods Thomas Blicher, Center for Biological Sequence Analysis Learning Objectives Outline the basic levels of protein structure.

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Biophysics II. Hydrophobic Bio-molecules. Key points to be covered. Molecular Interactions in Bio-molecular Structures - van der Waals Interaction

Biophysics II. Hydrophobic Bio-molecules. Key points to be covered. Molecular Interactions in Bio-molecular Structures - van der Waals Interaction Biophysics II Key points to be covered By A/Prof. Xiang Yang Liu Biophysics & Micro/nanostructures Lab Department of Physics, NUS 1. van der Waals Interaction 2. Hydrogen bond 3. Hydrophilic vs hydrophobic

More information

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability Part I. Review of forces Covalent bonds Non-covalent Interactions: Van der Waals Interactions

More information

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny 008 by The University of Chicago. All rights reserved.doi: 10.1086/588078 Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny (Am. Nat., vol. 17, no.

More information

Model Worksheet Teacher Key

Model Worksheet Teacher Key Introduction Despite the complexity of life on Earth, the most important large molecules found in all living things (biomolecules) can be classified into only four main categories: carbohydrates, lipids,

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Chem. 27 Section 1 Conformational Analysis Week of Feb. 6, TF: Walter E. Kowtoniuk Mallinckrodt 303 Liu Laboratory

Chem. 27 Section 1 Conformational Analysis Week of Feb. 6, TF: Walter E. Kowtoniuk Mallinckrodt 303 Liu Laboratory Chem. 27 Section 1 Conformational Analysis TF: Walter E. Kowtoniuk wekowton@fas.harvard.edu Mallinckrodt 303 Liu Laboratory ffice hours are: Monday and Wednesday 3:00-4:00pm in Mallinckrodt 303 Course

More information

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure Bioch/BIMS 503 Lecture 2 Structure and Function of Proteins August 28, 2008 Robert Nakamoto rkn3c@virginia.edu 2-0279 Secondary Structure Φ Ψ angles determine protein structure Φ Ψ angles are restricted

More information

The MOLECULES of LIFE

The MOLECULES of LIFE The MLEULE of LIFE Physical and hemical Principles olutions Manual Prepared by James Fraser and amuel Leachman hapter 1 From enes to RA and Proteins Problems and olutions and Multiple hoice 1. When two

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Charged amino acids (side-chains)

Charged amino acids (side-chains) Proteins are composed of monomers called amino acids There are 20 different amino acids Amine Group Central ydrocarbon N C C R Group Carboxyl Group ALL amino acids have the exact same structure except

More information

Packing of Secondary Structures

Packing of Secondary Structures 7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding and Human Disease Professor Gossard Retrieving, Viewing Protein Structures from the Protein Data Base Helix helix packing Packing of Secondary

More information

Big Idea #1: The process of evolution drives the diversity and unity of life

Big Idea #1: The process of evolution drives the diversity and unity of life BIG IDEA! Big Idea #1: The process of evolution drives the diversity and unity of life Key Terms for this section: emigration phenotype adaptation evolution phylogenetic tree adaptive radiation fertility

More information

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building How Molecules Evolve Guest Lecture: Principles and Methods of Systematic Biology 11 November 2013 Chris Simon Approaching phylogenetics from the point of view of the data Understanding how sequences evolve

More information

Protein Structure & Motifs

Protein Structure & Motifs & Motifs Biochemistry 201 Molecular Biology January 12, 2000 Doug Brutlag Introduction Proteins are more flexible than nucleic acids in structure because of both the larger number of types of residues

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation 1 dist est: Estimation of Rates-Across-Sites Distributions in Phylogenetic Subsititution Models Version 1.0 Edward Susko Department of Mathematics and Statistics, Dalhousie University Introduction The

More information

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Contrasts for a within-species comparative method

Contrasts for a within-species comparative method Contrasts for a within-species comparative method Joseph Felsenstein, Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, USA email address: joe@genetics.washington.edu

More information

Lecture 26: Polymers: DNA Packing and Protein folding 26.1 Problem Set 4 due today. Reading for Lectures 22 24: PKT Chapter 8 [ ].

Lecture 26: Polymers: DNA Packing and Protein folding 26.1 Problem Set 4 due today. Reading for Lectures 22 24: PKT Chapter 8 [ ]. Lecture 26: Polymers: DA Packing and Protein folding 26.1 Problem Set 4 due today. eading for Lectures 22 24: PKT hapter 8 DA Packing for Eukaryotes: The packing problem for the larger eukaryotic genomes

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

Supersecondary Structures (structural motifs)

Supersecondary Structures (structural motifs) Supersecondary Structures (structural motifs) Various Sources Slide 1 Supersecondary Structures (Motifs) Supersecondary Structures (Motifs): : Combinations of secondary structures in specific geometric

More information

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds Protein Structure Hierarchy of Protein Structure 2 3 Structural element Primary structure Secondary structure Super-secondary structure Domain Tertiary structure Quaternary structure Description amino

More information

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG Department of Biology (Galton Laboratory), University College London, 4 Stephenson

More information

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae Emily Germain, Rensselaer Polytechnic Institute Mentor: Dr. Hugh Nicholas, Biomedical Initiative, Pittsburgh Supercomputing

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Biology 211 (2) Week 1 KEY!

Biology 211 (2) Week 1 KEY! Biology 211 (2) Week 1 KEY Chapter 1 KEY FIGURES: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 VOCABULARY: Adaptation: a trait that increases the fitness Cells: a developed, system bound with a thin outer layer made of

More information

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1 Zhou Pei-Yuan Centre for Applied Mathematics, Tsinghua University November 2013 F. Piazza Center for Molecular Biophysics and University of Orléans, France Selected topic in Physical Biology Lecture 1

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur Lecture - 06 Protein Structure IV We complete our discussion on Protein Structures today. And just to recap

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change

More information

Properties of amino acids in proteins

Properties of amino acids in proteins Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Transmembrane Domains (TMDs) of ABC transporters

Transmembrane Domains (TMDs) of ABC transporters Transmembrane Domains (TMDs) of ABC transporters Most ABC transporters contain heterodimeric TMDs (e.g. HisMQ, MalFG) TMDs show only limited sequence homology (high diversity) High degree of conservation

More information

Introduction to Computational Structural Biology

Introduction to Computational Structural Biology Introduction to Computational Structural Biology Part I 1. Introduction The disciplinary character of Computational Structural Biology The mathematical background required and the topics covered Bibliography

More information

4) Chapter 1 includes heredity (i.e. DNA and genes) as well as evolution. Discuss the connection between heredity and evolution?

4) Chapter 1 includes heredity (i.e. DNA and genes) as well as evolution. Discuss the connection between heredity and evolution? Name- Chapters 1-5 Questions 1) Life is easy to recognize but difficult to define. The dictionary defines life as the state or quality that distinguishes living beings or organisms from dead ones and from

More information