Correlations Among Amino Acid Sites in bhlh Protein Domains: An Information Theoretic Analysis

Size: px

Start display at page:

Download "Correlations Among Amino Acid Sites in bhlh Protein Domains: An Information Theoretic Analysis"

Claud Blair
5 years ago
Views:

1 Correlations Among Amino Acid Sites in bhlh Protein Domains: An Information Theoretic Analysis William R. Atchley,* 1 Kurt R. Wollenberg,* Walter M. Fitch, Werner Terhalle, and Andreas W. Dress *Department of Genetics, North Carolina State University; Department of Ecology and Evolutionary Biology, University of California at Irvine; and Fakultät für Mathematik, Universität Bielefeld, Bielefeld, Germany An information theoretic approach is used to examine the magnitude and origin of associations among amino acid sites in the basic helix-loop-helix (bhlh) family of transcription factors. Entropy and mutual information values are used to summarize the variability and covariability of amino acids comprising the bhlh domain for 242 sequences. When these quantitative measures are integrated with crystal structure data and summarized using helical wheels, they provide important insights into the evolution of three-dimensional structure in these proteins. We show that amino acid sites in the bhlh domain known to pack against each other have very low entropy values, indicating little residue diversity at these contact sites. Noncontact sites, on the other hand, exhibit significantly larger entropy values, as well as statistically significant levels of mutual information or association among sites. High levels of mutual information indicate significant amounts of intercorrelation among amino acid residues at these various sites. Using computer simulations based on a parametric bootstrap procedure, we are able to partition the observed covariation among various amino acid sites into that arising from phylogenetic (common ancestry) and stochastic causes and those resulting from structural and functional constraints. These results show that a significant amount of the observed covariation among amino acid sites is due to structural/functional constraints, over and above the covariation arising from phylogenetic constraints. These quantitative analyses provide a highly integrated evolutionary picture of the multidimensional dynamics of sequence diversity and protein structure. Introduction Analyses integrating protein structure and evolution can proceed in different ways. One popular experimental approach, for example, is to carry out site-directed mutagenesis where specific amino acids within a protein are altered, and then to examine the effects of these changes on protein characteristics. Changes in amino acid attributes can then be correlated with changes in protein characteristics (e.g., Koshi and Goldstein 1997 and references therein). Another approach is to model large families of naturally occurring proteins or protein domains and determine how nature has changed their characteristics over billions of years of evolutionary diversification. By examining patterns of sequence diversity, one can explore how naturally occurring sequence variability and amino acid properties (e.g., hydrophobicity, volume, and charge) are important in maintaining protein structure. The latter approach permits analyses of protein structure and function over extensive geological timescales where evolutionary processes have experimented in nature with amino acid changes with regard to protein stability, foldability, and functionality. Experimental, as well as quantitative, analyses of proteins (including multiple-alignment procedures) often proceed by modeling frequencies of residues at individual amino acid sites. For computational expediency, these analyses assume that amino acid sites are inde- Present address: Fakultät für Mathematik, Universität Bielefeld, Bielefeld, Germany. Key words: mutual information, protein evolution, entropy, parametric bootstrap, helix-loop-helix. Address for correspondence and reprints: William R. Atchley, Fakultät für Mathematik, Universität Bielefeld, Postfach , D Bielefeld, Germany. Mol. Biol. Evol. 17(1): by the Society for Molecular Biology and Evolution. ISSN: pendent, i.e., the presence of a residue at one site is assumed to be independent of residues at other sites (Swofford et al. 1996). However, it is well known that this assumption is naïve, since the activities and properties of proteins are the result of interactions among their constitutive amino acids. Interactions among amino acid sites include salt bridges between charged residues, hydrogen bonds between electron acceptors and donors, size constraints reflecting structural interactions between large and small side chains, electrostatic interactions, hydrophobic effects, Van der Waal s forces, and similar phenomena. Detecting structural interactions and statistical covariance or associations among separate amino acid sites is fundamental for understanding protein structure and evolution. Consequently, it is important to determine the magnitude and direction of residue covariability, its origin, and its structural and functional significance. Because associations among separate amino acid sites may arise from several different sources, partitioning these associations into their component sources is fundamental to understanding protein structure, function, and evolution. The observed covariation in residue composition between amino acid sites i and j (C ij ) arises from several separate underlying causes, which can be expressed by a linear model of the form: C ij C phylogeny C structure C function C interactions C stochastic. Indeed, the primary null hypothesis to be evaluated for such a model is that any component covariance is equal to zero and, as a consequence, makes no significant contribution to the observed association between amino acid sites i and j. Let us consider what these various sources of variation entail. 164

2 Correlation Among Amino Acid Sites 165 One obvious source of covariation among residues at different sites is common evolutionary history (C phylogeny ). Felsenstein (1985) discussed this problem with regard to evolution of complex polygenic traits among species. He pointed out quite elegantly that species are part of a hierarchically structured phylogeny and therefore cannot be regarded for statistical purposes as being drawn independently from the same distribution. Felsenstein s (1985) argument holds as well for associations among amino acid sites in related proteins. For example, an ancient gene may have undergone early duplications followed by sequence diversification through mutation, natural selection, and genetic drift, which may act differentially in separate evolutionary lineages. The result will be collections of related proteins, e.g., families of bhlh proteins like MyoD and Myc. These families contain a number of functionally and structurally similar proteins that have arisen from a common ancestral protein followed by evolutionary diversification and hierarchical branching (Atchley, Fitch, and Bronner-Fraser 1994; Atchley and Fitch 1995). Within the individual members of such protein families, we would expect to find associations among residues at various amino acid sites that have persisted from the early duplication events. Additionally, covariation among sites can arise for structural or functional reasons, i.e., C structure and C function. In this instance, associations among amino acids arise independently of common ancestry and reflect a bias in amino acid replacements in order to satisfy structural demands. The folded nature of a native functioning protein requires that only certain amino acid replacements can occur at particular sites and still maintain the structural integrity of the folded protein. Furthermore, there are constraints on amino acid replacements that arise for functional reasons, such as amino acid bias at recognition sites related to DNA binding in transcriptional regulators. These functional changes may arise when selection operates to optimize adaptation and subsequently generate protein diversification. Clearly, the main effects in the linear model (i.e., structure, function, and phylogeny) are confounded and therefore are not statistically independent. It is well known that structural and functional changes arise through evolutionary processes. Consequently, inclusion of a covariance term, C interactions, in the model is necessary to account for such higher-order statistical nonindependence. Finally, covariation among sites may occur that cannot be explained by the main effects in the model and their statistical interaction. This component, designated here C stochastic, refers to the lack of fit of the data to the model and is analogous to the unexplained sum of squares in analysis of variance or regression. For the sake of simplicity, this stochastic effect can be assumed to represent background covariability. While it is obvious that covariability among sites has a multidimensional basis, partitioning the observed covariability among sites into appropriate underlying components is not a simple matter. Rather, it is a process fraught with many statistical and computational difficulties. Not the least of these difficulties is that biological sequences are represented by symbols that have no natural ordering or underlying metric (Atchley, Terhalle, and Dress 1999). Consequently, conventional statistical analyses typically used to partition variability and covariability are difficult to apply with sequence data. Herein, we use an entropy (information theoretic) approach coupled with simulation-based parametric bootstrap procedures to examine the magnitude and origin of associations among amino acid sites in the highly conserved basic helix-loop-helix (bhlh) domain. The bhlh domain is a DNA-binding and dimerization domain of approximately amino acids found in a large and diverse family of transcription factors (Murre et al. 1994). A number of these proteins have been the focus of detailed structural and functional analyses. Furthermore, the bhlh domain has been the subject of several recent evolutionary analyses (Atchley and Fitch 1997; Atchley, Terhalle, and Dress 1999; Morgenstern and Atchley 1999). The present paper explores a number of questions about amino acid associations and protein structure. First, we ascertain the magnitude of association or covariation among residues between amino acid sites within the highly conserved bhlh domain. Second, we carry out computer simulations to elucidate the underlying origins of the observed associations among amino acid sites. We inquire if the observed covariability arises simply from stochastic events or if it is due to evolutionary history or structural and functional constraints. Third, we integrate measures of variability and covariability derived from information theory with structural data from published crystal studies on the bhlh domain. In doing so, we explore the relationships between primary sequence diversity and protein structure/function. Fourth, we examine the evolution of the -helical structure of the bhlh domain among a diverse collection of proteins. Methods and Materials Data The analyses, and conclusions inherent to these analyses and discussions, are based on 242 bhlh domain sequences reported in Atchley and Fitch (1997) and Atchley, Terhalle, and Dress (1999). Multiple-sequence alignment was initially carried out using CLUS- TAL W (Thompson, Higgins, and Gibson 1994), and the alignment was then improved by eye. The various phylogenetic analyses, definitions of various clades and evolutionary lineages, descriptions of the protein families, and the like discussed herein are reported in Atchley and Fitch (1997). Structure of bhlh Proteins Crystal structure studies have been carried out on the bhlh domains of six proteins, i.e., Max, E47, MyoD, USF, PHO4, and SREBP (Ferre-D Amare et al. 1993, 1994; Ellenberger et al. 1994; Ma et al. 1994; Brownlie et al. 1997; Shimizu et al. 1997; Parraga et al. 1998). The Max protein, which is the dimerization part-

3 166 Atchley et al. FIG. 1. Illustration showing the protein homodimer DNA interaction for Max. This figure (modified from Ferre-D Amare et al. 1993) shows two -helices separated by a loop. One -helix comprises the basic DNA-binding region and helix 1, while the second component involves helix 1 and the leucine zipper. The basic region is shown interacting with DNA in this figure. Of particular relevance to these discussions is the interaction of the two helical components proximal to the loop. ner of the protooncogene Myc, has been examined in considerable crystallographic detail and shown to have an amphipathic -helical structure (fig. 1) in which the protein has opposing hydrophobic and hydrophilic faces. The crystal structure of the Max homodimer shows it to be a parallel, left-handed, four-helix bundle with a hydrophobic core (Ferre-D Amare et al. 1993). The conserved hydrophobic amino acids from helix 1 (H1) and helix 2 (H2) in Max are buried in the interior of the four-helix bundle, where they pack together and exhibit strong van der Waals interactions that stabilize the structure of the homodimer. The conserved, hydrophobic amino acids in H1 and H2 appear to be required for a stable protein DNA complex formation in Max. The crystal structure of Max appears to be generally similar to that of other bhlh proteins, such as E47 (Ellenberger et al. 1994), MyoD (Ma et al. 1994), USF (Ferre-D Amare et al. 1994), PHO4 (Shimizu et al. 1997, and SREBP (Parraga et al. 1998). Indeed, Parraga et al. (1998) point out the remarkable similarity in structure between SREBP and Max and that the hydrophobic cores are virtually identical in these two distantly related bhlh proteins. The protein Max will be used as the structural model for our discussions, and it will be assumed that the 242 bhlh proteins involved in these analyses have the

4 Correlation Among Amino Acid Sites 167 same general structural features as Max. This extrapolation is based on the studies of Ferre-D Amare et al. (1993, 1994), Ma et al. (1994), Ellenberger et al. (1994), Shimizu et al. (1997), and Parraga et al. (1998). Helical Wheel Projections Helical wheel projections are used in these analyses to provide insight into residue interrelationships with a protein structure. Helical wheels graphically display the disposition of amino acid side chains about an assumed -helix. The projection is along the central axis of the helix, from the N-terminus to the C-terminus, and it is a useful device for displaying the symmetry (or asymmetry) of hydrophobic/hydrophilic side chains. The helical wheel assumes a periodicity of 3.6 residues per helical turn. Thirty-three exemplar sequences are used to explore the phylogenetic aspects of the helical wheel projections. Generally speaking, these 33 sequences are well-studied proteins that reflect the evolutionary diversity of the bhlh domain. The evolutionary relationships among the various clades and lineages for these sequences are represented by a neighbor-joining tree. Sequences are arranged phylogenetically and shown in a helical wheel configuration for helices 1 and 2. Secondary Structure Prediction In several instances, the secondary structure of a particular bhlh-domain-containing protein is examined using the Protein Sequence Analysis (PSA) Server from Boston University. The computer model analyzes amino acid sequences and calculates the probability of secondary structures and folding classes within regions of a sequence. The underlying theory for these predictions is described in White, Stultz, and Smith (1994), and the URL of the server is Variability and Covariability in Protein Sequences As noted earlier, statistical analyses of biological sequences present difficulties because these sequences are represented by symbols that have no natural ordering or underlying metric (Atchley, Terhalle, and Dress 1999). Consequently, conventional statistical estimates of variability and covariability are difficult to apply. Recently, several authors have suggested the use of the concepts of entropy and mutual information (Korber et al. 1993; Clarke 1995; Herzel and Gross 1995; Schneider 1996; Roman-Roldan, Bernaola-Gavan, and Oliver 1996; Atchley, Terhalle, and Dress, 1999). Entropy (E) is a measure of uncertainty derived from thermodynamics and statistical physics which has considerable utility for studies of protein structure. Assume X is a discrete random variable (the amino acid sites) for which we are uncertain which of its 20 values (x 1, x 2,...,x 20 ) (amino acid residues) will occur at site X, but we do know their expected frequencies, p i,..., p n. These expected frequencies can be used to calculate how much information E(X) is present at site X. In this context, information is a measure of the uncertainty about which residue will occur at a specified site. The Boltzmann-Shannon entropy E(X) is defined (Applebaum 1996) by n j 2 j j 1 E(X) : p log (p ), (1) where n 20, p j is the probability of an amino acid being of the jth kind, and p j log 2 p j : 0ifp j 0. E 0 when all elements are in the same category (the same amino acid residue at a particular site). E increases with both the number of categories (residues at a site) and their equiprobability. Entropy of a uniform distribution whose range has size n is E n log 2 (n). (2) Thus, the minimum entropy or uncertainty value will be zero when only a single residue occurs at a particular site in all included proteins. The expected maximum entropy will be 4.32 when all 20 residues are present in equal frequencies at a given site. The relative information content of Y contained in X is termed the mutual information, or MI(X, Y), where n m p jk jk j 1 k 1 j k MI(X, Y) p log 2. (3) pq Note that MI(X, Y) MI(Y, X), and if X and Y are independent, then MI(X, Y) 0, corresponding to the fact that no information is obtained regarding Y by finding out about X. In biological sequences, MI describes the extent of correlation or association between residues at amino acid sites X and Y that might arise from evolutionary, functional, or structural constraints. More algebraic details are provided in Atchley, Terhalle, and Dress (1999). Statistical Inference about Mutual Information Values An important question in biological sequence analyses is whether one can distinguish signals due to various biological sources (phylogeny, structure, and function) from any background noise (stochastic variation) inherent in a set of sequences. This is analogous, in quantitative genetics, to partitioning phenotypic variability into genetic components (including additive, dominance, and epistatic variance components) and environmental components. In these analyses, we use a parametric bootstrap approach (Efron and Tibshirani 1993; Goldman 1993; Huelsenbeck, Hillis, and Jones 1996) to generate a distribution of MI values reflecting only covariation involving stochastic and phylogenetic constraints. Additional details of this method are presented in Wollenberg and Atchley (2000). The parameters used in the parametric bootstrap simulations were the phylogenetic tree generated from the aligned protein sequences and a residue substitution matrix. Because the tree was derived from the data and the substitution matrix was not (it was chosen to reflect general amino acid substitution probabilities), the data sets generated in the parametric bootstrap simulations contained only stochastic and phylogenetic associations between sites. A neighbor-joining tree (Saitou and Nei 1987) was computed for the 237 sequences using p-distances. The residue change matrix used was that used in the computer program PAML, version 1.3 (Yang 1997). This

5 168 Atchley et al. matrix was generated using the algorithm of Jones, Taylor, and Thornton (1992) and is hereinafter referred to as the JTT matrix. This matrix does not consider gaps as characters for the generation of replicate data sets. Therefore, for MI values calculated on the empirical data to be comparable with MI values calculated on the parametrically generated data sets, only ungapped sites could be used for this statistical analysis. (All sites were used for generating the phylogeny.) For this reason, the original 242 sequences of Atchley and Fitch (1997) were reduced to 237 to decrease the number of sites in the alignment having gaps. This resulted in 32 sites without gaps for analysis. Like any numerical simulation of a physical process, the results depend on the assumptions of the underlying models for their validity. As in any phylogenetic analysis, results depend on the confidence one has that the tree is a realistic description of the history of the subjects being analyzed. The parametric bootstrap also depends on the tree as the source of information about the level and distribution of sequence variation. The residue substitution matrix used will control the changes that occur between sequences in the simulation. Biases in this matrix can affect the potential associations measured in the resulting simulated sequences. However, a matrix having no biases (i.e., a matrix of uniform substitution probabilities) would ignore the biology of the substitution process. Alternatively, one could use a substitution matrix derived from the empirical data, such as that calculated by the RIND program (Bruno 1996). However, a matrix of this type would reflect biases due to phylogeny, structure, and function that are inherent in the empirical data being analyzed (Wollenberg and Atchley 2000). For these reasons, we used a general protein substitution matrix derived using the JTT algorithm. The statistical significance of the MI values was determined by comparing the frequency distributions of MI values for the 237 bhlh sequences and the results for the parametric bootstrap analyses. Any MI value above a specific threshold was considered to contain significant associations over and above those due to stochastic or phylogenetic constraints. This threshold MI value was that value in the frequency distribution of parametric bootstrap MI values greater than a specified percentage (i.e., 99%, 99.9%) of parametric bootstrap MI values. Thus, this procedure does not test whether a given MI value is different from random; rather, it tests whether an MI value reflects a significant association due to structural and functional constraints over and above covariation arising from evolutionary history and stochastic events. Results Entropy Values Figure 2 provides a histogram of the entropy values (E) for the individual amino acid sites in the bhlh domain for 242 proteins. The basic region extends from site 1 to site 13, helix 1 extends from site 14 to site 28, and helix 2 extends from site 50 to site 64. The basic FIG. 2. Histogram of entropy values for individual amino acid sites in the bhlh domain. and helix 1 regions are modeled as a continuous helix (Ferre-D Amare et al. 1993) separated from the second helix by a variable-length loop. The loop is not considered in these discussions. Actual numerical values of E range from 0.15 (98% L residues) at site 23 to the maximum observed value of 3.45 for highly variable sites 21 and 62. As noted earlier, the maximum possible entropy value will be 4.32 when all 20 residues are present at a site in equal frequencies. Crystal studies of the protein Max by Ferre- D Amare et al. (1993) have shown that residues at various sites pack against each other. These sites are described herein as contact sites, using the numerical site identification described by Atchley and Fitch (1997) and Atchley, Terhalle, and Dress (1999). Thus, K1 refers to a K residue at site 1. With regard to the contact sites, I16 packs against F20, which contacts R50, I53, and L54. Site L23 abuts I53 and A57. V27 and P28 pack against Y60, and I61 packs against its symmetry mate (the site in the dimerization partner) I61 and Y60, while M64 interacts with its homodimer symmetry mate M64. These packed residues, together with the relevant van der Waals forces, stabilize the structure of the homodimer and help conserve the hydrophobic core. The analyses described here assume the crystal structure of Max, E47, MyoD, USF, PHO4, and SREBP, extrapolated to the other known bhlh proteins. Based on the multiple alignment of the bhlh domain sequences used in Atchley and Fitch (1997) and Atchley, Terhalle, and Dress (1999), the most prevalent residues at these contact or packing sites are 16 (I, L, V), 20 (F, I, L), 23 (L), 27 (I, L, V), and 28 (L) in helix 1, and 50 (K), 53 (I, T, V), 57 (A), 60 (Y), and 64 (L) in helix 2. Thus, the five relevant contact sites in helix 1 are highly hydrophobic while the five sites in helix 2 (except for site 50, which initiates the second helix) are predominantly hydrophobic. The packing interrelationships among sites in helix 1 and helix 2 are shown graphically in figure 3. Additionally, structural studies on SREBP (Parraga et al. 1998) also indicate interactions between site 12 in the basic region with site 17 in helix 1 and site 50 in helix 2 in the other monomer of the homodimer. All known core contact positions given by Ferre- D Amare et al. (1993) and Parraga et al. (1998) have

6 Correlation Among Amino Acid Sites 169 FIG. 3. Interactions (packing) between helix 1 and helix 2 sites in bhlh proteins. The dotted lines between numbered boxes refer to those sites known to pack together. The entropy values for each site are given in italics. Sites that are symmetry mates with the dimerization partner are underlined. entropy values which range from 0.15 to 2.27 (table 1). All of the remaining amino acid positions except two (described below) have entropy values ranging from 2.46 to The distributions of E values for contact versus noncontact values are nonoverlapping, and their medians are statistically highly significantly different by a Mann-Whitney test (Sokal and Rohlf 1995) (table 1). One might conclude that contact residues can be identified by their entropy values, i.e., assigning contact positions to those with E values below a particular threshold value (approximately 2.3 in the case of these bhlh domains). While this works for the bhlh domain, it is unknown whether it can be generalized to other proteins. If the various amino acid residues are classified into functional groups (e.g., D, E acidic; K, R, H basic) as described in Atchley, Terhalle, and Dress (1999), the contact sites have entropy values ranging from 0.03 to All values are 1.0 except those of site 60 (1.31) and site 28 (1.64). For sites not in contact, the entropy values for functional groups range from 0.58 to The noncontact site with the smallest entropy value (site 51, with E 0.58) has a variety of residues, but they are all aliphatic/hydrophobic, which accounts for this seemingly anomalously low value. A. R. Ferre-D Amare generously provided information from his crystal studies about the structural interactions of site 51. The side chains of the alanine residue in Max and the leucine residue in E47 at site 51 interact with those of site 16 to facilitate stabilization of the protein dimers hydrophobic core. Furthermore, in SREBP, the serine at site 51 makes a water-mediated hydrogen bond with a phosphate oxygen anchoring the dimer to DNA. Other bhlh proteins have histidine residues at site 51, and this side chain could make DNA contacts as well. Our results suggest that two positions not originally considered contact sites by Ferre-D Amare et al. (1993) may be such. Sites 17 and 24 have E values of 1.18 and 1.93, respectively (0.93 and 0.94 with residues classified into functional groups), which are well within the range Table 1 Entropy Values for Amino Acid Sites in Helix 1 and Helix 2 that Are in Contact or Not in Contact Sites in Contact Sites Not in Contact Site Entropy E F Site Entropy E F * * Mean Mean SD SD NOTE. The Entropy columns give the entropy values for individual amino acids at each site. E F is the entropy value when the amino acids at each site are converted to their functional groups (sensu Atchley, Terhalle, and Dress 1999). Sites 17 and 24 are putative contact sites based on their entropy values (see text for more details). A Mann-Whitney test of the null hypothesis that these two sets of sites have the same median entropy value is rejected at P for both E and E F. of the E values for contact sites (table 1). Further examination of the bhlh structural data (A. R. Ferre D Amare, personal communication) suggests that site 17 functions in a water-mediated DNA-protein contact. This interrelationship has apparently resulted in a high level of conservation of hydrophilic residues at this site (74% asparagine, 24% lysine or arginine). This interaction is clearly demonstrated in the sterol-regulatoryelement-binding proteins (SREBPs) by Parraga et al. (1998). Unfortunately, the situation with site 24 is more difficult to explain. Side chains in this position form part of an extension of the hydrophobic core of the bhlh dimer by burying under the flap formed by the loop component. The conservation of longer side-chain residues at site 24 may stem from their hydrophobic methylenes being partially buried and the hydrophilic tips being exposed (A. R. Ferre-D Amare, personal communication). Additional high-resolution studies of the bhlh domain may provide a biological explanation for these quantitative observations on site 24. Variability in Buried and Exposed Sites The results reported here are in line with many previous observations that internal (buried) residues are less variable than external or exposed residues, and less variation means lower entropy values (e.g., Goldman, Thorne, and Jones 1998). Indeed, Atchley, Terhalle, and Dress (1999) previously tested this hypothesis and found that the entropy values of the buried sites are significantly smaller than those of the exposed sites. Mutual Information Tables 2 and 3 provide MI values describing the level of association among amino acid sites within the

7 170 Atchley et al. Table 2 Mutual Information (MI) Values 1.0 Within and Between Components for Basic, Helix 1, and Helix 2 Sites Basic region Helix 1 Helix NOTE. Only values 1.0, which reflect the top 5% of all MI values for the bhlh domain, are included. Basic region sites are 1 13, helix 1 sites are 14 28, and helix 2 sites are Sites from the loop region are not included. bhlh domain. Values 1.0 reported in table 2 constitute the top 5% of MI values for all bhlh domain sites as reported by Atchley, Terhalle, and Dress (1999). When the sites with MI 1.0 are arranged in a network, specific patterns of association are made apparent (fig. 4). Within this network are subnetworks consisting of sets of sites for which each site has connections to all other sites in the subnetwork. These completely connected subnetworks correspond to the cliques previously defined in Atchley, Terhalle, and Dress (1999). A maximum clique corresponds to the largest maximally connected subnetwork. The two maximum cliques for the data from table 2 are presented in figure 4A and B. Mutual Information Between the Basic Region and Helices 1 and 2 Table 2 describes the pattern in mutual information values between the basic DNA-binding region (sites 1 13) and helices 1 and 2. Within the basic component itself, MI values 1.0 are noted between sites (3, 4), (3, 7), (4, 5) (4, 7), (4, 8), and (7, 11). Sites 3, 4, 7, and 8 in the basic region show high levels of association with specific sites in the two heli- Table 3 Entropy (E) and Mutual Information (MI) Values for Sites in Helix 1 (H1) (Sites 14 28) and Helix 2 (H2) Sites (50 64) H2 SITES 64 (E 1.21) 63 (E 3.36) 62 (E 3.45) 61 (E 1.46) 60 (E 1.25) 59 (E 3.16) 58 (E 2.49) 57 (E 1.23) 56 (E 3.08) 55 (E 2.84) 54 (E 0.20) 53 (E 1.24) 52 (E 3.00) 51 (E 2.46) 50 (E 0.52) E H * * NOTE. Sites known to interact in the protein Max are underlined and in bold italic type. Two sites predicted from these analyses to interact are indicated with asterisks. MI values 0.56 reflect sites with high probabilities (P 99%) of having associations due to structural and functional constraints. Values 1.0 represent the top 5% of all bhlh MI values and are given in bold type and underlined.

8 Correlation Among Amino Acid Sites 171 From Table 3, it is clear that sites with low sequence diversity (small entropy values) also show little covariation with other amino acid sites. This is to be expected, since residues at paired sites do not covary if the individual sites themselves had little residue variation. Thus, the four primary sites in each helix previously shown to pack together (sites 16, 20, 23, and 27 in helix 1, and sites 50, 53, 57, and 60 in helix 2) exhibit very little residue variability (low entropy values), and there is very little covariability among the contact sites. The latter finding is reflected by the fact there are no MI values among these eight contact residues higher than 0.22 and, as will be seen below, none show significant covariation due to structural and functional constraints. Several sites within the helices with higher residue diversity exhibit considerable mutual information with other variable sites. Thus, site 23 (E 3.48) in helix 1 exhibits the highest observed MI values (MI 1.19) with sites 52, 55, and 56 in helix 2. In helix 2, site 52 likewise shows high MI values with sites 19, 21, 25, and 26 in helix 1. Note, however, that while site 18 exhibits considerable residue diversity (E 3.26), it does not necessarily exhibit high MI values with other sites. This demonstrates that high entropy does not necessarily produce high values of mutual information even in highly conserved protein domains. FIG. 4. Network diagrams of the data from table 2 (pairs of sites with MI 1.0). Maximum cliques are indicated by the heavy black lines. A, Maximum cliques containing site 7 from the basic region. B, Maximum clique containing site 52 from helix 2. ces, primarily sites 14 and 21 in helix 1 and sites 52, 56, and 62 in helix 2. The helical wheel assumes a periodicity of 3.6 residues per helical turn, so site 14 is at the start of helix 1, and site 21 is two complete turns into the helix. In rank order, the MI values for the top 13 paired sites are as follows: 3 and 21 (1.20), 7 and 21 (1.19), 3 and 14 (1.18), 3 and 4 (1.17), 3 and 56 (1.15), 3 and 62 (1.13), 7 and 15 (1.12), 8 and 19 (1.12), 4 and 7 (1.11), 7 and 11 (1.10) 4 and 21 (1.10), 4 and 52 (1.10), and 11 and 21 (1.10). Mutual Information Within and Between Helix 1 and Helix 2 Table 3 provides the E and MI values for interacting sites in helices 1 and 2. Packed or contact sites are underlined and in italics in table 3, and entropy values are provided for each site. These values provide a description of residue diversity at each site, ranging from E 0.15 at site 23 (which is 98% leucine in this large database) to E 3.48 at site 21. The maximum possible value for E is The remainder of the table provides pairwise MI values for these amino acid sites. Sites in this subset with MI values 1.0 are shown in bold type and underlined. As a point of reference, the largest observed MI value for the entire bhlh domain was 1.26 between sites 14 and 21. Origins of Significant Mutual Information There are several possible explanations for significant levels of association among many sites within the bhlh domain (other than simply chance associations). These include associations arising from evolutionary constraints, correlated mutations, functional associations, and structural constraints. Simulation and the Partition of Observed Associations A simulation was carried out using parametric bootstrap procedures to partition the observed covariation among amino acid sites into that due to evolutionary history (phylogeny) and stochastic events on the one hand, and that due to structural and functional constraints on the other. The distributions of MI values for the parametric bootstrap data and the empirical data were significantly different at P This suggests that there are significant associations among many amino acid sites in the bhlh domain over and above those due to stochastic and phylogenetic effects. The distribution of MI values from 1,000 parametric bootstrap replicates, calculated using the neighborjoining tree and the JTT substitution matrix, was compared with the distribution of MI values for 237 bhlh proteins (fig. 5). This comparison permits calculation of threshold values for distinguishing between structural/ functional and phylogenetic/chance associations. In these analyses, sites having MI values above a given threshold value have a specific probability of covariation due to structural and functional constraints, rather than due to phylogenetic constraint or chance. The specific MI value used as this threshold has an associated prob-

9 172 Atchley et al. FIG. 5. Inverse cumulative frequency distributions of MI values for the alignment of 237 bhlh protein sequences and 1,000 parametric bootstrap replicates using the JTT substitution matrix. Three probability thresholds (P 0.01, P 0.001, and P ), their associated MI values, and the the probability associated with MI are indicated. These thresholds were calculated from the JTT bootstrap distribution. Because MI is a pairwise measure, each replicate consisted of x(x 1)/2 values, where x is the number of sites in the alignment. For the alignment of 237 bhlh sequences, there were 32 sites without gaps, resulting in 496 MI values per replicate. ability based on the number of values in the parametric bootstrap distribution that are greater than the threshold. Based on the frequency distribution shown in figure 5, any pair of nongapped sites in the bhlh alignment that have MI values will have a probability 0.01 that the covariation among sites is due to phylogeny or chance associations. With a threshold value of 0.749, the probability of this level of covariation being due to phylogeny or chance is reduced by an order of magnitude to P In table 2, sites are designated that have MI values 1.0 and it is noted that they reflect the top 5% of all MI values for the entire bhlh domain. These values 1.0 have a probability of P (from the parametric bootstrap simulations) that the associations are due to phylogeny or chance associations. Among the contact sites, there is no significant covariation from structural and functional constraints. However, there is considerable significant covariation among noncontact sites that stems from structural and functional constraints (fig. 6). The latter is evident in those MI values larger than 0.56, the 1% critical value from the parametric bootstrap simulations. Correlations in Hydropathy From a structural and functional perspective, correlated amino acid replacements may have occurred among sites to maintain the same hydropathic relationships, similar electrostatic charges, size constraints, etc. It is difficult to examine statistical associations among sites with regard to electrostatic charge, since only five amino acids (K, R, H, D, and E) are charged and 15 residues have no charge. Since paired sites must show variability to exhibit correlation, attempts to compute correlation coefficients with invariant data (charge of zero in both variables) is undefined, as can be seen by equation (3). Consequently, we are unable to examine associations due to charge and have focused instead on the possibility of significant associations due to hydropathy and size constraints. For 33 exemplar sequences, residues in helix 1 and helix 2 were coded as 1 if they were hydrophobic (A, C, G, I, L, M, F, P, and V), 1 if they were hydrophilic (R, N, D, E, Q, H, and K), and 0 if they were S, T, Y, or W. Then, pairwise product-moment correlation coefficients (r) were computed among all sites in the two helices to determine if significant associations existed among sites for hydropathy states. Several pairs of sites showed high correlations, including sites 20 and 23 (r 0.89), 17 and 61 (r 0.81), 50 and 61 (r 0.75), and 27 and 61 (r 0.71). (Any product-moment correlation 0.45 in this analysis is significant at P 0.01) As can be seen in the data given in figure 7, the high positive correlations relate to strong association for hydrophobic residues at sites 20 and 23, as well as at sites 27 and 61 (fig. 8). The negative values refer to inverse relationships between hydrophobic and hydrophilic residues. In addition to these four pairs of sites, two pairs of sites (sites 23 and 27 and sites 25 and 50) had correlations 0.6. Correlations in the Sizes of Amino Acid Residues For these same 33 bhlh sequences, product-moment correlation coefficients were computed for the sizes (volume) of residues at pairs of sites for helices 1 and 2. Seven pairwise correlations occurred that were statistically significant. These included sites 23 and 50 (r 0.66), 50 and 60 (r 0.64), 53 and 54 (r 0.57), 22 and 26 (r 0.57), 23 and 60 (r 0.52), 54 and 60 (r 0.51), and 56 and 61 (r 0.50). However, some

10 Correlation Among Amino Acid Sites 173 FIG. 6. Representation of high mutual information values among a subset of amino acid sites in bhlh proteins. The solid boxes are the conserved sites known to pack together (see fig. 1) from crystal structure studies. The lines between sites in helix 1 and helix 2 reflect the 12 sites with the highest mutual information ( 0.9). values must be viewed with caution because of computational difficulties associated with high levels of conservation (low entropy) and different residues with equivalent volumes. In spite of these caveats, there are several interesting findings here involving associations based on size. The sites with the largest correlation coefficient (sites 20 and 53) are contact sites, and these observations indicate that there is a like association; i.e., large residues are paired with large residues. Thus, when F occurs at site 20, I, L, Y, or T occurs at site 53. Similarly, an L at site 20 pairs with an L, T, or V at site 53. The largest negative value occurs between two adjacent sites (sites 53 and 54). It might be expected that significant inverse relationships would exist for the volume of adjacent residues. However, the difference between the entropy values for sites 53 (E 1.24) and 54 (E 0.20) stresses the need for caution in interpreting this correlation, since variable site 53 is paired with a largely unvaried site 54. The correlation coefficient of 0.57 between sites 22 and 26 probably represents a meaningful structural/functional association, because there is considerable residue variability at both of these sites. Both sites occur away from the contact sites and, consequently, exhibit more variability. Helical Wheels and Sequence Diversity The helical wheel shown in figure 7 displays the helical distribution of residues including both the basic DNA-binding region and helix 1, while figure 8 provides a helical wheel for amino acid sites 50 64, which constitute helix 2. These figures summarize information for the 392 bhlh domain sequences in the database. The most prevalent residues at each site are shown in the figure. Amino acid cliques (sensu Atchley, Terhalle, and Dress 1999) shown in these figures are defined for each helix. Cliques are groups of amino acid positions FIG. 7. A helical wheel drawing of the basic DNA-binding region and helix 1. Highly conserved sites are shaded, and those sites involved in the predictive motif of Atchley, Terhalle, and Dress (1999) are denoted with an X. The five sites involved in the highest-ranked multisite clique are enclosed with a dotted line. all of which are more highly associated with each other than any are with a nonmember of the clique. Maximal cliques are those not contained in larger cliques. Finally, those sites involved in determining the predictive motif described in Atchley, Terhalle, and Dress (1999) are marked by an X. This predictive motif is a collection of 19 highly conserved sites whose amino acid compositions accurately discriminate bhlh-domain-containing proteins into groups A D according to the evolutionary classification proposed by Atchley and Fitch (1997). In Figure 7, sites are denoted that have entropy values 2.0, values between 2.0 and 2.4, and values 2.4. The most prevalent amino acid residues at each site are noted with the appropriate symbols where possible. Furthermore, five sites (sites 3, 4, 7, 14, and 21) that constitute the highest-ranked multisite clique in helix 1 are denoted. FIG. 8. Helical wheel of helix 2 sites. Labeling conventions follow figure 7.

11 174 Atchley et al. FIG. 9. Neighbor-joining tree for 33 exemplar proteins and the distribution of residues for helix 1 and helix 2. Within a helix, residues are grouped by their positions in the helical wheel (figs. 6 and 7). In the tree, lineages representing Groups A D of Atchley and Fitch (1997) are denoted. In the DNA-binding region, there are five strongly conserved sites with E 1.7 (sites 1, 2, 9, 10, and 12) (fig. 7), and four of these are highly basic in that they have K or R residues in great preponderance. The exception is site 9, which has a glutamic acid in 93% of all sequences. The glutamic acid at site 9 contacts the C in the E-box (CANNTG), and its presence indicates that DNA binding occurs. Those bhlh proteins lacking an E at site 9 do not bind DNA (groups C and D, sensu Atchley and Fitch 1997). The remaining highly variable sites in the basic region (5, 6, and 11) do not appear to show a systematic pattern of functional group amino acids. The remainder of the sites shown in figure 7 (sites 14 28) constitute helix 1 in the bhlh domain. There are a number of highly conserved sites constituting one face on the helical wheel. These include sites 16, 17, 20, 23, 24, 27, and 28, and they comprise a conspicuous distribution. Sites 16, 20, 23, and 27 are hydrophobic sites with a high preponderance of I, L, V, and F amino acid residues. According to Klingler and Brutlag (1994) and others, there is a hydrophobic periodicity that characterizes many helices, i.e., the relative positions of amino acids in an amphipathic helix may influence their interresidue correlation structure. Thus, an amino acid at position i may show a preference for similar types of amino acids at sites i 3 and i 4, which are on the same side of the helix. Analogously, an amino acid at position i may show a preference for dissimilar amino acid types at positions i 2 and i 5. More explicitly, if a hydrophobic residue occurs at site i, there is a greater expectation of seeing a hydrophobic residue at sites i 3 and i 4. Thus, the more coincident two residues are on one side of an helix, the more likely they are to be of the same hydropathy. Conversely, the closer to a 180 separation two residues are, the more likely they are to be of opposite hydropathies (Klingler and Brutlag 1994). Sites 16, 20, 23, and 27 are three to four sites apart, in accordance with the i 3 and i 4 pattern of hydrophobic periodicity described by Klingler and Brutlag (1994). Thus, this set of four sites provides a highly hydrophobic face for the helix. Sites 17 and 24, on the other hand, are conserved sites with hydrophilic residues

12 Correlation Among Amino Acid Sites 175 (K, R, N) which provide the hydrophilic face indicative of amphipathic helices. Finally, site 28 has a high frequency of P residues (63%), indicative of the last site of an helix. At site 32, 35% of the sequences have P residues. Some proteins, like MyoD, continue helix 1 another turn, and the protein is turned out of the helix into the loop one turn later than in other bhlh proteins. Figure 8 shows the helical wheel for helix 2. The helix starts with site 50, which is a highly conserved K residue (93% of all sequences). Hydrophobic periodicity is easily seen relative to sites 57 and 64, which are predominantly hydrophobic residues, as are the i 3 and i 4 sites. However, the i 2 and i 5 sites to sites 57 and 60 are not necessarily hydrophobic and are rather diverse in their amino acid compositions. This is also the case for sites 54 and 61. On average, helix 2 appears to be more hydrophobic than helix 1. The predictive motif proposed by Atchley, Terhalle, and Dress (1999) to discriminate bhlh proteins employs sites that fall on these highly conserved faces of the two helices. Phylogenetic Aspects of the Helical Configuration To explore the phylogenetic aspects of an helix configuration, we combined a neighbor-joining tree of 33 bhlh domains with the distribution of their amino acids along a helical wheel (fig. 9). The proteins chosen for these analyses were simply some typical representatives of the various clades and evolutionary lineages as reported by Atchley and Fitch (1997). This tree delimits those proteins that belong to groups A D. Group A and B bhlh domain proteins are most prevalent in the literature and databases, followed by those of group C. Group A proteins bind to a CAGCTG E-box configuration, while group B proteins bind to the CACGTG E-box configuration. Group C has a more complex DNA-binding behavior, and group D does not bind DNA. Atchley, Terhalle, and Dress (1999) derived a 19- element predictive motif that accurately identifies bhlh-domain-containing proteins. Groups A and B show the smallest deviation from this motif, while groups C and D deviate considerably, mostly in the basic DNA-binding region, as would be expected. Proteins that fit this predictive motif most closely include MyoD and Achaete-Scute (no mismatches) and LYL1, d- HAND, and SREBP (one mismatch). The most divergent proteins are INO4 and CENP-B, with 10 mismatches, while INO2 has 8 mismatches. (EMC also has 10 mismatches, but this is to be expected since it is a group D protein which does not have a typical basic DNAbinding component.) Let us consider the structures of those proteins that deviate most from the predictive motif. CENP-B was originally described as a bhlh protein (Sullivan and Glass 1991; Sugimoto, Muro, and Himeno 1992). However, Atchley and Fitch (1997) suggested that it deviated considerably in its primary sequence from more typical bhlh proteins. In addition to considerable deviation from the predictive motif for bhlh domains (Atchley, Terhalle, and Dress 1999), the first residue in helix 1 is a proline. Proline is an amino acid generally associated with breaking of helices, and, indeed, the last residue in helix 1 and the first one in the loop regions are prolines. Analysis of the secondary structure of the CENP- B bhlh domain by the PSA algorithm described in Materials and Methods indicates that the stretch of residues considered to be the helix 1 region have low probabilities of being helices. The probability values range from 0.34 for the first residue (proline) to 0.56 (an isoleucine midway in helix 1). The means and standard deviations for the probabilities of being helices for the residues in the basic, H1, L, and H2 regions are 0.67 ( 0.16), 0.37 ( 0.15), 0.40 ( 0.15), and 0.55 ( 0.15), respectively. An analysis of variance of the basic and H1 regions to test the null hypothesis that the basic and helix 1 regions have an equal probability of being helices is rejected at P Recently, it was suggested that CENP-B is not a helix-loop-helix protein but, rather, should be classified as helix-turn-helix (Iwahara et al. 1998). Our analyses here suggest that CENP-B probably should not be classified as an HLH protein. Another protein that deviates considerably from the predictive model is INO4, which is believed to positively regulate the coordinate expression of phospholipid structural genes in yeast (Nikoloff and Henry 1994). The fit of INO4 to the predictive motif in helix 2 is particularly bad (five mismatches). However, in spite of the sequence differences, the PSA analyses indicate that INO4 fits the helix model rather well. Our analyses suggest that a more detailed analysis of the structure of INO4 might be fruitful. The extent of conservation at particular sites and pairs of sites can now be more clearly seen, together with deviations from these patterns of conservation. For example, all of the proteins except four have proline residues at site 28. The exceptions include the proteins ADD1, INO2, MyoD, and E12. Furthermore, only a single sequence (PHO4) has a residue other than L or I at site 54. In PHO4, the residue at this site is an E. Discussion A clear understanding of interactions among amino acid sites is fundamental to producing a comprehensive model for the structure, function, and evolution of proteins. Because of the limited number of secondary and tertiary structures that proteins can assume (Sternberg 1996), it can be argued that evolution at the amino acid sequence level is constrained by higher-order structure. This phylogenetic inertia (sensu Felsenstein 1985), which determines the rate and direction of evolutionary change at the sequence level, is reflected in the variability and covariability among primary sequence elements. Covariance structure and phylogenetic inertia are important concepts in Darwinian evolution that have not been adequately explored at the molecular level. The analyses reported here deal with the magnitude and origins of covariability among amino acid sites and their relevance to protein structure and evolution.

Solving the protein sequence metric problem

Solving the protein sequence metric problem William R. Atchley*, Jieping Zhao, Andrew D. Fernandes, and Tanja Drüke *Department of Genetics, Bioinformatics Research Center, Graduate Program in Biomathematics,