Supplementing information theory with opposite polarity of amino acids for protein contact prediction

Size: px

Start display at page:

Download "Supplementing information theory with opposite polarity of amino acids for protein contact prediction"

Britton Rose
5 years ago
Views:

1 Supplementing information theory with opposite polarity of amino acids for protein contact prediction Yancy Liao 1, Jeremy Selengut 1 1 Department of Computer Science, University of Maryland - College Park The covariation of protein residue pairs is often measured by mutual information scores. Residue pairs with high mutual information scores, however, are not guaranteed to be functionally or structurally significant, since their covariation may be due to a number of indirect effects. We propose filtering for residue pairs whose covariational patterns suggest a direct interaction based on their chemical properties. Specifically, we took three protein families (grol, gudd, and BADH) and filtered them for pairs that covary by opposite polarity of charge ([K or R] with [E or D]). The resulting sets of pairs had moderate improvements in physical proximity compared to using mutual information alone. This suggests a potentially new approach of applying mutual information in tandem with chemical properties to improve contact prediction. Introduction Residue pairs which share a high degree of covariation can offer insights concerning the structure and function of a protein family. When changes in one residue is mirrored by compensatory changes in the other residue, this suggests that the residues have co-evolved. 1

2 Covariation can be used to infer residue pairs which are in physical contact or which have functional relationships. This can then be used to construct models of protein folding or mutational processes (Juan et al., 2013). While these inferences have become more widely accessible with the proliferation of sequence data, they are imperfect. Apart from random background noise, there are a variety of confounding factors by which residue pairs can exhibit a spurious degree of covariation. Such spurious relationships can come about, e.g., through a transitive chain of direct interactions, or because of the presence of a common lineage within the sequence data (Marks et al., 2012). Covariational methods are split broadly into two groups: local and global (Marks et al., 2012). While global methods are typically more complex and avoid the problem of transitivity by modeling sequences as a whole, local methods are relatively simpler computationally and allow us to easily isolate for the contributions of each amino acid pairing, and thereby to augment the method to more heavily favor those pairings which correspond to meaningful chemical interactions. Corrected Mutual Information In this paper, we focus on one commonly used local method called mutual information. In particular, we use a variant which corrects for a portion of the background noise including the impact of common lineage. Mutual information is a way to quantify the dependence of two random variables, or how much knowing one variable can tell you about the other. Concretely, suppose X and Y are positions in a multiple sequence alignment (MSA). Then the mutual information of X and Y is: MI(X, Y ) = p(x, y) log y Y x X ( ) p(x, y) p(x)p(y) 2

3 Like many other local methods, mutual information revolves around a comparison of the observed distribution of pairs (p(x, y)) with the expected distribution under an assumption of independence (p(x)p(y)). The contribution of each amino acid pairing (or combination of x and y) is additive, so as we make use of later, we can meaningfully distinguish particular pairings which have an out-sized effect on the overall mutual information score. Introduced by Dunn et al. (2008), corrected mutual information (cmi) is a minor variation which subtracts away a quantity called the average product correction (APC). The APC of positions X and Y is defined as: AP C(X, Y ) = MI(X, )MI(, Y ) MI where, if we allow n to be the length of the MSA, MI = 2 na=1 nb A MI(A, B) n(n 1) is the average of MI scores across all position pairs, and MI(X, ) = 1 nb X MI(X, B) is n 1 the average of MI scores among position pairs featuring X. And finally, we have: cmi(x, Y ) = MI(X, Y ) AP C(X, Y ) By incorporating a positional averaging, the APC under certain theoretical assumptions approximates the amount of background noise or shared ancestry inherent in its input positions, which is then subtracted away from the MI to produce the cmi, which has been empirically shown to be a more accurate measure of the covariation from structural or functional constraints (Dunn et al., 2008). Methods The selection of protein families on which to test our hypothesis was constrained by finding proteins with: ample sequence length, a representative hidden markov model (HMM), a PDB 3

4 crystal structure with multiple chains, and a sufficient number of representatives in our sequence database. With these criteria in mind, we selected and tested on three protein families: grol, gudd, and BADH. In this section we focus on giving a detailed breakdown of the process for grol. Chaperonin GroEL (grol) is a protein found in bacteria which is involved with protein folding. It has an HMM of length 526, denoted TIGR02348 in the TIGRFAMs database. It has a PDB crystal structure of length 525, denoted 2EU1 in the Protein Data Bank. The crystal structure consists of 14 different chains. Physical distance within the crystal structure was defined as the minimum distance between atoms in the r-groups. For glycine, which has no r-group, we used the location of its c-alpha atom. We further refined the notion of distance between two positions to be the minimum distance for those positions across combinations of the 14 approximately identical chains. Using HMMER 3.0, we performed an HMM search against the Uniref full database which found over 5000 sequences above the trusted cut-off, a number of sequences well surpassing the minimum for MI methods (Gloor et al., 2005). We aligned the crystal structure sequence against the HMM as well, in order to translate residue positions from HMM to crystal structure, and vice versa. With the 5000 sequences as our data set, we computed cmi for each pair of positions from the 526. We selected those pairs above the 99.5th percentile in cmi, or the top 687 pairs of positions. For each pair, we translated them to the crystal structure and computed their physical distance. We then generated 196 pairs of positions taken at random, and computed their physical distance. We also generated 99 pairs of adjacent positions, meaning they are within 5 positions apart on the sequence. Intuitively, we would expect the high cmi positions, since they are covarying, to be closer in distance than the random positions. We would also expect the 4

5 adjacent positions to be fairly close in distance since they are near in terms of sequence order. Finally, with regards our hypothesis, we wished to filter the high cmi positions for covariations from amino acids having opposite polarity. We selected high cmi pairs for which [K or R] with [E or D] interactions accounted for at least 15% of their MI score. Since opposite polarities have a physical explanation for covarying, we expected those positions to be closer. Results We first plotted the three sets of position pairs: high cmi, random, and adjacent, according to their physical distance (in angstroms) versus their cmi score (note that while regular MI is always non-negative, it is possible for cmi to be negative). Figure 1: 5

6 We see in Figure 1 that the random positions are spread fairly evenly in terms of distance, from 0 angstroms apart to over 60 angstroms apart, with a mean distance of Meanwhile, they are clustered tightly around 0 in cmi score. The high cmi positions, naturally, only occur on the right side of the graph, with a clear separation from the random positions. While still fairly spread out in terms of distance, the high cmi positions are weighted more heavily towards the lower side of the graph, indicating their closer distances while having a mean of The adjacent positions are salient in the uniformity of their closer distances with a mean of We then plotted within the high cmi group, highlighting those with covariation coming from opposite polarity amino acids (KR-ED correlation). Figure 2: Figure 3: We see in Figure 2 that of the 687 high cmi positions, 112 of them were found to be KR-ED correlated. The KR-ED correlated pairs do indeed have a closer mean distance (12.88) compared to the overall high cmi (19.04). However, of the 112 KR-ED correlated pairs, we found that 38 of them were sequentially adjacent. This is a fairly high percentage. And as we know from the previous Figure 1, adjacent pairs are almost always close in distance. So, what portion of the improved distance in KR-ED correlated pairs could be explained by the sequen- 6

7 tially adjacent? In Figure 3, we excluded the sequentially adjacent and plotted 576 non-adjacent high cmi positions alongside 74 non-adjacent KR-ED correlated positions. The former had a mean of 21.71, while the latter had a mean of 16.98, meaning the difference was diminished slightly, but still presents a moderate improvement. The overall results we presented thus far for grol hold true for the other two protein families, gudd and BADH, as well, and their figures are included in the supplemental section. Discussion and Future Work We have seen that filtering for opposite polarity of covarying amino acid pairs improves upon a mere reliance of high cmi score, even when removing the effects of selecting for sequentially adjacent pairs. The improvement effect is moderate, on an averaged basis, and by no means gives a guaranteed set of contacting pairs. In this study we focused exclusively on opposite polarity. In the course of the study we did attempt to filter by other interactions, such as those for hydrophobic affiliation, but obtained weak or mixed results. We also ran through the most common specific pairings found for close positions versus far positions but did not detect a discernibly generalizable pattern. Nevertheless, this line of investigation may hold untapped potential in leading to additional, more fine-grained filters based on the chemical properties of covariational patterns, and thereby augmenting local methods like mutual information in producing more accurate contact predictions. Acknowledgments I am extremely grateful to Dr. Jeremy Selengut for his generous guidance and expertise while working on this project; to members of the CBCB including Dr. Mihai Pop, Jay Ghurye, and Nidhi Shah for their helpful insights during conversations; and to IARPA for providing the 7

8 funding for this project to occur. References and Notes 1. Dunn, S.D., Wahl, L.M., Gloor, G.B. (2008) Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics, Volume 24, Issue 3, Pages Gloor, Gregory B., Martin, Louise C., Wahl, Lindi M., Dunn, Stanley D. (2005) Mutual Information in Protein Multiple Sequence Alignments Reveals Two Classes of Coevolving Positions. Biochemistry 44 (19), Pages Juan, David de, Pazos, Florencio, Valencia, Alfonso. (2013) Emerging methods in protein co-evolution. Nature Review Genetics 14, Pages Marks, Debora S., Hopf, Thomas A., Sander, Chris. (2012) Protein structure prediction from sequence variation. Nature Biotechnology 30, Pages

9 Supplemental 9

10 10

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot