Mutual Information & Genotype-Phenotype Association Norman MacDonald January 31, 2011 CSCI 4181/6802
2 Overview What is information (specifically Shannon Information)? What are information entropy and mutual information? How are they used? In-depth example: Genotype-Phenotype Association
Which message has more information? 3
4 What is information? There are many definitions of (different types of) information. Here, we are talking about Shannon Information. Shannon Information is not knowledge.
5 Aside: A little history on Claude Shannon Research Fellow at Princeton s Institute for Advanced Study Claude Shannon worked as a wartime cryptanalyst from 1940-1945 at Bell Labs. His work led to his influential A mathematical theory of communication, published in 1948. Some say he already had the most famous master s thesis of the century at MIT, laying the groundwork of electronic communication with Boolean algebra.
6 What is information? Information is often defined in terms of communication. It depends only on the probability of a message. The more improbable a message, the more information it contains.
What is information? We can measure information in bits. Less probable events have more information. Can intuitively be thought of as surprisal *. *Coined by Myron Tribus in Thermostatics and Thermodynamics (1961). 7
Drawings by David Mosher from M Mitchell s Complexity: A Guided Tour (2009) 8
9 Intuitively, either outcome of a fair coin flip has 1 bit of information. e.g. Let Heads=1, Tails=0 Each outcome equally probable, thus, each outcome equally informative.
The result of each possible die roll has 2.58 bits of information. 2.58 bits?? Huh? Ok, practically we need 3 bits, but theoretically only 2.58 bits are needed (we can represent up to 8 states with 3 bits.) 10
11 How do we measure information? Where ω n is a given outcome and P(.) is the probability mass function for ω. With a log base of two, the units are bits. The more unlikely an event, the more information is received when it occurs. Definite events (P=1.0) have 0 bits of information. suprisal
12 Another example: Winning the lottery Let M be a language with two messages: W: Yay! I won! L : Boo! I lost! Let P(M=W) = 0.0000001 P(M=L) = 0.9999999 Then L : has 1.44 x 10-7 bits of information. W : has 23.3 bits of information.
13 Information Entropy Now that we can measure the information of actual messages received, we can think about overall information content of a random variable. A useful measure of this is the Expected Value of the Information of a random variable, otherwise known as the Information Entropy.
14 Information Entropy E: The expected value function H: The information entropy expected surprisal
H(X) 15 Information Entropy 1 Entropy of a two state variable 0.5 0 0 0.5 1 P(X) A coin flip, the distribution is (p=0.5), and the entropy (average suprisal) is 1 bit. The lottery example (p=1.0x10-7 ) has near zero entropy.
16 More examples with entropy A flip of a fair coin: Initially: Low prior information, thus high uncertainty A roll of a six-sided die: Initially: Low prior information, thus high uncertainty A lottery ticket: Initially: High prior information, thus low uncertainty Note that this has to do with the uncertainty. Uncertainty deals with the future. The actual information contained in a message depends upon the probability of the actual event that occurred!
17 Conditional Entropy H(X Y) What uncertainty is left in X if we know Y? E.g. X: {grass wet, grass dry} Y: {rainy, sunny} In this case, very little uncertainty remains.
18 Conditional Entropy If the entropy in a system is H(Y,X), and we remove the entropy of X, then we have H(Y G). Note: H(Y X) = H(Y) iff X and Y are independent. (knowing one gives no information about the other)
19 So far We now have a sense of: The information (surprisal) of a specific state. The expected information over all states, known as the entropy. What about the information shared between two random variables?
20 Mutual Information Given two random variables, we can formally define the level of relationship between them by the average mutual information. A couple of extremes: Zero mutual information: The variables are independent. Mutual information ~= Information: The variables are potentially redundant. Can be thought of as agreement
21 Mutual Information Formally: Other quantities:
22 Mutual Information Important point: Mutual information is ignorant of the message itself. Each value contributes to the information. e.g. the absence and presence of a feature equally contribute to the information. Agreement Reminder: Information is dependent only on the probability of an outcome, not on any meaning attributed to the outcome.
Entropy Relationships 23
Application areas Lossless data compression (e.g. Huffman encoding) Theoretical channel capacity Corpus linguistics (word collocation) RNA secondary structure prediction (covarying sites) Feature selection Relevance and redundancy Microarray expression Measuring cluster quality Genotype-phenotype association 24
Genotype-Phenotype Association 25
26 The problem Gene A, Gene B Trait http://www.csb.yale.edu/userguides/graphics/ribbons/help/dna_rgb.html http://oceanexplorer.noaa.gov/explorations/04fire/ logs/hirez/champagne_vent_hirez.jpg
27 We can create two random variables. X = 1, 0 The presence or absence of a gene Y = 1, 0 The presence or absence of a trait With this encoding, we can measure the agreement among X and Y to determine if they may be related.
Genotype Phenotype http://www.giantmicrobes.com/ 28
Genotype Phenotype http://www.giantmicrobes.com/ 29
NETCAR Tamura and D haeseleer, 2008, Bioinformatics 30
31 So we need examples of organisms with and without genes and traits to analyze. We can get our examples from complete genomes available for download online.
32 However, some of these microbes will be distantly related, having genes with similar function, but are not identical. We need to group based on orthology.
33 Clusters of Orthologous Groups Homologous genes: Set of genes that share a last common ancestor. Orthologous genes: Homologous genes that are separated by a speciation event. COGs used here are from NCBI and the STRING databases.
34 Once we have our genomes, COGs, and traits, we can build phylogenetic profiles (Pellegrini et al 1999) Organism α β γ A 1 1 1 Gene B 0 0 1 C 1 0 1 Trait Y 1 0 1 We can analyze patterns of presence and absence
35 Associative rule models Gene A and Gene B and Gene C Trait If we were to exhaustively search all possible interactions of size three in a 26,290 gene set, we would have a search space of size 3.03 x 10 12. Association rule mining allow us to prune this search space.
36 Associative rule models (Agrawal et al. 1993) A classical example is a set of grocery store sales transactions. +
NETCAR (Association rule mining algorithm) 1. Find parent features strongly associated with phenotype Orthologous gene clusters H A E J B L G I C D O M N F Q K P E Thermophily Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 37
NETCAR (Association rule mining algorithm) 2. Find all child features within x steps of a parent in terms of mutual information. Orthologous gene clusters H L C M K E B J G I O D Q N P R A F Thermophily Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 38
NETCAR (Association rule mining algorithm) 3. Generate candidate rules with at least one parent. Orthologous gene clusters H L M O J B I D N Q P R C K G E E A F Thermophily [A E] [F E G] [F E] [F G C] [F G] [F G K] [F] [F C K] [A] Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 39
NETCAR (Association rule mining algorithm) 4. Save rules with high mutual information with phenotype. Thermophily [A E] [F E G] [F E] [F G C] [F G] [F G K] [F] [F C K] [A] Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 40
Classification based on Predictive Association Rules (CPAR) F, Q None above gain threshold F F, Z None above gain threshold A None above gain threshold Rules discovered: 1. F, Q -> POSITIVE 2. F, Z -> POSITIVE 3. A -> POSITIVE Covered samples get their weight reduced before the next iteration Yin and Han, Proceedings of the Third SIAM International Conference on Data Mining (SDM03), 2003. 41
42 Data 427 organisms (STRING 8) 26 290 unique orthologous gene cluster patterns 10 Phenotypes (focus on thermophily, JGI IMG) Taxonomy (NCBI)
43 CPAR versus NETCAR Accuracy Runtime (s)
Dependent Samples 44
45 Dependence among samples Both A and B have a strong association with the phenotype (measured with mutual information) Phenotype Gene A can be explained by shared ancestry. Gene A Gene B Gene B cannot be explained by shared ancestry and should be highlighted.
Dependent Samples 29 of the 40 correctly classified thermophiles are homogeneous to taxonomic rank order. Phenotype light: non-thermophiles dark: thermophiles and hyperthermophiles Gene A Gene B 46
Accounting for shared ancestry with conditional mutual information 47
48 Confoundment H: Shannon Entropy I: Mutual Information
49 Results of CWMI with MI There is no difference in accuracy but there is a difference in the genes that are selected.
Thermophily Top MI Top CWMI X: A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Makarova et al., Nucleic Acids Research, 2002, 30 (2), 482-496. 50
51 Misclassifications Some organisms classified correctly with one score and not the other. For example, over ten replicates, 5-fold cross-validation on thermophily
52 Misclassifications (10 replicates) Organism CPAR MI CWMI Streptococcus_thermophilus_LMG_18311 0 10 10 Streptococcus_thermophilus_CNRZ1066 1 10 10 Carboxydothermus_hydrogenoformans_Z-2901 1 8 5 Geobacillus_kaustophilus_HTA426 3 10 9 Synechococcus_sp._JA-3-3Ab 6 8 2 Methanocaldococcus_jannaschii_DSM_2661 8 0 0 Acidothermus_cellulolyticus_11B 8 9 6 Deinococcus_geothermalis_DSM_11300 9 8 5 Clostridium_thermocellum_ATCC_27405 9 10 4 Chlorobium_tepidum_TLS 10 10 8
53 Thermophilic streptococci Rules applying to Thermophilic streptococci
54 Discussion: CPAR vs MI CPAR uses an approximation of conditional probability P(Trait Gene). When we see gene G, what is the probability of trait P Mutual information is a measure of agreement How well does the presence & absence of G match the presence & absence of P
55 Discussion CPAR mines rules 100x faster than NETCAR, and those rules are better predictors. Shared ancestry confounds gene to trait association problems. Some of the rules weighted with CMI are already known to biologically influence the target traits. We may be subtracting predictive features in favor of those that defy ancestry.
56 References 1. Tamura and D haeseleer 2008. Microbial genotype-phenotype mapping by class association rule mining. Bioinformatics, 24(13):1523 1529, 2008. 2. Steuer, Kurths et al 2002. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18(2):S231- S240. 3. Yin X and Han J 2003. CPAR: Classification based on predictive association rules. In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, 2003. 4. G. Kastenmuller, M. Schenk, J. Gasteiger, and H.-W. Mewes. 2009 Uncovering metabolic pathways relevant to phenotypic traits of microbial genomes. Genome Biol, 10(3):R28 5. Cover and Thomas 2006. Elements of information theory. Wiley, New Jersey.