Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

Similar documents
Information in Biology

Information in Biology

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Quantitative Biology Lecture 3

Intro to Information Theory

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

INTRODUCTION TO INFORMATION THEORY

Shannon's Theory of Communication

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Classification & Information Theory Lecture #8

Introduction to Information Theory

Information & Correlation

Information Theory (Information Theory by J. V. Stone, 2015)

Some Basic Concepts of Probability and Information Theory: Pt. 2

CS 630 Basic Probability and Information Theory. Tim Campbell

Computational approaches for functional genomics

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

6.02 Fall 2011 Lecture #9

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Computing and Communications 2. Information Theory -Entropy

6.02 Fall 2012 Lecture #1

MATH 3C: MIDTERM 1 REVIEW. 1. Counting

Introduction to Information Theory. Part 4

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

UNIT I INFORMATION THEORY. I k log 2

Information Theory & Decision Trees

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Lecture 16 Oct 21, 2014

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

3F1 Information Theory, Lecture 3

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Lecture 1: Shannon s Theorem

Murray Gell-Mann, The Quark and the Jaguar, 1995

Machine Learning

MiGA: The Microbial Genome Atlas

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Microbial Taxonomy and the Evolution of Diversity

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Introduction to Information Theory. Part 3

Entropies & Information Theory

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

What is Entropy? Jeff Gill, 1 Entropy in Information Theory

Entropy as a measure of surprise

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Exercises with solutions (Set D)

Conditional Probability and Bayes Theorem (2.4) Independence (2.5)

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

Welcome to Comp 411! 2) Course Objectives. 1) Course Mechanics. 3) Information. I thought this course was called Computer Organization

Basic Probability and Statistics

Data Warehousing & Data Mining

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Markov Chains. Chapter 16. Markov Chains - 1

Lecture 1: Probability Fundamentals

Lecture 7: DecisionTrees

3F1 Information Theory, Lecture 3

Tree Building Activity

Computational methods for predicting protein-protein interactions

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

EPE / EDP 557 Homework 7

Microbiome: 16S rrna Sequencing 3/30/2018

Information Theory Primer:

biologically-inspired computing lecture 18

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Total

Heredity and Genetics WKSH

1 What are probabilities? 2 Sample Spaces. 3 Events and probability spaces

SUPPLEMENTARY INFORMATION

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Evolution Problem Drill 09: The Tree of Life

Chapter 19: Taxonomy, Systematics, and Phylogeny

Probability & Random Variables

Lecture 11: Continuous-valued signals and differential entropy

A different perspective. Genes, bioinformatics and dynamics. Metaphysics of science. The gene. Peter R Wills

Conditional Probability

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Introduction to Data Science Data Mining for Business Analytics

SUPPLEMENTARY INFORMATION

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

A brief review of basics of probabilities

Properties of Probability

3F1 Information Theory, Lecture 1

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Introduction to Information Theory. Part 2

Computational Structural Bioinformatics

Some Concepts in Probability and Information Theory

Unit of Study: Genetics, Evolution and Classification

Machine Learning

Transcription:

Mutual Information & Genotype-Phenotype Association Norman MacDonald January 31, 2011 CSCI 4181/6802

2 Overview What is information (specifically Shannon Information)? What are information entropy and mutual information? How are they used? In-depth example: Genotype-Phenotype Association

Which message has more information? 3

4 What is information? There are many definitions of (different types of) information. Here, we are talking about Shannon Information. Shannon Information is not knowledge.

5 Aside: A little history on Claude Shannon Research Fellow at Princeton s Institute for Advanced Study Claude Shannon worked as a wartime cryptanalyst from 1940-1945 at Bell Labs. His work led to his influential A mathematical theory of communication, published in 1948. Some say he already had the most famous master s thesis of the century at MIT, laying the groundwork of electronic communication with Boolean algebra.

6 What is information? Information is often defined in terms of communication. It depends only on the probability of a message. The more improbable a message, the more information it contains.

What is information? We can measure information in bits. Less probable events have more information. Can intuitively be thought of as surprisal *. *Coined by Myron Tribus in Thermostatics and Thermodynamics (1961). 7

Drawings by David Mosher from M Mitchell s Complexity: A Guided Tour (2009) 8

9 Intuitively, either outcome of a fair coin flip has 1 bit of information. e.g. Let Heads=1, Tails=0 Each outcome equally probable, thus, each outcome equally informative.

The result of each possible die roll has 2.58 bits of information. 2.58 bits?? Huh? Ok, practically we need 3 bits, but theoretically only 2.58 bits are needed (we can represent up to 8 states with 3 bits.) 10

11 How do we measure information? Where ω n is a given outcome and P(.) is the probability mass function for ω. With a log base of two, the units are bits. The more unlikely an event, the more information is received when it occurs. Definite events (P=1.0) have 0 bits of information. suprisal

12 Another example: Winning the lottery Let M be a language with two messages: W: Yay! I won! L : Boo! I lost! Let P(M=W) = 0.0000001 P(M=L) = 0.9999999 Then L : has 1.44 x 10-7 bits of information. W : has 23.3 bits of information.

13 Information Entropy Now that we can measure the information of actual messages received, we can think about overall information content of a random variable. A useful measure of this is the Expected Value of the Information of a random variable, otherwise known as the Information Entropy.

14 Information Entropy E: The expected value function H: The information entropy expected surprisal

H(X) 15 Information Entropy 1 Entropy of a two state variable 0.5 0 0 0.5 1 P(X) A coin flip, the distribution is (p=0.5), and the entropy (average suprisal) is 1 bit. The lottery example (p=1.0x10-7 ) has near zero entropy.

16 More examples with entropy A flip of a fair coin: Initially: Low prior information, thus high uncertainty A roll of a six-sided die: Initially: Low prior information, thus high uncertainty A lottery ticket: Initially: High prior information, thus low uncertainty Note that this has to do with the uncertainty. Uncertainty deals with the future. The actual information contained in a message depends upon the probability of the actual event that occurred!

17 Conditional Entropy H(X Y) What uncertainty is left in X if we know Y? E.g. X: {grass wet, grass dry} Y: {rainy, sunny} In this case, very little uncertainty remains.

18 Conditional Entropy If the entropy in a system is H(Y,X), and we remove the entropy of X, then we have H(Y G). Note: H(Y X) = H(Y) iff X and Y are independent. (knowing one gives no information about the other)

19 So far We now have a sense of: The information (surprisal) of a specific state. The expected information over all states, known as the entropy. What about the information shared between two random variables?

20 Mutual Information Given two random variables, we can formally define the level of relationship between them by the average mutual information. A couple of extremes: Zero mutual information: The variables are independent. Mutual information ~= Information: The variables are potentially redundant. Can be thought of as agreement

21 Mutual Information Formally: Other quantities:

22 Mutual Information Important point: Mutual information is ignorant of the message itself. Each value contributes to the information. e.g. the absence and presence of a feature equally contribute to the information. Agreement Reminder: Information is dependent only on the probability of an outcome, not on any meaning attributed to the outcome.

Entropy Relationships 23

Application areas Lossless data compression (e.g. Huffman encoding) Theoretical channel capacity Corpus linguistics (word collocation) RNA secondary structure prediction (covarying sites) Feature selection Relevance and redundancy Microarray expression Measuring cluster quality Genotype-phenotype association 24

Genotype-Phenotype Association 25

26 The problem Gene A, Gene B Trait http://www.csb.yale.edu/userguides/graphics/ribbons/help/dna_rgb.html http://oceanexplorer.noaa.gov/explorations/04fire/ logs/hirez/champagne_vent_hirez.jpg

27 We can create two random variables. X = 1, 0 The presence or absence of a gene Y = 1, 0 The presence or absence of a trait With this encoding, we can measure the agreement among X and Y to determine if they may be related.

Genotype Phenotype http://www.giantmicrobes.com/ 28

Genotype Phenotype http://www.giantmicrobes.com/ 29

NETCAR Tamura and D haeseleer, 2008, Bioinformatics 30

31 So we need examples of organisms with and without genes and traits to analyze. We can get our examples from complete genomes available for download online.

32 However, some of these microbes will be distantly related, having genes with similar function, but are not identical. We need to group based on orthology.

33 Clusters of Orthologous Groups Homologous genes: Set of genes that share a last common ancestor. Orthologous genes: Homologous genes that are separated by a speciation event. COGs used here are from NCBI and the STRING databases.

34 Once we have our genomes, COGs, and traits, we can build phylogenetic profiles (Pellegrini et al 1999) Organism α β γ A 1 1 1 Gene B 0 0 1 C 1 0 1 Trait Y 1 0 1 We can analyze patterns of presence and absence

35 Associative rule models Gene A and Gene B and Gene C Trait If we were to exhaustively search all possible interactions of size three in a 26,290 gene set, we would have a search space of size 3.03 x 10 12. Association rule mining allow us to prune this search space.

36 Associative rule models (Agrawal et al. 1993) A classical example is a set of grocery store sales transactions. +

NETCAR (Association rule mining algorithm) 1. Find parent features strongly associated with phenotype Orthologous gene clusters H A E J B L G I C D O M N F Q K P E Thermophily Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 37

NETCAR (Association rule mining algorithm) 2. Find all child features within x steps of a parent in terms of mutual information. Orthologous gene clusters H L C M K E B J G I O D Q N P R A F Thermophily Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 38

NETCAR (Association rule mining algorithm) 3. Generate candidate rules with at least one parent. Orthologous gene clusters H L M O J B I D N Q P R C K G E E A F Thermophily [A E] [F E G] [F E] [F G C] [F G] [F G K] [F] [F C K] [A] Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 39

NETCAR (Association rule mining algorithm) 4. Save rules with high mutual information with phenotype. Thermophily [A E] [F E G] [F E] [F G C] [F G] [F G K] [F] [F C K] [A] Tamura and D'haeseleer, Bioinformatics, 2008, 24 (13), 1523-1529. 40

Classification based on Predictive Association Rules (CPAR) F, Q None above gain threshold F F, Z None above gain threshold A None above gain threshold Rules discovered: 1. F, Q -> POSITIVE 2. F, Z -> POSITIVE 3. A -> POSITIVE Covered samples get their weight reduced before the next iteration Yin and Han, Proceedings of the Third SIAM International Conference on Data Mining (SDM03), 2003. 41

42 Data 427 organisms (STRING 8) 26 290 unique orthologous gene cluster patterns 10 Phenotypes (focus on thermophily, JGI IMG) Taxonomy (NCBI)

43 CPAR versus NETCAR Accuracy Runtime (s)

Dependent Samples 44

45 Dependence among samples Both A and B have a strong association with the phenotype (measured with mutual information) Phenotype Gene A can be explained by shared ancestry. Gene A Gene B Gene B cannot be explained by shared ancestry and should be highlighted.

Dependent Samples 29 of the 40 correctly classified thermophiles are homogeneous to taxonomic rank order. Phenotype light: non-thermophiles dark: thermophiles and hyperthermophiles Gene A Gene B 46

Accounting for shared ancestry with conditional mutual information 47

48 Confoundment H: Shannon Entropy I: Mutual Information

49 Results of CWMI with MI There is no difference in accuracy but there is a difference in the genes that are selected.

Thermophily Top MI Top CWMI X: A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Makarova et al., Nucleic Acids Research, 2002, 30 (2), 482-496. 50

51 Misclassifications Some organisms classified correctly with one score and not the other. For example, over ten replicates, 5-fold cross-validation on thermophily

52 Misclassifications (10 replicates) Organism CPAR MI CWMI Streptococcus_thermophilus_LMG_18311 0 10 10 Streptococcus_thermophilus_CNRZ1066 1 10 10 Carboxydothermus_hydrogenoformans_Z-2901 1 8 5 Geobacillus_kaustophilus_HTA426 3 10 9 Synechococcus_sp._JA-3-3Ab 6 8 2 Methanocaldococcus_jannaschii_DSM_2661 8 0 0 Acidothermus_cellulolyticus_11B 8 9 6 Deinococcus_geothermalis_DSM_11300 9 8 5 Clostridium_thermocellum_ATCC_27405 9 10 4 Chlorobium_tepidum_TLS 10 10 8

53 Thermophilic streptococci Rules applying to Thermophilic streptococci

54 Discussion: CPAR vs MI CPAR uses an approximation of conditional probability P(Trait Gene). When we see gene G, what is the probability of trait P Mutual information is a measure of agreement How well does the presence & absence of G match the presence & absence of P

55 Discussion CPAR mines rules 100x faster than NETCAR, and those rules are better predictors. Shared ancestry confounds gene to trait association problems. Some of the rules weighted with CMI are already known to biologically influence the target traits. We may be subtracting predictive features in favor of those that defy ancestry.

56 References 1. Tamura and D haeseleer 2008. Microbial genotype-phenotype mapping by class association rule mining. Bioinformatics, 24(13):1523 1529, 2008. 2. Steuer, Kurths et al 2002. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18(2):S231- S240. 3. Yin X and Han J 2003. CPAR: Classification based on predictive association rules. In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, 2003. 4. G. Kastenmuller, M. Schenk, J. Gasteiger, and H.-W. Mewes. 2009 Uncovering metabolic pathways relevant to phenotypic traits of microbial genomes. Genome Biol, 10(3):R28 5. Cover and Thomas 2006. Elements of information theory. Wiley, New Jersey.