Quantitative Biology Lecture 3

Similar documents
Introduction to Information Theory

Quantitative Biology II Lecture 4: Variational Methods

Lecture 2: August 31

Information in Biology

Information in Biology

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3F1 Information Theory, Lecture 1

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Lecture 1: Introduction, Entropy and ML estimation

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Machine Learning Srihari. Information Theory. Sargur N. Srihari

QB LECTURE #4: Motif Finding

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Computational Systems Biology: Biology X

Classification & Information Theory Lecture #8

Information Theory (Information Theory by J. V. Stone, 2015)

Natural Image Statistics and Neural Representations

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Entropies & Information Theory

Shannon's Theory of Communication

Information Theory Primer:

An introduction to basic information theory. Hampus Wessman

Lecture 11: Continuous-valued signals and differential entropy

Lecture 5 - Information theory

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

5 Mutual Information and Channel Capacity

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Theory, Statistics, and Decision Trees

Lecture 22: Final Review

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Computing and Communications 2. Information Theory -Entropy

Noisy channel communication

Chapter 2 Review of Classical Information Theory

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

ELEMENT OF INFORMATION THEORY

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Bioinformatics: Biology X

ECE 4400:693 - Information Theory

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

Information Theory. Week 4 Compressing streams. Iain Murray,

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Information Theory in Intelligent Decision Making

Shannon s Noisy-Channel Coding Theorem

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Homework Set #2 Data Compression, Huffman code and AEP

Example: Letter Frequencies

Lecture 1: Shannon s Theorem

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Example: Letter Frequencies

Chapter I: Fundamental Information Theory

CS 630 Basic Probability and Information Theory. Tim Campbell

Probabilistic and Bayesian Machine Learning

Some Basic Concepts of Probability and Information Theory: Pt. 2

Principles of Communications

3F1 Information Theory, Lecture 3

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Quantum Information Theory and Cryptography

CSCI 2570 Introduction to Nanocomputing

Introduction to Machine Learning

Exercises with solutions (Set D)

3F1 Information Theory, Lecture 3

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Information theory and decision tree

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Hands-On Learning Theory Fall 2016, Lecture 3

Exercises with solutions (Set B)

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information


Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Chaos, Complexity, and Inference (36-462)

Lecture 2. Capacity of the Gaussian channel

Introduction to Machine Learning

x log x, which is strictly convex, and use Jensen s Inequality:

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

6.02 Fall 2012 Lecture #1

Revision of Lecture 5

Topics. Probability Theory. Perfect Secrecy. Information Theory

CSE468 Information Conflict

The binary entropy function

Biology as Information Dynamics

(Classical) Information Theory III: Noisy channel coding

LECTURE 3. Last time:

the Information Bottleneck

Example: Letter Frequencies

ECE Information theory Final (Fall 2008)

Information & Correlation

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Characterizing Activity Landscapes Using an Information-Theoretic Approach

to mere bit flips) may affect the transmission.

Transcription:

23 nd Sep 2015 Quantitative Biology Lecture 3 Gurinder Singh Mickey Atwal Center for Quantitative Biology

Summary Covariance, Correlation Confounding variables (Batch Effects) Information Theory

Covariance So far, we have been analyzing summary statistics that describe aspects of a single list of numbers, i.e. a single variable. Frequently, however, we are interested in how variables behave together.

Smoking and Lung Capacity Suppose, for example, we want to investigate the relationship between cigarette smoking and lung capacity We might ask a group of people about their smoking habits, and measure their lung capacities.

Smoking and Lung Capacity Cigarettes (X) Lung Capacity (Y) 0 45 5 42 10 33 15 31 20 29

Smoking and Lung Capacity 50 Lung Capacity (Y) Lung Capacity 40 <Y> 30 20-10 Smoking 0 10 <X> ΔY ΔX 20 Smoking rate (X) 30

Covariance

The Sample Covariance The covariance quantifies the linear relationship between two variables. The sample covariance Cov(x,y) is an unbiased estimate of the true covariance from a collection of N data Cov(x, y) = 1 N 1 N (x i x)(y i y) i=1 Why N-1 and not N in the denominator? The reason is that the averages <x> and <y> in the formula are not true averages of the x and y variables, but only estimates of the averages from the finite set of available data. The N-1 corrects for this.

Correlation ranges from -1 to 1-1 (negative correlation) 0 (uncorrelated) 1 (positive correlation) Pearson Correlation The correlation is a normalized version of the covariance r xy = 1 n "1 n ) i=1 # % $ x i " x s x &# ( y i " y % ' $ s y & ( ' s x = standard deviation of the x variable

Correlation Wikipedia Note that the Pearson correlation does not capture non-linear relationships.

Correlation does not imply Causation! Confounding variables can give rise to a correlation between 2 indirect variables. Example: association between cancer risk and genetic variation can be confounded by population history Population history Genetic variation in cancer genes correlation and causation Cancer Phenotype Genetic variation in non-cancer genes possible correlation but no causation

Confounding Example: Genetic Association Studies

Simpson s Paradox Correlations can sometimes be reversed when combining different sets of data Example: Test results Week 1 Week 2 Total Eve 60/100 (60%) 1/10 (10%) 61/110 (55%) Adam 9/10 (90%) 30/100 (30%) 39/110 (35%) Adam performs better than Eve in each week, but worse when all the results are added up

Batch Effects in Gene Expression Data Orange and gene represent two different processing dates. Raw data from published bladder cancer microarray study. 10 example genes showing batch effects, even after normalization. Samples cluster perfectly by processing date Leek et al, Nature Reviews Genetics (2010)

Batch Effects in Next- Generation Sequencing Uneven DNA (human, Chr16) sequencing coverage. Some days the coverage is high (orange) and some days low (blue). Leek et al, Nature Reviews Genetics (2010)

Eliminating batch effects: Pooling and Randomization E.g. eliminating lane batch effects for RNA-Seq by pooling samples CORRECT WRONG Auer and Doerge, Genetics (2010)

Information Theory

Role of Information Theory in Biology i) Mathematical modeling of biological phenomena e.g. Optimization of early neural processing in the brain; bacterial population strategies ii) Extraction of biological information from large data-sets e.g. Gene expression analyses; GWAS (genome-wide association studies)

Mathematical Theory of Communication Claude Shannon (1948) Bell Sys. Tech. J. Vol.27, 379-423, 623-656 How to encode information? How to transmit messages reliably?

Model of General Communication System Information source message Channel Destination Visual Image Retina Visual Cortex Morphogen Concentration Gene Pathway Differentiation Genes Computer File Fiber Optic Cable Another Computer

Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise MESSAGE ENCODED MESSAGE DECODED

Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise Shannon s Source Coding theorem There exists a fundamental lower bound on the size of the compressed message without losing information.

Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise 2) Shannon s channel coding theorem Information can be transmitted, with negligible error, at rates no faster than the channel capacity.

Information Theory Information content of a message (random variable)? How much uncertainty is there in an outcome of an event? e.g. 0.5 0.4 Homo sapiens p(a)=p(t)=p(g)=p(c)=0.25 0.3 0.2 0.1 0 A T G C High information content Plasmodium falciparum 0.5 p(a)=p(t)=0.4 p(g)=p(c)=0.1 0.4 0.3 0.2 0.1 Low information content 0 A T G C

Measure of Uncertainty H({p i }) Suppose we have a set of N possible events with probabilities p 1 p 2 p N General requirements of H Continuous in p i If all p i are equal then H should be monotonically increasing with N H should be consistent 1/2 1/2 1/3 1/6 = 1/2 2/3 1/3

Entropy as a measure of uncertainty Unique answer provided by Shannon H[B] = " $ b#b p(b)log 2 p(b) Discrete states random variable B with N elements b Similar to Gibbs entropy in statistical mechanics Maximum when all probabilities B are equal, p(b)=1/n, Units are measured in bits (binary digits) H B] = p( b)log p( b) db [ 2 H 2 base 2 [ ] max = log N Boltzmann entropy Continuous states

Intrepretations of entropy H Average length of shortest possible code to transmit a message (Shannon s source coding theorem) Captures variability of a variable without making any model assumptions Average yes/no questions to determine the outcome of a random event 0.5 Homo Sapiens p(a)=p(t)=p(g)=p(c)=0.25 0.4 0.3 0.2 0.1 0 A T G C H = 2 bits Plasmodium falciparum p(a)=p(t)=0.4 p(g)=p(c)=0.1 0.5 0.4 0.3 0.2 0.1 0 A T G C H ~ 1 bit

Example : Binding sequence conservation Sequence conservation R seq = H = max Hobs log2 N n= 1 N p n log 2 p n CAP (Catabolite Activator Protein), acts as a transcription promoter at more than 100 sites within the E. Coli genome Sequence conservation reveals CAP binding site

Two random variables? Joint entropy = Y y X x y x p y x p Y X H, 2 ), ( )log, ( ], [ If variables are independent p(x,y)=p(x)p(y) then H[X,Y]=H[X]+H[Y] Difference measures total amount of correlation between two variables = + = Y y X x y p x p y x p y x p Y X H Y H X H Y X I, 2 ) ( ) ( ), ( )log, ( ], [ ] [ ] [ ] ; [ Mutual Information, I(X;Y)

Mutual Information, I(X;Y) H[X] H[X Y] I[X;Y] H[Y X] H[Y] I( X ; Y) = H( X ) H( X Y) H[X,Y] I(X;Y) quantifies how much uncertainty of X is reduced if we know Y If X and Y are independent, then I(X;Y)=0 Model independent Captures all non-linear correlations (c.f. Pearson s correlation) Independent of measurement scale Units (bits) have physical meaning

Mutual information captures non-linear relationships A 2 R 2 = 0.487 ± 0.019 I = 0.72 ± 0.08 MIC = 0.48 ± 0.02 B 2 R 2 = 0.001 ± 0.002 I = 0.70 ± 0.09 MIC = 0.40 ± 0.02 y 1 y 1 0 0 0.5 1 x 0 1 0 1 x Kinney and Atwal, PNAS 2014

Responsiveness to complicated relations MI~1 bit; Corr.~0.9 MI~1.3 bits; Corr.~0 gene-b expression level gene-b expression level gene-a expression level gene-a expression level

Data processing inequality Suppose we have a sequence of processes e.g. a signal transduction pathway (Markov process) A B C Physical Statement In any physical process the information about A gets continually degraded along the sequence of processes Mathematical Statement I ( A; C) I ( A; B) I ( B; C)

Multi-Entropy, H(x 1 x 2 x n ) H[ X1X 2... X n] = p( x1x2... xn)log2 p( x1x2... x x 1 x 2... x Measures total correlation in n variables n n ) Multi-Information, I(x 1 x 2 x n ) n p(x I[X 1 X 2...X n ] = p(x 1 x 2...x n )log 1 x 2...x n ) " 2 p(x 1 )p(x 2 )...p(x n ) i=1

Generalised correlation between more than two elements Multi-information is a natural extension of Shannon s mutual information to an arbitrary number of random variables N I({ X1, X 2,..., X N}) = H ( X i ) H ({ X1, X 2,..., X N}) i= 1 Provides a general measure of nonindependence among multiple variables in a network Captures higher-order interactions than just simple pairwise interactions

Capturing more than pairwise relations MI~0 bits; Corr.~0 Multi-information ~ 1.0 bits gene-a/gene-b expression gene-a/gene-b/gene-c expression Experiment index Experiment index

Multi-allelic associations allele A allele B XOR I(A;B)=I(A;P)=I(B;P)=0 I(A;B;P)=1 bit Phenotype A B P 0 0 0 0 1 1 1 0 1 1 1 0 Multi-loci associations can be completely masked by single-loci studies!

Synergy and Redundancy )] ; ( ) ; ( [ ) };, ({ ) ; ( ) ; ( ) ; ( ) ; ; ( Z Y I Z X I Z Y X I Z Y I Z X I Y X I Z Y X I S + = = S compares the information that X and Y together provide about Z with the information that these two variables provide separately If S < 0 then X and Y are redundant in providing information about Z If S > 0 then there is synergy between X and Y Motivating example X : SNP 1 Y : SNP 2 Z : phenotype (apoptosis level)

How do we quantify distance between distributions? Kullback-Leibler Divergence (D KL ) Also known as relative entropy Quantifies difference between two distributions: P(x) and Q(x) D KL (P Q) =! x = " P(x)ln Non-symmetric measure P(x)ln P(x) Q(x) P(x) Q(x) dx D KL (P Q) 0, D KL (P Q)=0 if and only if P=Q Invariant to reparameterization of x (discrete) (continuous)

Kullback-Leibler Divergence D KL 0 Proof, use Jensen s inequality: for a concave function f(x), f x E.g. ln ( )! f (x) ( x )! ln(x) ln(x) for a concave function, every chord lies below the function D KL (P Q) =! P(x)ln P(x) Q(x) = "! P(x)ln Q(x) P(x) = " ln Q(x) x x x P(x) P(x) # "ln Q(x) = "ln! P(x) Q(x) P(x) P(x) P(x) = ln! Q(x) = ln1= 0! D KL (P Q) " 0 x x

Kullback-Leibler Divergence Motivation 1: Counting Statistics Flip a fair coin N times, i.e., q H =q T =0.5 E.g. N=50, observe 27 heads and 23 tails What is the probability of observing this? 0.6 0.4 0.2 Observed Distribution 0.6 0.4 0.2 Actual Distribution 0 Heads Tails 0 Heads Tails P(x)={0.54;0.46} Q(x)={0.50;0.50} p H p T q H q T

Kullback-Leibler Divergence Motivation 1: Counting Statistics P (n H,n T )= N! n H!n T! qn H H qn T T exp ( Np H ln p H /q H Np T ln p T /q T ) =exp( ND KL [P Q]) (Binomial distribution) (for large N) - Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - D KL emerges from the large N limit of a binomial (multinomial) distribution. - D KL quantifies how much the observed distribution diverges from the true underlying distribution. - If D KL >1/N then the distributions are very different.

Kullback-Leibler Divergence Motivation 2: Information Theory How many extra bits, on average, do we need to code samples from P(x) using a code optimized for Q(x)? D KL (P Q) = avg no. of bits using bad code - avg no. of bits using optimal code # & # & = %! P(x)log 2 Q(x) (!%! P(x)log 2 P(x) ( $ ' $ ' " x P(x) = " P(x)log 2 x Q(x) " x

Kullback-Leibler Divergence Motivation 2: Information Theory Symbol Probability of symbol, P(x) Bad code, but optimal for Q(x) Optimal code for P(x) A 1/2 00 0 C 1/4 01 10 T 1/8 10 110 G 1/8 11 111 D KL (P Q)=2-1.75=0.25 Avg length =2 bits Avg length =1.75 P(x)={1/2,1/4,1/8,1/8} Q(x)={1/4,1/4,1/4,1/4} Entropy of symbol distribution =! p(x)log 2 p(x) " x =1.75 bits This is equal to the entropy and thus is optimal i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.