23 nd Sep 2015 Quantitative Biology Lecture 3 Gurinder Singh Mickey Atwal Center for Quantitative Biology
Summary Covariance, Correlation Confounding variables (Batch Effects) Information Theory
Covariance So far, we have been analyzing summary statistics that describe aspects of a single list of numbers, i.e. a single variable. Frequently, however, we are interested in how variables behave together.
Smoking and Lung Capacity Suppose, for example, we want to investigate the relationship between cigarette smoking and lung capacity We might ask a group of people about their smoking habits, and measure their lung capacities.
Smoking and Lung Capacity Cigarettes (X) Lung Capacity (Y) 0 45 5 42 10 33 15 31 20 29
Smoking and Lung Capacity 50 Lung Capacity (Y) Lung Capacity 40 <Y> 30 20-10 Smoking 0 10 <X> ΔY ΔX 20 Smoking rate (X) 30
Covariance
The Sample Covariance The covariance quantifies the linear relationship between two variables. The sample covariance Cov(x,y) is an unbiased estimate of the true covariance from a collection of N data Cov(x, y) = 1 N 1 N (x i x)(y i y) i=1 Why N-1 and not N in the denominator? The reason is that the averages <x> and <y> in the formula are not true averages of the x and y variables, but only estimates of the averages from the finite set of available data. The N-1 corrects for this.
Correlation ranges from -1 to 1-1 (negative correlation) 0 (uncorrelated) 1 (positive correlation) Pearson Correlation The correlation is a normalized version of the covariance r xy = 1 n "1 n ) i=1 # % $ x i " x s x &# ( y i " y % ' $ s y & ( ' s x = standard deviation of the x variable
Correlation Wikipedia Note that the Pearson correlation does not capture non-linear relationships.
Correlation does not imply Causation! Confounding variables can give rise to a correlation between 2 indirect variables. Example: association between cancer risk and genetic variation can be confounded by population history Population history Genetic variation in cancer genes correlation and causation Cancer Phenotype Genetic variation in non-cancer genes possible correlation but no causation
Confounding Example: Genetic Association Studies
Simpson s Paradox Correlations can sometimes be reversed when combining different sets of data Example: Test results Week 1 Week 2 Total Eve 60/100 (60%) 1/10 (10%) 61/110 (55%) Adam 9/10 (90%) 30/100 (30%) 39/110 (35%) Adam performs better than Eve in each week, but worse when all the results are added up
Batch Effects in Gene Expression Data Orange and gene represent two different processing dates. Raw data from published bladder cancer microarray study. 10 example genes showing batch effects, even after normalization. Samples cluster perfectly by processing date Leek et al, Nature Reviews Genetics (2010)
Batch Effects in Next- Generation Sequencing Uneven DNA (human, Chr16) sequencing coverage. Some days the coverage is high (orange) and some days low (blue). Leek et al, Nature Reviews Genetics (2010)
Eliminating batch effects: Pooling and Randomization E.g. eliminating lane batch effects for RNA-Seq by pooling samples CORRECT WRONG Auer and Doerge, Genetics (2010)
Information Theory
Role of Information Theory in Biology i) Mathematical modeling of biological phenomena e.g. Optimization of early neural processing in the brain; bacterial population strategies ii) Extraction of biological information from large data-sets e.g. Gene expression analyses; GWAS (genome-wide association studies)
Mathematical Theory of Communication Claude Shannon (1948) Bell Sys. Tech. J. Vol.27, 379-423, 623-656 How to encode information? How to transmit messages reliably?
Model of General Communication System Information source message Channel Destination Visual Image Retina Visual Cortex Morphogen Concentration Gene Pathway Differentiation Genes Computer File Fiber Optic Cable Another Computer
Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise MESSAGE ENCODED MESSAGE DECODED
Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise Shannon s Source Coding theorem There exists a fundamental lower bound on the size of the compressed message without losing information.
Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise 2) Shannon s channel coding theorem Information can be transmitted, with negligible error, at rates no faster than the channel capacity.
Information Theory Information content of a message (random variable)? How much uncertainty is there in an outcome of an event? e.g. 0.5 0.4 Homo sapiens p(a)=p(t)=p(g)=p(c)=0.25 0.3 0.2 0.1 0 A T G C High information content Plasmodium falciparum 0.5 p(a)=p(t)=0.4 p(g)=p(c)=0.1 0.4 0.3 0.2 0.1 Low information content 0 A T G C
Measure of Uncertainty H({p i }) Suppose we have a set of N possible events with probabilities p 1 p 2 p N General requirements of H Continuous in p i If all p i are equal then H should be monotonically increasing with N H should be consistent 1/2 1/2 1/3 1/6 = 1/2 2/3 1/3
Entropy as a measure of uncertainty Unique answer provided by Shannon H[B] = " $ b#b p(b)log 2 p(b) Discrete states random variable B with N elements b Similar to Gibbs entropy in statistical mechanics Maximum when all probabilities B are equal, p(b)=1/n, Units are measured in bits (binary digits) H B] = p( b)log p( b) db [ 2 H 2 base 2 [ ] max = log N Boltzmann entropy Continuous states
Intrepretations of entropy H Average length of shortest possible code to transmit a message (Shannon s source coding theorem) Captures variability of a variable without making any model assumptions Average yes/no questions to determine the outcome of a random event 0.5 Homo Sapiens p(a)=p(t)=p(g)=p(c)=0.25 0.4 0.3 0.2 0.1 0 A T G C H = 2 bits Plasmodium falciparum p(a)=p(t)=0.4 p(g)=p(c)=0.1 0.5 0.4 0.3 0.2 0.1 0 A T G C H ~ 1 bit
Example : Binding sequence conservation Sequence conservation R seq = H = max Hobs log2 N n= 1 N p n log 2 p n CAP (Catabolite Activator Protein), acts as a transcription promoter at more than 100 sites within the E. Coli genome Sequence conservation reveals CAP binding site
Two random variables? Joint entropy = Y y X x y x p y x p Y X H, 2 ), ( )log, ( ], [ If variables are independent p(x,y)=p(x)p(y) then H[X,Y]=H[X]+H[Y] Difference measures total amount of correlation between two variables = + = Y y X x y p x p y x p y x p Y X H Y H X H Y X I, 2 ) ( ) ( ), ( )log, ( ], [ ] [ ] [ ] ; [ Mutual Information, I(X;Y)
Mutual Information, I(X;Y) H[X] H[X Y] I[X;Y] H[Y X] H[Y] I( X ; Y) = H( X ) H( X Y) H[X,Y] I(X;Y) quantifies how much uncertainty of X is reduced if we know Y If X and Y are independent, then I(X;Y)=0 Model independent Captures all non-linear correlations (c.f. Pearson s correlation) Independent of measurement scale Units (bits) have physical meaning
Mutual information captures non-linear relationships A 2 R 2 = 0.487 ± 0.019 I = 0.72 ± 0.08 MIC = 0.48 ± 0.02 B 2 R 2 = 0.001 ± 0.002 I = 0.70 ± 0.09 MIC = 0.40 ± 0.02 y 1 y 1 0 0 0.5 1 x 0 1 0 1 x Kinney and Atwal, PNAS 2014
Responsiveness to complicated relations MI~1 bit; Corr.~0.9 MI~1.3 bits; Corr.~0 gene-b expression level gene-b expression level gene-a expression level gene-a expression level
Data processing inequality Suppose we have a sequence of processes e.g. a signal transduction pathway (Markov process) A B C Physical Statement In any physical process the information about A gets continually degraded along the sequence of processes Mathematical Statement I ( A; C) I ( A; B) I ( B; C)
Multi-Entropy, H(x 1 x 2 x n ) H[ X1X 2... X n] = p( x1x2... xn)log2 p( x1x2... x x 1 x 2... x Measures total correlation in n variables n n ) Multi-Information, I(x 1 x 2 x n ) n p(x I[X 1 X 2...X n ] = p(x 1 x 2...x n )log 1 x 2...x n ) " 2 p(x 1 )p(x 2 )...p(x n ) i=1
Generalised correlation between more than two elements Multi-information is a natural extension of Shannon s mutual information to an arbitrary number of random variables N I({ X1, X 2,..., X N}) = H ( X i ) H ({ X1, X 2,..., X N}) i= 1 Provides a general measure of nonindependence among multiple variables in a network Captures higher-order interactions than just simple pairwise interactions
Capturing more than pairwise relations MI~0 bits; Corr.~0 Multi-information ~ 1.0 bits gene-a/gene-b expression gene-a/gene-b/gene-c expression Experiment index Experiment index
Multi-allelic associations allele A allele B XOR I(A;B)=I(A;P)=I(B;P)=0 I(A;B;P)=1 bit Phenotype A B P 0 0 0 0 1 1 1 0 1 1 1 0 Multi-loci associations can be completely masked by single-loci studies!
Synergy and Redundancy )] ; ( ) ; ( [ ) };, ({ ) ; ( ) ; ( ) ; ( ) ; ; ( Z Y I Z X I Z Y X I Z Y I Z X I Y X I Z Y X I S + = = S compares the information that X and Y together provide about Z with the information that these two variables provide separately If S < 0 then X and Y are redundant in providing information about Z If S > 0 then there is synergy between X and Y Motivating example X : SNP 1 Y : SNP 2 Z : phenotype (apoptosis level)
How do we quantify distance between distributions? Kullback-Leibler Divergence (D KL ) Also known as relative entropy Quantifies difference between two distributions: P(x) and Q(x) D KL (P Q) =! x = " P(x)ln Non-symmetric measure P(x)ln P(x) Q(x) P(x) Q(x) dx D KL (P Q) 0, D KL (P Q)=0 if and only if P=Q Invariant to reparameterization of x (discrete) (continuous)
Kullback-Leibler Divergence D KL 0 Proof, use Jensen s inequality: for a concave function f(x), f x E.g. ln ( )! f (x) ( x )! ln(x) ln(x) for a concave function, every chord lies below the function D KL (P Q) =! P(x)ln P(x) Q(x) = "! P(x)ln Q(x) P(x) = " ln Q(x) x x x P(x) P(x) # "ln Q(x) = "ln! P(x) Q(x) P(x) P(x) P(x) = ln! Q(x) = ln1= 0! D KL (P Q) " 0 x x
Kullback-Leibler Divergence Motivation 1: Counting Statistics Flip a fair coin N times, i.e., q H =q T =0.5 E.g. N=50, observe 27 heads and 23 tails What is the probability of observing this? 0.6 0.4 0.2 Observed Distribution 0.6 0.4 0.2 Actual Distribution 0 Heads Tails 0 Heads Tails P(x)={0.54;0.46} Q(x)={0.50;0.50} p H p T q H q T
Kullback-Leibler Divergence Motivation 1: Counting Statistics P (n H,n T )= N! n H!n T! qn H H qn T T exp ( Np H ln p H /q H Np T ln p T /q T ) =exp( ND KL [P Q]) (Binomial distribution) (for large N) - Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - D KL emerges from the large N limit of a binomial (multinomial) distribution. - D KL quantifies how much the observed distribution diverges from the true underlying distribution. - If D KL >1/N then the distributions are very different.
Kullback-Leibler Divergence Motivation 2: Information Theory How many extra bits, on average, do we need to code samples from P(x) using a code optimized for Q(x)? D KL (P Q) = avg no. of bits using bad code - avg no. of bits using optimal code # & # & = %! P(x)log 2 Q(x) (!%! P(x)log 2 P(x) ( $ ' $ ' " x P(x) = " P(x)log 2 x Q(x) " x
Kullback-Leibler Divergence Motivation 2: Information Theory Symbol Probability of symbol, P(x) Bad code, but optimal for Q(x) Optimal code for P(x) A 1/2 00 0 C 1/4 01 10 T 1/8 10 110 G 1/8 11 111 D KL (P Q)=2-1.75=0.25 Avg length =2 bits Avg length =1.75 P(x)={1/2,1/4,1/8,1/8} Q(x)={1/4,1/4,1/4,1/4} Entropy of symbol distribution =! p(x)log 2 p(x) " x =1.75 bits This is equal to the entropy and thus is optimal i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.