Introduction to Information Theory

Similar documents
Quantitative Biology Lecture 3

Quantitative Biology II Lecture 4: Variational Methods

Lecture 2: August 31

Information in Biology

Information in Biology

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Lecture 1: Introduction, Entropy and ML estimation

3F1 Information Theory, Lecture 1

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

QB LECTURE #4: Motif Finding

Information Theory (Information Theory by J. V. Stone, 2015)

Computing and Communications 2. Information Theory -Entropy

An introduction to basic information theory. Hampus Wessman

Shannon's Theory of Communication

Natural Image Statistics and Neural Representations

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

6.02 Fall 2012 Lecture #1

Quantum Information Theory and Cryptography

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Principles of Communications

Chapter 2 Review of Classical Information Theory

The binary entropy function

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Information Theory, Statistics, and Decision Trees

Exercises with solutions (Set B)

CSCI 2570 Introduction to Nanocomputing

Classification & Information Theory Lecture #8

Information Theory Primer:

Entropies & Information Theory

5 Mutual Information and Channel Capacity

Computational Systems Biology: Biology X

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

UNIT I INFORMATION THEORY. I k log 2

Shannon s Noisy-Channel Coding Theorem

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

(Classical) Information Theory III: Noisy channel coding

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Topics. Probability Theory. Perfect Secrecy. Information Theory

Machine Learning Srihari. Information Theory. Sargur N. Srihari

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Lecture 22: Final Review

ELEMENT OF INFORMATION THEORY


EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

1. Basics of Information

to mere bit flips) may affect the transmission.

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Noisy channel communication

Lecture 1: Shannon s Theorem

Lecture 2. Capacity of the Gaussian channel

Information & Correlation

Information Theory in Intelligent Decision Making

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

9. Distance measures. 9.1 Classical information measures. Head Tail. How similar/close are two probability distributions? Trace distance.

Quiz 2 Date: Monday, November 21, 2016

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

One Lesson of Information Theory

Bioinformatics: Biology X

1 Basic Information Theory

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Solutions to Homework Set #3 Channel and Source coding

Lecture 5 - Information theory

Probabilistic and Bayesian Machine Learning

Information Theory - Entropy. Figure 3

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

CS 630 Basic Probability and Information Theory. Tim Campbell

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Chapter 9 Fundamental Limits in Information Theory

3F1 Information Theory, Lecture 3

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Introduction to Information Theory. Part 4

Lecture 11: Continuous-valued signals and differential entropy

Chapter 2: Source coding

3F1 Information Theory, Lecture 3

Noisy-Channel Coding

ECE Information theory Final (Fall 2008)

6.02 Fall 2011 Lecture #9

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 8: Shannon s Noise Models

Massachusetts Institute of Technology

LECTURE 3. Last time:

Information Theory. Week 4 Compressing streams. Iain Murray,

Homework Set #2 Data Compression, Huffman code and AEP

4 An Introduction to Channel Coding and Decoding over BSC

INTRODUCTION TO INFORMATION THEORY

Some Basic Concepts of Probability and Information Theory: Pt. 2

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Intro to Information Theory

Transcription:

Introduction to Information Theory Gurinder Singh Mickey Atwal atwal@cshl.edu Center for Quantitative Biology

Kullback-Leibler Divergence Summary Shannon s coding theorems Entropy Mutual Information Multi-information

Role of Information Theory in Biology i) Mathematical modeling of biological phenomena e.g. Optimization of early neural processing in the brain; bacterial population strategies ii) Etraction of biological information from large data-sets e.g. Gene epression analyses; GWAS (genome-wide association studies)

Mathematical Theory of Communication Claude Shannon (1948) Bell Sys. Tech. J. Vol.27, 379-423, 623-656 How to encode information? How to transmit messages reliably?

Model of General Communication System Information source message Channel Destination Visual Image Retina Visual Corte Morphogen Concentration Gene Pathway Differentiation Genes Computer File Fiber Optic Cable Another Computer

Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise MESSAGE ENCODED MESSAGE DECODED

Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise Shannon s Source Coding theorem There eists a fundamental lower bound on the size of the compressed message without losing information

Model of General Communication System Information source message signal message Transmitter Channel Receiver Destination noise 2) Shannon s channel coding theorem Information can be transmitted, with negligible error, at rates no faster than the channel capacity

Information Theory Information content of a message (random variable)? How much uncertainty is there in an outcome of an event? e.g. 0.5 0.4 Homo sapiens p(a)=p(t)=p(g)=p(c)=0.25 0.3 0.2 0.1 0 A T G C High information content Plasmodium falciparum 0.5 p(a)=p(t)=0.4 p(g)=p(c)=0.1 0.4 0.3 0.2 0.1 Low information content 0 A T G C

Measure of Uncertainty H({p i }) Suppose we have a set of N possible events with probabilities p 1 p 2 p N General requirements of H Continuous in p i If all p i are equal then H should be monotonically increasing with N H should be consistent 1/2 1/2 1/3 1/6 = 1/2 2/3 1/3

Entropy as a measure of uncertainty Unique answer provided by Shannon H[B] = " $ b#b p(b)log 2 p(b) Discrete states random variable B with N elements b Similar to Gibbs entropy in statistical mechanics Maimum when all probabilities B are equal, p(b)=1/n, Units are measured in bits (binary digits) H B] = p( b)log p( b) db [ 2 H 2 base 2 [ ] ma = log N Boltzmann entropy Continuous states

Intrepretations of entropy H Average length of shortest code to transmit a message (Shannon s source coding theorem) Captures variability of a variable without making any model assumptions Average yes/no questions to determine the outcome of a random event 0.5 Homo Sapiens p(a)=p(t)=p(g)=p(c)=0.25 0.4 0.3 0.2 0.1 0 A T G C H = 2 bits Plasmodium falciparum p(a)=p(t)=0.4 p(g)=p(c)=0.1 0.5 0.4 0.3 0.2 0.1 0 A T G C H ~ 1 bit

Entropy as average length of shortest code Symbol Probability of symbol, P() Optimal code length =-log 2 (P) Optimal code A 1/2 1 0 C 1/4 2 10 T 1/8 3 110 G 1/8 3 111 Note that the average length of the optimal code is equal to the entropy of the distribution = log 2 P () P () H[] Avg length=1.75 P () log 2 P ()

Eample : Binding sequence conservation Sequence conservation R seq = H = ma Hobs log2 N n= 1 N p n log 2 p n CAP (Catabolite Activator Protein), acts as a transcription promoter at more than 100 sites within the E. Coli genome Sequence conservation reveals CAP binding site

Two random variables? Joint entropy = Y y X y p y p Y X H, 2 ), ( )log, ( ], [ If variables are independent p(,y)=p()p(y) then H[X,Y]=H[X]+H[Y] Difference measures total amount of correlation between two variables = + = Y y X y p p y p y p Y X H Y H X H Y X I, 2 ) ( ) ( ), ( )log, ( ], [ ] [ ] [ ] ; [ Mutual Information, I(X;Y)

Mutual Information, I(X;Y) H[X] H[X Y] I[X;Y] H[Y X] H[Y] I( X ; Y) = H( X ) H( X Y) H[X,Y] I(X;Y) quantifies how much uncertainty of X is reduced if we know Y If X and Y are independent, then I(X;Y)=0 Model independent Captures all non-linear correlations (c.f. Pearson s correlation) Independent of measurement scale Units (bits) have physical meaning

Mutual information captures non-linear relationships A 2 R 2 = 0.487 ± 0.019 I = 0.72 ± 0.08 MIC = 0.48 ± 0.02 B 2 R 2 = 0.001 ± 0.002 I = 0.70 ± 0.09 MIC = 0.40 ± 0.02 y 1 y 1 0 0 0.5 1 0 1 0 1 Kinney and Atwal, PNAS 2014

Responsiveness to complicated relations MI~1 bit; Corr.~0.9 MI~1.3 bits; Corr.~0 gene-b epression level gene-b epression level gene-a epression level gene-a epression level

Data processing inequality Suppose we have a sequence of processes e.g. a signal transduction pathway (Markov process) A B C Physical Statement In any physical process the information about A gets continually degraded along the sequence of processes Mathematical Statement I ( A; C) I ( A; B) I ( B; C)

Multi-Entropy, H( 1 2 n ) H[ X1X 2... X n] = p( 12... n)log2 p( 12... 1 2... Measures total correlation in n variables n n ) Multi-Information, I( 1 2 n ) n p( I[X 1 X 2...X n ] = p( 1 2... n )log 1 2... n ) " 2 p( 1 )p( 2 )...p( n ) i=1

Generalised correlation between more than two elements Multiinformation is a natural etension of Shannon s mutual information to an arbitrary number of random variables N I({ X1, X 2,..., X N}) = H ( X i ) H ({ X1, X 2,..., X N}) i= 1 Provides a general measure of nonindependence among multiple variables in a network Captures higher-order interactions than just simple pairwise interactions

Capturing more than pairwise relations MI~0 bits; Corr.~0 Multi-information ~ 1.0 bits gene-a/gene-b epression gene-a/gene-b/gene-c epression Eperiment inde Eperiment inde

Multi-allelic associations allele A allele B XOR I(A;B)=I(A;P)=I(B;P)=0 I(A;B;P)=1 bit Phenotype A B P 0 0 0 0 1 1 1 0 1 1 1 0 Multi-loci associations can be completely masked by single-loci studies!

Synergy and Redundancy )] ; ( ) ; ( [ ) };, ({ ) ; ( ) ; ( ) ; ( ) ; ; ( Z Y I Z X I Z Y X I Z Y I Z X I Y X I Z Y X I S + = = S compares the information that X and Y together provide about Z with the information that these two variables provide separately If S < 0 then X and Y are redundant in providing information about Z If S > 0 then there is synergy between X and Y Motivating eample X : SNP 1 Y : SNP 2 Z : phenotype (apoptosis level)

How do we quantify distance between distributions? Kullback-Leibler Divergence (D KL ) Also known as relative entropy Quantifies difference between two distributions: P() and Q() D KL (P Q) =! = " P()ln Non-symmetric measure P()ln P() Q() P() Q() d D KL (P Q) 0, D KL (P Q)=0 if and only if P=Q Invariant to reparameterization of (discrete) (continuous)

Kullback-Leibler Divergence D KL 0 Proof, use Jensen s inequality: for a concave function f(), f E.g. ln ( )! f () ( )! ln() ln() for a concave function, every chord lies below the function D KL (P Q) =! P()ln P() Q() = "! P()ln Q() P() = " ln Q() P() P() # "ln Q() = "ln! P() Q() P() P() P() = ln! Q() = ln1= 0! D KL (P Q) " 0

Kullback-Leibler Divergence Motivation 1: Counting Statistics Flip a fair coin N times, i.e., q H =q T =0.5 E.g. N=50, observe 27 heads and 23 tails What is the probability of observing this? 0.6 0.4 0.2 Observed Distribution 0.6 0.4 0.2 Actual Distribution 0 Heads Tails 0 Heads Tails P()={0.54;0.46} Q()={0.50;0.50} p H p T q H q T

Kullback-Leibler Divergence Motivation 1: Counting Statistics P (n H,n T )= N! n H!n T! qn H H qn T T ep ( Np H ln p H /q H Np T ln p T /q T ) =ep( ND KL [P Q]) (Binomial distribution) (for large N) - Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - D KL emerges from the large N limit of a binomial (multinomial) distribution. - D KL quantifies how much the observed distribution diverges from the true underlying distribution. - If D KL >1/N then the distributions are very different.

Kullback-Leibler Divergence Motivation 2: Information Theory How many etra bits, on average, do we need to code samples from P() using a code optimized for Q()? D KL (P Q) = avg no. of bits using bad code - avg no. of bits using optimal code # & # & = %! P()log 2 Q() (!%! P()log 2 P() ( $ ' $ ' " P() = " P()log 2 Q() "

Kullback-Leibler Divergence Motivation 2: Information Theory Symbol Probability of symbol, P() Bad code, but optimal for Q() Optimal code for P() A 1/2 00 0 C 1/4 01 10 T 1/8 10 110 G 1/8 11 111 D KL (P Q)=2-1.75=0.25 Avg length =2 bits Avg length =1.75 P()={1/2,1/4,1/8,1/8} Q()={1/4,1/4,1/4,1/4} Entropy of symbol distribution =! p()log 2 p() " =1.75 bits This is equal to the entropy and thus is optimal i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.