QB LECTURE #4: Motif Finding

Similar documents
Mathematics. ( : Focus on free Education) (Chapter 16) (Probability) (Class XI) Exercise 16.2

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Lecture 2: August 31

Lecture 5 - Information theory

Lecture 1: Introduction, Entropy and ML estimation

SDS 321: Introduction to Probability and Statistics

Machine Learning Lecture Notes

Expectation Maximization

CS 630 Basic Probability and Information Theory. Tim Campbell

Quantitative Biology Lecture 3

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Introduction to Information Theory

Information Theory Primer:

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Classification & Information Theory Lecture #8

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Chapter 3: Random Variables 1

Lecture 22: Final Review

Homework Set #2 Data Compression, Huffman code and AEP

Intro to Information Theory

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Computing and Communications 2. Information Theory -Entropy

Deep Learning for Computer Vision

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

MODULE 2 RANDOM VARIABLE AND ITS DISTRIBUTION LECTURES DISTRIBUTION FUNCTION AND ITS PROPERTIES

More on Distribution Function

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

What is Probability? Probability. Sample Spaces and Events. Simple Event

Chapter 3: Random Variables 1

Notation: X = random variable; x = particular value; P(X = x) denotes probability that X equals the value x.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

MCMC: Markov Chain Monte Carlo

Exercises with solutions (Set D)

(Classical) Information Theory II: Source coding

Lecture 5: Asymptotic Equipartition Property

Probability Theory for Machine Learning. Chris Cremer September 2015

Conditional Probability

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Lecture 1 : The Mathematical Theory of Probability

Probabilistic and Bayesian Machine Learning

Information in Biology

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Entropies & Information Theory

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Naïve Bayes classification

27 Binary Arithmetic: An Application to Programming

Solutions to Set #2 Data Compression, Huffman code and AEP

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Lecture 3: Random variables, distributions, and transformations

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Noisy channel communication

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Quantitative Biology II Lecture 4: Variational Methods

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

the Information Bottleneck

Probability. VCE Maths Methods - Unit 2 - Probability

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

PERMUTATIONS, COMBINATIONS AND DISCRETE PROBABILITY

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Information in Biology

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Sequences and Information

Joint Distribution of Two or More Random Variables

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

What is a random variable

Introduction to Machine Learning

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Introduction Probability. Math 141. Introduction to Probability and Statistics. Albyn Jones

Probability Pearson Education, Inc. Slide

Bioinformatics: Biology X

Information Theory, Statistics, and Decision Trees

ELEC546 Review of Information Theory

Introduction to Probability and Statistics (Continued)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Information Theory and Hypothesis Testing

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Communications Theory and Engineering

CLASS 6 July 16, 2015 STT

Probability (10A) Young Won Lim 6/12/17

Information Theory in Intelligent Decision Making

Expected Value 7/7/2006

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Chapter 2: Source coding

U Logo Use Guidelines

Sec$on Summary. Assigning Probabilities Probabilities of Complements and Unions of Events Conditional Probability

Communication Theory and Engineering

Probabilistic Graphical Models

The binary entropy function

Events A and B are said to be independent if the occurrence of A does not affect the probability of B.

Quantitative Bioinformatics

Transcription:

QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015

2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding

3 Transcription Initiation Chromatin Distal TFBS Co-activator complex Transcription initiation complex Transcription initiation CRM Proximal TFBS

4 Binding Sites a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites b B R M C W A W H R W G G B M Consensus sequence

5 Probability Model for Motifs Let x =(x 1,...,x k ) be a sequence possibly representing a binding site of length k We represent the motif as a sequence of position-specific multinomial models, π =(π 1,A,π 1,C,π 1,G,π 1,T,π 2,A...,π k,t ) such that at position i π i,j The likelihood is: L(x π) = is the probability of base j k P (x i π i,. )= i=1 k i=1 π i,xi

6 Background Model Assume an iid multinomial background model,, so that As with alignment, classical theory says a good statistic for discrimination is: where θ =(θ A,θ C,θ G,θ T ) k L(x θ) = log L(x π) L(x θ) = log k = k i=1 i=1 i=1 s i,xi θ xi π i,xi θ xi = s i,a = log π i,a log θ a k log π i,xi log θ xi i=1

7 Weight Matrix a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M Consensus sequence c Position frequency matrix (PFM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 {s i,a } d Position weight matrix (PWM) A 1.93 0.79 0.79 1.93 0.45 1.50 0.79 0.45 1.07 0.79 0.00 1.93 1.93 0.79 C 0.45 1.93 0.79 1.68 1.93 1.93 1.93 0.45 1.93 1.93 1.93 1.93 0.00 0.79 G 0.00 0.45 1.93 1.93 1.93 1.93 1.93 1.93 0.66 1.93 1.30 1.68 1.07 1.93 T 0.15 0.66 1.93 1.93 1.07 0.66 0.79 0.00 0.00 0.79 1.93 1.93 0.66 1.93

8 Estimating the Model If we have several training examples, we can estimate the parameters in the usual way for multinomial models a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b Problem: sparse data π i,j G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M c Position frequency matrix (PFM) Consensus sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 π 1,A =0 π 1,C = 3 8 π 1,G = 1 4. π 14,T =0

9 Example of Estimates with Pseudocounts a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M c Position frequency matrix (PFM) Consensus sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 π 1,A = 1 12 π 1,C = 4 12 π 1,G = 3 12. π 14,T = 1 12

10 Prediction of Binding Sites We predict a binding site if and only if: S(x) = k i=1 s i,xi T where T is chosen to achieve the desired tradeoff between sensitivity and specificity Sensitivity is the fraction of true sites that are predicted (1 false negative rate) Specificity is the fraction of false sites that are not predicted (1 false positive rate)

11 background (null) binding sites (alternative) false negatives T S(x) false positives

12 Sorting Out Terms Prediction Outcome True Condition False Pos TP FP (PPV) Neg FN TN (NPV) Sens = TP / (TP+FN) Spec = TN / (FP+TN) FP rate = type I error = α = FP/(FP+TN) = 1 Spec FN rate = type II error = β = FN/(TP+FN) = 1 Sens β α Power = 1 (for bounded ) A p-value is an estimate of α for a given observation

13 How to Choose T? If known positive and negative examples, can estimate sensitivity and specificity directly and adjust accordingly Can control false positive rate only using a reasonable proxy for background (sometimes permuted data) Can generate synthetic data from the background model and use it to simulate from the null distribution of S(x) Can compute the exact null distribution by dynamic programming in some cases

14 Computing p-values Similar methods can be used to compute p-values for predicted motifs First characterize null distribution of log-odds scores, f(s(x)), empirically or analytically Now assign a p-value to a prediction by computing p = y S(x) f(y) Must be corrected for multiple testing

15 Improving the Background Model Bases are not independent: CpGs, polyas, simple sequence repeats, transposons, etc. In some cases, nonindependence will inflate false positive rates A better background model is needed Typically, higher order Markov models are used

16 Markov Models We are interested in the joint distribution of X1,...,Xk and for convenience have assumed: P (X 1,...,X k )=P (X 1 ) P (X k ) X 1 X 2 It may be slightly less egregious to assume: P (X 1,...,X k )=P (X 1 )P (X 2 X 1 )P (X 3 X 2 ) P (X k X k 1 ) X 1 X 2 X k X k This is a 1st-order Markov model. In an N th order model each Xi depends on Xi N,...,Xi 1

17 Markov Scores Now the background model is θ =(θ A A,θ A C,θ A T,...,θ T G,θ T T ) where θ x1 x 0 is specially defined to denote the marginal probability of x1 The log odds scores are: where L(x θ) = log L(x π) L(x θ) = = k log π i,xi log θ xi x i 1 i=1 k i=1 k i=1 s i,xi x i 1 θ xi x i 1 s i,a b = log π i,a log θ a b

18 Effect of Better Background Model background (null) binding sites (alternative) background (null) binding sites (alternative) T S(x) T S(x) false negatives false positives false negatives false positives First Model Better Model

19 An Aside on Information Theory Invented by Claude Shannon in the late 1940s, at the dawn of the digital age Motivated by problems in information transmission, especially data compression Has deep connections with probability theory, computer science, statistical mechanics, gambling and investment, etc. You benefit from it every time you gzip a file or look at a JPEG image!

20 Entropy The entropy of a (discrete) rv X is: H(X) = x = E p(x) log p(x) [ log 1 ] p(x) Interpretations of H(X): - Min. ave. length of binary encoding of X - Ave. information gained by observing X - Min. ave. number of yes/no questions to find out X - Min. ave. number of fair coins required to generate X

21 Encoding Example Suppose we want to encode n coin tosses as a binary sequence. If the coin is fair, we can do no better than to use a bit for each coin toss, e.g., 00101110 for TTHTHHHT. It will always take n bits to encode the sequence. Suppose, however, that the coin has weight θ = 0.2. Can we do better? It turns out we can (for large enough n), by encoding subsequences and giving shorter codes to more probable subsequences.

22 Encoding Example, cont. X P(X) Code TTT 0.512 0 Expected length: 0.512 1 + 0.128 3 3 + 0.032 5 3 + 0.008 5 = 2.184 TTH 0.128 100 THT 0.128 101 HTT 0.128 110 THH 0.032 11100 HTH 0.032 11101 HHT 0.032 11110 0 1 0 1 0 0 0 1 0 0 1 1 1 Therefore, 2.184/3 = 0.728 bits/coin are needed For the naive code: 1 Entropy: 0.722 HHH 0.008 11111 1

23 Entropy for Bernoulli rv with Parameter p H(X) is always concave and nonnegative

24 Perfect Code Suppose X has pdf: p(x) = An optimal binary encoding is: 1 2 x = a 1 4 x = b 1 8 x = c 1 8 x = d 0 a 10 b 110 c 111 d Expected length = H(X) = 1.75 bits Naive encoding: 2 bits

25 Entropy and Information Before an event X, your uncertainty about it is measured by H(X) Therefore, when you observe X, your ave. gain in information is measured by H(X) However, you may not observe X directly; after observing a noisy message Y, there may still be uncertainty about X We can measure the (ave.) information content of Y as Hbefore(X) Hafter(X)

26 Relative Entropy The relative entropy of pdf p wrt pdf q is: D(p q) = x p(x) log p(x) q(x) It represents the average additional bits needed to encode X if it comes from p but the code was optimized for q D(p q) =H pq (X) H pp (X) = x p(x) log q(x)+ x p(x) log p(x) = x p(x) log p(x) q(x) Useful as a measure of divergence between distributions

27 Mutual Information The mutual information in rv s X and Y is: I(X; Y )= x p(x, y) log y p(x, y) p(x)p(y) I(X;Y) is the relative entropy of P(X,Y) wrt P(X)P(Y). It represents the reduction in uncertainty about X due to knowledge of Y. Mutual information can be thought of as a test statistic for independence (connected with χ 2 test, G test)

28 Likelihood Connection Suppose n iid random variables X 1,..., Xn. What is the expected log likelihood? n p(x) log p(x) = i=1 x Similarly, what is the expected log-odds score of model 1 wrt model 2, if the variables are drawn from model 1? n i=1 x If drawn from model 2? n i=1 x n H(X) = nh(x) i=1 p 1 (x) log p 1(x) p 2 (x) = nd(p 1 p 2 ) p 2 (x) log p 1(x) p 2 (x) = nd(p 2 p 1 )

29 Motif Information Content The entropy of the distribution for each position determines the information content of that position: IC i =2 H(X i ) Can be considered ave. reduction in uncertainty wrt random DNA. Also, relative entropy wrt random DNA: p(x = b) log b p(x = b) 1/4 = b p(x = b) log p(x = b) b p(x = b) log 1 4 =2 H(X) Also related to the binding energy and the evolutionary constraint Visualized in widely used sequence logos

a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M Consensus sequence 30 c Position frequency matrix (PFM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 d Position weight matrix (PWM) A 1.93 0.79 0.79 1.93 0.45 1.50 0.79 0.45 1.07 0.79 0.00 1.93 1.93 0.79 C 0.45 1.93 0.79 1.68 1.93 1.93 1.93 0.45 1.93 1.93 1.93 1.93 0.00 0.79 G 0.00 0.45 1.93 1.93 1.93 1.93 1.93 1.93 0.66 1.93 1.30 1.68 1.07 1.93 T 0.15 0.66 1.93 1.93 1.07 0.66 0.79 0.00 0.00 0.79 1.93 1.93 0.66 1.93 e Site scoring 2 0.45 0.66 0.79 1.68 0.45 0.66 0.79 0.45 0.66 0.79 0.00 1.68 0.66 0.79 T T A C A T A A G T A G T C Σ = 5.23, 78% of maximum f Bits 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Position

31 Motif Discovery Consider the problem of estimating a motif model from N sequences, each believed to have a binding site for some TF As before, we assume a motif model of width k with a multinomial distribution at each position l. We assume an iid multinomial background model. θ bg The goal in this case is to learn the parameters of the motif model The location of the binding site in each sequence i, denoted zi, is a latent variable θ l

32 Illustration initialize sample or average θ 1 θ 2... θ k sample or average

33 EM vs. Gibbs Sampling In EM, we average over potential positions In Gibbs sampling, we sample positions In EM, you estimate parameters that maximize the likelihood (locally) In Gibbs, you sample both binding sites and parameters, allowing for uncertainty in both

That s All 34