CS284A: Representations and Algorithms in Molecular Biology

Similar documents
Bayesian Methods: Introduction to Multi-parameter Models

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Lecture 12: November 13, 2018

Topic 9: Sampling Distributions of Estimators

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Topic 9: Sampling Distributions of Estimators

Lecture 14: Graph Entropy

4. Partial Sums and the Central Limit Theorem

Axioms of Measure Theory

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Random Variables, Sampling and Estimation

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Topic 9: Sampling Distributions of Estimators

Infinite Sequences and Series

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

( ) = p and P( i = b) = q.

Ma 530 Introduction to Power Series

Recurrence Relations

Introduction to Computational Molecular Biology. Gibbs Sampling

Lecture 10 October Minimaxity and least favorable prior sequences

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Statisticians use the word population to refer the total number of (potential) observations under consideration

1 Hash tables. 1.1 Implementation

7.1 Convergence of sequences of random variables

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 11 and 12: Basic estimation theory

CSE 527, Additional notes on MLE & EM

Introduction to Computational Biology Homework 2 Solution

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)


Lecture 2: Monte Carlo Simulation

Math 155 (Lecture 3)

Chapter 4. Fourier Series

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Expectation-Maximization Algorithm.

Mixtures of Gaussians and the EM Algorithm

AMS570 Lecture Notes #2

Basics of Probability Theory (for Theory of Computation courses)

L = n i, i=1. dp p n 1

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Simulation. Two Rule For Inverting A Distribution Function

Chapter 2 The Monte Carlo Method

An Introduction to Randomized Algorithms

Shannon s noiseless coding theorem

Massachusetts Institute of Technology

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Massachusetts Institute of Technology

Math 475, Problem Set #12: Answers

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

7.1 Convergence of sequences of random variables

Statistical Pattern Recognition

Chapter 6 Principles of Data Reduction

6.867 Machine learning, lecture 7 (Jaakkola) 1

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

1 Introduction to reducing variance in Monte Carlo simulations

Chapter 7 COMBINATIONS AND PERMUTATIONS. where we have the specific formula for the binomial coefficients:

Lecture 2: April 3, 2013

Computing Confidence Intervals for Sample Data

FIR Filter Design: Part II

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Discrete probability distributions

Sums, products and sequences

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

STAT 516 Answers Homework 6 April 2, 2008 Solutions by Mark Daniel Ward PROBLEMS

Lecture Overview. 2 Permutations and Combinations. n(n 1) (n (k 1)) = n(n 1) (n k + 1) =

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Lecture 7: Properties of Random Samples

6.3 Testing Series With Positive Terms

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 10: Universal coding and prediction

1 Models for Matched Pairs

Approximations and more PMFs and PDFs

GG313 GEOLOGICAL DATA ANALYSIS

Statistics 511 Additional Materials

Lecture 19: Convergence

Stat 421-SP2012 Interval Estimation Section

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Chapter 6 Sampling Distributions

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Properties and Hypothesis Testing

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

This is an introductory course in Analysis of Variance and Design of Experiments.

On Random Line Segments in the Unit Square

1 Review of Probability & Statistics

Economics Spring 2015

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Probability theory and mathematical statistics:

1 Generating functions for balls in boxes

SDS 321: Introduction to Probability and Statistics

Confidence Intervals for the Population Proportion p

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

(7 One- and Two-Sample Estimation Problem )

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Transcription:

CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by Professor Xiaohui Xie o Jauary 14 & 16, 2008 I Motif Discovery via Eumeratio A A Model for Motif Discovery (Review from Lecture 2) We wat to idetify biologically sigificat motifs i a set S of sequeces, s 1, s 2,, s Each potetially sigificat motif m i of legth w is associated with a summatio variable k i, which is the total umber of sequeces from S i which the motif appears To systematically measure this sigificace, we must first fid the uderlyig probability p ay sequece of legth l cotais ay theoretical motif of legth w With the overridig assumptio that the four bases are uiformly distributed, or ( P(A),P(C),P(G),P(T) ) = 1 4, 1 4, 1 4, 1 % # 4 ', we have calculated a value for p of & 1 1 1 lw+1 # & % ( We use p as the probability of success for fidig this 4 w ' theoretical motif each time we sample a sequece from set S For k out of trials, the probability of success is biomial, P( k) = % ' p k ( 1( p) (k, # k& %! where ' = # k& k! ( ( k)!, as a motif either is i a sequece or is ot To test the sigificace of our specific motif m i, we evaluate a p-value, or the probability, based o our distributio, that m i would appear i at least k i sequeces:

2 # P( k) = % & ( p k ( 1) p) )k k' k= k i k= k i If the p-value is smaller tha a chose sigificace level, 1 we ca say with some cofidece that our motif m i is biologically sigificat For large the biomial distributio is approximated by a ormal distributio, ad we ca map k i to a ew distributio ad compute the z-score to determie the sigificace of our motif m i B Problems with this Model 1 The assumptio that the four bases are uiformly distributed i the sequeces is ot ecessarily correct To be more accurate, we would eed to model the first-order statistics (ie, P(A), P(C), P(G), ad P(T)) of the ucleotide distributio 2 The model igores secod-order statistics Two bases might be more likely paired together tha distributed at radom (eg P(GA) P(G)P(A) ) The same could be also said for higher-order statistics C Cotrol Sequeces I order ot to rely o the assumptio of uiform distributio of bases to measure sigificace, we ca geerate a set of N cotrol sequeces, s o 1, s o 2,,s o N The assumptio is that our motif of iterest m i is ot sigificat i the cotrol sequeces Now we have two sets of sequeces Each m i is associated with two values k i ad k o i, which correspod with the umber of differet sequeces this motif appears i the sets of sequeces S ad S o, respectively Now to fid out if our motif m i is biologically sigificat, we choose the appropriate probability distributio for successfully fidig a motif i k out of trials There are two types to choose from: 1 The biomial distributio If the set S is idepedet of S o, we ca still model the probability of success P(k) o fidig a motif i k out of trials usig the biomial distributio If S S o (ie, the set S is a subset of S o ), choosig the appropriate distributio ow depeds o the size of both sets ad the distributio of our motif m i i them If the umber N of S o sequeces ad the umber k i o of sequeces cotaiig our motif are large compared to the umber of S sequeces, the the probability p of radomly pickig a sequece with our motif remais essetially uchaged for trials, ad we could still model the probability P(k)

3 usig the biomial distributio 2 For these scearios the oly chage we eed to make from the model i Part A is to adopt a differet uderlyig probability p of success for fidig a motif every time we sample a sequece For p we will use the relative frequecy k i N our motif m i is foud i the set S o This way, whe we ru k trials, we ca compare the distributios from both S ad S o to see if our motif ideed stads out i S The probability of success o k out of trials may be writte as P(k) = % # k& ' k o k % i ' 1( k o % i ' # N & # N & To test the sigificace of our motif, we calculate the p-value i the same fashio as we did before: P(k) For large we ca agai map k i to a ormal distributio with mea p ad variace p(1-p) ad compute the z-score 2 The hypergeometric distributio k= k i If S S o ad if either N or k i o is ot large compared to for a give m i, the sequece of trials is aalogous to samplig without replacemet The probability p of radomly pickig a sequece with our motif chages sigificatly over trials Hece, we caot use the biomial distributio, which assumes the same p for all trials The appropriate distributio is hypergeometric, where the probability of success o fidig a motif i k out of trials is P( k) = K% ' N ( K % ' # k &# ( k &, N% ' # & where K % ' is the umber of ways of choosig k sequeces with a # k & # N K& motif from the total umber K of sequeces with that motif, % ( is k ' the umber of ways of choosig -k sequeces without the motif from N% the total umber N-K of sequeces without the motif, ad ' is the # & (k o

4 umber of ways of choosig sequeces from the total umber N sequeces While usig this distributio to test the sigificace of our particular motif m i, we assig k o i to the value K Like before we calculate the p-value usig the summatio P(k) We caot compute a z-score here, as a ormal distributio does ot approximate a hypergeometric distributio for large k= k i II Represetatio of a Motif Usig a Positio Weight Matrix A What is a Positio Weight Matrix? Motifs are hardly ever represeted accurately by a uique cosecutive sequece of A s, C s, G s ad T s Istead, we create a positio weight matrix (PWM) to represet the frequecies of each base at each positio i the motif: G 0 10 0 0 07 10 0 0 04 08 A 04 0 10 0 0 0 10 0 0 0 T 06 0 0 10 03 0 0 10 04 02 C 0 0 0 0 0 0 0 0 02 0 Sometimes a positio weight matrix is represeted by a sequece logo, where the height of the letters represetig the ucleotides correlates with the frequecy that base is foud i differet sequeces cotaiig the motif: From the example above, positio 1 is said to be degeerate; there is o sigle ucleotide that represets the motif here O the other had positio 3 is said to be striget because the motif is well represeted by adeosie B Mathematical Represetatio of a Positio Weight Matrix The positio weight matrix for a motif of width w ca be expressed as

5 # 11 21 w1 & % ( = % 12 22 w2 (, % 13 23 w3 ( % ( 14 24 w4 ' where each row j represets A, C, G, or T, ad each colum i represets oe positio of the motif, ad is ormalized: 4 # ij =1 j=1 for all i = 1, 2, w For example θ 23 is the relative frequecy that guaie is foud i positio 2 of the motif C Likelihood of a Sequece If all the relative frequecies θ ij are give for the positio weight matrix θ, we ca measure the probability of geeratig a sequece S = (s 1, s 2,, s w ) This is also kow as the likelihood L(θ) of the sequece For example we ca use a positio weight matrix of width w = 3 to calculate likelihood of the sequece GGG It is simply the product of three relative frequecies θ 13, θ 23, ad θ 33 Geeralizig this usig mathematics, we fid the likelihood of a sequece S = (s 1, s 2,, s w ) give θ i is L() = P S ( ) = ij I( s i = j) where I s i = j w 4 #, i=1 j=1 ( ) = 1 if s i = j # 0 if ot Let us briefly go over a few sytax elemets First of all, the expressio P(S θ) represets a coditioal probability: We are askig, What is the likelihood of sequece S give the coditio that the positio weight matrix is θ? Secodly, the (ie, capital pi) otatio meas we take the product of the associated terms Fially, for coveiece we coverted the alphabetical strig (A, C, G, T) ito a umerical oe (1, 2, 3, 4) These umbers are represeted by the variable j i the above expressio Other ways of expressig the likelihood L(θ) are

6 L() = P S w # ( ) = P( s i i ) i=1 w = # i,si The coditioal probability P(s i θ i ) is the probability of geeratig a ucleotide elemet s i give its relative frequecy θ i We ca expad this idea further ad measure the likelihood for a set of sequeces S 1, S 2,, S give θ Sice we are assumig each sequece S k is geerated idepedetly from θ, this probability is simply the product of the relative frequecies i,ski represetig each ucleotide elemet s ki : L() = P S 1,,S i=1 ( ) = P( S k ) # w ## = i,ski Note that the sytax P(S 1, S 2,, S θ) represets a joit probability the probability of geeratig sequeces S 1, S 2,, S as well as a coditioal probability the probability give θ i=1 D Usig Maximum Likelihood to Estimate the Positioal Weight Matrix θ Ofte times we wat to costruct a positio weight matrix θ of legth w from observed sequece data For a set of sequeces S 1, S 2,, S represeted by the same θ, our strategy is to maximize the likelihood L(θ) over all possible values of θ ij This could be doe by settig the partial derivative L(#) # ij equal to zero ad solvig for θ ij ; however, it is much easier to take the partial derivative with respect to the log-likelihood fuctio (ie, the logarithm of the likelihood) ad set it to zero logl(#) # ij = 0 because the product associated with the likelihood L(θ) turs ito a sum Note that there are oly 3w ad ot 4w parameters for which we eed to solve, sice if we figure out θ i1, θ i2, ad θ i3, we ca use the relatio # ij =1to give us θ i4 4 j=1

7 Usig this method o a set of sequeces S 1, S 2,, S, all with the same θ, we ca derive a expressio for the relative frequecy ij = ij, which is simply the absolute frequecy of each ucleotide j for every colum i, divided by the total umber of sequeces Ofte times it is much harder to solve for the positio weight matrix θ It is quite likely withi a set of give sequeces S 1, S 2,, S that oly some sequeces cotai the motif, ad thus oly this subset ca geerate the weight matrix θ The problem is we do ot kow which sequeces form this subset Let us assume the rest of the o-motif (also called backgroud) sequeces form a subset geerated from a sigle distributio (ie, from a secod positio weight matrix θ o made up of idetical colums of p o = (p o A, p o C, p o G, p o T) = (p o 1, p o 2, p o 3, p o 4) The likelihood L(θ, θ o ) for this set of sequeces S 1, S 2,, S is ow ( ) = [ z k P( S k ) + ( 1# z k )P( S k o )] L(, o ) = P S 1,,S z,, o, # where z k = 1 if S k is geerated by % 0 if S k is geerated by o The problem of ot kowig if a sequece S k belogs to the motif (θ) or the backgroud model (θ o ) ca ow be expressed mathematically as ot kowig which value 0 or 1 to use for the biary fuctio z k associated with each S k Fortuately, we ca remove z from the equatio by itegratig the likelihood L(θ, θ o ) over all possible evets z: 3 ( ) = P( S 1,,S z,, o ) P S 1,,S, o After itegratio, we are left with L(, o ) = P S 1,,S, o # P( z) z ( ) = [ P( z k )P( S k ) + ( 1# P( z k ))P( S k o )] We may be fortuate to kow the probability P(z k =1) for the set of sequeces S 1, S 2,, S Represetig this probability as the costat α, the likelihood of the set may ow be writte as

8 ( ) = %[#P( S k ) + ( 1# )P( S k o )] L(, o ) = P S 1,,S, o Havig successfully expressed the likelihood as a fuctio of 3w o idepedet variables i,ski ad 3 idepedet variables i,ski, we ca ow use o our strategy of solvig for i,ski ad i,ski whe the likelihood is at a maximum However, settig the partial derivatives of the log-likelihood fuctio equal to zero is too difficult a task because the likelihood L(θ, θ o ) i this case is simply ot just a product of the idepedet variables We will implemet the EM Algorithm ext lecture to solve this maximum likelihood estimatio problem 1 Wikipedia, P-value, http://ewikipediaorg/wiki/p-value 2 The relative frequecy k o i N the motif is foud i the set So must also ot be close to 0 or 1 3 I geeral we ca calculate a margial probability from a coditioal or joit probability by removig oe of the variables usig itegratio ( ) = P( X,Y) P X = P( X Y) P( Y), Y where we take the sum over all possible evets Y From R Durbi, S Eddy, A Krogh, ad G Mitchiso, Biological Sequece Aalysis, Cambridge Uiversity Press, 2006, p 6 Y