CS284A: Representations and Algorithms in Molecular Biology

CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by Professor Xiaohui Xie o Jauary 14 & 16, 2008 I Motif Discovery via Eumeratio A A Model for Motif Discovery (Review from Lecture 2) We wat to idetify biologically sigificat motifs i a set S of sequeces, s 1, s 2,, s Each potetially sigificat motif m i of legth w is associated with a summatio variable k i, which is the total umber of sequeces from S i which the motif appears To systematically measure this sigificace, we must first fid the uderlyig probability p ay sequece of legth l cotais ay theoretical motif of legth w With the overridig assumptio that the four bases are uiformly distributed, or ( P(A),P(C),P(G),P(T) ) = 1 4, 1 4, 1 4, 1 % # 4 ', we have calculated a value for p of & 1 1 1 lw+1 # & % ( We use p as the probability of success for fidig this 4 w ' theoretical motif each time we sample a sequece from set S For k out of trials, the probability of success is biomial, P( k) = % ' p k ( 1( p) (k, # k& %! where ' = # k& k! ( ( k)!, as a motif either is i a sequece or is ot To test the sigificace of our specific motif m i, we evaluate a p-value, or the probability, based o our distributio, that m i would appear i at least k i sequeces:

2 # P( k) = % & ( p k ( 1) p) )k k' k= k i k= k i If the p-value is smaller tha a chose sigificace level, 1 we ca say with some cofidece that our motif m i is biologically sigificat For large the biomial distributio is approximated by a ormal distributio, ad we ca map k i to a ew distributio ad compute the z-score to determie the sigificace of our motif m i B Problems with this Model 1 The assumptio that the four bases are uiformly distributed i the sequeces is ot ecessarily correct To be more accurate, we would eed to model the first-order statistics (ie, P(A), P(C), P(G), ad P(T)) of the ucleotide distributio 2 The model igores secod-order statistics Two bases might be more likely paired together tha distributed at radom (eg P(GA) P(G)P(A) ) The same could be also said for higher-order statistics C Cotrol Sequeces I order ot to rely o the assumptio of uiform distributio of bases to measure sigificace, we ca geerate a set of N cotrol sequeces, s o 1, s o 2,,s o N The assumptio is that our motif of iterest m i is ot sigificat i the cotrol sequeces Now we have two sets of sequeces Each m i is associated with two values k i ad k o i, which correspod with the umber of differet sequeces this motif appears i the sets of sequeces S ad S o, respectively Now to fid out if our motif m i is biologically sigificat, we choose the appropriate probability distributio for successfully fidig a motif i k out of trials There are two types to choose from: 1 The biomial distributio If the set S is idepedet of S o, we ca still model the probability of success P(k) o fidig a motif i k out of trials usig the biomial distributio If S S o (ie, the set S is a subset of S o ), choosig the appropriate distributio ow depeds o the size of both sets ad the distributio of our motif m i i them If the umber N of S o sequeces ad the umber k i o of sequeces cotaiig our motif are large compared to the umber of S sequeces, the the probability p of radomly pickig a sequece with our motif remais essetially uchaged for trials, ad we could still model the probability P(k)

3 usig the biomial distributio 2 For these scearios the oly chage we eed to make from the model i Part A is to adopt a differet uderlyig probability p of success for fidig a motif every time we sample a sequece For p we will use the relative frequecy k i N our motif m i is foud i the set S o This way, whe we ru k trials, we ca compare the distributios from both S ad S o to see if our motif ideed stads out i S The probability of success o k out of trials may be writte as P(k) = % # k& ' k o k % i ' 1( k o % i ' # N & # N & To test the sigificace of our motif, we calculate the p-value i the same fashio as we did before: P(k) For large we ca agai map k i to a ormal distributio with mea p ad variace p(1-p) ad compute the z-score 2 The hypergeometric distributio k= k i If S S o ad if either N or k i o is ot large compared to for a give m i, the sequece of trials is aalogous to samplig without replacemet The probability p of radomly pickig a sequece with our motif chages sigificatly over trials Hece, we caot use the biomial distributio, which assumes the same p for all trials The appropriate distributio is hypergeometric, where the probability of success o fidig a motif i k out of trials is P( k) = K% ' N ( K % ' # k &# ( k &, N% ' # & where K % ' is the umber of ways of choosig k sequeces with a # k & # N K& motif from the total umber K of sequeces with that motif, % ( is k ' the umber of ways of choosig -k sequeces without the motif from N% the total umber N-K of sequeces without the motif, ad ' is the # & (k o

4 umber of ways of choosig sequeces from the total umber N sequeces While usig this distributio to test the sigificace of our particular motif m i, we assig k o i to the value K Like before we calculate the p-value usig the summatio P(k) We caot compute a z-score here, as a ormal distributio does ot approximate a hypergeometric distributio for large k= k i II Represetatio of a Motif Usig a Positio Weight Matrix A What is a Positio Weight Matrix? Motifs are hardly ever represeted accurately by a uique cosecutive sequece of A s, C s, G s ad T s Istead, we create a positio weight matrix (PWM) to represet the frequecies of each base at each positio i the motif: G 0 10 0 0 07 10 0 0 04 08 A 04 0 10 0 0 0 10 0 0 0 T 06 0 0 10 03 0 0 10 04 02 C 0 0 0 0 0 0 0 0 02 0 Sometimes a positio weight matrix is represeted by a sequece logo, where the height of the letters represetig the ucleotides correlates with the frequecy that base is foud i differet sequeces cotaiig the motif: From the example above, positio 1 is said to be degeerate; there is o sigle ucleotide that represets the motif here O the other had positio 3 is said to be striget because the motif is well represeted by adeosie B Mathematical Represetatio of a Positio Weight Matrix The positio weight matrix for a motif of width w ca be expressed as

5 # 11 21 w1 & % ( = % 12 22 w2 (, % 13 23 w3 ( % ( 14 24 w4 ' where each row j represets A, C, G, or T, ad each colum i represets oe positio of the motif, ad is ormalized: 4 # ij =1 j=1 for all i = 1, 2, w For example θ 23 is the relative frequecy that guaie is foud i positio 2 of the motif C Likelihood of a Sequece If all the relative frequecies θ ij are give for the positio weight matrix θ, we ca measure the probability of geeratig a sequece S = (s 1, s 2,, s w ) This is also kow as the likelihood L(θ) of the sequece For example we ca use a positio weight matrix of width w = 3 to calculate likelihood of the sequece GGG It is simply the product of three relative frequecies θ 13, θ 23, ad θ 33 Geeralizig this usig mathematics, we fid the likelihood of a sequece S = (s 1, s 2,, s w ) give θ i is L() = P S ( ) = ij I( s i = j) where I s i = j w 4 #, i=1 j=1 ( ) = 1 if s i = j # 0 if ot Let us briefly go over a few sytax elemets First of all, the expressio P(S θ) represets a coditioal probability: We are askig, What is the likelihood of sequece S give the coditio that the positio weight matrix is θ? Secodly, the (ie, capital pi) otatio meas we take the product of the associated terms Fially, for coveiece we coverted the alphabetical strig (A, C, G, T) ito a umerical oe (1, 2, 3, 4) These umbers are represeted by the variable j i the above expressio Other ways of expressig the likelihood L(θ) are

6 L() = P S w # ( ) = P( s i i ) i=1 w = # i,si The coditioal probability P(s i θ i ) is the probability of geeratig a ucleotide elemet s i give its relative frequecy θ i We ca expad this idea further ad measure the likelihood for a set of sequeces S 1, S 2,, S give θ Sice we are assumig each sequece S k is geerated idepedetly from θ, this probability is simply the product of the relative frequecies i,ski represetig each ucleotide elemet s ki : L() = P S 1,,S i=1 ( ) = P( S k ) # w ## = i,ski Note that the sytax P(S 1, S 2,, S θ) represets a joit probability the probability of geeratig sequeces S 1, S 2,, S as well as a coditioal probability the probability give θ i=1 D Usig Maximum Likelihood to Estimate the Positioal Weight Matrix θ Ofte times we wat to costruct a positio weight matrix θ of legth w from observed sequece data For a set of sequeces S 1, S 2,, S represeted by the same θ, our strategy is to maximize the likelihood L(θ) over all possible values of θ ij This could be doe by settig the partial derivative L(#) # ij equal to zero ad solvig for θ ij ; however, it is much easier to take the partial derivative with respect to the log-likelihood fuctio (ie, the logarithm of the likelihood) ad set it to zero logl(#) # ij = 0 because the product associated with the likelihood L(θ) turs ito a sum Note that there are oly 3w ad ot 4w parameters for which we eed to solve, sice if we figure out θ i1, θ i2, ad θ i3, we ca use the relatio # ij =1to give us θ i4 4 j=1

7 Usig this method o a set of sequeces S 1, S 2,, S, all with the same θ, we ca derive a expressio for the relative frequecy ij = ij, which is simply the absolute frequecy of each ucleotide j for every colum i, divided by the total umber of sequeces Ofte times it is much harder to solve for the positio weight matrix θ It is quite likely withi a set of give sequeces S 1, S 2,, S that oly some sequeces cotai the motif, ad thus oly this subset ca geerate the weight matrix θ The problem is we do ot kow which sequeces form this subset Let us assume the rest of the o-motif (also called backgroud) sequeces form a subset geerated from a sigle distributio (ie, from a secod positio weight matrix θ o made up of idetical colums of p o = (p o A, p o C, p o G, p o T) = (p o 1, p o 2, p o 3, p o 4) The likelihood L(θ, θ o ) for this set of sequeces S 1, S 2,, S is ow ( ) = [ z k P( S k ) + ( 1# z k )P( S k o )] L(, o ) = P S 1,,S z,, o, # where z k = 1 if S k is geerated by % 0 if S k is geerated by o The problem of ot kowig if a sequece S k belogs to the motif (θ) or the backgroud model (θ o ) ca ow be expressed mathematically as ot kowig which value 0 or 1 to use for the biary fuctio z k associated with each S k Fortuately, we ca remove z from the equatio by itegratig the likelihood L(θ, θ o ) over all possible evets z: 3 ( ) = P( S 1,,S z,, o ) P S 1,,S, o After itegratio, we are left with L(, o ) = P S 1,,S, o # P( z) z ( ) = [ P( z k )P( S k ) + ( 1# P( z k ))P( S k o )] We may be fortuate to kow the probability P(z k =1) for the set of sequeces S 1, S 2,, S Represetig this probability as the costat α, the likelihood of the set may ow be writte as

8 ( ) = %[#P( S k ) + ( 1# )P( S k o )] L(, o ) = P S 1,,S, o Havig successfully expressed the likelihood as a fuctio of 3w o idepedet variables i,ski ad 3 idepedet variables i,ski, we ca ow use o our strategy of solvig for i,ski ad i,ski whe the likelihood is at a maximum However, settig the partial derivatives of the log-likelihood fuctio equal to zero is too difficult a task because the likelihood L(θ, θ o ) i this case is simply ot just a product of the idepedet variables We will implemet the EM Algorithm ext lecture to solve this maximum likelihood estimatio problem 1 Wikipedia, P-value, http://ewikipediaorg/wiki/p-value 2 The relative frequecy k o i N the motif is foud i the set So must also ot be close to 0 or 1 3 I geeral we ca calculate a margial probability from a coditioal or joit probability by removig oe of the variables usig itegratio ( ) = P( X,Y) P X = P( X Y) P( Y), Y where we take the sum over all possible evets Y From R Durbi, S Eddy, A Krogh, ad G Mitchiso, Biological Sequece Aalysis, Cambridge Uiversity Press, 2006, p 6 Y