Introduction to Computational Molecular Biology. Gibbs Sampling

18.417 Itroductio to Computatioal Molecular Biology Lecture 19: November 16, 2004 Scribe: Tushara C. Karuarata Lecturer: Ross Lippert Editor: Tushara C. Karuarata Gibbs Samplig Itroductio Let s first recall the Motif Fidig Problem: give a set of DNA sequeces each of legth t, fid the profile (a set of l-mers, oe from each sequece) that maximizes the cosesus score. We have already see various aive brute-force approaches for solvig this problem. I this lecture, we will apply a probabilistic method kow as Gibbs Samplig to solve this problem. A probabilistic approach to Motif Fidig We ca geeralize the Motif Fidig Problem as follows: give a multivariable scorig fuctio f(y 1, y 2,..., y ), fid the vector y that maximizes f. Cosider a probability distributio p where p f. Ituitively, if f is relatively large at the optimum, the if we repeatedly sample from the probability distributio p, the we are likely to quickly ecouter the optimum. Gibbs Samplig provides us a method of samplig from a probability distributio over a large set. We will use a techique kow as simulated aealig to trasform a probability distributio ito oe that has a relatively tall peak at the optimum, to esure that Gibbs samplig is likely to quickly ecouter the optimum. I particular, we will observe visually that the probability distributio p f 1/T, for a sufficietly small T, is a good choice. 19-1

19-2 Lecture 19: November 16, 2004 Gibbs Samplig Gibbs Samplig solves the followig problem. Iput: a probability distributio p(y 1, y 2,..., y ), where each y i S. S may be big, but S is assumed to be maageable. Output: a radom y chose from the probability distributio p. Gibbs Samplig uses the techique of Mote Carlo Markov Chai simulatio. The idea is to set up a Markov Chai havig p as its steady-state distributio, ad the simulate this Markov Chai for log eough to be cofidet that a approximatio of the steady-state has bee attaied. The fial state of the simulatio approximately represets a sample from the steady-state distributio. Let s ow defie our Markov Chai. The set of states of our Markov Chai is S. Trasitios exist oly betwee states differig i at most oe coordiate. For states y = (y 1,..., y m,..., y ) ad y = (y 1,..., y m,..., y ), we defie the trasitio prob- p(y 1,...,y m,...,y ability T ( y y ) = 1 ) P. ym p(y 1,...,y m,...,y ) We ow show that the distributio p is a steady-state distributio of our Markov Chai. Recall that the defiiig property of a steady-state distributio π is This property is kow as global balace. The stroger property πt = π π( y)t ( y y ) = π( y )T ( y y) is kow as detailed balace. We ca see that detailed balace implies global balace by summig both sides of the detailed balace coditio over y : π( y)t ( y y ) = π( y )T ( y y) y y π( y) T ( y y ) = π( y )T ( y y) y y π( y) = (πt )( y) Therefore, let s just check whether p satisfies detailed balace. If y differs from y i zero or more tha oe place, the detailed balace trivially holds (i the latter case,

Lecture 19: November 16, 2004 19-3 both sides of the detailed balace coditio evaluate to zero). So, suppose that y differs from y i oly oe place, say coordiate m. The left-had-side of the detailed balace coditio evaluates to p( y) 1. The right-had-side evaluates to p( y ) 1 p( y) Pym p(y 1,...,y m,...,y ) p( y ) Pym p(y 1,...,y m,...,y ). The two sides are equal, as desired. Therefore, p is ideed the steady-state distributio of our Markov Chai. Scorig profiles Let s ivestigate a probabilistic approach to scorig profiles, as a alterative to simply usig the cosesus score. We assume a backgroud frequecy P x for character x. Let C deote the umber of occureces of character x i the i th colum of the profile. We call this the profile matrix. The, i the backgroud, the probability that a profile has profile matrix C is give by l 1 ( prob(c) = P C a,i P C c,i C P g,i C C g P t,i a c t a,i C c,i C g,i C i=0 t,i 1 P C C! x Sice the profile correspodig to the actual motif locatios should have small backgroud probability, we assig score(c) 1/prob(C) C!P C x Now, log (!) = Θ( log ). Therefore, C score(c) exp ( C log ) P The expoet is kow as the etropy of the profile. I summary, maximizig the etropy, rather tha the cosesus score, is a statistically more adequate approach of fidig motifs. x

19-4 Lecture 19: November 16, 2004 Motif fidig via Gibbs Samplig Here is pseudocode for Motif Fidig usig the Gibbs Samplig techique. 1. Radomly geerate a start state y 1,..., y. 2. Pick m uiformly at radom from 1,...,. 3. Replace y m with y picked radomly from the distributio that assigs relative m weight 1/prob(C(y 1,..., y m,..., y )) to y m. 4. <do whatever with the sample> 5. Goto step 2. Note that we are just doig a simulatio of the Markov Chai defied by the Gibbs Samplig techique. Simulated Aealig Aealig is a process by which glass is put ito a highly durable state by a process of slow coolig. We ca use the same idea here: to amplify the probability of samplig at the optimum of a probability distributio p, we istead sample from p 1/T where T 0. Figure 19.1 shows us a graph of a probability distributio p. The optimum occurs at state 4, but there are other peaks that have sigificatly large height. Figure 19.1: Graph of a probability distributio p.

Lecture 19: November 16, 2004 19-5 Figures 19.2 ad 19.3 show the graphs of the probability distributios p 5 ad p 50 respectively. The height of the peak at state 4 has icreased cosiderably with respect to the heights of the other peaks. Figure 19.2: Graph of p 5. Figure 19.3: Graph of p 50. How do we fid the right T? Here are two possible approaches: we ca either drop T by a small amout after reachig steady-state, or we ca drop T by a small amout at each step. Some questios that we did t aswer For how log should we ru the Markov Chai? How ofte ca we sample?