Introduction to Computational Biology Lecture # 27: Gibbs Sampling and Bi-clustering

Size: px

Start display at page:

Download "Introduction to Computational Biology Lecture # 27: Gibbs Sampling and Bi-clustering"

Norah Clark
5 years ago
Views:

1 Introduction to Computational Biology Lecture # 27: Gibbs Sampling and Bi-clustering Lihi Pertman 3/5/10 1 A Brief Review Last week we learnt about MCMC. Our goal was to obtain samples from some complex probability distribution P(x). In order to do so, we built a markov chain with a transition probability P t (X (t+1) X (t) ) that converges to some stationary distribution when t. X (1), X (2),..., X (t) s.t P MC (X (t) = x) t P (X) We also described Metropolis-Hastings sampling algorithm. It generates suggestions for the value x from a proposal distribution R x and returns x if the sampled value applied to the detailed balance consition. Running this procedure until infinity will sample, eventually, a value from the stationary distribution. Sample next(x) Sample x R x (X) Sample u U[0, 1] If u P (x /e) P (x/e) R x (x) R x(x ) Return x Else return x 2 Back to Clustering - Two Dimensional Clustering On previous lectures we talked about clustering. If the clustering was successful, the rows in each cluster are similar, yet we didn t say anything about the columns. 1

In theory, we can cluster both rows and columns (run the procedure on X T ), and devide the matrix

2 Figure 1: row clustering - the rows in each cluster are similar We defined: X i, - the value i on sample. C i - the cluster of row i. Let s also define: D i - the cluster of column i. In theory, we can cluster both rows and columns (run the procedure on X T ), and devide the matrix X into clusters of rows and columns. Figure 2: row and column clustering - the matrix is devided into squares 2

3 Formaly we are looking for the oint probability of X, C, D P ( X, C, D) = (1) [ P (C i )] [ P (D )] = (2) [ i i P (X i, C i, D ] where X = {X i, } C = {Ci } D = {D } (1) C i s are independent and D s are independent. (2) X i, X \ X i,, C \ C i, D \ D meaning, if we know in which row s and column s clusters X i, is, the rest of the world is irrelevant. The sufficient statistics are also similar to the one dimensional clustering: M 0 c = i 1(C i = c) M 0 d = 1(C = c) M 1 cd = i M 2 cd = i X i, 1(C i = c) 1(D = d) Xi, 2 1(C i = c) 1(D = d) However, solving this problem with EM, is not practical. In this case, the probabilities of rows clustering is not independent of the columns clustering (otherwise all X i, whould be independent), making the calculation of the expectation in the E step inefficient and impractical: IE[S c,d ] = i, P (C i = c, D = d X) X i, }{{} C i,d iare dependent given the data 2.1 Metropolis-Hasting - The Two Dimensional Case We will estimate the expectation, using the Metropolis-Hastings algorithm. We want to sample c, d P ( C, D X). Metropolis-Hasting procedure generates, as in the one dimensional case, suggestions for the values c and d from a proposal distribution R c,d and returns c and d if the sampled value (u) is smaller than the variable (w) which should varify that the detailed balance condition holds. Sample next(c, d) Sample c, d R c,d (C, D) Sample u U[0, 1] w P (c,d X) P (x,d X) If u w Return c, d Rc,d (c,d) R c,d (c,d ) The only difference from the one dimensional case is the proposal distribution R c,d that we need to define. Let s define n as the number of rows and m as the number of columns. The procedure changes one index in the vector with an equal chance to change the row vector or the column vector. 3

4 P roc R(c, d) c c d d Sample u U[0, 1] If u 0.5 Sample i U[1 n] Sample c i P (C i) Else Sample U[1 m] Sample d P (D ) Let s also take a closer look at the ratio P (c,d X) P (c,d X) Rc,d (c,d) R c,d (c,d ). Assume i is the row index that has changed (a change in a column index is similar): P (c, d X) P (c, d X) = P (c, d, X) P (c, d, X) P (C i = C i }{{} = ) P (C i = C i ) C i (C i D,X) P (X i 0, c i 0, d) P (X i 0, c i0, d) The ratio R c,d (c,d) R c,d (c,d ) has left as an exercise. What bothers us now is how would we know that we were converged to the stationary distributio and how many iterations we need? 2.2 Gibbs Sampling Let s look at an example of the distribution P: Figure 3: The distribution P. The red disk is the sampling area 4

5 The samples according to the proposal distribution R c,d will fall into the area of the disk in figure 3. A small disk indicates a conservative proposal of a tiny step and a large disk indicates a large step. Using a small disk we will get eventually to the stationary distribution, but it will take us a very long time. However, a large disk can cover the interesting area, causing us to sample from the uniform distribution, without remembering where we came from. We should also notice that calculating P (c,d X) P (c,d X) is harder as the pertubations are wilder. We will try to overcome this problem using another procedure, a variant of Metropolis-Hasting, which tries to use larger steps without loosing the proposal distribution: Given x we want to sample y : y P ( y x) where y s length is n. gibbs next(y) y y Sample i U[1 n] Sample y i P (y i y 1 y i 1, y i+1 y n, x) Return y How can we be sure that y is from the stationary distribution, given y is from the stationary distribution? Let s define y i =< y 1 y i 1, y i+1 y n >. If Q is the stationary distribution, P ( y x) = P ( y i x) Q( y i x) and this equation is true for each i, therefore P = Q, and we want to sample in each iteration: 2.3 two sided clustering - gibbs sampling P (y i y i, X) P (y i, y i, X) We will modify the abstract Gibbs procedure to solve our problem - clustering of the rows and the columns together. The procedure two sided clustering - Gibbs sampling, is similar to the Gibbs sampling algorithem, only this time, P satisfies P (c i = c c i, D, X) P (c i = c) P (X i, C i = c, D ) gibbs next(c, d) c c d d If u 0.5 Sample i U[1 n] Sample c i P (c i c i, d, x) We describe another procedure, Block-Gibbs, a variant of the procedure above, which sample C first, and given C sample D: BG next(c, d) Sample c P (c d, x) Sample d P (d c, x) We want to prove that we can sample from P (C D, X) efficiently. 5

6 To do so we will prove that : P (C D, X) = i P (C i D, X) P (C i C i, D, X) P (C i ) P (X i, C i, D ) = P (C i D, X i,1 X i,m ) P (C D, X) }{{} = P (C i C i C i 1, D, X) }{{} = P (C i D, X i,1 X i,m ) i the chain rule C i (C i D,X) i and therfore we can sample each row seperatly. The last procedure for today is an efficient variant of the above procedure (the number of iteration is smaller). This procedure changes on each step half of the parameter, and thus enlarge the steps and reduce the number of iterations. This can be done since given C, d are independent, and given D, c are independent. BG sample(c, d) fori = 1 n Sample c i P (c i D, X) for = 1 m Sample d i P (d i C, X) 6

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo Assaf Weiner Tuesday, March 13, 2007 1 Introduction Today we will return to the motif finding problem, in lecture 10