Introduction to Computational Molecular Biology. Gibbs Sampling

Similar documents
CS284A: Representations and Algorithms in Molecular Biology

Simulation. Two Rule For Inverting A Distribution Function

Lecture 9: Hierarchy Theorems

Lecture 11: Pseudorandom functions

Output Analysis and Run-Length Control

Fastest mixing Markov chain on a path

if j is a neighbor of i,

The Random Walk For Dummies


EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Notes for Lecture 11

Application to Random Graphs

Approximations and more PMFs and PDFs

Lecture Chapter 6: Convergence of Random Sequences

Lecture 2: April 3, 2013

6.3 Testing Series With Positive Terms

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Lecture 12: November 13, 2018

Spectral Partitioning in the Planted Partition Model

FIR Filter Design: Part II

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

1 Introduction to reducing variance in Monte Carlo simulations

Lecture 11: Hash Functions and Random Oracle Model

Introduction to Computational Biology Homework 2 Solution

An Introduction to Randomized Algorithms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

NUMERICAL METHODS FOR SOLVING EQUATIONS

Lecture Notes 15 Hypothesis Testing (Chapter 10)

( ) = p and P( i = b) = q.

CS322: Network Analysis. Problem Set 2 - Fall 2009

Massachusetts Institute of Technology

2 Markov Chain Monte Carlo Sampling

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

On forward improvement iteration for stopping problems

MATH 10550, EXAM 3 SOLUTIONS

4. Partial Sums and the Central Limit Theorem

Lecture 14: Graph Entropy

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

1 Review of Probability & Statistics

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Math 113, Calculus II Winter 2007 Final Exam Solutions

Mixtures of Gaussians and the EM Algorithm

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Regression with quadratic loss

Machine Learning Brett Bernstein

Frequentist Inference

Analysis of Algorithms. Introduction. Contents

Chapter 2 The Monte Carlo Method

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Quantum Computing Lecture 7. Quantum Factoring

Achieving Stationary Distributions in Markov Chains. Monday, November 17, 2008 Rice University

Lecture 9: Pseudo-random generators against space bounded computation,

Lecture 33: Bootstrap

Infinite Sequences and Series

Examples: data compression, path-finding, game-playing, scheduling, bin packing

1 Hash tables. 1.1 Implementation

Math 113 Exam 3 Practice

Frequency Response of FIR Filters

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

Estimation for Complete Data

Monte Carlo Integration

Stochastic Simulation

Lecture 2. The Lovász Local Lemma

CS276A Practice Problem Set 1 Solutions

ECEN 655: Advanced Channel Coding Spring Lecture 7 02/04/14. Belief propagation is exact on tree-structured factor graphs.

Topic 9: Sampling Distributions of Estimators

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 1: Basic problems of coding theory

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

10-701/ Machine Learning Mid-term Exam Solution

Quantum Annealing for Heisenberg Spin Chains

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Lecture 5: April 17, 2013

Lecture 15: Strong, Conditional, & Joint Typicality

4.3 Growth Rates of Solutions to Recurrences

General IxJ Contingency Tables

Lecture 9: Expanders Part 2, Extractors

Confidence Intervals

Recurrence Relations

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Math 155 (Lecture 3)

REGRESSION WITH QUADRATIC LOSS

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Binomial Distribution

Lecture 7: October 18, 2017

Math 1314 Lesson 16 Area and Riemann Sums and Lesson 17 Riemann Sums Using GeoGebra; Definite Integrals

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

IP Reference guide for integer programming formulations.

Design and Analysis of Algorithms

Lecture 2: Monte Carlo Simulation

Economics Spring 2015

6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

Exponential Families and Bayesian Inference

A statistical method to determine sample size to estimate characteristic value of soil parameters

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Shannon s noiseless coding theorem

Sums, products and sequences

A Question. Output Analysis. Example. What Are We Doing Wrong? Result from throwing a die. Let X be the random variable

Chapter 4. Fourier Series

Transcription:

18.417 Itroductio to Computatioal Molecular Biology Lecture 19: November 16, 2004 Scribe: Tushara C. Karuarata Lecturer: Ross Lippert Editor: Tushara C. Karuarata Gibbs Samplig Itroductio Let s first recall the Motif Fidig Problem: give a set of DNA sequeces each of legth t, fid the profile (a set of l-mers, oe from each sequece) that maximizes the cosesus score. We have already see various aive brute-force approaches for solvig this problem. I this lecture, we will apply a probabilistic method kow as Gibbs Samplig to solve this problem. A probabilistic approach to Motif Fidig We ca geeralize the Motif Fidig Problem as follows: give a multivariable scorig fuctio f(y 1, y 2,..., y ), fid the vector y that maximizes f. Cosider a probability distributio p where p f. Ituitively, if f is relatively large at the optimum, the if we repeatedly sample from the probability distributio p, the we are likely to quickly ecouter the optimum. Gibbs Samplig provides us a method of samplig from a probability distributio over a large set. We will use a techique kow as simulated aealig to trasform a probability distributio ito oe that has a relatively tall peak at the optimum, to esure that Gibbs samplig is likely to quickly ecouter the optimum. I particular, we will observe visually that the probability distributio p f 1/T, for a sufficietly small T, is a good choice. 19-1

19-2 Lecture 19: November 16, 2004 Gibbs Samplig Gibbs Samplig solves the followig problem. Iput: a probability distributio p(y 1, y 2,..., y ), where each y i S. S may be big, but S is assumed to be maageable. Output: a radom y chose from the probability distributio p. Gibbs Samplig uses the techique of Mote Carlo Markov Chai simulatio. The idea is to set up a Markov Chai havig p as its steady-state distributio, ad the simulate this Markov Chai for log eough to be cofidet that a approximatio of the steady-state has bee attaied. The fial state of the simulatio approximately represets a sample from the steady-state distributio. Let s ow defie our Markov Chai. The set of states of our Markov Chai is S. Trasitios exist oly betwee states differig i at most oe coordiate. For states y = (y 1,..., y m,..., y ) ad y = (y 1,..., y m,..., y ), we defie the trasitio prob- p(y 1,...,y m,...,y ability T ( y y ) = 1 ) P. ym p(y 1,...,y m,...,y ) We ow show that the distributio p is a steady-state distributio of our Markov Chai. Recall that the defiiig property of a steady-state distributio π is This property is kow as global balace. The stroger property πt = π π( y)t ( y y ) = π( y )T ( y y) is kow as detailed balace. We ca see that detailed balace implies global balace by summig both sides of the detailed balace coditio over y : π( y)t ( y y ) = π( y )T ( y y) y y π( y) T ( y y ) = π( y )T ( y y) y y π( y) = (πt )( y) Therefore, let s just check whether p satisfies detailed balace. If y differs from y i zero or more tha oe place, the detailed balace trivially holds (i the latter case,

Lecture 19: November 16, 2004 19-3 both sides of the detailed balace coditio evaluate to zero). So, suppose that y differs from y i oly oe place, say coordiate m. The left-had-side of the detailed balace coditio evaluates to p( y) 1. The right-had-side evaluates to p( y ) 1 p( y) Pym p(y 1,...,y m,...,y ) p( y ) Pym p(y 1,...,y m,...,y ). The two sides are equal, as desired. Therefore, p is ideed the steady-state distributio of our Markov Chai. Scorig profiles Let s ivestigate a probabilistic approach to scorig profiles, as a alterative to simply usig the cosesus score. We assume a backgroud frequecy P x for character x. Let C deote the umber of occureces of character x i the i th colum of the profile. We call this the profile matrix. The, i the backgroud, the probability that a profile has profile matrix C is give by l 1 ( prob(c) = P C a,i P C c,i C P g,i C C g P t,i a c t a,i C c,i C g,i C i=0 t,i 1 P C C! x Sice the profile correspodig to the actual motif locatios should have small backgroud probability, we assig score(c) 1/prob(C) C!P C x Now, log (!) = Θ( log ). Therefore, C score(c) exp ( C log ) P The expoet is kow as the etropy of the profile. I summary, maximizig the etropy, rather tha the cosesus score, is a statistically more adequate approach of fidig motifs. x

19-4 Lecture 19: November 16, 2004 Motif fidig via Gibbs Samplig Here is pseudocode for Motif Fidig usig the Gibbs Samplig techique. 1. Radomly geerate a start state y 1,..., y. 2. Pick m uiformly at radom from 1,...,. 3. Replace y m with y picked radomly from the distributio that assigs relative m weight 1/prob(C(y 1,..., y m,..., y )) to y m. 4. <do whatever with the sample> 5. Goto step 2. Note that we are just doig a simulatio of the Markov Chai defied by the Gibbs Samplig techique. Simulated Aealig Aealig is a process by which glass is put ito a highly durable state by a process of slow coolig. We ca use the same idea here: to amplify the probability of samplig at the optimum of a probability distributio p, we istead sample from p 1/T where T 0. Figure 19.1 shows us a graph of a probability distributio p. The optimum occurs at state 4, but there are other peaks that have sigificatly large height. Figure 19.1: Graph of a probability distributio p.

Lecture 19: November 16, 2004 19-5 Figures 19.2 ad 19.3 show the graphs of the probability distributios p 5 ad p 50 respectively. The height of the peak at state 4 has icreased cosiderably with respect to the heights of the other peaks. Figure 19.2: Graph of p 5. Figure 19.3: Graph of p 50. How do we fid the right T? Here are two possible approaches: we ca either drop T by a small amout after reachig steady-state, or we ca drop T by a small amout at each step. Some questios that we did t aswer For how log should we ru the Markov Chai? How ofte ca we sample?