Advanced Sampling Algorithms

Size: px

Start display at page:

Download "Advanced Sampling Algorithms"

Jordan Potter
6 years ago
Views:

1 + Advanced Sampling Algorithms

2 + Mobashir Mohammad Hirak Sarkar Parvathy Sudhir Yamilet Serrano Llerena Advanced Sampling Algorithms Aditya Kulkarni Tobias Bertelsen Nirandika Wanigasekara Malay Singh

3 + Introduction Mobashir Mohammad 3

answer not Guaranteed Time Bounded Las Vegas Algorithms Always gives an

4 + Sampling Algorithms 4 Monte Carlo Algorithms Gives an approximate answer with a certain probability May terminate with a failure signal Correct answer not Guaranteed Time Bounded Las Vegas Algorithms Always gives an exact answer Doesn t terminates in a definite time Answer Guaranteed Time not Bounded

5 + Monte Carlo Method 5 What are the chances of winning with a properly shuffled deck? Hard to compute because winning or losing depends on a complex procedure of reorganizing the cards Insight Why not play just a few hands, and see empirically how many do in fact we win? More generally We can approximate a probability density function using only a few samples from that density

6 + Simplest Monte Carlo Method 6 Finding the value of π Shooting Darts 1 π/4 is equal to the area of a circle of diameter 1 p 4 ò ò x 2 + y 2 <1 = dx dy 1 Integral solved with Monte Carlo Randomly select a large number of points inside the square N _ inside_ circle p» 4 * (Exact when N_total ) N _ total

7 + Monte Carlo Principle 7 Given a very large set X and a distribution p(x) over it We draw a set of N independent and identically distributed samples We can then approximate the distribution using these samples. N p N x = 1 n i=1 1 xi =x N p x We can also use these sample to compute expectations,

8 + Sampling 8 Problem: Exact inference from a probability distribution is hard Determine the number of people likely to go shopping to a Nike Store at Vivo City on a Sunday? Solution: Use random samples to get an approximate inference. How: Use a Markov Chain to get random samples!

9 + Markov Chains Hirak Sarkar 9

10 + Introduction to Markov Chain 10 Real world systems contain uncertainty and evolve over time Stochastic Processes and Markov Chains model such systems A discrete time stochastic process is a sequence of random variables X 0, X 1, and typically denoted by {X n }

11 + Components of a Stochastic 11 Process The state space of a stochastic process is the set of all values that the {X n } can take. Time: n = 1,2, Set of states is denoted by S. S = {s 1, s 2,, s k } here there are k states Clearly {X n } will take one of k values

12 + Markov Property and Markov 12 Chain Consider special class of stochastic processes that satisfy a certain property called Markov Property. Markov Property: State for X n only depends on content of X n 1 not any of X i for i < n 1 Formally, P X n = i n X 0 = i 0,, X n 1 = i n 1 ) = P X n = i n X n 1 = i n 1 ) = p(i n, i n 1 ) We call such processes as Markov Chain A matrix containing p(i n, i n 1 ) in (i n, i n 1 ) cell is called Transition Matrix.

13 + Transition Matrix 13 The one-step transition matrix for a Markov chain with states S = 0,1,2 is P p p p p p p p p p Where p ij = Pr X 1 = j X 0 = i If P is same for any point of time then it is called time homogeneous. m j=1 p i,j = 1 p i,j 0 i, j

14 + Example: Random Walk 14 Adjacency Matrix Transition Matrix 1 1 1/2 1/2

15 + 15 What is a random walk B 1 A 1 t=0 1/2 1/2 C

16 + 16 What is a random walk B 1 A 1 t=0 t=1 A 1/2 B 1 1 1/2 1/2 C 1/2 C

17 + 17 What is a random walk B 1 A 1 t=0 t=1 A 1 1/2 1/2 1 B 1/2 C 1/2 C B 1 A 1 t=2 1/2 1/2 C

18 + 18 What is a random walk B 1 A 1 t=0 t=1 A 1 1/2 1/2 1 B 1/2 C 1/2 C B 1 1/2 A 1 C t=2 1/2 B 1 1/2 A 1 C t=3 1/2

19 + 19 Probability Distributions x t i = P X t = i = probability that the surfer is at node i at time t x t+1 (i)= j P X t+1 = i X t = j) P(X t = j) = j p i,j x t (j) x t+1 = x t P= x t 1 P P= x t 2 P P P = =x 0 P t What happens when the surfer keeps walking for a long time?

20 + 20 Stationary Distribution When the surfer keeps walking for a long time When the distribution does not change anymore i.e. x T+1 = x T We denote it as π

21 + Basic Limit Theorem 21 P = P 5 = P 10 = P 20 =

22 + Example 22

23 + Classification of States 23 State j is reachable from state i if the probability to go from i to j in n > 0 steps is greater than zero (State j is reachable from state i if in the state transition diagram there is a path from i to j) A subset S of the state space S is closed if p ij = 0 for every i S and j S A state i is said to be absorbing if it is a single element closed set. A closed set S of states is irreducible if any state j S is reachable from every state i S A Markov chain is said to be irreducible if the state space S is irreducible.

24 + Example 24 Irreducible Markov Chain p 01 p 12 p 22 p 00 p 10 p Reducible Markov Chain p 01 p 12 p p 00 p 10 p 14 p 32 Absorbing State 4 p 22 p 33 Closed irreducible set

25 + Periodic and Aperiodic States 25 Suppose that the structure of the Markov Chain is such that state i is visited after a number of steps that is an integer multiple of an integer d >1. Then the state is called periodic with period d. If no such integer exists (i.e., d =1) then the state is called aperiodic. Example Periodic State d = P

26 + Stationary Distribution 26 P = π 0 = π 0 0, π 1 0 = 0.5,0.5 π 1 = π 0 P = = ( )

27 + Steady State Analysis 27 Recall that the probability of finding the MC at state i after the k th step is given by Pr X i π k k, k... i k k 0 1 An interesting question is what happens in the long run, i.e., Questions: Do these limits exists? i lim k k This is referred to as steady state or equilibrium or stationary state probability If they exist, do they converge to a legitimate probability distribution, i.e., i How do we evaluate π j, for all j. 1

28 + Steady State Analysis 28 THEOREM: In a finite irreducible aperiodic Markov chain a unique stationary state probability vector π exists such that π j > 0 and lim k π j k = π j The steady state distribution π is determined by solving π = πp and i 1 i

29 + Sampling from a Markov Chain Parvathy 29

30 + Sampling from a Markov Chain 30 Sampling create random samples from the desired probability distribution (P) Construct Markov Chain with steady state distribution π Π = P Run the Markov chain long enough On convergence, its states are approximately distributed according to π. Random sampling of Markov chain yields random samples of P But! Difficult to determine how long to run the Markov chain.

31 + Ising Model 31 Named after physicist Ernst Ising, invented by Wilhem Lenz Mathematical model of ferromagnetism Consists of discrete variables representing magnetic dipole moments of atomic spins (+1 or 1)

32 + 1D Ising Model 32 Solved by Ernst Ising N spins arranged in a ring P(s) = e βh(s) Z β inverse temperature H s energy s = x 1,, x N is one possible configuration x N x 1 Higher probability on states with lower energy

33 + 2D Ising Model 33 Spins are arranged in a lattice. Each spin interacts with 4 neighbors Simplest statistical models to show a phase transition

34 + 2D Ising Model (2) 34 G = (V, E) x i {+1, 1} - value of spin at location i S = {s 1, s 2,, s n } is the state space. H s = xi,x j E s x i s(x j ) Similar spins subtract energy Opposite spins add energy

35 + 2D Ising Model (3) 35 P(s) = e βh(s) Z β = e xi x j E s x i s(x j ) Z If β = 0 all spin configurations have same probability If β > 0 lower energy preferred

36 + Exact Inference is Hard 36 Posterior distribution over s If Y is any observed variable P s Y) = p s,y p(s,y) Intractable computation Joint probability distribution - 2 N possible combinations of s Marginal probability distribution at a site MAP estimation

37 + The Big Question 37 Given π on S = s 1, s 2,, s n distribution π simulate a random object with

38 + Methods 38 Generate random samples to estimate a quantity Samples are generated Markov-chain style Markov Chain Monte Carlo (MCMC) Propp Wilson Simulation Las Vegas variant Sandwiching Improvement on Propp Wilson

39 + MARKOV CHAIN MONTE CARLO METHOD (MCMC) YAMILET R. SERRANO LL. 39

40 + Recall 40 Given a probability distribution π on S = {s1,, sk}, how do we simulate a random object with distribution π?

41 + Intuition 41 Given a probability distribution π on S = {s1,, sk}, how do we simulate a random object with distribution π? High Dimension Space

+ MCMC 42 1. Construct an irreducible and aperiodic Markov Chain [X0,X1, ], whose stationary distribution π. 2. If we run the chain with arbitrary initial distribution then 1.

42 + MCMC Construct an irreducible and aperiodic Markov Chain [X0,X1, ], whose stationary distribution π. 2. If we run the chain with arbitrary initial distribution then 1. The Markov Chain Convergence Theorem guarantees that the distribution of the chain at time n converges to π. 3. Hence, if we run the chain for a sufficiently long time n, then the distribution of Xn, will be very close to π. So it can be used as a sample.

43 + MCMC(2) 43 Generally, two types of MCMC algorithm Metropolis Hasting Gibbs Sampling

44 + Metropolis Hastings Algorithm 44 Original method: Metropolis, Rosenbluth, Rosenbluth, Teller and Teller (1953). Generalized by Hasting in Rediscovered by Tanner and Wong (1987) and Gelfang and Smith (1990) Is one way to implement MCMC. Nicholas Metropolis

45 + Metropolis Hastings Algorithm 45 Basic Idea GIVEN: A probability distribution π on S = {s1,, sk} GOAL : Approx. sample from π Start with a proposal distribution Q(x,y) Q(x,y) specifies transition of Markov Chain Q(x,y) plays the role of the transition matrix x: current state y: new proposal By accepting/rejecting the proposal, MH simulates a Markov Chain, whose stationary distribution is π

46 + Metropolis Hastings Algorithm 46 Algorithm GIVEN: GOAL : A probability distribution π on S = {s1,, sk} Approx. sample from π Given current sample, Draw y from the proposal distribution, Draw U ~ (0,1) and update where the acceptance probability is

47 + Metropolis Hastings Algorithm 47 The Ising Model Consider m sites around a circle. Each site i can have one of two spins xi {-1,1} The target distribution :

48 + Metropolis Hastings Algorithm The Ising Model Target Distribution: Proposal Distribution: 1. Randomly pick one out of the m spins 2. Flip its sign Acceptance probability (say i-th spin flipped):

49 + Metropolis Hastings Algorithm 49 The Ising Model Image has been taken from Montecarlo investigation of the Ising Model(2006) Tobin Fricke

50 + Disadvantages of MCMC 50 No matter how large n is taken to be in the MCMC algorithm, there will still be some discrepancy between the distribution of the output and the target distribution π. In order to make the previous error small, we need to figure out how large n needs to be.

51 + Bounds on Convergence of MCMC Aditya Kulkarni 51

+ Seminar on probabilities 52 (Strasunbourge - 1983) If it is difficult to obtain asymptotic bounds on the convergence time of MCMC

52 + Seminar on probabilities 52 (Strasunbourge ) If it is difficult to obtain asymptotic bounds on the convergence time of MCMC algorithms for Ising models, use quantitative bounds Use a characteristics of Ising model MCMC algorithm to obtain quantitative bounds David Aldous

53 + What we need to ensure 53 As we go along the time We approach to the stationary distribution in a monotonically decreasing fashion. When we stop d t 0 as t The sample should follow a distribution which is not further apart from a stationary distribution by some factor ε. d t ε

54 + Total Variation Distance TV 54 Given two distributions p and q over a finite set of states S, The total variation distance TV is p q TV = 1 2 p q 1 = 1 2 x S p x q x

55 + Convergence Time 55 Let X = X 0, X 1, be a Markov Chain of a state space with stationary distribution π We define d t as the worst total variation distance at time t. d t = max x V P X t X 0 = x π TV The mixing time t is the minimum time t such that d t most ε. is at τ(ε) = min {t d t ε} Define τ = τ 1 2e = min {t d t 1 2e}

56 + Quantitative results 56 1 d t 1 2e ε 0 τ τ(ε) t

57 + Lemma 57 Consider two random walks started from state i and state j Define ρ t as the worst total variation distance between their respective probability distributions at time t a. ρ t 2d t b. d(t) is decreasing

58 + Upper bound on d(t) proof 58 From part a and the definition of τ = τ we get ρ τ e 1 1 2e = min {t d t 1 2e} Also from part b, we deduce that upper bound of ρ at a particular time t 0 gives upper bound for later times ρ t (ρ(t 0 )) n ; nt 0 t (n + 1)t 0 ρ t (ρ(t 0 )) ( t t 1) 0 Substitute τ instead of t 0 ρ t (ρ(τ)) ( t τ 1) d t exp 1 t τ, t 0

59 + Upper bound on τ(ε) proof 59 Algebraic calculations: d t exp 1 t τ log d(t) 1 t τ t τ 1 log d t 1 + log( 1 d(t) ) t τ (1 + log( 1 ε)) τ ε 2et (1 + log( 1 ε))

60 + Upper bounds 60 d t min(1, exp 1 t τ ), t 0 τ(ε) τ (1 + log 1 ε ), 0 < ε < 1 τ(ε) 2et (1 + log 1 ε ), 0 < ε < 1

61 + Entropy of initial distribution 61 Measure of randomness: ent μ = μ x log μ x x V x is initial state μ is initial probability distribution

62 + Few more lemma s Let (X t ) is a random walk associated with μ, and μ t be the distribution of X t then ent μ t t ent(μ) 2. If v is a distribution of S such that v π ε then ent v (1 ε) log S

63 + Lower bound on d(t) proof 63 From lemma 2, ent v 1 ε log S ent v log S d t 1 1 ε (1 d(t)) ent v log S d t 1 t ent μ log S

64 + Lower bound on τ(ε) proof 64 From lemma 1, ent μ t t ent(μ) t ent μ t ent(μ) From lemma 2, ent μ t (1 ε) log S t 1 ε log S ent(μ) τ(ε) 1 ε log S ent(μ)

65 + Lower bounds 65 d t 1 t ent(μ) log S τ ε 1 ε log S ent(μ)

66 + Propp Wilson Tobias Bertelsen 66

67 + An exact version of MCMC 67 Problems with MCMC A. Have accuracy error, which depends on starting state B. We must know number of iterations James Propp and David Wilson propos Coupling from the past (1996) A.k.a. the Propp-Wilson algorithm Idea: Solve problems by running chain infinitely

68 + An exact version of MCMC 68 Theoretical Runs all configurations infinitely Literarily takes time Impossible Coupling from the past Runs all configurations for finite time Might take 1000 s of years Infeasible Sandwiching Run few configurations for finite time Takes seconds Practicable

69 + Theoretical exact sampling 69 Recall the convergence theorem: We will approach the stationary distribution as the number of steps goes to infinity Intuitive approach: To sample perfectly we start a chain and run for infinity Start at t = 0, sample at t = Problem: We never get a sample Alternative approach: To sample perfectly we take a chain that have already been running for an infinite amount of time Start at t =, sample at t = 0

70 + Theoretical independence of 70 starting state Sample from a Markov chain in MCMC depends solely on Starting state X Sequence of random numbers U We want to be independent of the starting state. For a given sequence of random numbers U,, U 1 we want to ensure that the starting state X has no effect on X 0

71 + Theoretical independence of 71 starting state Collisions: For a given U,, U 1 if two Markov chains is at the same state at some t the will continue on together:

72 + Coupling from the past 72 At some finite past time t = N All past chains has already run infinitively and has coupled into one N = We want to continue that coupled chain to t = 0 But we don t know which at state they will be at t = N Run all states from N instead of

73 + Coupling from the past Let U = U 1, U 2, U 3, be the sequence of independnent uniformly random numbers 2. For N j 1,2,4,8, 1. Extend U to length n j, keeping U 1,, U nj 1 the same 2. Start one chain from each state at t = N j 3. For t from N j to zero: Simulate the chains using U t 4. If all chains has converged at S i at t = 0, return S i 5. Else repeat loop

74 + Coupling from the past 74

75 + Questions 75 Why do we double the lengths? Worst case < 4N opt steps, where N opt is the minimal N at which we can achieve convergence 2 Compare to N [1,2,3,4,5, ], O N opt steps Why do we have to use the same random numbers? Different samples might take longer or shorter to converge. We must evaluate the same sample in each iteration. The sample should only be dependent on U not the different N

76 + Using the same U 76 We have k states with the following update function S i, U = S 1 U < 1 2 S i+1 otherwise φ S k, U = S 1 U < 1 2 S k otherwise π S i = 1 2 i, π k = 2 2 k

77 + Using the same U 77 The probability of S 1 only depends on the last random number: P S 1 = P U 1 < 1 2 = 1 2 Lets assume we generate a new U for each run: U 1, U 2, U 3, n U P S 1 P S 1 acc. 1 1 U 1 2 U 2 2 2, U 1 4 U 3 3 4,, U 1 P U 1 1 < 1 2 P U 2 1 < 1 U < 1 2 P U 3 1 < 1 U < 1 U < % 75 % % 8 [U 4 8,, U 4 1 ] P U 3 1 < 1 U < 1 U < 1 U < % 2

78 + Using the same U 78 The probability of S 1 only depends on the last random number: P S 1 = P U 1 < 1 2 = 1 2 Lets instead use the same U for each run: U 1 n U P S 1 P S 1 acc. 1 1 U 1 2 U 1 1 2, U 1 4 U 1 1 4,, U 1 P U 1 1 < 1 2 P U 1 1 < 1 2 P U 1 1 < % 50 % 50 % 8 [U 1 8,, U 1 1 ] P U 1 1 < %

79 + Problem 79 In each step we update up to k chains Total execution time is O N opt k BUT k = 2 L2 in the Ising model Worse than the naïve approach O(k)

80 + Sandwiching Nirandika Wanigasekara 80

81 + Sandwiching 81 Propp Wilson algorithm Many vertices running k chains will take time Impractical for large k Choose a relatively small state space Can we still get the same results? Try sandwiching

82 + Sandwiching 82 Idea Find two chains bounding all other chains If we have such two boundary chains Check if those two chains converge Then all other chains have also converged

83 + Sandwiching 83 To come up with the boundary chains Need a way to order the states S 1 S 2 S 3 Chain in higher state does not cross a chain in lower state Results in a Markov Chain obeying certain monotonicity properties

84 + Sandwiching 84 Lets consider A fixed set of states k State space S = {1,., k} A transition matrix P 11 = P 12 = 1 2 P kk = P k,k 1 = 1 2 for i=2,..k 1, P i,i 1 = P i,i+1 = 1 2 All the other entries are 0

85 + Sandwiching 85 What is this Markov Chain doing? Take one step up and one step down the ladder, at each integer time, with probability 1 2 If at the top or the bottom(state k) It will stays where it is Ladder Walk on k vertices The stationary distribution π of this Markov Chain π i = 1 k for i = 1,., k

86 + Sandwiching 86 Propp-Wilson for this Markov Chain with Valid update function φ if u < 1 2 if u 1 2 then step down then step up negative starting times N 1, N 2, = 1, 2, 4, 8, States k = 5

87 + Sandwiching 87

88 + Sandwiching 88 Update function preserves ordering between states for all U 0, 1 and all i, j 1,., k such that i j we have φ i, U φ(j, U) It is sufficient to run only 2 chains rather than k

89 + Sandwiching 89 Are these conditions always met No, not always But, there are frequent instances where these conditions are met Especially useful when k is large Ising model is a good example for this

90 + Ising Model Malay Singh 90

91 + 2D Ising Model 91 A grid with sites numbered from 1 to L 2 where L is grid size. Each site i can have spin x i 1, +1 V is set of sites S = 1,1 V defines all possible configurations (states) Magnetisation m for a state s is m s = i s x i L 2 The energy of a state s H s = ij s x i s x j

92 + Ordering of states 92 For two states s, s we say s s if s x < s x x V Minima m = 1 Maxima m = 1 s min x = 1 x V s max x = +1 x V Hence s min s s max for all s

93 + The update function 93 We use the sequence of random numbers U n, U n 1,, U 0 We are updating the state at X n. We choose a site x in 0, L 2 uniformly. X n+1 x = +1, if U n+1 < exp 2β K + x, s K x, s exp 2β K + x, s K x, s + 1 1, otherwise

94 + Maintaining ordering 94 Ordering after update (From X n to X n+1 ) We choose the same site x to update in both chains The spin of x at X n+1 depends on update function We want to check exp 2β K + x, s K x, s exp 2β K + x, s K x, s + 1 exp 2β K + x, s K x, s exp 2β K + x, s K x, s + 1 That is equivalent to checking K + x, s K x, s K + x, s K x, s

95 + Maintaining ordering 95 As s s we have K + x, s K + x, s K x, s K x, s First equation minus second equation K + x, s K x, s K + x, s K x, s

96 + Ising Model L = 4 96 T = 3.5 and N = 512 T = 4.8 and N = 128

97 + Ising Model L = 8 97 T = 5.9 and N = 512

98 + Ising Model L = T = 5.3 and N =

99 + Summary 99 Exact sampling Markov chains monte carlo but when we converge Propp Wilson with Sandwiching to rescue

100 + Questions? 100

101 + Additional 101 Example 6.2

102 + Aditional 102 Update function

103 + Additional 103 Proof for order of between states?? Need to find

Propp-Wilson Algorithm (and sampling the Ising model)

Propp-Wilson Algorithm (and sampling the Ising model) Danny Leshem, Nov 2009 References: Haggstrom, O. (2002) Finite Markov Chains and Algorithmic Applications, ch. 10-11 Propp, J. & Wilson, D. (1996)