Cross Entropy. Owen Jones, Cardiff University. Cross Entropy. CE for rare events. CE for disc opt. TSP example. A new setting. Necessary.

Size: px

Start display at page:

Download "Cross Entropy. Owen Jones, Cardiff University. Cross Entropy. CE for rare events. CE for disc opt. TSP example. A new setting. Necessary."

Shanna Ray
5 years ago
Views:

1 Owen Jones, Cardiff University

2 The Cross-Entropy (CE) method started as an adaptive importance sampling scheme for estimating rare event probabilities (Rubinstein, 1997), before being successfully applied to a variety of combinatorial optimisation problems (Rubinstein, 1999, 2001). It is a model based stochastic search technique and requires a parameterised sampling distribution.

3 Let X be a r.v. on some space X, and let f ( ; v) be a family of pdf s on X, indexed by v. We suppose that we are interested in estimating, for some given S, γ, u, l := P u (S(X) γ). Using an importance sampling density g we have the estimate ˆl = 1 N f (X i ; u) I N {S(Xi ) γ} g(x i ) i=1 where X 1,..., X N is an i.i.d. sample from g.

4 The ideal but unachievable choice for g is g f (x; u) (x) := I {S(x) γ}. l The idea behind the CE method is to choose g from the family f ( ; v) so that it is as close as possible to g, according to the Kullback-Leibler distance. That is (after a little work), choose v to maximise D(v) = E u I {S(X) γ} log f (X; v). In information theory the Kullback-Leibler distance is known as the cross-entropy.

5 We can estimate D(v) using importance sampling! For any w we can form ˆD(v) = 1 N N i=1 I {S(Xi ) γ} log f (X i ; v) f (X i; u) f (X i ; w), where X 1,..., X N is an i.i.d. sample from f ( ; w). In many cases of interest we can maximise ˆD analytically, to get an estimate of the optimal v. We would like to choose w to reduce the variance of ˆD, which is best achieved if P w (S(X) γ) is large. We achieve this using an iterative approach...

6 The algorithm We fix a parameter ρ (0, 1) 1. Put v 0 = u and t = Sample X 1,..., X N from f ( ; v t 1 ). Let γ t be the 1 ρ quantile of the ordered sample S(X 1 ),..., S(X N ). If γ t > γ put γ t = γ. 3. Maximise the modified estimate of D using γ t (instead of γ): v t = arg max v 1 N N i=1 I {S(Xi ) γ t} log f (X i ; v) f (X i; u) f (X i ; v t 1 ). 4. If γ t < γ put t t + 1 and go back to 2. Use the final v t to generate an importance sample with which to estimate P u (S(X) γ).

7 CE for discrete optimisation Suppose now that we wish to find the maximum of some function S over a (discrete) set X. Let γ be the maximum, obtained at x say. To apply the CE method we need to convert this optimisation problem into an estimation problem. Let f ( ; v) be a family of pmf s on X then, for (almost) any u, finding γ is equivalent to estimating P u (S(X) γ ) = x I {S(x) γ }f (x; u). Applying the CE method to this problem we get...

8 CE for discrete optimisation Not quite the algorithm We fix a parameter ρ (0, 1) 1. Put v 0 = u and t = Sample X 1,..., X N from f ( ; v t 1 ). Let γ t be the 1 ρ quantile of the ordered sample S(X 1 ),..., S(X N ). 3. Maximise the modified estimate of D using γ t : v t = arg max v 1 N N i=1 I {S(Xi ) γ t} log f (X i ; v) f (X i; u) f (X i ; v t 1 ). 4. If γ t hasn t changed for a while then stop, o/w put t t + 1 and go back to 2.

9 CE for discrete optimisation We can improve/simplify the algorithm in two ways. Firstly, given that u is arbitrary, we replace u by v t 1 in step 3, so that the term f (X i ; u)/f (X i ; v t 1 ) drops out. Secondly we use smoothed updating, to avoid assigning zero probability to points in the sample space.

10 CE for discrete optimisation The algorithm We fix parameters ρ, α (0, 1) 1. Put v 0 = u and t = Sample X 1,..., X N from f ( ; v t 1 ). Let γ t be the 1 ρ quantile of the ordered sample S(X 1 ),..., S(X N ). 3. Maximise the modified estimate of D using γ t : w t = arg max w 1 N N I {S(Xi ) γ t} log f (X i ; w). i=1 4. Put v t = αw t + (1 α)v t 1 5. If γ t hasn t changed for a while then stop, o/w put t t + 1 and go back to 2.

Choosing parameters There are a number of parameters to choose The family of distributions f ( ; v) often serve to provide a continuous interpolation of the discrete space X.

11 Choosing parameters There are a number of parameters to choose The family of distributions f ( ; v) often serve to provide a continuous interpolation of the discrete space X. The sample size N should be the same order as the dimension of v. ρ = 0.01 works well ;) Small α will slow convergence, but increase the chance of finding the optimum (see later).

12 Consider a complete graph on n nodes with edge weights d ij (distances). The Travelling Salesman Problem (TSP) is to find that permutation σ which minimises n 1 S(σ) := d σ(n),σ(1) + d σ(i),σ(i+1). We take as our state space X the set of permutations of {1,..., n}, and we ascribe a probability to x X by treating it as the sample path of a Markov chain. That is, our family of probability measures is indexed by n n transition matrices P, and i=1 n 1 f (x; P) P(x(n), x(1)) P(x(i), x(i + 1)). i=1

13 Given a sample X 1,..., X N from X, step 3 of the CE algorithm requires us to find P that maximises N I {S(Xi ) γ t} log f (X i ; P). i=1 Note that P is constrained to the space of transition matrices. Note also that we now have S(X i ) γ t. It is easily checked that the solution is given by ˆP(i, j) = #{X k : S(X k ) γ t and X k includes a transition from i to j} #{X k : S(X k ) γ t }

14 n <- 20 # number of cities N <- 5*n^2 # sample size, O(num params) rho < # for culling al <- 0.5 # smoothing param, in (0, 1) cities <- gen_cities(n) samples <- matrix(nrow=n, ncol=n) lengths <- rep(na, n) P <- matrix(1/(n-1), n, n); diag(p) <- 0 n_reps <- 20 for (rep in 1:n_reps) { # generate sample for (i in 1:N) { samples[,i] <- gen_tour(p) lengths[i] <- len_tour(cities$d, samples[,i]) } # identify top performers gamma <- sort(lengths)[ceiling(rho*n)] idx <- which(lengths <= gamma) # update P sp <- samplep(samples[,idx]) P <- (1 - al)*p + al*sp }

15 Resetting the setting In order to say something about the convergence of the CE method, we consider the following setting: X = {0, 1} n S has a unique maximum on X. For p = (p 1,..., p n ) (0, 1) n, and x = (x 1,..., x n ) X, we put f (x; p) = n i=1 p x i i (1 p i) 1 x i. That is, for X f ( ; p), the components X i are independent Bernoulli r.v.s with parameters p i. The smoothing parameter α t is allowed to vary. With this choice of f ( ; p) the CE algorithm becomes...

16 CE algorithm For given p 0, N, ρ, and {α t } 1. Put t = Sample X 1,..., X N from f ( ; p t 1 ). Let γ t be the 1 ρ quantile of the ordered sample S(X 1 ),..., S(X N ). 3. For w t = (w t,1,..., w t,n ) put w t,i = #{X k : S(X k ) γ t and X k,i = 1} #{X k : S(X k ) γ t } 4. Put p t = α t w t + (1 α t )p t 1 5. If γ t hasn t changed for a while then stop, o/w put t t + 1 and go back to 2.

17 condition for optimality The CE algorithm generates the optimum solution with probability 1 only if t=1 m=1 t (1 α m ) = Proof We can show that the probability the first component is never correct is at least ( ( t 1 N 1 φ 1,1 (1 α m ))), t=1 m=1 for some φ 1,1 (0, 1). This term is zero only if the stated condition holds.

18 condition for optimality The CE algorithm generates the optimum solution with probability 1 if t=1 m=1 t (1 α m ) n = Proof We can show that the probability that we never generate the optimum solution is bounded above by c ( ) t 1 N 1 φ 1 (1 α m ) n, t=2 m=1 for some c, φ 1 (0, 1). This term is zero only if the stated condition holds. Remark The condition holds if t=1 α t <.

19 Note on proofs Both ultimately rely on upper and lower bounds for p t,i p 0,i m=1 t (1 α m ) p t,i 1 (1 p 0,i ) t (1 α m ). m=1 Clearly if t α t < then p t,i is bounded from 0 and 1. That is, a necessary condition for p t to converge to a unit mass at some x is that t α t =.

20 CE with constant α If α t is constant, then p t converges. Proof In much the same way that you find upper and lower bounds on p t,i, you can show that P(w t,i = 1 t > k F k ) g(p k,i ) P(w t,i = 0 t > k F k ) g(1 p k,i ) where g(u) = t=0 (1 (1 u)(1 α)t ) N. Moreover, we can find 0 < a < b < 1 such that every time p t,i changes sign it must lie in (a, b). Thus, every time p t,i changes sign, there is a non-zero probability that either w t,i = 1 from then on, or w t,i = 0 from then on. This results in either p t,i increasing or decreasing monotonically (to some limit).

21 CE with constant α Moreover, p t converges to a unit mass at some x. Proof If p t,i converges to some p i (0, 1) then so must w t,i. But we know that w t,i is eventually either 0 or 1, a contradiction.

22 CE with constant α The probability that the optimal solution is generated increases to 1 as α 0. Proof Previously we obtained an upper bound on the probability that the optimal solution is never generated. We can show that for constant α, this bound decreases to 0 as α 0.

23 We applied the CE algorithm to a max-cut problem with 8 vertices, using different values of α. For each α we applied the algorithm 100 times to get a Monte-Carlo estimate of the probability that the optimal solution is generated. In the following diagram the x-axis is the number of iterations and the y-axis the proportion of runs in which the optimal solution had been generated.

24 1 0.9 α t =1/nt α=0.1 y α t =1/nt α=0.5 α=0.8 α = α = x

The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma

The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma Tzai-Der Wang Artificial Intelligence Economic Research Centre, National Chengchi University, Taipei, Taiwan. email: dougwang@nccu.edu.tw