Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Similar documents
EM Algorithm II. September 11, 2018

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Biostat 2065 Analysis of Incomplete Data

Lecture 3 September 1

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Review and continuation from last week Properties of MLEs

Scientific Computing: Optimization

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Lecture 4 September 15

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Expectation-Maximization Algorithm

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Statistical Estimation

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Computing the MLE and the EM Algorithm

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Maximum Likelihood Estimation

Multivariate Analysis and Likelihood Inference

Generalized Linear Models. Kurt Hornik

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Learning Parameters of Undirected Models. Sargur Srihari

Gaussian Mixture Models

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Learning Parameters of Undirected Models. Sargur Srihari

A Very Brief Summary of Statistical Inference, and Examples

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

The Expectation Maximization or EM algorithm

Maximum Likelihood Estimation

Gradient Descent. Sargur Srihari

EM for Spherical Gaussians

STAT 730 Chapter 4: Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation

Linear and logistic regression

MIT Spring 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

HOMEWORK #4: LOGISTIC REGRESSION

Statistics & Data Sciences: First Year Prelim Exam May 2018

Maximum Likelihood Estimation

EM & Variational Bayes

A minimalist s exposition of EM

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Maximum likelihood estimation

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

Chapter 3 : Likelihood function and inference

Likelihood-Based Methods

Probabilistic Graphical Models

Information in a Two-Stage Adaptive Optimal Design

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Lecture 3 - Linear and Logistic Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Estimating the parameters of hidden binomial trials by the EM algorithm

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Logistic Regression. Seungjin Choi

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Statistical Data Mining and Machine Learning Hilary Term 2016

Computational statistics

Finite Singular Multivariate Gaussian Mixture

Gaussian Processes 1. Schedule

Maximum Likelihood Estimation

Semiparametric Gaussian Copula Models: Progress and Problems

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

CS229 Lecture notes. Andrew Ng

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Support Vector Machines

Lecture 16 Solving GLMs via IRWLS

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Neural Network Training

STAT Advanced Bayesian Inference

Chapter 4: Asymptotic Properties of the MLE (Part 2)

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Nonparametric Bayesian Methods (Gaussian Processes)

Chapter 4: Asymptotic Properties of the MLE

Graphical Models for Collaborative Filtering

Stat 710: Mathematical Statistics Lecture 12

Parametric Techniques

Unsupervised Learning

1. Fisher Information

Bayesian Paradigm. Maximum A Posteriori Estimation

HOMEWORK #4: LOGISTIC REGRESSION

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Expectation Maximization Algorithm

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Computer Intensive Methods in Mathematical Statistics

Latent Variable Models and EM Algorithm

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

Classification: Logistic Regression from Data

A Note on the Expectation-Maximization (EM) Algorithm

Transcription:

Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may not be unique. It will be unique if f is strictly convex. A function g is strictly convex if g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. A twice continuously-differentiable function is strictly convex if and only if the Hessian matrix H ij (x) = 2 f(x)/ x i x j is positive definite x. If X is discrete the problem is known as combinatorial optimization. Newton s Method Suppose f has two continuous derivatives. Then all interior local maxima x of f satisfy f (x) = 0. More generally, write the vector of partial derivatives of f with respect to x (the gradient of f) as follows: f(x) = ( f/ x 1,..., f/ x m ). An interior local maximum f satisfies f(x) = 0. The Hessian of f, denoted H f, is an m m matrix for each value of x, defined by (H f ) ij = 2 f/ x i x j. For twice smooth functions H f is symmetric at each x. If f is convex in a neighborhood of a local maximizer x, then H f (x) is SPD. Suppose f has a continuous Hessian map at x 0. Then we can approximate f quadratically in a neighborhood of x 0 using f(x) f(x 0 ) + f(x 0 ) (x x 0 ) + 1 2 (x x 0) H f (x 0 )(x x 0 ). 1

This leads to the following approximation to the gradient: f(x) f(x 0 ) + H f (x 0 )(x x 0 ). We can solve this expression for the stationary point f(x) = 0, giving: x = x 0 H f (x 0 ) 1 f(x 0 ). This is the Newton-Raphson method for multidimensional optimization. We define a sequence of iterates starting at an arbitrary value x 0, and update using the rule x i+1 = x i H f (x i ) 1 f(x i ). The Newton-Raphson algorithm is globally convergent at quadratic rate whenever f is convex and has two continuous derivatives. Maximum Likelihood Suppose we observe independent realizations Y 1,..., Y n, where the density of Y i f i ( ; θ), where θ R d is an unknown parameter. Then the joint density is given by is and the log-likelihood is given by n p(y 1,..., Y n ) = f i (Y i ; θ), i=1 n L(θ Y 1,..., Y n ) = log f i (Y i ; θ). i=1 The maximum likelihood estimator (MLE) of θ is defined to be: argmax θ L(θ Y 1,..., Y n ). In this setting, the gradient L(θ) is known as the score function, and is often written s(θ). It has the form s(θ) = i f i (Y i ; θ)/f i (Y i ; θ) = i s i (θ). The score function is a random vector with expected value 0: E θ s(θ) = i = i = i ( f i (Y i ; θ)/f i (Y i ; θ))f i (Y i ; θ)dy i f i (Y i ; θ)dy i f i (Y i ; θ)dy i = 0. 2

The Hessian of the log-likelihood function has the form: We can write n f i (Y i ; θ)h fi (θ) f i (θ) f i (θ) H L (θ) =. fi 2 (Y i ; θ) i=1 E θ f i (Y i ; θ)h fi (θ)/fi 2 (Y i ; θ) = H fi (θ)dy i = 2 f i (Y i ; θ)dy i = 0. Thus, n E θ H L (θ) = i=1 s i (θ)s i (θ) f i (Y i ; θ)dy i = i cov θ (s i (θ), s i (θ)) = cov(s(θ), s(θ)). This quantity, the Fisher Information, is denoted I(θ). The Fisher information matrix is negative semidefinite, and can serve as a stand-in for the Hessian in the Newton-Raphson algorithm, giving the update: x i+1 = x i I(x i ) 1 s(x i ). This is the Fisher scoring algorithm. It has essentially the same convergence properties as Newton-Raphson, but it is often easier to compute I than H L. The Fisher information matrix also gives the inverse of the asymptotic variance of the MLE ˆθ. Example: Logistic regression. Suppose we observe independent pairs (Y i, X i ), i = 1,..., n with Y i {0, 1}, and the probability law is given by: P (Y i = 1 X i ) = exp(β X i ) 1 + exp(β X i ) P (Y i = 0 X i ) = 1 1 + exp(β X i ). our goal is to compute the MLE of β. The joint likelihood is P (Y 1,..., Y n X 1,..., X n ) = The log-likelihood function is: i:y i =1 exp(β X i ) 1 + exp(β X i ) i:y i =0 1 1 + exp(β X i ). 3

L(β) = β i:y i =1 X i i log(1 + exp(β X i )). The score function is: s(β) = i:y i =1 X i i exp(β X i ) 1 + exp(β X i ) X i. The Hessian of L is: H L (β) = i exp(β X i ) (1 + exp(β X i )) 2 X ix i. The score function is a linear combination of the X i, and the information matrix is a linear combination of X i X i. This is typical in exponential family regression models. The coefficients of X i X i in the expression for H L (β) are negative. Thus H L is negative definite as long as the X i span R m, and it follows that L is globally convex, so there is a unique local maximizer, which is also the global maximizer. The Y i do not appear in H L (β), so H L (β) = I(β). Example: Heteroscedastic regression Suppose we are fitting a linear regression model in which E(Y X) = α + βx and var(y X) = cx 2. We wish to estimate α, β, and c using maximum likelihood. The log-likelihood is L(α, β, c) = 1 2 The gradient is (Y i α βx i ) 2 cx 2 i n log(c)/2 i log X 2 i /2. L(α, β, c) = 1 2 where r i = Y i α βx i. The Hessian matrix is 2r i /(cx 2 i ) 2r i /(cx i ) r 2 i /(c 2 X 2 i ) 1/c, H L = i 1/(cXi 2 ) 1/(cX i ) r i /(c 2 Xi 2 ) 1/(cX i ) 1/c r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) ri 2 /(c 3 Xi 2 ) 1/(2c 2 ). The expected Hessian is: 4

H L = i 1/(cXi 2 ) 1/(cX i ) 0 1/(cX i ) 1/c 0 0 0 1/(2c 2 ). Methods using only one derivative In many cases, it is practical to compute the first derivative of L(θ), but not the Hessian or Fisher information matrix. In this case, we know that L(θ) gives the steepest uphill direction from a given position θ. One strategy is to always move in the direction of L(θ). That is, if we are currently at a point θ i, the next iteration will take us to θ i+1 = θ i + λ L(θ i ). We do not know how far to move in the direction L(θ) (i.e. the value of λ), so we can use a one dimensional line search (e.g. bisection) to find the greatest value. The resulting algorithm is known as the steepest ascent algorithm: 1. Start at θ 0, set i = 0. 2. Set G i = L(θ i ). 3. Use a one dimensional line search to obtain argmax λ L(θ i + λg i ). 4. Set θ i+1 = θ i + λg i, return to step 2. There are many variants of this algorithm. The main issue is whether a complete line search is done in step 3, or whether we take a single uphill step, or a fixed number of uphill steps. The best strategy in a given situation depends on the relative expense of evaluating L(θ) and L(θ). The steepest ascent algorithm never takes a downhill step. The steepest ascent algorithm almost always converges to a local or global maximum in practice. Theoretically, the method can converge to a saddle point but this is an unstable solution hence difficult to achieve in practice. The trajectory followed by the steepest ascent algorithm can be characterized geometrically in terms of the level sets of the objective function. The gradient f(x 0 ) of f, evaluated at x 0, is perpendicular to the level sets of f at x 0. The extreme values of the restricted function g(λ) = f(x 0 + λ f(x 0 )) are located at the points λ such that f(x 0 ) is is tangent to the level curves of f at x 0 + λ f(x 0 ). Thus each iteration moves from x 0 in the direction f(x 0 ) such that the angle of f(x 0 ) relative to the level sets moves from π/2 to 0. 5

The steepest ascent method is globally convergent for concave functions, but tends to be very slow. It is not finitely convergent for any interesting class of functions. It is possible to construct optimization algorithms using only first derivatives that are finitely convergent for quadratic functions. These methods are generally preferred to steepest ascent. EM Algorithm Suppose we observe Y according to a density f θ (Y ), where L(θ Y ) = log f θ (Y ) is a complicated function of θ, and therefore is hard to optimize. Suppose further that the nature of the problem suggests a random variable Z such that if Z were observed, then the augmented log-likelihood L(θ Y, Z) = log f θ (Y, Z) is simpler, in that the following two conditions hold: 1. We can compute the following conditional expectation as a function of θ in closed form Q(θ θ ) = E θ (L θ (Y, Z) Y ). 2. We can optimize Q(θ θ ) as a function of θ for fixed θ. That is, we can compute M(θ ) = argmax θ Q(θ θ ). Under these conditions, the EM-algorithm constructs a sequence of iterates according to the rule θ i+1 = M(θ i ). Under certain additional regularity conditions (J. Wu, Annals Stat. 1983, pp. 95-103), the θ i converge to a stationary point of L(θ Y ). If L(θ Y ) is unimodal and has only one stationary point, then the convergence is to the MLE. When convergence occurs, the rate is typically linear. Step 1 is called the E-step, and step 2 is called the M-step. The EM algorithm, like the steepest ascent and conjugate gradient algorithms, but unlike Newton s method, is monotonic, in that the sequence θ n of iterates satisfies L(θ n+1 Y ) L(θ n Y ). Begin by observing that L(θ Y ) = log f θ (Y, Z) log f θ (Z Y ) = f θ (Z Y ) log f θ (Y, Z)dZ f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) K(θ θ ). Continuing, we have 6

L(θ Y ) L(θ Y ) = Q(θ θ ) Q(θ θ ) + K(θ θ ) K(θ θ ) = Q(θ θ ) Q(θ θ ) log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz. If we set θ = argmax τ Q(τ θ ), then Q(θ θ ) Q(θ θ ) 0. By Jensen s inequality, log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz log (f θ (Z Y )/f θ (Z Y ))f θ (Z Y )dz = 0. Thus L(θ Y ) L(θ Y ) 0. For any value of θ, L(θ Y ) Q(θ θ ) + L(θ Y ) Q(θ θ ) with equality for θ = θ. This leads to the characterization of the EM algorithm as a majorization or optimization transfer algorithm. Optimization of the surrogate function Q(θ θ ) always drives L(θ) uphill. Example: Finite mixture models. Suppose π 1,..., π K are positive constants with j π j = 1. Let f 1,..., f K denote density functions. Then j π j f j is also a density function. It is the density of an observation generated according to the following twostage procedure: (i) generate κ {1,..., K} with P (κ = k) = π k, (ii) generate Y f κ. We observe Y but not κ. Suppose we observe Y 1,..., Y n according to the mixture density, and we wish to estimate the parameters π = (π 1,..., π K ) using maximum likelihood. The log-likelihood for the observed data is: n K L(π {Y j }) = log( π k f k (Y j )). j=1 k=1 Suppose we were to observe as complete data the pair (Y j, κ j ), where κ j is the random variable defined in (i) above. Then the log-likelihood for the complete data is: n L(π {Y j, κ j }) = log(π κj ) + log(f κj (Y j )). j=1 The E-step of the EM algorithm requires that we compute the expectation of L(π {Y j, κ j }) given the Y j for a fixed probability vector π. This can be more easily accomplished if we rewrite the complete-data log-likelihood as follows: 7

n K L(π {Y j, κ j }) = log(π k )I(κ j = k) + log(f k (Y j ))I(κ j = k). j=1 k=1 This is linear in the random variables I(κ j = k), so the expectation is obtained by plugging in the values for the conditional expectations Using Bayes theorem, we have E π (I(κ j = k) Y j ) = P π (κ j = k Y j ). P π (κ j = k Y j ) P (Y j κ j = k)p π (κ j = k) = f k (Y j )π k. Therefore the expectation is given by: E π (I(κ j = k) Y j ) = f k (Y j )π k/ l f l (Y j )π l p jk, and the Q function takes the form n K Q(π π ) = log(π k )p jk + log(f k (Y j ))p jk. j=1 k=1 Note that only the first term involves π, so the M-step depends only on the first term. We update the parameters using the rule π k j p jk /n. Example: Gaussian Mixture Models Continuing with the previous example, suppose that f k (Y ) exp( (Y µ k ) Σ 1 k (Y µ k)/2)/ Σ k 1/2 is Gaussian, and we wish to estimate π 1,..., π K, µ 1,..., µ K and Σ 1,..., Σ K. The complete data log likelihood becomes L(π, {Σ k }, {µ k } {Y j, κ j }) = n K log(π k )I(κ j = k) 1 2 j=1 k=1 ( log Σk + (Y j µ k ) Σ 1 k (Y j µ k ) ) I(κ j = k). To carry out the E-step, the posterior probabilities p jk can be computed as in the previous example. To carry out the M-step for the mean and variance parameters, use the weighted moment updates 8

µ k j p jk Y j / j p jk, Σ k j p jk (Y j µ k )(Y j µ k ) / j p jk. The probability parameters π k are updated as in the previous example. Example: Contingency tables with collapsed cells. Suppose we observe a m n contingency table with independent rows and columns. If X ij denotes the number of observations in cell (i, j), then EX ij = np i q j where n is the total number of observations, p i is the probability of an observation falling in a cell in row i, and q j is the probability of an observation falling in a cell in column j. Now suppose that we are unable to distinguish between certain cells, so we can only observe the sum. To be concrete, consider a 2 3 table where we can only observe the following 4 values: Z 1 = X 11 + X 12 Z 2 = X 21 The likelihood for the observable data is: Z 3 = X 22 + X 13 Z 4 = X 23. L({p i }, {q j } {Z j }) (p 1 q 1 + p 1 q 2 ) Z 1 (p 2 q 1 ) Z 2 (p 2 q 2 + p 1 q 3 ) Z 3 (p 2 q 3 ) Z 4. We could optimize this function using gradient methods, but it would require a fair amount of programming. On the other hand, consider the likelihood for the {X ij } (some of which are unobservable): L({p i }, {q j } {X j }) ij X ij log(p i q j ). This is linear in the X ij, so to compute the Q( ) function, we need only to compute the conditional means of these values given the Z j, and then plug them in. Specifically, set X11 = E(X 11 {Z j }) = p 1 q 1 Z 1 /(p 1 q 1 + p 1 q 2 ) X12 = E(X 12 {Z j }) = p 1 q 2 Z 1 /(p 1 q 1 + p 1 q 2 ) X22 = E(X 22 {Z j }) = p 2 q 2 Z 3 /(p 2 q 2 + p 1 q 3 ) X13 = E(X 13 {Z j }) = p 1 q 3 Z 3 /(p 2 q 2 + p 1 q 3 ), 9

with the other X ij set to their directly-observable values. Finally, we can update the parameter estimates using p i j X ij/n q i i X ij/n. This defines one iteration of the EM algorithm. Example: Missing data in a multivariate normal model. Suppose that we observe Y j R d j with Y j N(µ j, Σ j ). Suppose further that there exist injective functions σ j : {1,..., d j } {1,..., d}, such that µ j (k) = µ(σ j (k)) and Σ j (k 1, k 2 ) = Σ(σ j (k 1 ), σ j (k 2 )). In words, the components of each Y j comprise a subset of the components of a common multivariate normal distribution. Suppose that we want to estimate θ = (µ, Σ) based on the following log-likelihood: L θ ({Y j }) = j 1 2 log Σ j 1 2 (Y j µ j ) Σ 1 j (Y j µ j ). Note that this might not be advisable if the σ j are viewed as random variables that could be dependent on the Y j. In general, L θ ({Y j }) can not be optimized in closed-form. A conjugate gradient algorithm would be a good choice here, but the EM algorithm is considerably simpler to derive and program (at least for a statistician). The complete data are random vectors Z j R d such that Z j N(µ, Σ) and Y j (k) = Z j (σ j (k)). The complete-data log-likelihood is: L θ ({Z j }) = j 1 2 log Σ 1 2 (Z j µ) Σ 1 (Z j µ) 1 2 log Σ 1 1 2 tr(σ 1 (Z j µ)(z j µ) ). n j To compute the Q(θ θ ) function, we need the values η j = E µ,σ (Z j Y j ) and C j = E µ,σ (Z jz j Y j ). Once we have these values, the Q(θ θ ) function can be expressed: Q(θ θ ) = 1 2 log Σ 1 2 tr Σ 1 1 n C j η j µ µη j + µµ j = 1 2 log Σ 1 2 tr ( Σ 1 (µ η)(µ η) + Σ 1 ( C η η ) ), where η = j η j /n and C = j C j /n. The maximum of Q(θ θ ) occurs where µ = η, and Σ = C η η = j Var(Z j Y j )/n. 10

The values η j and C j can be easily computed using the usual formulas for the conditional mean and variance of a multivariate normal random vector. Specifically, let Z denote the missing components of Z. Then: E(Z Y ) = EZ + cov(z, Y )cov(y ) 1 (Z EY ) cov(z Y ) = cov(z ) cov(z, Y )cov(y ) 1 cov(y, Z ), where all of the values on the right hand side are computed with respect to θ = (µ, Σ ) using the usual formulas. General optimization transfer. The basic idea behind the EM algorithm can be extended to optimization problems where there is no data augmentation, or even to problems that have no likelihood. In fact the reason that the procedure works has only to do with convexity, and nothing whatsoever to do with probability. We retain the idea of the minorant Q(θ θ ) and shift it by Q(θ θ ), so that Q(θ θ ) = 0 and L(θ) Q(θ θ ) + L(θ ), so that optimizing Q(θ θ ) (with respect to θ) drives L(θ) uphill. There are many ways of constructing such a Q( ). For example, if L(θ) is convex and differentiable, then suggesting Q(θ θ ) = L (θ)(θ θ ). L(θ) L (θ)(θ θ ) + L(θ ), Simulated Annealing Suppose we want to maximize a real-valued function f with domain S. Simulated annealing is an attractive optimization procedure if any of the following hold: (i) f has many local extreme values, (ii) S is discrete, (iii) f is not smooth, or the derivative of f is much more expensive to compute than the function value. Suppose we have a family of proposal distributions on S indexed by X S: R(X X ). From a starting value X 0, define a sequence of points in S using the rules Z n R( X n 1 ) P (X n = Z n ) = min(exp((f(z n ) f(x n 1 ))/T n ), 1) P (X n = X n 1 ) = 1 P (X n = Z n ). Note that if f(z n ) > f(x n 1 ) then X n = Z n uphill proposals are always accepted, but some downhill proposals are accepted as well. 11

As T n 0, downhill proposals are accepted less often. If T n 0 sufficiently slowly, then a.s. convergence to the global extremum occurs for a reasonable class of problems. Theoretically, the rate of T n should have the form T n 1/ log(c + n). However this rate is too slow to be useful for most practical problems, since for n of manageable size the estimate X n will have substantial variability. There is some experimental evidence that the optimal value can be identified for many problems using a heuristic rule, such as decreasing T n by 10% each time a fixed number of steps are accepted. 12