Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may not be unique. It will be unique if f is strictly convex. A function g is strictly convex if g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. A twice continuously-differentiable function is strictly convex if and only if the Hessian matrix H ij (x) = 2 f(x)/ x i x j is positive definite x. If X is discrete the problem is known as combinatorial optimization. Newton s Method Suppose f has two continuous derivatives. Then all interior local maxima x of f satisfy f (x) = 0. More generally, write the vector of partial derivatives of f with respect to x (the gradient of f) as follows: f(x) = ( f/ x 1,..., f/ x m ). An interior local maximum f satisfies f(x) = 0. The Hessian of f, denoted H f, is an m m matrix for each value of x, defined by (H f ) ij = 2 f/ x i x j. For twice smooth functions H f is symmetric at each x. If f is convex in a neighborhood of a local maximizer x, then H f (x) is SPD. Suppose f has a continuous Hessian map at x 0. Then we can approximate f quadratically in a neighborhood of x 0 using f(x) f(x 0 ) + f(x 0 ) (x x 0 ) + 1 2 (x x 0) H f (x 0 )(x x 0 ). 1

This leads to the following approximation to the gradient: f(x) f(x 0 ) + H f (x 0 )(x x 0 ). We can solve this expression for the stationary point f(x) = 0, giving: x = x 0 H f (x 0 ) 1 f(x 0 ). This is the Newton-Raphson method for multidimensional optimization. We define a sequence of iterates starting at an arbitrary value x 0, and update using the rule x i+1 = x i H f (x i ) 1 f(x i ). The Newton-Raphson algorithm is globally convergent at quadratic rate whenever f is convex and has two continuous derivatives. Maximum Likelihood Suppose we observe independent realizations Y 1,..., Y n, where the density of Y i f i ( ; θ), where θ R d is an unknown parameter. Then the joint density is given by is and the log-likelihood is given by n p(y 1,..., Y n ) = f i (Y i ; θ), i=1 n L(θ Y 1,..., Y n ) = log f i (Y i ; θ). i=1 The maximum likelihood estimator (MLE) of θ is defined to be: argmax θ L(θ Y 1,..., Y n ). In this setting, the gradient L(θ) is known as the score function, and is often written s(θ). It has the form s(θ) = i f i (Y i ; θ)/f i (Y i ; θ) = i s i (θ). The score function is a random vector with expected value 0: E θ s(θ) = i = i = i ( f i (Y i ; θ)/f i (Y i ; θ))f i (Y i ; θ)dy i f i (Y i ; θ)dy i f i (Y i ; θ)dy i = 0. 2

The Hessian of the log-likelihood function has the form: We can write n f i (Y i ; θ)h fi (θ) f i (θ) f i (θ) H L (θ) =. fi 2 (Y i ; θ) i=1 E θ f i (Y i ; θ)h fi (θ)/fi 2 (Y i ; θ) = H fi (θ)dy i = 2 f i (Y i ; θ)dy i = 0. Thus, n E θ H L (θ) = i=1 s i (θ)s i (θ) f i (Y i ; θ)dy i = i cov θ (s i (θ), s i (θ)) = cov(s(θ), s(θ)). This quantity, the Fisher Information, is denoted I(θ). The Fisher information matrix is negative semidefinite, and can serve as a stand-in for the Hessian in the Newton-Raphson algorithm, giving the update: x i+1 = x i I(x i ) 1 s(x i ). This is the Fisher scoring algorithm. It has essentially the same convergence properties as Newton-Raphson, but it is often easier to compute I than H L. The Fisher information matrix also gives the inverse of the asymptotic variance of the MLE ˆθ. Example: Logistic regression. Suppose we observe independent pairs (Y i, X i ), i = 1,..., n with Y i {0, 1}, and the probability law is given by: P (Y i = 1 X i ) = exp(β X i ) 1 + exp(β X i ) P (Y i = 0 X i ) = 1 1 + exp(β X i ). our goal is to compute the MLE of β. The joint likelihood is P (Y 1,..., Y n X 1,..., X n ) = The log-likelihood function is: i:y i =1 exp(β X i ) 1 + exp(β X i ) i:y i =0 1 1 + exp(β X i ). 3

L(β) = β i:y i =1 X i i log(1 + exp(β X i )). The score function is: s(β) = i:y i =1 X i i exp(β X i ) 1 + exp(β X i ) X i. The Hessian of L is: H L (β) = i exp(β X i ) (1 + exp(β X i )) 2 X ix i. The score function is a linear combination of the X i, and the information matrix is a linear combination of X i X i. This is typical in exponential family regression models. The coefficients of X i X i in the expression for H L (β) are negative. Thus H L is negative definite as long as the X i span R m, and it follows that L is globally convex, so there is a unique local maximizer, which is also the global maximizer. The Y i do not appear in H L (β), so H L (β) = I(β). Example: Heteroscedastic regression Suppose we are fitting a linear regression model in which E(Y X) = α + βx and var(y X) = cx 2. We wish to estimate α, β, and c using maximum likelihood. The log-likelihood is L(α, β, c) = 1 2 The gradient is (Y i α βx i ) 2 cx 2 i n log(c)/2 i log X 2 i /2. L(α, β, c) = 1 2 where r i = Y i α βx i. The Hessian matrix is 2r i /(cx 2 i ) 2r i /(cx i ) r 2 i /(c 2 X 2 i ) 1/c, H L = i 1/(cXi 2 ) 1/(cX i ) r i /(c 2 Xi 2 ) 1/(cX i ) 1/c r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) ri 2 /(c 3 Xi 2 ) 1/(2c 2 ). The expected Hessian is: 4

H L = i 1/(cXi 2 ) 1/(cX i ) 0 1/(cX i ) 1/c 0 0 0 1/(2c 2 ). Methods using only one derivative In many cases, it is practical to compute the first derivative of L(θ), but not the Hessian or Fisher information matrix. In this case, we know that L(θ) gives the steepest uphill direction from a given position θ. One strategy is to always move in the direction of L(θ). That is, if we are currently at a point θ i, the next iteration will take us to θ i+1 = θ i + λ L(θ i ). We do not know how far to move in the direction L(θ) (i.e. the value of λ), so we can use a one dimensional line search (e.g. bisection) to find the greatest value. The resulting algorithm is known as the steepest ascent algorithm: 1. Start at θ 0, set i = 0. 2. Set G i = L(θ i ). 3. Use a one dimensional line search to obtain argmax λ L(θ i + λg i ). 4. Set θ i+1 = θ i + λg i, return to step 2. There are many variants of this algorithm. The main issue is whether a complete line search is done in step 3, or whether we take a single uphill step, or a fixed number of uphill steps. The best strategy in a given situation depends on the relative expense of evaluating L(θ) and L(θ). The steepest ascent algorithm never takes a downhill step. The steepest ascent algorithm almost always converges to a local or global maximum in practice. Theoretically, the method can converge to a saddle point but this is an unstable solution hence difficult to achieve in practice. The trajectory followed by the steepest ascent algorithm can be characterized geometrically in terms of the level sets of the objective function. The gradient f(x 0 ) of f, evaluated at x 0, is perpendicular to the level sets of f at x 0. The extreme values of the restricted function g(λ) = f(x 0 + λ f(x 0 )) are located at the points λ such that f(x 0 ) is is tangent to the level curves of f at x 0 + λ f(x 0 ). Thus each iteration moves from x 0 in the direction f(x 0 ) such that the angle of f(x 0 ) relative to the level sets moves from π/2 to 0. 5

The steepest ascent method is globally convergent for concave functions, but tends to be very slow. It is not finitely convergent for any interesting class of functions. It is possible to construct optimization algorithms using only first derivatives that are finitely convergent for quadratic functions. These methods are generally preferred to steepest ascent. EM Algorithm Suppose we observe Y according to a density f θ (Y ), where L(θ Y ) = log f θ (Y ) is a complicated function of θ, and therefore is hard to optimize. Suppose further that the nature of the problem suggests a random variable Z such that if Z were observed, then the augmented log-likelihood L(θ Y, Z) = log f θ (Y, Z) is simpler, in that the following two conditions hold: 1. We can compute the following conditional expectation as a function of θ in closed form Q(θ θ ) = E θ (L θ (Y, Z) Y ). 2. We can optimize Q(θ θ ) as a function of θ for fixed θ. That is, we can compute M(θ ) = argmax θ Q(θ θ ). Under these conditions, the EM-algorithm constructs a sequence of iterates according to the rule θ i+1 = M(θ i ). Under certain additional regularity conditions (J. Wu, Annals Stat. 1983, pp. 95-103), the θ i converge to a stationary point of L(θ Y ). If L(θ Y ) is unimodal and has only one stationary point, then the convergence is to the MLE. When convergence occurs, the rate is typically linear. Step 1 is called the E-step, and step 2 is called the M-step. The EM algorithm, like the steepest ascent and conjugate gradient algorithms, but unlike Newton s method, is monotonic, in that the sequence θ n of iterates satisfies L(θ n+1 Y ) L(θ n Y ). Begin by observing that L(θ Y ) = log f θ (Y, Z) log f θ (Z Y ) = f θ (Z Y ) log f θ (Y, Z)dZ f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) K(θ θ ). Continuing, we have 6

L(θ Y ) L(θ Y ) = Q(θ θ ) Q(θ θ ) + K(θ θ ) K(θ θ ) = Q(θ θ ) Q(θ θ ) log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz. If we set θ = argmax τ Q(τ θ ), then Q(θ θ ) Q(θ θ ) 0. By Jensen s inequality, log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz log (f θ (Z Y )/f θ (Z Y ))f θ (Z Y )dz = 0. Thus L(θ Y ) L(θ Y ) 0. For any value of θ, L(θ Y ) Q(θ θ ) + L(θ Y ) Q(θ θ ) with equality for θ = θ. This leads to the characterization of the EM algorithm as a majorization or optimization transfer algorithm. Optimization of the surrogate function Q(θ θ ) always drives L(θ) uphill. Example: Finite mixture models. Suppose π 1,..., π K are positive constants with j π j = 1. Let f 1,..., f K denote density functions. Then j π j f j is also a density function. It is the density of an observation generated according to the following twostage procedure: (i) generate κ {1,..., K} with P (κ = k) = π k, (ii) generate Y f κ. We observe Y but not κ. Suppose we observe Y 1,..., Y n according to the mixture density, and we wish to estimate the parameters π = (π 1,..., π K ) using maximum likelihood. The log-likelihood for the observed data is: n K L(π {Y j }) = log( π k f k (Y j )). j=1 k=1 Suppose we were to observe as complete data the pair (Y j, κ j ), where κ j is the random variable defined in (i) above. Then the log-likelihood for the complete data is: n L(π {Y j, κ j }) = log(π κj ) + log(f κj (Y j )). j=1 The E-step of the EM algorithm requires that we compute the expectation of L(π {Y j, κ j }) given the Y j for a fixed probability vector π. This can be more easily accomplished if we rewrite the complete-data log-likelihood as follows: 7

n K L(π {Y j, κ j }) = log(π k )I(κ j = k) + log(f k (Y j ))I(κ j = k). j=1 k=1 This is linear in the random variables I(κ j = k), so the expectation is obtained by plugging in the values for the conditional expectations Using Bayes theorem, we have E π (I(κ j = k) Y j ) = P π (κ j = k Y j ). P π (κ j = k Y j ) P (Y j κ j = k)p π (κ j = k) = f k (Y j )π k. Therefore the expectation is given by: E π (I(κ j = k) Y j ) = f k (Y j )π k/ l f l (Y j )π l p jk, and the Q function takes the form n K Q(π π ) = log(π k )p jk + log(f k (Y j ))p jk. j=1 k=1 Note that only the first term involves π, so the M-step depends only on the first term. We update the parameters using the rule π k j p jk /n. Example: Gaussian Mixture Models Continuing with the previous example, suppose that f k (Y ) exp( (Y µ k ) Σ 1 k (Y µ k)/2)/ Σ k 1/2 is Gaussian, and we wish to estimate π 1,..., π K, µ 1,..., µ K and Σ 1,..., Σ K. The complete data log likelihood becomes L(π, {Σ k }, {µ k } {Y j, κ j }) = n K log(π k )I(κ j = k) 1 2 j=1 k=1 ( log Σk + (Y j µ k ) Σ 1 k (Y j µ k ) ) I(κ j = k). To carry out the E-step, the posterior probabilities p jk can be computed as in the previous example. To carry out the M-step for the mean and variance parameters, use the weighted moment updates 8

µ k j p jk Y j / j p jk, Σ k j p jk (Y j µ k )(Y j µ k ) / j p jk. The probability parameters π k are updated as in the previous example. Example: Contingency tables with collapsed cells. Suppose we observe a m n contingency table with independent rows and columns. If X ij denotes the number of observations in cell (i, j), then EX ij = np i q j where n is the total number of observations, p i is the probability of an observation falling in a cell in row i, and q j is the probability of an observation falling in a cell in column j. Now suppose that we are unable to distinguish between certain cells, so we can only observe the sum. To be concrete, consider a 2 3 table where we can only observe the following 4 values: Z 1 = X 11 + X 12 Z 2 = X 21 The likelihood for the observable data is: Z 3 = X 22 + X 13 Z 4 = X 23. L({p i }, {q j } {Z j }) (p 1 q 1 + p 1 q 2 ) Z 1 (p 2 q 1 ) Z 2 (p 2 q 2 + p 1 q 3 ) Z 3 (p 2 q 3 ) Z 4. We could optimize this function using gradient methods, but it would require a fair amount of programming. On the other hand, consider the likelihood for the {X ij } (some of which are unobservable): L({p i }, {q j } {X j }) ij X ij log(p i q j ). This is linear in the X ij, so to compute the Q( ) function, we need only to compute the conditional means of these values given the Z j, and then plug them in. Specifically, set X11 = E(X 11 {Z j }) = p 1 q 1 Z 1 /(p 1 q 1 + p 1 q 2 ) X12 = E(X 12 {Z j }) = p 1 q 2 Z 1 /(p 1 q 1 + p 1 q 2 ) X22 = E(X 22 {Z j }) = p 2 q 2 Z 3 /(p 2 q 2 + p 1 q 3 ) X13 = E(X 13 {Z j }) = p 1 q 3 Z 3 /(p 2 q 2 + p 1 q 3 ), 9

with the other X ij set to their directly-observable values. Finally, we can update the parameter estimates using p i j X ij/n q i i X ij/n. This defines one iteration of the EM algorithm. Example: Missing data in a multivariate normal model. Suppose that we observe Y j R d j with Y j N(µ j, Σ j ). Suppose further that there exist injective functions σ j : {1,..., d j } {1,..., d}, such that µ j (k) = µ(σ j (k)) and Σ j (k 1, k 2 ) = Σ(σ j (k 1 ), σ j (k 2 )). In words, the components of each Y j comprise a subset of the components of a common multivariate normal distribution. Suppose that we want to estimate θ = (µ, Σ) based on the following log-likelihood: L θ ({Y j }) = j 1 2 log Σ j 1 2 (Y j µ j ) Σ 1 j (Y j µ j ). Note that this might not be advisable if the σ j are viewed as random variables that could be dependent on the Y j. In general, L θ ({Y j }) can not be optimized in closed-form. A conjugate gradient algorithm would be a good choice here, but the EM algorithm is considerably simpler to derive and program (at least for a statistician). The complete data are random vectors Z j R d such that Z j N(µ, Σ) and Y j (k) = Z j (σ j (k)). The complete-data log-likelihood is: L θ ({Z j }) = j 1 2 log Σ 1 2 (Z j µ) Σ 1 (Z j µ) 1 2 log Σ 1 1 2 tr(σ 1 (Z j µ)(z j µ) ). n j To compute the Q(θ θ ) function, we need the values η j = E µ,σ (Z j Y j ) and C j = E µ,σ (Z jz j Y j ). Once we have these values, the Q(θ θ ) function can be expressed: Q(θ θ ) = 1 2 log Σ 1 2 tr Σ 1 1 n C j η j µ µη j + µµ j = 1 2 log Σ 1 2 tr ( Σ 1 (µ η)(µ η) + Σ 1 ( C η η ) ), where η = j η j /n and C = j C j /n. The maximum of Q(θ θ ) occurs where µ = η, and Σ = C η η = j Var(Z j Y j )/n. 10

The values η j and C j can be easily computed using the usual formulas for the conditional mean and variance of a multivariate normal random vector. Specifically, let Z denote the missing components of Z. Then: E(Z Y ) = EZ + cov(z, Y )cov(y ) 1 (Z EY ) cov(z Y ) = cov(z ) cov(z, Y )cov(y ) 1 cov(y, Z ), where all of the values on the right hand side are computed with respect to θ = (µ, Σ ) using the usual formulas. General optimization transfer. The basic idea behind the EM algorithm can be extended to optimization problems where there is no data augmentation, or even to problems that have no likelihood. In fact the reason that the procedure works has only to do with convexity, and nothing whatsoever to do with probability. We retain the idea of the minorant Q(θ θ ) and shift it by Q(θ θ ), so that Q(θ θ ) = 0 and L(θ) Q(θ θ ) + L(θ ), so that optimizing Q(θ θ ) (with respect to θ) drives L(θ) uphill. There are many ways of constructing such a Q( ). For example, if L(θ) is convex and differentiable, then suggesting Q(θ θ ) = L (θ)(θ θ ). L(θ) L (θ)(θ θ ) + L(θ ), Simulated Annealing Suppose we want to maximize a real-valued function f with domain S. Simulated annealing is an attractive optimization procedure if any of the following hold: (i) f has many local extreme values, (ii) S is discrete, (iii) f is not smooth, or the derivative of f is much more expensive to compute than the function value. Suppose we have a family of proposal distributions on S indexed by X S: R(X X ). From a starting value X 0, define a sequence of points in S using the rules Z n R( X n 1 ) P (X n = Z n ) = min(exp((f(z n ) f(x n 1 ))/T n ), 1) P (X n = X n 1 ) = 1 P (X n = Z n ). Note that if f(z n ) > f(x n 1 ) then X n = Z n uphill proposals are always accepted, but some downhill proposals are accepted as well. 11

As T n 0, downhill proposals are accepted less often. If T n 0 sufficiently slowly, then a.s. convergence to the global extremum occurs for a reasonable class of problems. Theoretically, the rate of T n should have the form T n 1/ log(c + n). However this rate is too slow to be useful for most practical problems, since for n of manageable size the estimate X n will have substantial variability. There is some experimental evidence that the optimal value can be identified for many problems using a heuristic rule, such as decreasing T n by 10% each time a fixed number of steps are accepted. 12