Optimization. September 4, PDF Free Download

Optmzaton September 4, 2018

Optmzaton problem 1/34 An optmzaton problem s the problem of fndng the best soluton for an objectve functon. Optmzaton method plays an mportant role n statstcs, for example, to fnd maxmum lkelhood estmate (MLE). Unconstraned vs. constraned optmzaton problem: whether there s constrant n the soluton space. Most algorthms are based on teratve procedures. We ll spend next few lectures on several optmzaton methods, under the context of statstcs: New-Raphson, Fsher scorng, etc. EM and MM. Hdden Markov models. Lnear and quadratc programmng.

Revew: Newton-Raphson (NR) method 2/34 Goal: Fnd the root for equaton f (θ) = 0. Approach: 1. Choose an ntal value θ (0) as the startng pont. 2. By Taylor expanson at θ (0), we have f (θ) = f (θ (0) ) + f (θ (0) )(θ θ (0) ). Set f (θ) = 0 gves an update of the parameter: θ (1) = θ (0) f (θ (0) )/ f (θ (0) ). 3. Repeated update untl convergence: θ (k+1) = θ (k) f (θ (k) )/ f (θ (k) ).

NR method convergence rate 3/34 Quadratc convergence: θ s the soluton. lm k θ (k+1) θ θ (k) θ 2 = c (rate = c > 0, order = 2) The # of sgnfcant dgts nearly doubles at each step (n the neghborhood of θ ). Proof: By Taylor expanson (to the second order) at θ (k), 0 = f (θ ) = f (θ (k) ) + f (θ (k) )(θ θ (k) ) + 1 2 f (ξ (k) )(θ θ (k) ) 2, ξ (k) [θ, θ (k) ] Dvdng the equaton by f (θ (k) ) gves f (θ (k) )/ f (θ (k) ) (θ θ (k) ) = f (ξ (k) ) 2 f (θ (k) ) (θ θ (k) ) 2. The defnton of θ (k+1) = θ (k) f (θ (k) )/ f (θ (k) ) gves What condtons are needed? f (θ (k) ) 0 n the neghborhood of θ f (ξ (k) ) s bounded θ (k+1) θ = f (ξ (k) ) 2 f (θ (k) ) (θ θ (k) ) 2. Startng pont s suffcently close to the root θ

Revew: maxmum lkelhood 4/34 Here s a lst of some defntons related to maxmum lkelhood estmate: Parameter θ, a p-vector Data X Log lkelhood l(θ) = log Pr(X θ) Score functon l(θ) = ( l/ θ 1,..., l/ θ p ) Hessan matrx l(θ) = { 2 l/ θ θ j }, j=1,...,p Fsher nformaton I(θ) = E l(θ) = E l(θ){ l(θ)} Observed nformaton l(ˆθ) When θ s a local maxmum of l, l(θ ) = 0, and l(θ ) s negatve defnte.

Applcaton of NR method n MLE: when θ s a scalar 5/34 Maxmum Lkelhood Estmaton (MLE): ˆθ = arg max θ l(θ). Approach Fnd ˆθ such that l(ˆθ) = 0. If the closed form soluton for l(ˆθ) = 0 s dffcult to obtan, one can use NR method (replace f by l). The the NR update for solvng MLE s: θ (k+1) = θ (k) l(θ (k) )/ l(θ (k) ).

What can go wrong? 6/34 Bad startng pont May not converge to the global maxmum Saddle pont: l(ˆθ) = 0, but l(ˆθ) s nether negatve defnte nor postve defnte (statonary pont but not a local extremum; can be used to check the lkelhood) startng pont & local extremum saddle pont saddle pont l(θ) = θ 3 l(θ 1, θ 2 ) = θ1 2 θ2 2

Generalzaton to hgher dmensons: when θ s a vector 7/34 General Algorthm 1. (Startng pont) Pck a startng pont θ (0) and let k = 0 2. (Iteraton) Determne the drecton d (k) (a p-vector) and the step sze α (k) (a scalar) and calculate θ (k+1) = θ (k) + α (k) d (k), such that l(θ (k+1) ) > l(θ (k) ) 3. (Stop crtera) Stop teraton f l(θ (k+1) ) l(θ (k) ) /( l(θ (k) ) + ɛ 1 ) < ɛ 2 or θ k+1, j θ k, j /( θ k, j + ɛ 1 ) < ɛ 2, j = 1,..., p for precsons such as ɛ 1 = 10 4 and ɛ 2 = 10 6. Otherwse go to 2. Key: Determne the drecton and the step sze

Generalzaton to hgher dmensons (contnued) 8/34 Determnng the drecton (general framework, detals later) We generally pck d (k) = R 1 l(θ (k) ), where R s a postve defnte matrx. Choosng a step sze (gven the drecton) Step halvng To fnd α (k) such that l(θ (k+1) ) > l(θ (k) ) Start at a large value of α (k). Halve α (k) untl l(θ (k+1) ) > l(θ (k) ) Smple, robust, but relatvely slow Lnear search To fnd α (k) = arg max α l(θ (k) + αd (k) ) Approxmate l(θ (k) + αd (k) ) by dong a polynomal nterpolaton and fnd α (k) maxmzng the polynomal Fast

Polynomal nterpolaton 9/34 Gven a set of p + 1 data ponts from the functon f (α) l(θ (k) + αd (k) ), we can fnd a unque polynomal wth degree p that goes through the p + 1 data ponts. (For a quadratc approxmaton, we only need 3 data ponts.)

Survey of basc methods 10/34 1. Steepest ascent: R = I = dentty matrx d (k) = l(θ (k) ) α (k) = arg max α l(θ(k) + α l(θ (k) )) or a small fxed number θ (k+1) = θ (k) + α (k) l(θ (k) ) Why l(θ (k) ) s the steepest ascent drecton? By Taylor expanson at θ (k), By Cauchy-Schwarz nequalty, l(θ (k) + ) l(θ (k) ) = T l(θ (k) ) + o( ) T l(θ (k) ) l(θ (k) ) The equalty holds at = α l(θ (k) ). So when = α l(θ (k) ), l(θ (k) + ) ncreases the most. Easy to mplement; only requre the frst dervatve/gradent/score Guarantee an ncrease at each step no matter where you start Converge slowly. The drectons of two consecutve steps are orthogonal, so the algorthm zgzags to the maxma.

Steepest ascent (contnued) 11/34 When α (k) s chosen as arg max α l(θ (k) + α l(θ (k) )), the drectons of two consecutve steps are orthogonal,.e., [ l(θ (k) )] T l(θ (k+1) ) = 0. Proof: By the defnton of α (k) and θ (k+1) 0 = l(θ(k) + α l(θ (k) )) α = l(θ (k) + α (k) l(θ (k) )) T l(θ (k) ) = l(θ (k+1) ) T l(θ (k) ). α=α (k)

Example: Steepest Ascent 12/34 Maxmze the functon f (x) = 6x x 3 Y 6 4 2 0 2 4 6 2 1 0 1 2 X

Example: Steepest Ascent (cont.) 13/34 fun0 <- functon(x) return(- xˆ3 + 6*x) grd0 <- functon(x) return(- 3*xˆ2 + 6) # target functon # gradent # Steepest Ascent Algorthm Steepest_Ascent <- functon(x, fun=fun0, grd=grd0, step=0.01, kmax=1000, tol1=1e-6, tol2=1e-4) { dff <- 2*x # use a large value to get nto the followng "whle" loop k <- 0 # count teraton whle ( all(abs(dff) > tol1*(abs(x)+tol2) ) & k <= kmax) # stop crtera { g_x <- grd(x) # calculate gradent usng x dff <- step * g_x # calculate the dfference used n the stop crtera x <- x + dff # update x k <- k + 1 # update teraton } f_x = fun(x) } return(lst(teraton=k, x=x, f_x=f_x, g_x=g_x))

Example: Steepest Ascent (cont.) 14/34 > Steepest_Ascent(x=2, step=0.01) $teraton [1] 117 $x [1] 1.414228 $f_x [1] 5.656854 $g_x [1] -0.0001380379 > Steepest_Ascent(x=1, step=-0.01) $teraton [1] 159 $x [1] -1.414199 $f_x [1] -5.656854 $g_x [1] 0.0001370128

In large dataset 15/34 The data log-lkelhood s usually summed over n observatons: l(θ) = n =1 l(x ; θ). When n s large, ths poses computatonal burden. One can mplement a stochastc verson of the algorthm: stochastc gradent descent (SGD). Note: Gradent descent s just steepest descent. Smple SGD algorthm: replace the gradent l(θ) by the gradent computed from a sngle sample l(x ; θ), where x s randomly sampled. Mn-batch SGD algorthm: compute the gradent based on a small number of observatons. Advantage of SGD: Evaluate gradent at one (or a few) observatons, requres less memory. Has better property to escape from local mnmum (gradent s nosy). Dsdvantage of SGD: Slower convergence.

Survey of basc methods (contnued) 16/34 2. Newton-Raphson: R = l(θ (k) ) = observed nformaton d (k) = [ l(θ (k) )] 1 l(θ (k) ) θ (k+1) = θ (k) + [ l(θ (k) )] 1 l(θ (k) ) α (k) = 1 for all k Fast, quadratc convergence Need very good startng ponts Theorem: If R s postve defnte, the equaton set Rd (k) = l(θ (k) ) has a unque soluton for the drecton d (k), and the drecton ensures ascent of l(θ). Proof: When R s postve defnte, t s nvertble. So we have a unque soluton d (k) = R 1 l(θ (k) ). Let θ (k+1) = θ (k) + αd (k) = θ (k) + αr 1 l(θ (k) ). By Taylor expanson, l(θ (k+1) ) l(θ (k) ) + α l(θ (k) ) T R 1 l(θ (k) ). The postve defnte matrx R ensures that l(θ (k+1) ) > l(θ (k) ) for suffcently small postve α.

Newton-Raphson vs. steepest ascent 17/34 Newton-Raphson converges much faster than steepest ascent (gradent descent). NR requres the computaton of second dervatve, whch can be dffcult and computatonally expensve. In contrast, gradent descent requres only the frst dervatve, whch s easy to compute. For poorly behaved objectve functon (non-convex), gradent-based methods are often more stable. Gradent-based method (especally SGD) s wdely used n modern machne learnng.

Example: Newton Raphson 18/34 fun0 <- functon(x) return(- xˆ3 + 6*x) grd0 <- functon(x) return(- 3*xˆ2 + 6) hes0 <- functon(x) return(- 6*x) # target functon # gradent # Hessan # Newton-Raphson Algorthm Newton_Raphson <- functon(x, fun=fun0, grd=grd0, hes=hes0, kmax=1000, tol1=1e-6, tol2=1e-4) { dff <- 2*x k <- 0 whle ( all(abs(dff) > tol1*(abs(x)+tol2) ) & k <= kmax) { g_x <- grd(x) h_x <- hes(x) # calculate the second dervatve (Hessan) dff <- -g_x/h_x # calculate the dfference used by the stop crtera x <- x + dff k <- k + 1 } f_x = fun(x) } return(lst(teraton=k, x=x, f_x=f_x, g_x=g_x, h_x=h_x))

Example: Newton Raphson 19/34 > Newton_Raphson(x=2) $teraton [1] 5 $x [1] 1.414214 $f_x [1] 5.656854 $g_x [1] -1.353229e-11 $h_x [1] -8.485281 > Newton_Raphson(x=1) $teraton [1] 5 $x [1] 1.414214 $f_x [1] 5.656854 $g_x [1] -1.353229e-11 $h_x [1] -8.485281

Survey of basc methods (contnued) 20/34 3. Modfcaton of Newton-Raphson Fsher scorng: replace l(θ) wth E l(θ) E l(θ) = E l(θ) l(θ) s always postve and stablze the algorthm E l(θ) can have a smpler form than l(θ) Newton-Raphson and Fsher scorng are equvalent for parameter estmaton n GLM wth canoncal lnk. Quas-Newton: aka varable metrc methods or secant methods. Approxmate l(θ) n a way that avods calculatng Hessan and ts nverse has convergence propertes smlar to Newton

Fsher Scorng: Example 21/34 In the Posson regresson model of n subjects, The responses Y Posson(λ ) = (Y!) 1 λ Y e λ. We know that λ = E(Y X ). We relate the mean of Y to X by g(λ ) = X β. Takng dervatve on both sdes, g (λ ) λ β = X λ β = X g (λ ) Log lkelhood: l(β) = n =1 (Y log λ λ ), where λ s satsfy g(λ ) = X β. Maxmum lkelhood estmaton: ˆβ = arg max β l(β) Newton-Raphson needs ( ) Y λ ( ) l(β) = 1 λ β = Y 1 1 λ g (λ ) X Y λ 1 ( ) l(β) = λ 2 β g (λ ) X Y g (λ ) λ 1 λ g (λ ) 2 β X 1 1 ( ) Y 1 1 = λ g (λ ) 2X2 1 λ λ g (λ ) 2X2 ( Y λ 1 ) g (λ ) g (λ ) 3X2

Fsher Scorng: Example (contnued) 22/34 Fsher scorng needs l(β) and E [ l(β) ] = whch s l(β) wthout the extra terms. 1 1 λ g (λ ) 2X2 Wth the canoncal lnk for Posson regresson: we have g (λ ) = λ 1 g(λ ) = log λ, and g (λ ) = λ 2. So that the extra terms equal to zero (check ths!) and we conclude that Newton-Raphson and Fsher scorng are equvalent.

Quas-Newton 23/34 1. Davdson-Fletcher-Powell QNR algorthm Let l (k) = l(θ (k) ) l(θ (k 1) ) and θ (k) = θ (k) θ (k 1). Approxmate negatve Hessan by Use the startng matrx G (0) = I. G (k+1) = G (k) + θ(k) ( θ (k) ) T ( θ (k) ) T θ (k) G(k) l (k) ( l (k) ) T G (k) ( l (k) ) T G (k) l (k). Theorem: If the startng matrx G (0) s postve defnte, the above formula ensures that every G (k) durng the teraton s postve defnte.

Nonlnear Regresson Models 24/34 Data: (x, y ) for = 1,..., n Notaton and assumptons Model: y = h(x, β) + ɛ, where ɛ..d N(0, σ 2 ) and h(.) s known Resdual: e (β) = y h(x, β) Jacoban: {J(β)} j = h(x,β) β j = e (β) β j, a n p matrx Goal: to obtan MLE ˆβ = arg mn β S (β), where S (β) = {y h(x, β)} 2 = [e(β)] T e(β) We could use the prevously-dscussed Newton-Raphson algorthm. Gradent: g j (β) = S (β) β j = 2 e (β) e (β) β j,.e., g(β) = 2J(β) T e(β) Hessan: H jr (β) = 2 S (β) β j β r = 2 {e (β) 2 e (β) Problem: Hessan could be hard to obtan. β j β r + e (β) β j e (β) β r }

Gauss-Newton algorthm 25/34 Recall n lnear regresson models, we mnmze { S (β) = y x T β} 2 Because S (β) s a quadratc functon, t s easy to get MLE ˆβ = x x 1 T x y Now n the nonlnear regresson models, we want to mnmze S (β) = {y h(x, β)} 2 Idea: Approxmate h(x, β) by a lnear functon, teratvely at β (k) Gven β (k) and by Taylor expanson of h(x, β) at β (k), S (β) becomes { S (β) y h(x, β (k) ) (β β (k) ) T h(x, β (k) ) β } 2

Gauss-Newton algorthm (cont.) 26/34 1. Fnd a good startng pont β (0) 2. At step k + 1, (a) Form e(β (k) ) and J(β (k) ) (b) Use a standard lnear regresson routne to obtan δ (k) = [J(β (k) ) T J(β (k) )] 1 J(β (k) ) T e(β (k) ) (c) Obtan the new estmate β (k+1) = β (k) + δ (k) Don t need computng Hessan matrx. Need good startng values. Requre J(β (k) ) T J(β (k) ) to be nvertble. Ths s not a general optmzaton method. Only applcable to lease square problem.

Example: Generalzed lnear models (GLM) 27/34 Data: (y, x ) for = 1,..., n Notaton and assumptons Mean: E(y x) = µ Lnk g: g(µ) = x β Varance functon V: Var(y x) = φv(µ) Log lkelhood (exponental famly): l(θ, φ; y) = {yθ b(θ)}/a(φ) + c(y, φ) We obtan Score functon: l = {y b (θ)}/a(φ) Observed nformaton: l = b (θ)/a(φ) Mean (n term of θ): E(y x) = a(φ)e( l) + b (θ) = b (θ) Varance (θ, φ): Var(y x) = E(y b (θ)) 2 = a(φ) 2 E( l l ) = a(φ) 2 E( l) = b (θ)a(φ) Canoncal lnk: g such that g(µ) = θ,.e. g 1 = b Generally we have a(φ) = φ/w, n whch case φ wll drop out of the followng.

GLM 28/34 Model Normal Posson Bnomal Gamma φ σ 2 1 1/m 1/ν b(θ) θ 2 /2 exp(θ) log(1 + e θ ) log( θ) µ θ exp(θ) e θ /(1 + e θ ) 1/θ Canoncal lnk g dentty log logt recprocal Varance functon V 1 µ µ(1 µ) µ 2

Towards Iteratvely reweghted least squares 29/34 In lnear regresson models, E(y x ) = x T β, so we mnmze { S (β) = y x T β} 2 Because S (β) s a quadratc functon, t s easy to get MLE ˆβ = x x 1 T x y In generalzed lnear models, consder construct a smlar quadratc functon S (β). Queston? Can we use Answer: No, because { S (β) = g(y ) x T β} 2 E {g(y ) x } x T β Idea: Approxmate g(y ) by a lnear functon wth expectaton x T β(k), nteractvely at β (k)

Iteratvely reweghted least squares 30/34 Lnearze g(y ) around ˆµ (k) Check the varances of ỹ (k) W (k) = g 1 (x T β(k) ), denote the lnearzed value by ỹ (k). ỹ (k) = g(ˆµ (k) ) + (y ˆµ (k) )g (ˆµ (k) and use them as weghts = { Var(ỹ (k) Gven β (k), we consder mnmze IRLS algorthm: S (β) = ) ) } 1 [ = {g (ˆµ (k) )} 2 V(ˆµ (k) ) ] 1 W (k) 1. Start wth ntal estmates, generally ˆµ (0) 2. Form ỹ (k) and W (k) 3. Estmate β (k+1) by regressng ỹ (k) {ỹ(k) x T β} 2 = y on x wth weghts W (k) 4. Form ˆµ (k+1) = g 1 (x T β(k+1) ) and return to step 2.

Iteratvely reweghted least squares (contnued) 31/34 Model Posson Bnomal Gamma µ = g 1 (η) e η e η /(1 + e η ) 1/η g (µ) 1/µ 1/[µ(1 µ)] 1/µ 2 V(µ) µ µ(1 µ) µ 2 McCullagh and Nelder (1983) justfed IRLS by showng that IRLS s equvalent to Fsher scorng. In the case of the canoncal lnk, IRLS s also equvalent to Newton-Raphson. IRLS s attractve because no specal optmzaton algorthm s requred, just a subroutne that computes weghted least square estmates.

Mscellaneous thngs 32/34 Dsperson parameter: When we do not take φ = 1, the usual estmate s va the method of moments: ˆφ = 1 (y ˆµ ) 2 n p V(ˆµ ) Standard errors: Var(ˆβ) = ˆφ(X ŴX) 1 Quas lkelhood: Pck a lnk and a varance functon, and IRLS can proceed wthout worryng about the model. In other words, IRLS s a good thng!

A quck revew 33/34 Optmzaton method s mportant n statstcs, (.e., to fnd MLE), or n general machne learnng (mnmze some loss functon). Maxmzng/mnmzng an objectve functon s acheved by solvng the equaton that the frst dervatve s 0 (need to check second dervatve). Steepest ascent method: Only need gradent. Slow convergence. In large dataset wth ll-behaved objectve functon, stochastc verson (SGD) usually works better.

Newton-Raphson (NR) method: Quadratc convergence rate. Could stuck n local maxmum. In hgher dmenson, the problems are to fnd drectons and step szes n each teraton. Fsher scorng: use expected nformaton matrx. NR use observed nformaton matrx. The expected nformaton s more stable and smpler. Fsher scorng and Newton-Raphson are equvalent under canoncal lnk. Gauss-Newton algorthm for non-lnear regresson: Hessan matrx s not needed.

Optimization. September 4, 2018