15 Lagrange Multipliers

Size: px

Start display at page:

Download "15 Lagrange Multipliers"

Amber Lamb
5 years ago
Views:

1 15 The Method of s a powerful technque for constraned optmzaton. Whle t has applcatons far beyond machne learnng t was orgnally developed to solve physcs equatons), t s used for several ey dervatons n machne learnng. The problem set-up s as follows: We wsh to fnd extrema.e., maxma or mnma) of a dfferentable objectve functon Ex) = Ex 1,x 2,...x D ). 1) If we have no constrants on the problem, then the extrema must necessarly satsfy the followng system of equatons n terms of the gradent ofe: E = 0. 2) Ths s equvalent to requrng that E x = 0 for all. Ths equaton says that there s no way to nfntesmally perturb x to get a dfferent value for E. That s, the objectve functon s locally flat. ow, however, our goal wll be to fnd extrema subject to a constrant: gx) = 0. 3) In other words, we want to fnd the extrema among the set of pontsx, all of whch satsfygx) = 0. It s sometmes possble to reparameterze the problem to elmnate the constrants.e., so that the new parameterzaton ncludes all possble solutons to gx) = 0). But ths can be awward n some cases, and mpossble n others. Gven the constrant, gx) = 0, we are no longer loong for a pont where no perturbaton n any drecton changes E. Instead, we need to fnd a pont at whch perturbatons that satsfy the constrants do not change E. Ths can be expressed by the followng condton: E +λ g = 0, 4) for some arbtrary scalar value λ. Frst note that, for ponts on the contour gx) = 0, the gradent g s always perpendcular to the contour ths s a great exercse f you don t remember how to prove that ths s true). Hence the expresson E = λ g says that the gradent of E must be parallel to the gradent of the contour at a possble soluton pont. In other words, any perturbaton to x that changes E also maes the constrant become volated. Perturbatons that do not change g, and hence stll le on the contour gx) = 0 do not change E ether. Hence, our goal s to fnd a pont x that satsfes ths gradent condton and alsogx) = 0 In the method of Lagrange multplers, we change the constraned optmzaton above nto an unconstraned optmzaton wth a new objectve functon, called the Lagrangan: Lx,λ) = Ex)+λgx). 5) ow, our goal s to fnd extrema of L wth respect to both x and λ. The ey fact s that extrema of the unconstraned objectve L are the extrema of the orgnal constraned problem. So we Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 91

2 E x g gx) = 0 Fgure 1: The set of solutons togx) = 0 vsualzed as a curve. The gradent g s always normal to the curve. At an extremal pont, E ponts s parallel to g. Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) have elmnated the nasty constrants by changng the objectve functon and also ntroducng new unnowns. To see why, let s loo at the extrema of L. Because L depends on two parameters ts extrema must necessarly satsfy two gradent constrants,.e., λ x = gx) = 0 6) = E +λ g = 0. 7) One can mmedately see that these gradent constrants are exactly the condtons gven above. The frst equaton ensures that gx) s zero, and the second s our constrant that the gradents of E and g mucst be parallel. Usng the Lagrangan s a convenent way to combne these two constrants nto one unconstraned optmzaton Examples Mnmzng on a crcle. We begn wth a smple geometrc example. We have the followng constraned optmzaton problem: argmn x,y x+y 8) subject tox 2 +y 2 = 1 9) In words, we want to fnd the pont on a unt crcle that mnmzes x+y. The problem s depcted n Fg. 2. Here, Ex,y) = x+y and gx,y) = x 2 +y 2 1. The Lagrangan for ths problem s gven by Lx,y,λ) = x+y +λx 2 +y 2 1). 10) Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 92

3 Fgure 2: Illustraton of the maxmzaton on a crcle problem. Image from Wpeda.) Settng the gradent to zero wth respect tox,y andλ) gves us the followng system of equatons: = 1+2λx = 0 11) x = 1+2λy = 0 12) y λ = x2 +y 2 1 = 0 13) The frst two equatons ensure that x = y. Substtutng ths nto the constrant and solvng gves two solutons x = y = ± 1 2. Substtutng these two solutons nto the objectve, we fnd that the mnmum occurs atx = y = 1 2. Estmatng a Categoral dstrbuton. A Categoral dstrbuton over a random varable c wth K possble dscrete, dsjont states or outcomes). Accordngly t s specfed by K probabltes, denoted here byp : Pc = ) p, 14) for = 1...K, and let p = p 1,...,p K ). For example, n con-flppng the outcome of a con flp follows a Bernoull dstrbuton, whch s the specal case of a Categorcal dstrbuton when K = 2, andc = 1 mght ndcate that the con lands heads sde up. Suppose we observe ndependent draws from such a random process,.e., we observe the sequence c 1:. The lelhood of the observed data s therefore the product of the ndependent lelhoods: Pc 1: p) = K =1Pc p) = p, 15) Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 93

4 where s the number of tmes that c =,.e., the number of occurrences of the -th state. To estmate ths Categorcal dstrbuton, we mnmze the negatve log-lelhood of the observed data, mn lnp 16) subject to p = 1 and p 0, for all. 17) The constrants here are requred to ensure that the p s form a vald probablty dstrbuton. One way to optmze ths problem s to reparameterze the probabltes,.e., replacep K n the lelhood by1 K 1 =1 p, and then optmze the unconstraned problem n closed-form. Whle ths method does wor n ths case, t breas the natural symmetry of the problem, resultng n some messy calculatons. Moreover, ths method often cannot be generalzed to other problems.) The Lagrangan for ths problem s Lp,λ) = ) lnp + λ p 1. 18) Here, we omt the constrant that p 0 and hope that ths constrant wll be satsfed by the soluton t wll). Settng the gradent to zero gves Multplyng / p = 0 byp, and summng over yelds = +λ = 0 for all p p 19) λ = p 1 = 0 20) 0 = K =1 +λ p = +λ, 21) snce = and p = 1. Hence, the optmal λ =. Substtutng ths nto / p and solvng yelds the estmated probabltes p =, 22) whch s the famlar maxmum-lelhood estmator for a Categorcal dstrbuton. Maxmum varance PCA. In the orgnal formulaton of PCA, the goal s to fnd a low-dmensonal projecton of data ponts y. Here, suppose we just want to fnd a one-dmensonal subspace spanned by the vector w. In that case the subspace projecton s gven by x = w T y b). 23) Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 94

5 One way to formulate PCA s as a optmzaton to fnd the drecton w whch maxmzes the varance of the projecton, subject to the constrant that w T w = 1. The Lagrangan can be expressed as ) Lw,b,λ) = 1 x 1 2 x +λw T w 1) = 1 w T y b) 1 2 w T y b)) +λw T w 1) = 1 w T y b) 1 2 y b))) +λw T w 1) = 1 w T y ȳ) ) 2 +λw T w 1) = 1 w T y ȳ)y ȳ) T w+λw T w 1) = w T 1 y ȳ)y ȳ) )w+λw T T w 1), 24) where ȳ = y /. Solvng / w = 0 yelds 1 y ȳ)y ȳ) )w T = λw 25) Ths s the egenvector equaton. That s, w must be an egenvector of the sample covarance matrx of the y s. And λ must be the correspondng egenvalue. In order to determne whch one, we can substtute ths equalty nto the Lagrangan to obtan L = w T λw+λw T w 1) = λ, 26) snce w T w = 1. Snce our goal s to maxmze the varance, we choose the egenvector w whch has the largest egenvalue λ. We have not yet selected b, but t s clear that the value of the objectve functon does not depend on b, so we mght as well set t to be the mean of the data b = y /, whch results n the x s havng zero mean,.e., x / = Least-Squares PCA n 1D Let s now consder a dfferent way to formulate PCA. Instead of fndng the drecton of maxmum varance, let s fnd the one-dmensonal projecton whch mnmzes the squared error n the subspace approxmaton. Specfcally, we are gven a collecton of data vectors y 1:, and wsh to fnd Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 95

6 a bas b, a sngle unt vectorw, and one-dmensonal coordnates x 1:, to mnmze: arg mn y wx +b) 2 27) w,x 1:,b subject tow T w = 1 28) Here, x specfes poston along a lne wth drecton w and dstance from the orgn b. The total error s the sum of squared Eucldean dstances between data pontsy and ther correspondng ponts on the model lne. 1 The vectorws called the frst prncpal component. The Lagrangan s: Lw,x 1:,b,λ) = y wx +b) 2 +λ w 2 1) 29) There are several sets of unnowns, and we derve ther optmal values each n turn. Projectons x ). We frst derve the projectons: Usngw T w = 1 and solvng for x gves: x = 2w T y wx +b)) = 0 30) x = w T y b) 31) Bas b). We begn by dfferentatng: b = 2 y wx +b)) 32) Substtutng n Equaton 31 gves b = 2 y ww T y b)+b)) = 2 y +2ww T y 2ww T b+2b = 2I ww T ) y +2I ww T )b = 0 33) Factorng out2i ww T ) from both terms, one can see that we obtan b = 1 y 34) 1 It s mportant to note that ths optmzaton problem dffers n subtle ways from the lnear regresson earler n the notes. Wth lnear regresson we had mult-dmensonal nputs and a scalar output. Here we have vector-valued data y and we are tryng to fnd a scalar nput x. In lnear regresson we mnmzed the error n the predcted y.e., the vertcal dstance of each pont to the curve), whle here the error s the Eucldean dstant from each 2D data pont y to a locaton on the model lne. Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 96

7 Bass vector w). To mae thngs smpler, we wll defne ỹ = y b) as the mean-centered data ponts, and the reconstructons are thenx = w T ỹ, and the objectve functon s: L = = = = ỹ wx 2 +λw T w 1) ỹ ww T ỹ 2 +λw T w 1) ỹ ww T ỹ ) T ỹ ww T ỹ )+λw T w 1) ỹ T ỹ 2ỹ T ww T ỹ +ỹ T ww T ww T ỹ )+λw T w 1) = ỹ T ỹ ỹ T w) 2 +λw T w 1) 35) where we have usedw T w = 1. We then dfferentate and smplfy: We can rearrange ths to get: w = 2 ỹ ỹ T w+2λw = 0 36) ) ỹ ỹ T w = λw 37) Ths s exactly the egenvector equaton, meanng that extrema for L occur when w s an egenvector of the matrx ỹỹ T, andλs the correspondng egenvalue. Multplyng both sdes by 1/, we see ths matrx has the same egenvectors as the data covarance: 1 y b)y b) )w T = λ w 38) ow we must determne whch egenvector to use. To ths end, we rewrte Eqn. 35) as L = = ỹ T ỹ w T ỹ ỹ T w+λw T w 1) ) ỹ T ỹ w T ỹ ỹ T w+λw T w 1), 39) and substtute n Eqn. 37): L = = ỹ T ỹ λw T w+λw T w 1) ỹ T ỹ λ, 40) agan usng w T w = 1. We must pc the egenvalue λ that gves the smallest value of L. Hence, we pc the largest egenvalue, and setwto be the correspondng egenvector. Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 97

8 15.3 Multple Constrants When we wsh to optmze wth respect to multple constrants {g x)},.e., argmn x Ex) 41) subject to g x) = 0 for = 1...K 42) Extrema occur when: E + λ g = 0 43) where we have ntroduced K Lagrange multplers λ. The constrants can be combned nto a sngle Lagrangan: Lx,λ 1:K ) = Ex)+ λ g x) 44) 15.4 Inequalty Constrants The method can be extended to nequalty constrants of the form gx) 0. For a soluton to be vald and maxmal, there two possble cases: The optmal soluton s nsde the constrant regon, and, hence E = 0 and gx) > 0. In ths regon, the constrant s nactve, meanng thatλcan be set to zero. The optmal soluton les on the boundarygx) = 0. In ths case, the gradent E must pont n the opposte drecton of the gradent of g; otherwse, followng the gradent of E would causeg to become postve whle also modfynge. Hence, we must have E = λ g for λ 0. ote that, n both cases, we have λgx) = 0. Hence, we can enforce that one of these cases s found wth the followng optmzaton problem: max w,λ Ex)+λgx) 45) such that gx) 0 46) λ 0 47) λgx) = 0 48) These are called the Karush-Kuhn-Tucer KKT) condtons, whch generalze the Method of Lagrange Multplers. When mnmzng, we want E to pont n the same drecton as g when on the boundary, and so we mnmze E λg nstead of E +λg. Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 98

9 E x 1 g x 2 gx) > 0 gx) = 0 Fgure 3: Illustraton of the condton for nequalty constrants: the soluton may le on the boundary of the constrant regon, or n the nteror. Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2018 Aaron Hertzmann and Davd J. Fleet 99

14 Lagrange Multipliers

14 Lagrange Multipliers Lagrange Multplers 14 Lagrange Multplers The Method of Lagrange Multplers s a powerful technque for constraned optmzaton. Whle t has applcatons far beyond machne learnng t was orgnally developed to solve