ELE 538B: Large-Scale Optimization for Data Science. Introduction. Yuxin Chen Princeton University, Spring PDF Free Download

ELE 538B: Large-Scale Opimizaion for Daa Science Inroducion Yuxin Chen Princeon Universiy, Spring 2018

Surge of daa-inensive applicaions Widespread applicaions in large-scale daa science and learning 2.5 exabyes of daa are generaed every day (2012) exabye zeabye yoabye...?? limied processing abiliy (compuaion, sorage,...)

Opimizaion has ransformed algorihm design

Opimizaion has ransformed algorihm design (Convex) opimizaion is almos a ool

Solvabiliy / racabiliy... he grea waershed in opimizaion isn beween lineariy and nonlineariy, bu convexiy and nonconvexiy R. Rockafellar 1993 Inroducion 1-

Polynomial-ime solvabiliy scalabiliy Even polynomial-ime algorihms migh be useless in large-scale applicaions Inroducion 1-5

Since = 2/(( + 1)), we have f (x ) f x k x k Example: Newon s mehod 1) 10 5 10 10 =) 10 150 k f (xk ) converges in only 5 seps ( + 1) +1 2 2 + 1 kg k22 ( + 1) 1 ( + 1) +1 2 3 1 2 kg k2 x k22 + L2f f bes,k x k22 + 1X k 2 kg k2 k=0 L2f 2L2f P ( + 1) k=0 k k f (x ) 5 f op O(log log 1ε ) ieraions k=0 1 X f (x ) and hence quadraic local convergence =) f op aains ε αaccuracy backracking parameers = 0.1, β =wihin 0.7 consrained minimizaion 2 2 f op 0 100 op ff(x (x)(k)) f p 2 L f f bes,k k=0 105 k f (xk ) 1 x k22 + ( + 1) +1 X x(1) f op 0 Summing over all ieraions before, we ge x ( ( + 1) +1 ( x R 1) n f (x) Examplesminimize f (x ) f op x k22 +1 2 1 x = x ( f (x )) f (x ) Summing over all ieraions before, we ge L2f 2L2f P ( + 1) k=0 k k=0 x k22 + ( 1) x k22 ( + 1) +1 xample in R2 (page 10 9) (0) op and hence 1X k 2 kg k2 1 2 kg k2 Inroducion 10 21 1-6

Since = 2/(( + 1)), we have f (x ) f x k x k Example: Newon s mehod 1) 10 5 10 10 =) 10 150 k f (xk ) =) f op aains ε αaccuracy backracking parameers = 0.1, β =wihin 0.7 converges inypically only 5 sepsrequires ( + 1) +1 2 2 + 1 kg k22 ( + 1) 1 ( + 1) +1 2 3 1 2 kg k2 x k22 + L2f f bes,k x k22 + 1X k 2 kg k2 k=0 L2f 2L2f P ( + 1) k=0 k k f (x ) 5 f op O(log log 1ε ) ieraions Hessian informaion 2 f (x) Rn n k=0 1 X f (x ) and hence quadraic local convergence consrained minimizaion 2 2 f op 0 100 op ff(x (x)(k)) f p 2 L f f bes,k k=0 105 k f (xk ) 1 x k22 + ( + 1) +1 X x(1) f op 0 Summing over all ieraions before, we ge x ( ( + 1) +1 ( x R 1) n f (x) Examplesminimize f (x ) f op x k22 +1 2 1 x = x ( f (x )) f (x ) Summing over all ieraions before, we ge L2f 2L2f P ( + 1) k=0 k k=0 x k22 + ( 1) x k22 ( + 1) +1 xample in R2 (page 10 9) (0) op and hence 1X k 2 kg k2 1 2 kg k2 Inroducion 10 21 1-6

Since = 2/(( + 1)), we have f (x ) f x k x k Example: Newon s mehod 1) 10 5 10 10 =) 10 150 k f (xk ) =) f op aains ε αaccuracy backracking parameers = 0.1, β =wihin 0.7 converges inypically only 5 sepsrequires +1 2 2 + 1 kg k22 ( + 1) 1 ( + 1) +1 2 3 1 2 kg k2 x k22 + L2f f bes,k x k22 + 1X k 2 kg k2 k=0 L2f 2L2f P ( + 1) k=0 k k f (x ) 5 f op O(log log 1ε ) ieraions Hessian informaion 2 f (x) Rn n 1 consrained minimizaion k=0 X f (x ) ( + 1) a single ieraion may las forever; prohibiive sorage requiremen quadraic local convergence and hence 2 2 f op 0 100 op ff(x (x)(k)) f p 2 L f f bes,k k=0 105 k f (xk ) 1 x k22 + ( + 1) +1 X x(1) f op 0 Summing over all ieraions before, we ge x ( ( + 1) +1 ( x R 1) n f (x) Examplesminimize f (x ) f op x k22 +1 2 1 x = x ( f (x )) f (x ) Summing over all ieraions before, we ge L2f 2L2f P ( + 1) k=0 k k=0 x k22 + ( 1) x k22 ( + 1) +1 xample in R2 (page 10 9) (0) op and hence 1X k 2 kg k2 1 2 kg k2 Inroducion 10 21 1-6

Ieraion complexiy vs. per-ieraion cos compuaional cos = ieraion complexiy }{{} #ieraions needed cos per ieraion ieraion complexiy ieraions needed cos per ieraion Large-scale problems call for mehods wih cheap ieraions Inroducion 1-7

Mehods of choice x Fir firs-order oracle f(x) -orde Òf(x) -order Firs-order mehods: mehods ha exploi only informaion on funcion values and (sub)gradiens (wihou using Hessian informaion) cheap ieraions low memory requiremen Inroducion 1-8

Wha his course will NOT cover second-order mehods check ORF 522 (by M. Wang) convex analysis check ORF 522 (by M. Wang) and ORF 523 (by A. Ahmadi) sum of squares programming check ORF 523 (by A. Ahmadi) approximaion algorihms for NP hard problems check ORF 523 (by A. Ahmadi) compuaional hardness check ORF 523 (by A. Ahmadi) online opimizaion check COS 511 (by E. Hazan) Inroducion 1-9

Wha his course will cover: convex opimizaion algorihms gradien mehods Frank-Wolfe and projeced gradien mehods subgradien mehods proximal gradien mehods acceleraed proximal gradien mehods mirror descen sochasic gradien mehods ADMM quasi-newon mehods (BFGS) large-scale linear algebra (conjugae gradien, lanczos mehod) ODE inerpreaions Inroducion 1-10

Wha his course will cover: nonconvex opimizaion? geomery of marix facorizaion (phase rerieval, marix compleion, marix sensing) escaping saddle poins gradien mehods for marix facorizaion neural nework? Inroducion 1-11

Texbooks We recommend hese hree books, bu will no follow hem closely... Inroducion 1-12

WARNING There will be quie a few THEOREMS and PROOFS... May be somewha disorganized Taugh for he firs ime Inroducion 1-13

Prerequisies basic linear algebra basic probabiliy a programming language (e.g. Malab, Pyhon,...) knowledge in basic convex opimizaion Inroducion 1-1

Prerequisies basic linear algebra basic probabiliy a programming language (e.g. Malab, Pyhon,...) knowledge in basic convex opimizaion Somewha surprisingly, mos proofs rely only on basic linear algebra and elemenary recursive formula Inroducion 1-1

Grading difficuly Inroducion 1-15

Grading difficuly workload Inroducion 1-15

Grading Homeworks: 3 problem ses use Piazza as he main mode of elecronic communicaion; please pos (and answer) quesions here! Term projec eiher individually or in groups of wo Inroducion 1-16

Grading Homeworks: 3 problem ses use Piazza as he main mode of elecronic communicaion; please pos (and answer) quesions here! Term projec eiher individually or in groups of wo grade = { max{0.5h + 0.5P, P }, if exam where H: homework; P : projec Inroducion 1-16

Grading Homeworks: 3 problem ses use Piazza as he main mode of elecronic communicaion; please pos (and answer) quesions here! Term projec eiher individually or in groups of wo grade = max{0.5h + 0.5P, P }, max{0.5h + 0.5P, 0.5E + 0.5P, P }, if } exam {{} random else where H: homework; P : projec; E: akehome exam Inroducion 1-16

Term projec Two forms lieraure review original research You are srongly encouraged o combine i wih your own research Inroducion 1-17

Term projec Two forms lieraure review original research You are srongly encouraged o combine i wih your own research Three milesones Proposal (March 16): up o 1 page Presenaion (eiher las week of class or reading period) Repor (May 13): up o pages wih unlimied appendix Inroducion 1-17

ELE 538B: Large-Scale Optimization for Data Science. Introduction. Yuxin Chen Princeton University, Spring 2018