For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have
|
|
- Brianna Hope Holmes
- 5 years ago
- Views:
Transcription
1 A Proofs We now give the details for the proof of our main results, i.e., heorems and 2. Below, we outline the steps for the proof of FAG s heorem. he proof of heorem 2 for FARE follows the same line of reasoning. Also, we note that, in what follows, lemmas/corollaries required for the proof of heorem 2, are given immediately after those of FAG.. FAG is essentially a combination of mirror descent and proximal gradient descent steps (emmas and 4). 2. in Algorithm plays the role of an e ective gradient ipschitz constant in each iteration. he convergence rate of FAG ultimately depends on P P g S g. (emma 8 and Corollary 3) 3. By picing S adaptively lie in AdaGrad, we achieve a non-trivial upper bound for P. (emma 5) 4. FAG relies on picing an x at each iteration that satisfies an inequality involving (Corollary ). However, because is not nown prior to picing x, we must choose an x to roughly satisfy the inequality for all possible values of. We do this by picing x using binary search. (emmas 2 and 3 and Corollary ) 5. Finally, we need to pic the right stepsize for each iteration. Our scheme is very similar to the one used in [], but generalized to handle a di erent each iteration. (emmas 6 and 8 as well as Corollary 3). 6. heorem 3 combines items, 2 and 4, above. Finally, to prove heorem, we combine heorem 3 with items 3 and 5 above. A. Proof of heorem and heorem 2 First, we obtain the following ey result (similar to [4, emma 2.3]) regarding the vector p (prox(x) x), as in Step 3 of FAG, which is nown as the Gradient Mapping of F on C. emma (Gradient Mapping) For any x, y 2C, we have F (prox(x)) F (y)h(prox(x) x), y xi 2 x prox(x)2 2, where prox(x) is defined as in (3). In particular, F (prox(x)) F (x) 2 x prox(x)2 2. Proof of emma his result is the same as emma 2.3 in [4]. We bring its proof here for completeness. For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have 0 hrf(x)v (prox(x) x), y prox(x)i hrf(x)v (prox(x) x), y xi hrf(x) and so v (prox(x) x), x prox(x)i, hrf(x), prox(x) xi hrf(x)v (prox(x) x), y xi hv, x prox(x)i x prox(x) 2 2, Now from -ipschitz continuity of rf as well as convexity of f and h, we get F (prox(x)) f(prox(x)) h(prox(x)) f(x)hrf(x), prox(x) xi 2 prox(x) x2 2 h(prox(x)) f(x)hrf(x)v (prox(x) x), y xi hv, x prox(x)i 2 x prox(x)2 2 h(prox(x)) f(y)hv (prox(x) x), y xi hv, x prox(x)i 2 x prox(x)2 2 h(prox(x)) f(y)h(prox(x) x), y xi hv, y prox(x)i 2 x prox(x)2 2 h(prox(x)) F (y)h(prox(x) x), y xi 2 x prox(x)2 2. he following lemma establishes the ipschitz continuity of the prox operator. emma 2 (Prox Operator Continuity) prox : R d! R d is a 2-ipschitz continuous, that is, for any x, y 2C, we have prox(x) prox(y) 2 2x y 2. 2
2 Proof of emma 2 By Definition (3), for any x, y, z, z 0 2 C, v and w we have hv, z hw, z 0 prox(x)i hrf(x)(prox(x) x), z prox(x)i, prox(y)i hrf(y)(prox(y) y), z 0 prox(y)i. In particular, for z prox(y) and z 0 prox(z), we get emma 3 (Binary Search emma) et x BinarySearch(z, y, ) defined as in Algorithm 2. hen one of 3 cases happen: (i) x y and hprox(x) x, x zi 0, (ii) x z and hprox(x) x, y xi 0, or (iii) x ty ( t)z for some t 2 (0, ) and hprox(x) x, y zi 3y z 2 2. hv, prox(y) hw, prox(y) prox(x)i hrf(x)(prox(x) x), prox(y) prox(x)i, prox(x)i hrf(y)(prox(y) y), prox(x) prox(y)i. Proof of emma 3 Items (i) and (ii), are simply Steps 2 and 5, respectively. For item (iii), wehave x w 2 ty ( t)z t y ( t )z 2 By monotonicity of sub-gradient, we get hv, prox(y) prox(x)i hw, prox(y) prox(x)i. So hrf(x)(prox(x) x), prox(x) prox(y)i hrf(y)(prox(y) y), prox(x) prox(y)i, and as a result hrf(x)(prox(x) x), prox(x) prox(y)i hrf(x) (prox(x) prox(y)prox(y) x), prox(x) prox(y)i prox(x) prox(y) 2 2 hrf(x)(prox(y) x), prox(x) prox(y)i hrf(y)(prox(y) y), prox(x) prox(y)i, which gives prox(x) prox(y) 2 2 hrf(y) rf(x)(x y), prox(x) prox(y)i (rf(y) rf(x) 2 x y 2 ) prox(x) prox(y) 2 2x y 2 prox(x) prox(y) 2, (t t )y (t t )z 2 y z 2. Now it follows that hprox(x) x, y zi hprox(x) x, y zi hprox(w) w, y zi hprox(x) prox(w), y zi 2 hx w, y zi prox(x) prox(w) 2 y z 2 x w 2 y z 2 2x w 2 y z 2 x w 2 y z 2 3x w 2 y z 2 3 y z 2 2. Where the third inequality follows by emma 2 Using the above result, we can prove the following: Corollary et x, y, z and be defined as in Algorithm and. hen for all, hp, x z i( )hp, y x i D. and the result follows. Using prox operator continuity emma 2, we can conclude that given any y, z 2C,ifhprox(y) y, y zi < 0 and hprox(z) z, y zi > 0, then there must be a t 2 (0, ) for which w t y ( t )z gives hprox(w) w, y zi 0. Algorithm 2 finds an approximation to w in O(log / ) iterations. Proof of Corollary Note that by Step 3 of Algorithm ), p (prox(x ) x ). For, since x y z, the inequality is trivially true. For 2, we consider the three cases of emma 3: (i) if x y, the right hand side is / 0 and the left hand side is hp, x z i h (prox(x ) x ), x z i0, (ii) if x z, the left hand side 3
3 is 0 and hp, y x i h (prox(x ) x ), y x i 0, so the inequality holds trivially, and (iii) in this last case, for some t 2 (0, ), we have hp, x z i h (prox(x ) x ),ty ( t)z z i th(prox(x ) x ), y z i, and hp, y x i h (prox(x ) x ), y ty ( t)z i ( t)h(prox(x ) x ), (y z )i. Hence hp, x z i ( )hp, y x i hp, x z i ( )hp, y x i ( t ( )( t)) h(prox(x ) x ), (y z )i Next, we state a result regarding the mirror descent step. Similar results can be found in most texts on online optimization, e.g. []. emma 4 (Mirror Descent Inequality) et z arg min z2c h p, z z i 2 z z 2 S and D : sup x,y2c x y 2 be the diameter of C measured by infinity norm. hen for any u 2C, we have h p, z ui 2 2 p 2 S D 2 s 3 ( t ( )( t)) y z ( t) y z 2 2 3( )y z y z 2 2 6D y z 2 2 D D, 6d Proof of emma 4 For any u 2Cand by optimality of z,wehaveh p, z uihs (z z ), u z i. Hence, using (5) and (4), it follows that where in the last line we used the fact that y z 2 2 Dd Similar to for Algorithm, the following emma proves an analogous result for Algorithm 3. Corollary 2 et x, y, z and be defined as in Algorithm 3 and. hen for all, hp, x z i( )hp, y x i D. Proof of Corollary 2 We consider two cases:. If x is generated through Algorithm 5, then x BinarySearch(y, z, ) and,sothe statement follows from Corollary. 2. If x is generated through Algorithm 4, thenx y z, and so satisfies hp, x z i ( )hp, y x i h p, z ui h p, z z i h p, z ui h p, z z i hs (z z ), z ui h p, z z i 2 z z 2 S 2 z u 2 S 2 u z 2 S h p, zi sup z2r d 2 z2 S 2 z u 2 S 2 u z 2 S 2 2 p 2 S 2 u z 2 S 2 u z 2 S. Now recalling from Steps 5-7 of Algorithm that S diag(s ) I and s s,wesumover to 4
4 get h p, z ui 2 2 p 2 S 2 u z 2 S u z 2 S 2 u z 2 S 2 2 p 2 S 2 u z 2 S h(s S )(u z ), u z i p 2 S 2 u z 2 hs, i u z 2 hs s, i p 2 S D 2 hs, i D p 2 S D 2 s hs s, i 2 Finally, we state a similar result to that of [7] that captures the benefits of using S in FAG. emma 5 (AdaGrad Inequalities) Define q : P d i G (i, :) 2, where G is as in Step 5 of Algorithm. We have (i) P g S g 2q, (ii) q 2 min S2S P g S g, where S : {S 2 R d d S is diagonal, S ii > 0, trace(s) }, and (iii) p q p d. Proof of emma 5 o prove part (i), we use the following inequality introduced in the proof of emma 4 in [7]: for any arbitrary real-valued sequence of {a i } i and its vector representation as a : [a,a 2,...,a ], we have a 2 a : 2 2a : 2. 5 So it follows that g S g i i i 2q, g 2 (i) s 2 (i) g 2 (i) s (i) g 2 (i) G (i, :) 2 where the last equality follows from the definition of s in Step 6 of Algorithm. For the rest of the proof, one can easily see that g S g i g 2 (i) s(i) i a(i) s(i), where a(i) : P g2 (i) and s diag(s). Now the agrangian for 0 and 0, can be written as! a(i) (s,, ) s(i) s(i) h, si. i i Since the strong duality holds, for any primal-dual optimal solutions, S, and, it follows from complementary slacness that 0 (since s > 0). Now requiring )/@s(i) 0 gives s (i) p a i > 0, which since s (i) > 0, implies that > 0. As a result, by using complementary slacness again, we must have P d i s (i). Now simple algebraic calculations gives s (i) p ai /( P d i p ai ) and part (ii) follows. For part (iii), recall that g 2. Now, since min(s 0 ), one has g S g, and so q. One the other hand, consider the optimization problem v ux max G (i, :) 2 t gi 2() i i s.t. g 2 2,, 2,...,. he agrangian can be written as v ux ({g }, { } ) t gi 2() i! gi 2 (). i
5 By KK necessary condition, we require }, { } )/@g i() 0, which implies that /(2 q P g2 i ()), i, 2,...,d. Hence, P d P i g2 i () d/(4 2 ), and so 2 p d/, which gives q p d. We can now prove the central theorems of which is used to obtain FAG s main result. heorem 3 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm, we get n o 2 2 F (y ) F (u) 2 F (y ) D 2 D 2 s. Proof of heorem 3 Noting that p (y x ) is the gradient mapping of F 6 on C, it follows that (F (y ) (F (prox(x )) hp, x ui hp, (z u)i 2 2 p 2 S ( ) 2 2 p 2 2 hp, x D 2 s z i hp, x p 2 2 D 2 s 2 p 2 2 z i hp, x ( ) p 2 2 D 2 2 s ( )hp, y x i D 2 D 2 D 2 s ( ) (F (y ) F (y )). (emma ) Where the first inequality is by emma, the second inequality is by emma 4, the third equality is by Step 8 of Algorithm, and the second last inequality is by Corollary. Now we have 2 p 2 2 z i (F (y ) ( ) (F (y ) F (y )) F (y ) F (u) ( )F (y ) ( )F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 2 F (y ) F (u),
6 and the result follows. Proof of heorem 4 Parts of this proof which di er from the proof of heorem 3 are bolded. Noting that p (y x ) is the gradient mapping of F on C, it follows that Once again, we present the analog of heorem 3 for Algorithm 3. heorem 4 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm, we get n 2 2 F (y ) 2 F (y ) D 2 D 2 s. o F (u) (F (y ) (F (prox(x )) hp, x ui hp, (z u)i 2 p p 2 S 2 p 2 2 hp, x D 2 s z i hp, x 2 p 2 2 ( ) p 2 2 D 2 2 s hp, x z i ( ) p 2 2 D 2 2 s ( )hp, y x i D 2 D 2 D 2 s ( ) (F (y ) F (y )). z i! Where the first inequality follows from emma, the second inequality follows from emma 4, the last equality follows from Steps 9 and of Alg 4, Steps8 and 9 of Alg 5, and the second last inequality follows from Corollary 2, and the last equality follows from emma. 7
7 Now we have (F (y ) ( ) (F (y ) F (y )) F (y ) F (u) ( )F (y ) ( )F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 2 F (y ) and the result follows. F (u), We now set out to put the final piece of the proof in place: choosing the stepsize for the mirror descent step. emma 7 For the choice of in Algorithm 3 and, we have (i) 2 P i i, (ii) 2 2 0,and (iii). Proof of emma 7 Completely identical to proof of emma 6. Corollary 3 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm, we get F (y ) F (u) D 2 Ds 2 P. Proof of corollary 3 he result follows from heorem 3 and emma 6 as well as noting that 2 P i i P i i 2. he FARE analog: emma 6 For the choice of in Algorithm and, we have (i) 2 P i i, (ii) 2 2 0,and (iii). Corollary 4 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm 3, we get F (y ) F (u) D 2 Ds 2 P. Proof We prove (i) by induction. For, is is easy to verify that /, and so 2 and the base case follows trivially. Now suppose 2 P i i. Re-arranging (i) for gives X 0 2 i 2 2. i Now, it is easy to verify that the choice of in Algorithm is a solution of the above quadratic equation. he rest of the items follow immediately from part (i). Proof of corollary 4 he result follows from heorem 4 and emma 7 as well as noting that 2 P i i P i i 2. Finally, it only remains to lower bound P, which is done in the following emma. emma 8 For the choice of in Algorithm, we have 000 P Once again, the FARE analog of emma 6 is 8
8 Proof of emma 8 We prove by induction on. For, we have /, and the base case holds trivially. Suppose the desired relation holds for. We have ( ) P 2 s ( ) P 000 s ( ) P ( ) 3 P 000 s ( ) P 8000 P Where the first inequality is by the induction hypothesis on. Now if ( ) P 000 P, then we are done. Otherwise denoting : P, we must have that Hence, we get ( ) (3 2 3 ) 4 P. ( ) P ( ) P 000 P. v u 4 t P P Remar: We note here that we made little e ort to minimize constants, and that we used rather sloppy bounds such as /2. As a result, the constant appearing above is very conservative and a mere by product of our proof technique.. 2 emma 9 For the choice of in Algorithm 3, we have 000 P Proof of emma 9 Once again, exactly identical to the proof of emma 8, wehave 000 P Finally, using the guarantee that from Step of Algorithm 4 and Step 9 from Algorithm 5, we get the conclusion. he proof of FAG s main result, heorem, follows rather immediately. Proof of heorem he result follows immediately P from emma 8 and Corollary 3 and noting that P g S g 2q by emma 5 and s q by Step 6 of Algorithm and definition of q in emma 5. his gives F (y ) F (u) D 2 q2 000D 2 q2 00D 2. Now from emma 5, we see that : q 2 / 2 [,d]. Finally, the run-time per iteration follows from having to do log 2 (/ ) calls to bisection, each taing O( prox )time. he proof of FARE s main result, heorem 2, is obtained similarly to that of heorem. Proof of heorem 2 he result follows immediately P from emma 9 and Corollary 4 and noting that P g S g 2q by emma 5 and s q by Step 6 of Algorithm 4 and Step 5 of Algorithm 5 and definition of q in emma 5. his gives F (y ) F (u) D 2 q2 q2 00 D D 2 : q 2 / 2 [,d]. Now from emma 5, we see that Finally, we try to guess a suitable for log(d/ ) times, and resort to BinarySearch after. If we resort 9
9 to algorithm 5 (essentially BinarySeaerch), we mae log(/ ) calls to bisection, so overall the number of inner iterations per outer iteration is same as Algorithm. Each inner iteration taes O( prox )timein the worst case (if we have to resort to algorithm 5 each time). 20
Lecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationDuality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by
Duality (Continued) Recall, the general primal problem is min f ( x), xx g( x) 0 n m where X R, f : X R, g : XR ( X). he Lagrangian is a function L: XR R m defined by L( xλ, ) f ( x) λ g( x) Duality (Continued)
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationNesterov s Optimal Gradient Methods
Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved
More informationLecture 19: Follow The Regulerized Leader
COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationIteration-complexity of first-order penalty methods for convex programming
Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)
More informationNOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES
NOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES JONATHAN LUK These notes discuss theorems on the existence, uniqueness and extension of solutions for ODEs. None of these results are original. The proofs
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationAn E cient A ne-scaling Algorithm for Hyperbolic Programming
An E cient A ne-scaling Algorithm for Hyperbolic Programming Jim Renegar joint work with Mutiara Sondjaja 1 Euclidean space A homogeneous polynomial p : E!R is hyperbolic if there is a vector e 2E such
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationConvergent Iterative Algorithms in the 2-inner Product Space R n
Int. J. Open Problems Compt. Math., Vol. 6, No. 4, December 2013 ISSN 1998-6262; Copyright c ICSRS Publication, 2013 www.i-csrs.org Convergent Iterative Algorithms in the 2-inner Product Space R n Iqbal
More informationLecture 25: Subgradient Method and Bundle Methods April 24
IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationNotes on Some Methods for Solving Linear Systems
Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms
More informationThe Proximal Gradient Method
Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,
More informationConvex Optimization Conjugate, Subdifferential, Proximation
1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview
More informationPackage spcov. R topics documented: February 20, 2015
Package spcov February 20, 2015 Type Package Title Sparse Estimation of a Covariance Matrix Version 1.01 Date 2012-03-04 Author Jacob Bien and Rob Tibshirani Maintainer Jacob Bien Description
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationPrimal/Dual Decomposition Methods
Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients
More informationWHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????
DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationLecture 16: Introduction to Neural Networks
Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.18
XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationThe Steepest Descent Algorithm for Unconstrained Optimization
The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem
More informationMath 413/513 Chapter 6 (from Friedberg, Insel, & Spence)
Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence) David Glickenstein December 7, 2015 1 Inner product spaces In this chapter, we will only consider the elds R and C. De nition 1 Let V be a vector
More informationPavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017
Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationKiddie Talk - The Diamond Lemma and its applications
Kiddie Tal - The Diamond Lemma and its applications April 22, 2013 1 Intro Start with an example: consider the following game, a solitaire. Tae a finite graph G with n vertices and a function e : V (G)
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationNonlinear Optimization for Optimal Control
Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]
More information3.10 Lagrangian relaxation
3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationProof Pearl: Bounding Least Common Multiples with Triangles
Proof Pearl: Bounding Least Common Multiples with Triangles Hing-Lun Chan and Michael Norrish 2 joseph.chan@anu.edu.au Australian National University 2 Michael.Norrish@data6.csiro.au Canberra Research
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationDISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder
011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More information1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:
Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationLagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)
Lagrange Duality Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline of Lecture Lagrangian Dual function Dual
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationDuality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationLecture 24 November 27
EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve
More informationAbsolute Value Programming
O. L. Mangasarian Absolute Value Programming Abstract. We investigate equations, inequalities and mathematical programs involving absolute values of variables such as the equation Ax + B x = b, where A
More informationMaximization of Submodular Set Functions
Northeastern University Department of Electrical and Computer Engineering Maximization of Submodular Set Functions Biomedical Signal Processing, Imaging, Reasoning, and Learning BSPIRAL) Group Author:
More informationMachine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives
Machine Learning Brett Bernstein Recitation 1: Gradients and Directional Derivatives Intro Question 1 We are given the data set (x 1, y 1 ),, (x n, y n ) where x i R d and y i R We want to fit a linear
More information10 Numerical methods for constrained problems
10 Numerical methods for constrained problems min s.t. f(x) h(x) = 0 (l), g(x) 0 (m), x X The algorithms can be roughly divided the following way: ˆ primal methods: find descent direction keeping inside
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationLecture 3: Huge-scale optimization problems
Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationStanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures
2 1 Borel Regular Measures We now state and prove an important regularity property of Borel regular outer measures: Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon
More informationPrimal-Dual Interior-Point Methods for Linear Programming based on Newton s Method
Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Robert M. Freund March, 2004 2004 Massachusetts Institute of Technology. The Problem The logarithmic barrier approach
More information(1) is an invertible sheaf on X, which is generated by the global sections
7. Linear systems First a word about the base scheme. We would lie to wor in enough generality to cover the general case. On the other hand, it taes some wor to state properly the general results if one
More informationOptimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization
5.93 Optimization Methods Lecture 8: Optimality Conditions and Gradient Methods for Unconstrained Optimization Outline. Necessary and sucient optimality conditions Slide. Gradient m e t h o d s 3. The
More informationSECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS
SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS HONOUR SCHOOL OF MATHEMATICS, OXFORD UNIVERSITY HILARY TERM 2005, DR RAPHAEL HAUSER 1. The Quasi-Newton Idea. In this lecture we will discuss
More informationA class of Smoothing Method for Linear Second-Order Cone Programming
Columbia International Publishing Journal of Advanced Computing (13) 1: 9-4 doi:1776/jac1313 Research Article A class of Smoothing Method for Linear Second-Order Cone Programming Zhuqing Gui *, Zhibin
More informationSupplemental Material for Monte Carlo Sampling for Regret Minimization in Extensive Games
Supplemental Material for Monte Carlo Sampling for Regret Minimization in Extensive Games Marc Lanctot Department of Computing Science University of Alberta Edmonton, Alberta, Canada 6G E8 lanctot@ualberta.ca
More informationProximal splitting methods on convex problems with a quadratic term: Relax!
Proximal splitting methods on convex problems with a quadratic term: Relax! The slides I presented with added comments Laurent Condat GIPSA-lab, Univ. Grenoble Alpes, France Workshop BASP Frontiers, Jan.
More informationLecture: Duality of LP, SOCP and SDP
1/33 Lecture: Duality of LP, SOCP and SDP Zaiwen Wen Beijing International Center For Mathematical Research Peking University http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html wenzw@pku.edu.cn Acknowledgement:
More informationWAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME)
WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME ITAI BENJAMINI, GADY KOZMA, LÁSZLÓ LOVÁSZ, DAN ROMIK, AND GÁBOR TARDOS Abstract. We observe returns of a simple random wal on a finite graph to a fixed node,
More informationUSA Mathematical Talent Search Round 2 Solutions Year 27 Academic Year
1/2/27. In the grid to the right, the shortest path through unit squares between the pair of 2 s has length 2. Fill in some of the unit squares in the grid so that (i) exactly half of the squares in each
More information17 Solution of Nonlinear Systems
17 Solution of Nonlinear Systems We now discuss the solution of systems of nonlinear equations. An important ingredient will be the multivariate Taylor theorem. Theorem 17.1 Let D = {x 1, x 2,..., x m
More informationMatrix Secant Methods
Equation Solving g(x) = 0 Newton-Lie Iterations: x +1 := x J g(x ), where J g (x ). Newton-Lie Iterations: x +1 := x J g(x ), where J g (x ). 3700 years ago the Babylonians used the secant method in 1D:
More informationFor those who want to skip this chapter and carry on, that s fine, all you really need to know is that for the scalar expression: 2 H
1 Matrices are rectangular arrays of numbers. hey are usually written in terms of a capital bold letter, for example A. In previous chapters we ve looed at matrix algebra and matrix arithmetic. Where things
More informationContinuity. Chapter 4
Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More information6.854J / J Advanced Algorithms Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.85J / 8.5J Advanced Algorithms Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 8.5/6.85 Advanced Algorithms
More informationA Proof of the Converse for the Capacity of Gaussian MIMO Broadcast Channels
A Proof of the Converse for the Capacity of Gaussian MIMO Broadcast Channels Mehdi Mohseni Department of Electrical Engineering Stanford University Stanford, CA 94305, USA Email: mmohseni@stanford.edu
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More information15-780: LinearProgramming
15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear
More informationLECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE
LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization
More informationOptimization, Learning, and Games with Predictable Sequences
Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic
More informationOrthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016
Orthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016 1. Let V be a vector space. A linear transformation P : V V is called a projection if it is idempotent. That
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationOptimal Newton-type methods for nonconvex smooth optimization problems
Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations
More informationSize-Depth Tradeoffs for Boolean Formulae
Size-Depth Tradeoffs for Boolean Formulae Maria Luisa Bonet Department of Mathematics Univ. of Pennsylvania, Philadelphia Samuel R. Buss Department of Mathematics Univ. of California, San Diego July 3,
More informationI.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010
I.3. LMI DUALITY Didier HENRION henrion@laas.fr EECI Graduate School on Control Supélec - Spring 2010 Primal and dual For primal problem p = inf x g 0 (x) s.t. g i (x) 0 define Lagrangian L(x, z) = g 0
More informationNoisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto
More informationRadial Subgradient Descent
Radial Subgradient Descent Benja Grimmer Abstract We present a subgradient method for imizing non-smooth, non-lipschitz convex optimization problems. The only structure assumed is that a strictly feasible
More informationGRADIENT = STEEPEST DESCENT
GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More information