EE364b Homework 2. λ i f i ( x) = 0, i=1

Similar documents
Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

EE364a Homework 8 solutions

EE364b Homework 5. A ij = φ i (x i,y i ) subject to Ax + s = 0, Ay + t = 0, with variables x, y R n. This is the bi-commodity network flow problem.

EE364b Homework 4. L(y,ν) = (1/2) y x ν(1 T y 1), minimize (1/2) y x 2 2 subject to y 0, 1 T y = 1,

Lecture 7: Convex Optimizations

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

1. f(β) 0 (that is, β is a feasible point for the constraints)

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Nonlinear Optimization for Optimal Control

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Constrained Optimization and Lagrangian Duality

Math 273a: Optimization Convex Conjugacy

Exercises for EE364b

1 Sparsity and l 1 relaxation

Lecture 8. Strong Duality Results. September 22, 2008

5. Subgradient method

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

EE363 homework 8 solutions

Primal/Dual Decomposition Methods

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Tutorial on Convex Optimization for Engineers Part II

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Applications of Linear Programming

2.098/6.255/ Optimization Methods Practice True/False Questions

Sequential Convex Programming

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization and Modeling

Convex analysis and profit/cost/support functions

EE364b Convex Optimization II May 30 June 2, Final exam

Dual Decomposition.

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

minimize x subject to (x 2)(x 4) u,

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Additional Homework Problems

Optimality Conditions for Constrained Optimization

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

Stochastic Subgradient Method

Lecture 10: Linear programming duality and sensitivity 0-0

Lagrange Relaxation: Introduction and Applications

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

3.10 Lagrangian relaxation

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Lecture: Duality of LP, SOCP and SDP

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

Math 273a: Optimization Subgradients of convex functions

Additional Exercises for Introduction to Nonlinear Optimization Amir Beck March 16, 2017

Understanding the Simplex algorithm. Standard Optimization Problems.

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Solving Dual Problems

3. THE SIMPLEX ALGORITHM

5. Duality. Lagrangian

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

ICS-E4030 Kernel Methods in Machine Learning

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Dual Proximal Gradient Method

Optimization. Yuh-Jye Lee. March 28, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 40

Lecture 6: Conic Optimization September 8

Math 273a: Optimization Subgradient Methods

Matrix Rank Minimization with Applications

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Homework 4. Convex Optimization /36-725

4. Convex optimization problems

Subgradient Method. Ryan Tibshirani Convex Optimization

EE364b Homework 7. t=1

Advanced Mathematical Programming IE417. Lecture 24. Dr. Ted Ralphs

Dual and primal-dual methods

6. Proximal gradient method

Distributed Optimization: Analysis and Synthesis via Circuits

12. Interior-point methods

Petar Maymounkov Problem Set 1 MIT with Prof. Jonathan Kelner, Fall 07

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Extended Monotropic Programming and Duality 1

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

A Brief Review on Convex Optimization

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Lecture 1: Background on Convex Analysis

Lecture 3. Optimization Problems and Iterative Algorithms

An Introduction to Algebraic Multigrid (AMG) Algorithms Derrick Cerwinsky and Craig C. Douglas 1/84

Lecture 25: November 27

Lagrangian Duality Theory

Lagrange Relaxation and Duality

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Analytic Center Cutting-Plane Method

Solving large Semidefinite Programs - Part 1 and 2

1 Strict local optimality in unconstrained optimization

PRACTICE PROBLEM SET

Transcription:

EE364b Prof. S. Boyd EE364b Homework 2 1. Subgradient optimality conditions for nondifferentiable inequality constrained optimization. Consider the problem minimize f 0 (x) subject to f i (x) 0, i = 1,...,m, with variable x R n. We do not assume that f 0,...,f m are convex. Suppose that x and λ 0 satisfy primal feasibility, f i ( x) 0, i = 1,...,m, dual feasibility, m 0 f 0 ( x) + λ i f i ( x), i=1 and the complementarity condition λ i f i ( x) = 0, i = 1,...,m. Show that x is optimal, using only a simple argument, and definition of subgradient. Recall that we do not assume the functions f 0,...,f m are convex. Solution. Let g be defined by g(x) = f 0 (x) + m i=1 λi f i (x). Then, 0 g( x). By definition of subgradient, this means that for any y, Thus, for any y, g(y) g( x) + 0 T (y x). f 0 (y) f 0 ( x) m λ i (f i (y) f i ( x)). i=1 For each i, complementarity implies that either λ i = 0 or f i ( x) = 0. Hence, for any feasible y (for which f i (y) 0), each λ i (f i (y) f i ( x)) term is either zero or negative. Therefore, any feasible y also satisfies f 0 (y) f 0 ( x), and x is optimal. 2. Optimality conditions and coordinate-wise descent for l 1 -regularized minimization. We consider the problem of minimizing φ(x) = f(x) + λ x 1, where f : R n R is convex and differentiable, and λ 0. The number λ is the regularization parameter, and is used to control the trade-off between small f and small x 1. When l 1 -regularization is used as a heuristic for finding a sparse x for which f(x) is small, λ controls (roughly) the trade-off between f(x) and the cardinality (number of nonzero elements) of x. 1

(a) Show that x = 0 is optimal for this problem (i.e., minimizes φ) if and only if f(0) λ. In particular, for λ λ max = f(0), l 1 regularization yields the sparsest possible x, the zero vector. Remark. The value λ max gives a good reference point for choosing a value of the penalty parameter λ in l 1 -regularized minimization. A common choice is to start with λ = λ max /2, and then adjust λ to achieve the desired sparsity/fit trade-off. Solution. A necessary and sufficient condition for optimality of x = 0 is that 0 φ(0). Now φ(0) = f(0) + λ 0 1 = f(0) + λ[ 1, 1] n. In other words, x = 0 is optimal if f(x) [ λ,λ] n. This is equivalent to f(0) λ. (b) Coordinate-wise descent. In the coordinate-wise descent method for minimizing a convex function g, we first minimize over x 1, keeping all other variables fixed; then we minimize over x 2, keeping all other variables fixed, and so on. After minimizing over x n, we go back to x 1 and repeat the whole process, repeatedly cycling over all n variables. Show that coordinate-wise descent fails for the function g(x) = x 1 x 2 + 0.1(x 1 + x 2 ). (In particular, verify that the algorithm terminates after one step at the point (x (0) 2,x (0) 2 ), while inf x g(x) =.) Thus, coordinate-wise descent need not work, for general convex functions. Solution. We first minimize over x 1, with x 2 fixed as x (0) 2. The optimal choice is x 1 = x (0) 2, since the derivative on the left is 0.9, and on the right, it is 1.1. We then arrive at the point (x (0) 2,x (0) 2 ). We now optimize over x 2. But it is optimal, with the same left and right derivatives, so x is unchanged. We re now at a fixed point of the coordinate-descent algorithm. On the other hand, taking x = ( t,t) and letting t, we see that g(x) = 0.1t. It s good to visualize coordinate-wise descent for this function, to see why x gets stuck at the crease along x 1 = x 2. The graph looks like a folded piece of paper, with the crease along the line x 1 = x 2. The bottom of the crease has a small tilt in the direction ( 1, 1), so the function is unbounded below. Moving along either axis increases g, so coordinate-wise descent is stuck. But moving in the direction ( 1, 1), for example, decreases the function. (c) Now consider coordinate-wise descent for minimizing the specific function φ defined above. Assuming f is strongly convex (say) it can be shown that the iterates converge to a fixed point x. Show that x is optimal, i.e., minimizes φ. Thus, coordinate-wise descent works for l 1 -regularized minimization of a differentiable function. Solution. For each i, x i minimizes the function ψ, with all other variables kept 2

fixed. It follows that 0 xi ψ( x) = f x i ( x) + λi i, i = 1,...,n, where I i is the subdifferential of at x i : I i = { 1} if x i < 0, I i = {+1} if x i > 0, and I i = [ 1, 1] if x i = 0. But this is the same as saying 0 f( x)+ x 1, which means that x minimizes ψ. The subtlety here lies in the general formula that relates the subdifferential of a function to its partial subdifferentials with respect to its components. For a separable function h : R 2 R, we have but this is false in general. h(x) = x1 h(x) x2 h(x), (d) Work out an explicit form for coordinate-wise descent for l 1 -regularized leastsquares, i.e., for minimizing the function Ax b 2 2 + λ x 1. You might find the deadzone function u 1 u > 1 ψ(u) = 0 u 1 u + 1 u < 1 useful. Generate some data and try out the coordinate-wise descent method. Check the result against the solution found using CVX, and produce a graph showing convergence of your coordinate-wise method. Solution. At each step we choose an index i, and minimize Ax b 2 2 + λ x 1 over x i, while holding all other x j, with j i, constant. Selecting the optimal x i for this problem is equivalent to selecting the optimal x i in the problem minimize ax 2 i + cx i + x i, where a = (A T A) ii /λ and c = (2/λ)( j i(a T A) ij x j + (b T A) i ). Using the theory discussed above, any minimizer x i will satisfy 0 2ax i + c + x i. Now we note that a is positive, so the minimizer of the above problem will have opposite sign to c. From there we deduce that the (unique) minimizer x i will be { 0 c [ 1, 1] x i = (1/2a)( c + sign(c)) otherwise, where 1 u < 0 sign(u) = 0 u = 0 1 u > 0. 3

Finally, we make use of the deadzone function ψ defined above and write x i = ψ((2/λ) j i(a T A) ij x j + (b T A) i ) (2/λ)(A T A) ii. Coordinate descent was implemented in Matlab for a random problem instance with A R 400 200. When solving to within 0.1% accuracy, the iterative method required only a third the time of cvx. Sample code appears below, followed by a graph showing the coordinate-wise descent method s function value converging to the CVX function value. % Generate a random problem instance. randn( state, 10239); m = 400; n = 200; A = randn(m, n); ATA = A *A; b = randn(m, 1); l = 0.1; TOL = 0.001; xcoord = zeros(n, 1); % Solve in cvx as a benchmark. cvx_begin variable xcvx(n); minimize(sum_square(a*xcvx + b) + l*norm(xcvx, 1)); cvx_end % Solve using coordinate-wise descent. while abs(cvx_optval - (sum_square(a*xcoord + b) +... l*norm(xcoord, 1)))/cvx_optval > TOL for i = 1:n xcoord(i) = 0; ei = zeros(n,1); ei(i) = 1; c = 2/l*ei *(ATA*xcoord + A *b); xcoord(i) = -sign(c)*pos(abs(c) - 1)/(2*ATA(i,i)/l); end end 4

10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 0 5 10 15 20 25 30 5