On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Similar documents
Constrained Optimization and Lagrangian Duality

Dual Proximal Gradient Method

Lecture 8. Strong Duality Results. September 22, 2008

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

1 Sparsity and l 1 relaxation

MAT 419 Lecture Notes Transcribed by Eowyn Cenek 6/1/2012

6. Proximal gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture: Smoothing.

Applications of Linear Programming

Math 273a: Optimization Subgradients of convex functions

Conditional Gradient (Frank-Wolfe) Method

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Introduction to Alternating Direction Method of Multipliers

5. Subgradient method

15-780: LinearProgramming

Convex Optimization Algorithms for Machine Learning in 10 Slides

Linear Analysis Lecture 5

Lecture 18: Optimization Programming

Math 273a: Optimization Convex Conjugacy

Nonlinear Optimization for Optimal Control

5. Duality. Lagrangian

Second-Order Cone Program (SOCP) Detection and Transformation Algorithms for Optimization Software

6. Proximal gradient method

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Homework 4. Convex Optimization /36-725

Lecture 9: September 28

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

minimize x subject to (x 2)(x 4) u,

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Optimization methods

Convex Optimization Boyd & Vandenberghe. 5. Duality

Preconditioning via Diagonal Scaling

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization & Lagrange Duality

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

The proximal mapping

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Lecture: Duality of LP, SOCP and SDP

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Optimization methods

Lecture 5 : Projections

Lagrange duality. The Lagrangian. We consider an optimization program of the form

10. Unconstrained minimization

9. Dual decomposition and dual algorithms

Optimization for Machine Learning

Exponentiated Gradient Descent

Convex Optimization M2

Lasso: Algorithms and Extensions

EE364b Convex Optimization II May 30 June 2, Final exam

CS-E4830 Kernel Methods in Machine Learning

Functions of Several Variables

Lecture: Duality.

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Dual and primal-dual methods

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Lecture 7: Convex Optimizations

Linear and non-linear programming

8 Numerical methods for unconstrained problems

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

10-725/ Optimization Midterm Exam

Determinant maximization with linear. S. Boyd, L. Vandenberghe, S.-P. Wu. Information Systems Laboratory. Stanford University

Lecture 8: February 9

Optimization and Optimal Control in Banach Spaces

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 25: November 27

Convex Optimization and Modeling

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Lecture 16: October 22

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

On Generalized Primal-Dual Interior-Point Methods with Non-uniform Complementarity Perturbations for Quadratic Programming

Dual Decomposition.

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Additional Homework Problems

CPSC 540: Machine Learning

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

EE364b Homework 2. λ i f i ( x) = 0, i=1

A Tutorial on Primal-Dual Algorithm

Fast proximal gradient methods

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Lagrangian Duality and Convex Optimization

Convex Optimization and l 1 -minimization

15. Conic optimization

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

8. Geometric problems

Nonlinear Programming

CPSC 540: Machine Learning

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Transcription:

Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,..., /n). + log n = 0, On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith diagonal x i >. So Hd(x) I 0. Hence, the strong convexity parameter is at least one. Alternatively, it is acceptable to show that d(x) is -strongly convex in l norm. To this end, note that ht Hd(x)h = n h i = ( ) ( ) xi h x i i /x i ( hi ), i= where we make use of the fact x i = and the Cauchy-Schwarz inequality.. We have that f = sup (u,v) Q u v, Ax b d(u, v). We can re-write this as minimize w R m d(w) + w T c subject to w T =, where w = (u, v) and c = ( Ax + b, Ax b). We will follow some of the steps in Boyd & Vandenberghe [] for the conjugate of the entropy function. For the function of a single variable f(z) = z log z +az, we have that the conjugate is f (y) = sup z [0,] yz z log z az. Taking a derivative and setting it equal to zero gives that z = e (y a). Plugging in this solution gives f (y) = exp( (y a) ). Therefore, the conjugate of d(w) + w T c is f (y) = m i= e (y i c i ). Hence, the dual function is g(λ) = λ f ( λ) = λ m i= We can maximize the dual over λ and get that max g(λ) = log λ e ( λ ci) m = λ e (λ+)/ e ci/. ( m i= e c i/ Provided by Austin Benson and Victor Minden, with some modifications. ). i=

Table : Summary of performance of algorithms on test problem where A R 00 50 when using entropy smoothing. Algorithm Ax b time (seconds) cvx 0.54 0.38 fixed smoothing ( = 0.05) 0.05.0 adaptive smoothing 0.9 0.33 But we know that c = ( Ax + b, Ax b), so log ( m i= e c i/ ) = log ( m i= ( ) ) cosh (at i x b i ). By strong duality we know that this is equal to the optimal primal solution. Finally, we account for the log m term and the fact that we need to negate the solution (since we switched from maximizing to minimizing the objective). Putting everything together gives ( m ( ) ) ( m ( ) ) f (x) = log m + log cosh (at i x b i ) = log cosh m (at i x b i ) i= We will need the gradient to run our algorithm for the next question. The derivative with respect to x j is x j f (x) = m m i= cosh ( (at i x b i)) m i= i= ( ) sinh (at i x b i ) a ij So the gradient is given by f (x) = ( ) m ( ) A T sinh (Ax b), T cosh (Ax b) where cosh and sinh are taken entry-wise. 3. We consider three algorithms for solving the optimization problem: () a standard optimization solver (cvx), () gradient descent on f with fixed value 0 = 0.05 (fixed smoothing), and (3) adaptive gradient descent on f, where ranges from 5 to 0.05 = 0 and decreases by a factor of.5 at each iteration (adaptive smoothing). We restrict each sub-problem of adaptive smoothing to use one tenth the number of iterations as fixed smoothing, in order to control the running time of the former algorithm. For gradient descent, we compute the step size from a line search. Table summarizes the performance results on a test problem where A R 00 50. The cvx solution is the best, and the running time is about the same as for adaptive smoothing. The fixed smoothing algorithm is by far the slowest. Figure shows how the solution from adaptive smoothing varies with. We see that the adaptive smoothing approaches an objective value close that for fixed smoothing, but the adaptive version is much faster.

Entropy smoothing 0 0.6 Ax b 0 0.7 Adaptive smoothing fixed smoothing cvx 0 0.8 0 0 0 0 0 Figure : Performance of adaptive smoothing as varies for entropy smoothing. 4. Let r = Ax b. We have f (r) = sup u T r u u = sup u u r/ + r/ = [ ] r/ inf u u r/. The solution to inf u u r/ is the Euclidean projection of r/ onto the l ball. By the optimality conditions, we know that this achieved with soft thresholding for some unknown threshold λ []. Specifically, u i = sgn(r i ) ( r i / λ) + for some λ that satisfies ( ri / λ) + =. Furthermore, we can use bisection to find the λ, using the fact that λ [0, max i r i /]. Now we compute the gradient. We know that f (r) is the conjugate function of f(u) = I u + u. Thus, from the lecture notes, [ ] f (r) = arg max r/ u inf u u r/ = arg inf u u r/, i.e., f (r) is just the projection of r/ onto the l ball. Applying the chain rule since r = Ax b, we arrive at f (x) = A T inf u (Ax b). u 5. Now we solve the problem with 0 = 0.0. Table summarizes the performance results on the same test problem where A R 00 50. We note that cvx is around two orders of 3

Table : Summary of performance of algorithms on test problem where A R 00 50 when using quadratic smoothing. Algorithm Ax b time (seconds) cvx 0.54 0.3769 fixed smoothing ( = 0.0) 5.374 3.49 adaptive smoothing 0.554 8.43 Quadratic smoothing Ax b 0 0 Adaptive smoothing fixed smoothing cvx 0 0 0 0 0 Figure : Performance of adaptive smoothing as varies for quadratic smoothing. magnitude faster than our solver. In this case, fixed smoothing did not converge after 000 iterations, so the resulting objective value is large. (This is motivation to use an accelerated method, but we will not consider that here). On the other hand, adaptive smoothing finds a solution with residual comparable to the cvx solution. Figure shows how the solution from adaptive smoothing varies with. The figure illustrates the benefit of warm starts. Overall, adaptive smoothing performs much better than fixed smoothing. 6. We now test on a problem size of A R 3000 750. To keep our analysis simple, we will just compare the entropy and quadratic smoothing using fixed smoothing for a few different values of. To compare the quality of the solutions, we use the true residual and the relative error in the optimal solution x to the vector x 0 used to generate the data (b = Ax 0 + z, where z is noise). Table 3 summarizes the performance of the two different smoothing techniques. In general, the entropy smoothing is much faster and quadratic smoothing finds a solution that is as good or slightly better. Furthermore, the entropy smoothing was much easier to implement. For these reasons, I prefer the entropy smoothing. References [] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 004. 4

Table 3: Summary of performance of fixed smoothing with the two different smoothing techniques for min Ax b where A R 3000 750. The data is generated synthetically by b = Ax 0 + z, where A and b have entries N(0, ) and z has entries N(0, e-). smoothing Ax b x 0 x / x 0 time (seconds) 000 entropy 0.030.99e-4.84 quadratic 0.037.98e-4 33.77 500 entropy 0.037.99e-4 0.89 quadratic 0.037.98e-4 33.77 00 entropy 0.037.98e-4.33 quadratic 0.037.98e-4 64.3 [] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, (3):3 3, 03. 5