Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Conditional Gradient (Frank-Wolfe) Method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 5 : Projections

Composite Objective Mirror Descent

Lecture 9: September 28

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Online Convex Optimization

arxiv: v1 [math.oc] 1 Jul 2016

Lecture 23: November 19

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Convex Optimization Lecture 16

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Stochastic Gradient Descent with Variance Reduction

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Stochastic Gradient Descent with Only One Projection

Lecture 1: September 25

Adaptive Online Gradient Descent

Lecture 14: October 17

Convex Optimization Algorithms for Machine Learning in 10 Slides

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

A Greedy Framework for First-Order Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

The FTRL Algorithm with Strongly Convex Regularizers

1 Maximizing a Submodular Function

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Optimization for Machine Learning

On Nesterov s Random Coordinate Descent Algorithms - Continued

Introduction to Machine Learning (67577) Lecture 7

Homework 4. Convex Optimization /36-725

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Proximal methods. S. Villa. October 7, 2014

Lecture 23: Conditional Gradient Method

Spectral k-support Norm Regularization

Active Learning: Disagreement Coefficient

Bandits for Online Optimization

1 Overview. 2 Extreme Points. AM 221: Advanced Optimization Spring 2016

Lecture 6 : Projected Gradient Descent

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

A Simple Algorithm for Nuclear Norm Regularized Problems

L 1 Projections with Box Constraints

Mini-Batch Primal and Dual Methods for SVMs

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 14: Newton s Method

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Some tensor decomposition methods for machine learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Algorithms for Nonsmooth Optimization

Algorithmic Stability and Generalization Christoph Lampert

Miroslav Dudík, Zaid Harchaoui, Jérôme Malick

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

1 Strict local optimality in unconstrained optimization

Big Data Analytics: Optimization and Randomization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Lecture 16: FTRL and Online Mirror Descent

A survey: The convex optimization approach to regret minimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Inverse Time Dependency in Convex Regularized Learning

Online Passive-Aggressive Algorithms

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

Collaborative Filtering Matrix Completion Alternating Least Squares

A Perron-type theorem on the principal eigenvalue of nonsymmetric elliptic operators

Approximate Low-Rank Tensor Learning

Stochastic Optimization Algorithms Beyond SG

Efficient Bandit Algorithms for Online Multiclass Prediction

Lecture 8: February 9

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Lecture 6: September 17

Lecture #21. c T x Ax b. maximize subject to

Lecture 19: Follow The Regulerized Leader

Generalized Conditional Gradient and Its Applications

Lecture 3. Random Fourier measurements

Accelerating SVRG via second-order information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Lecture 24 November 27

CSC 411 Lecture 17: Support Vector Machine

Lecture 6: September 19

Subgradient Method. Ryan Tibshirani Convex Optimization

Dual Proximal Gradient Method

k-means Clustering via the Frank-Wolfe Algorithm

Newton s Method. Javier Peña Convex Optimization /36-725

1 Non-negative Matrix Factorization (NMF)

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Exponentiated Gradient Descent

Warm up: risk prediction with logistic regression

Nonnegative Matrix Factorization

6. Proximal gradient method

Discriminative Models

Nonlinear Optimization for Optimal Control

Transcription:

Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October 5, 08 5 Conditional gradient method In this lecture we discuss the conditional gradient method, also known as the Frank- Wolfe (FW) algorithm [FW56]. The motivation for this approach is that the projection step in projected gradient descent can be computationally inefficient in certain scenarios. The conditional gradient method provides an appealing alternative. 5. The algorithm Conditional gradient side steps the projection step using a clever idea. We start from some point x 0 Ω. Then, for time steps t = to T, where T is our final time step, we set x t+ = x t + η t ( x t x t ) where x t = arg min f (x t) + f (x t ) (x x t ). x Ω This expression simplifies to: x t = arg min x Ω f (x t) x Note that we need step size η t [0, ] to guarantee x t+ Ω. So, rather than taking a gradient step and projecting onto the constraint set. We optimize a liner function (defined by the gradient) inside the constraint set as summarized in Figure.

Starting from x 0 Ω, repeat: x t = arg min f (x t) x x Ω x t+ = x t + η t ( x t x t ) (linear optimization) (update step) Figure : Conditional gradient 5. Conditional gradient convergence analysis As it turns out, conditional gradient enjoys a convergence guarantee similar to the one we saw for projected gradient descent. Theorem 5. (Convergence Analysis). Assume we have a function f : Ω R that is convex, β-smooth and attains its global minimum at a point x Ω. Then, Frank-Wolfe achieves with step size f (x t ) f (x ) βd η t =. Here, D is the diameter of Ω, defined as D = max x y Ω x y. Note that we can trade our assumption of the existence of x for a dependence on L, the Lipschitz constant, in our bound. Proof of Theorem 5.. By smoothness and convexity, we have f (y) f (x) + f (x) (x x t ) + β x y Letting y = x t+ and x = x t, combined with the progress rule of conditional gradient descent, the above equation yields: f (x t+ ) f (x t ) + η t f (x t ) ( x t x t ) + η t β x t x t We now recall the definition of D from Theorem 5. and observe that x t x t D. Thus, we rewrite the inequality: f (x t+ ) f (x t ) + η t f (x t ) (x t x t ) + η t βd Because of convexity, we also have that f (x t ) (x x t ) f (x ) f (x t )

Thus, f (x t+ ) f (x ) ( η t )( f (x t ) f (x )) + η t βd We use induction in order to prove f (x t ) f (x ) βd t+ based on Equation above. Base case t = 0. Since f (x t+ ) f (x ) ( η t )( f (x t ) f (x )) + η t βd, when t = 0, we have η t = 0+ =. Hence, f (x ) f (x ) ( η t )( f (x t ) f (x )) + β x x = ( )( f (x t ) f (x )) + β x x βd βd 3 Thus, the induction hypothesis holds for our base case. () Inductive step. Proceeding by induction, we assume that f (x t ) f (x ) βd t+ holds for all integers up to t and we show the claim for t +. By Equation, f (x t+ ) f (x ) ( ) ( f (x t ) f (x )) + ) βd + 4 ( = βd ( t () + = βd t + () = βd t + βd t + 3 = βd t + 3 Thus, the inequality also holds for the t + case. () () βd ) 4 () βd 3

5.3 Application to nuclear norm optimization problems The code for the following examples can be found here. 5.3. Nuclear norm projection The nuclear norm (sometimes called Schatten -norm or trace norm) of a matrix A, denoted A, is defined as the sum of its singular values A = σ i (A). i The norm can be computed from the singular value decomposition of A. We denote the unit ball of the nuclear norm by B m n = {A R m n A }. How can we project a matrix A onto B? Formally, we want to solve min X B A X F Due to the rotational invariance of the Frobenius norm, the solution is obtained by projecting the singular values onto the unit simplex. This operation corresponds to shifting all singular values by the same parameter θ and clipping values at 0 so that the sum of the shifted and clipped values is equal to. This algorithm can be found in [DSSSC08]. 5.3. Low-rank matrix completion Suppose we have a partially observable matrix Y, of which the missing entries are filled with 0 and we would like to find its completion form projected on a nuclear norm ball. Formally we have the objective function min X B Y P O(X) F where P O is a linear projection onto a subset of coordinates of X specified by O. In this example P O (X) will generate a matrix with corresponding observable entries as in Y while other entries being 0. We can have P O (X) = X O where O is a matrix with binary entries. Calculating the gradient of this function, we have f (X) = Y X O. We can use projected gradient descent to solve this problem but it is more efficient to use Frank-Wolfe algorithm. We need to solve the linear optimization oracle X t argmin f (X t ) X X B 4

To simplify this problem, we need a simple fact that follows from the singular value decomposition. Fact 5.. The unit ball of the nuclear norm is the convex hull of rank- matrices conv{uv u = v =, u R m, v R n } = {X R m n X = }. From this fact it follows that the minimum of f (X t ) X is attained at a rank- matrix uv for unit vectors u and v. Equivalently, we can maximize f (X t ) uv over all unit vectors u and v. Put Z = f (X T ) and note that Z uv = tr(z uv ) = tr(u Zv) = u Zv. Another way to see this is to note that the dual norm of a nuclear norm is operator norm, Z = max Z, X. X Either way, we see that to run Frank-Wolfe over the nuclear norm ball we only need a way to compute the top left and singular vectors of a matrix. One way of doing this is using the classical power method described in Figure. Pick a random unit vector x and let y = A x/ A x. From k = to k = T : Put x k+ = Ay k Ay k Put y k+ = A x k+ A x k+ Return x T and y T as approximate top left and right singular vectors. Figure : Power method References [DSSSC08] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l -ball for learning in high dimensions. In Proc. 5th ICML, pages 7 79. ACM, 008. [FW56] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(-):95 0, 956. 5