Lecture 5 : Projections

Similar documents
Lecture 6 : Projected Gradient Descent

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

CPSC 540: Machine Learning

Conditional Gradient (Frank-Wolfe) Method

Lecture 2: Linear Algebra Review

Mathematics 530. Practice Problems. n + 1 }

Assignment 1: From the Definition of Convexity to Helley Theorem

arxiv: v1 [math.st] 10 Sep 2015

Lecture 8: Linear Algebra Background

Linear Algebra Massoud Malek

Spectral Theorem for Self-adjoint Linear Operators

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

1 Review: symmetric matrices, their eigenvalues and eigenvectors

Lecture 3: Review of Linear Algebra

Optimization Tutorial 1. Basic Gradient Descent

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

A Note on Two Different Types of Matrices and Their Applications

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

The theory of manifolds Lecture 2

1 Strict local optimality in unconstrained optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

1 Quantum states and von Neumann entropy

Tangent spaces, normals and extrema

Lecture 9: September 28

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Convex sets, conic matrix factorizations and conic rank lower bounds

Linear Algebra Review

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

1 Regression with High Dimensional Data

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

Optimization and Optimal Control in Banach Spaces

Spectral Processing. Misha Kazhdan

FFTs in Graphics and Vision. The Laplace Operator

Exercises * on Linear Algebra

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

There are two things that are particularly nice about the first basis

Chapter 3 Transformations

Lecture 3: Review of Linear Algebra

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Course Notes: Week 1

COMP 558 lecture 18 Nov. 15, 2010

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Least Squares Optimization

STAT 200C: High-dimensional Statistics

MAT 419 Lecture Notes Transcribed by Eowyn Cenek 6/1/2012

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Lecture 5: September 12

Generalization to inequality constrained problem. Maximize

Lecture 8 : Eigenvalues and Eigenvectors

Improving the Convergence of Back-Propogation Learning with Second Order Methods

An Introduction to Correlation Stress Testing

6. Proximal gradient method

Lecture 6: September 12

10-725/36-725: Convex Optimization Prerequisite Topics

Least Squares Optimization

Matrix Vector Products

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Linear Algebra. Session 12

Lecture 24: August 28

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

235 Final exam review questions

Optimization methods

Matrix Rank Minimization with Applications

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

Review problems for MA 54, Fall 2004.

U.C. Berkeley CS270: Algorithms Lecture 21 Professor Vazirani and Professor Rao Last revised. Lecture 21

Iterative Methods for Solving A x = b

1. General Vector Spaces

Lecture 7: Positive Semidefinite Matrices


6. Proximal gradient method

Fantope Regularization in Metric Learning

1 = I I I II 1 1 II 2 = normalization constant III 1 1 III 2 2 III 3 = normalization constant...

CS281 Section 4: Factor Analysis and PCA

Numerical Linear Algebra

Statistical Pattern Recognition

Sparse PCA in High Dimensions

Math Linear Algebra II. 1. Inner Products and Norms

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Basic Properties of Metric and Normed Spaces

minimize x subject to (x 2)(x 4) u,

Methods for sparse analysis of high-dimensional data, II

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

Math 1060 Linear Algebra Homework Exercises 1 1. Find the complete solutions (if any!) to each of the following systems of simultaneous equations:

Lecture 8: February 9

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Transcription:

Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization problem where the set of constraints C R d is convex and closed. min f(x) x C 1 Objective Function Consider some x 0 C. It might be that our gradient takes us outside of C. x 0 = x 0 α f(x 0 ) The simplest thing we can do is to project the point back onto the set, to get an x 1 C. x 1 = Π e (x 0 α f(x 0 )) where Π is the projection onto C. More explicitly, we are looking for the following. argmin x C x (x 0 α f(x 0 )) 2 This algorithm is called projected gradient descent. With extra projection properties and geometry, we can develop more understanding for this constrained version of gradient descent. Is this problem hard? If C is very complex - perhaps a polytope with many linear constraints - the problem may be too difficult. The answer is thus dependent on C. 1

2 Background Lemma: Say f is convex and differentiable. Then x argmin x C f(x) f(x ), z x 0 Without constraints, we re simply looking for a point with a gradient of zero. The above is the natural generalization with a constraint set C. Consider the geometry of this optimality condition. Take the optimal value x to be at the boundary of C and a gradient that points out of C. The condition gives us information about the gradient vector and any other vector pointing inwards. In effect, the condition tells us that we cannot find any feasible descent direction; equivalently, the angle between the descent direction f(x ) and feasible direction z x is at least 90 degrees. What happens if the above condition holds for some x int(c)? We ll see that this actually implies f(x ) = 0. To some degree, this means your constraints had no effect. 3 Projection Operator To ensure that the objective function makes sense, we have to check that there exists a unique solution. Take the Euclidean projection on C, a closed, convex subset of R d. Consider min x C y x 2 2 for some fixed y R d. Examine y x 2 2, we see the function value diverges or goes to infinity. We can also consider sub-level sets of g (x R d g(x) γ). Compactness and closedness implies existence. Strict convexity implies uniqueness. So, the objective function makes sense. We can then define an operator. Hence: Π e : R d C, where Π e (y) = argmin x C x y 2 2 is well-defined. In general, the projection is not linear. However, it still has a few nice properties, characterized by the following inequalities: Π C (y) y, z Π C (y) 0, z C The angle between the (a) difference between the projection Π C (y) and the original point y and (b) any feasible direction is non-negative. This characterizes, in short, the angle between those two vectors. To prove this, we see Π C (y) y = f(π C (y)). 2

Consider a linear subspace C and the projection of some y on C, Π C (y). This lemma is saying that you can move along any direction in C. In R 2 with C as a line, we have two possible directions along C from Π C (y), both of which form angles with y Π C (y) that total to 180. This means both are 90 and thus y Π C (y) is orthogonal to C. Π C (y) y, z Π C (y) = 0 The projection also has one more important property Consider the constraint set C and two points outside of C, y, ỹ R d. We are interested in the difference between the projections Π C (y) Π C (ỹ) compared to the difference between the original points y ỹ. We claim the projection is non-expansive or that Π C (y) Π C (ỹ) 2 y ỹ 2 If y, ỹ C, then the above inequality is an equality. Note that this is not true in general if C is not convex. For example, take the Euclidean sphere S = {x R d x 2 = 1}, which is comprised of only the shell of a ball. Consider two points in the sphere, and their projection onto the shell. Note that the distance between the projections can be greater than the distance between the original points. This matters in practice. Often, when solving for eigenvalues, which is a quadratic program over a sphere, we cannot assume that algorithms such as the power method are nonexpansive. 4 Examples Consider some examples of projections. (a) O + = {x R d x j 0, j = 1... d}. This, intuitively, clips the function Π O +(y) = max{0, y} component-wise. (b) Box [0, 1] d = {x R d x 1} where 0 if y 0 Π [0,1] d(y) = y if y [0, 1] 1 if y 1 3

(c) Kite C = {x R d x 1 R} forces a sparse solution. You may believe that a large set of features can be reduced to a small number of important features. This is called soft sparsity. We can show that [Π e (y)] j = sgn(y i ) max{0, y j λ}, where λ is chosen such that Π C (y) 1. this is called soft thresholding. What does soft thresholding do? Define a mapping that takes some real-valued u and gives a continuous function. Between [ λ, λ], the function is constant at 0. After λ, it slopes up up, and less than λ, it also slopes up. T λ (u) = sgn(u) max{0, u λ} This function is non-expansive. On the other hand, hard-thresholding is discontinuous H λ (u) = { 0 if u λ u otherwise (d) Matrix problems S+ d d = {X R d d X = X T, X 0}. Consider the frobenius norm X Y 2 F = d i,j=1 (X ij Y ij ) 2. What would the projection on the PSD cone of a diagonal matrix look like? We would see each entry along the diagonal clipped by a minimum of 0. Say Y is not diagonal. If it is symmetric, we can diagonalize it. Y = U T DU where U R d d is orthogonal and U T U = I. D is diagonal diag{λ 1,... λ d }. For any X S+ d d, we have X = UXU T which is also in S d d +. Y X 2 F = U T (D X)U 2 F = D X 2 F In other words, the frobenius norm is unitary invariant. This means that multiplying by unitary matrices does not change its value. This also means that there is no dependence on the eigenvectors. This means the norm is purely a spectrum of the matrix. If you minimize over all X, is it the same as minimizing over all X. For a symmetric matrix, simply diagonalize, clip and the re-assemble. (e) Take the nuclear norm. For symmetric X R d d, we have X nuc = d j=1 λ j(x). Note that the frobenius norm is effectively L-2 applied to the eigenspectrum. X F = d j=1 λ2 j (D), whereas the nuclear is effectively L-1 applied to the 4

eigenspectrum. The nuclear norm is useful in practice because it is a convex surrogate for low-rankness. When you run PCA, you are minimizing the frobenius norm between the matrix A and its low-rank approximations. There are other forms of low-rank approximations that are computationally intractable. Take the nuclear norm and consider its ratio with the frobenius norm. Suppose your matrix is rank r << d. X nuc X F r This tells you how close to low rank your matrix X is. This is analogous to the L-1, L-2 norms for vectors. x 1 x 2 k for all k-sparse x R d with at most k non-zeros. What does the projection onto the nuclear norm ball look like? We are soft-thresholding the eigenvalues. We can check that the nuclear norm will be unitarily invariant, since it only depends on the eigenvalues. The above are all polynomial time examples. Next time, we will analyze the projected gradient descent algorithm itself: x l+1 = Π C (x l α f(x l )) We will show that with (m, M)-convex, smooth guarantees, we will show x l x 2 ( 1 m/m 1+m/M )l x 0 x 2. 5