Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Similar documents
Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

CS-E4830 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Lagrangian Duality and Convex Optimization

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Convex Optimization and Modeling

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Optimization and Optimal Control in Banach Spaces

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Constrained Optimization and Lagrangian Duality

Lecture Support Vector Machine (SVM) Classifiers

ML (cont.): SUPPORT VECTOR MACHINES

Binary Classification / Perceptron

Math 273a: Optimization Convex Conjugacy

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Optimization methods

Support Vector Machines and Kernel Methods

Support vector machines Lecture 4

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Kernelized Perceptron Support Vector Machines

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lasso, Ridge, and Elastic Net

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Introduction to Machine Learning

Chapter 2: Preliminaries and elements of convex analysis

Support Vector Machines

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

On the Method of Lagrange Multipliers

Subgradient Method. Ryan Tibshirani Convex Optimization

Lecture 10. Econ August 21

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Multiclass and Introduction to Structured Prediction

DATA MINING AND MACHINE LEARNING

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Statistical Methods for Data Mining

Primal/Dual Decomposition Methods

Linear & nonlinear classifiers

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Statistical Data Mining and Machine Learning Hilary Term 2016

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

The proximal mapping

Lagrange Multipliers

Support Vector Machine

CPSC 540: Machine Learning

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Stochastic Subgradient Method

Selected Topics in Optimization. Some slides borrowed from

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Conjugate, Subdifferential, Proximation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Econ Slides from Lecture 10

INTRODUCTION TO DATA SCIENCE

Optimization and Gradient Descent

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Math 273a: Optimization Subgradients of convex functions

6. Proximal gradient method

8. Conjugate functions

Statistical Methods for SVM

Lecture 1: January 12

Support Vector Machines

Dual Decomposition.

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Statistical and Computational Learning Theory

Sequential Convex Programming

Deep Learning for Computer Vision

V. Graph Sketching and Max-Min Problems

5. Duality. Lagrangian

Multiclass Classification

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CIS 520: Machine Learning Oct 09, Kernel Methods

CSC321 Lecture 4: Learning a Classifier

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Max Margin-Classifier

1. f(β) 0 (that is, β is a feasible point for the constraints)

5. Subgradient method

Section 3.1 Inverse Functions

Lecture 1: Background on Convex Analysis

Lecture 4: Convex Functions, Part I February 1

subject to (x 2)(x 4) u,

Machine Learning 4771

CSC321 Lecture 4: Learning a Classifier

EE364b Homework 2. λ i f i ( x) = 0, i=1

10-725/36-725: Convex Optimization Prerequisite Topics

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Transcription:

Subgradient Descent David S. Rosenberg New York University February 7, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 1 / 43

Contents 1 Motivation and Review: Support Vector Machines 2 Convexity and Sublevel Sets 3 Convex and Differentiable Functions 4 Subgradients David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 2 / 43

Motivation and Review: Support Vector Machines David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 3 / 43

The Classification Problem Output space Y = { 1,1} Action space A = R Real-valued prediction function f : X R The value f (x) is called the score for the input x. Intuitively, magnitude of the score represents the confidence of our prediction. Typical convention: f (x) > 0 = Predict 1 f (x) < 0 = Predict 1 (But we can choose other thresholds...) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 4 / 43

The Margin The margin (or functional margin) for predicted score ŷ and true class y { 1,1} is yŷ. The margin often looks like yf (x), where f (x) is our score function. The margin is a measure of how correct we are. We want to maximize the margin. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 5 / 43

[Margin-Based] Classification Losses SVM/Hinge loss: l Hinge = max{1 m,0} = (1 m) + Not differentiable at m = 1. We have a margin error when m < 1. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 6 / 43

[Soft Margin] Linear Support Vector Machine (No Intercept) Hypothesis space F = { f (x) = w T x w R d}. Loss l(m) = max(0,1 m) l 2 regularization min w R d n max ( 0,1 y i w T ) x i + λ w 2 2 i=1 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 7 / 43

SVM Optimization Problem (no intercept) SVM objective function: J(w) = 1 n n max ( [ 0,1 y i w T ]) x i + λ w 2. i=1 Not differentiable... but let s think about gradient descent anyway. Derivative of hinge loss l(m) = max(0,1 m): 0 m > 1 l (m) = 1 m < 1 undefined m = 1 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 8 / 43

Gradient of SVM Objective We need gradient with respect to parameter vector w R d : w l ( y i w T ) x i = l ( y i w T ) x i yi x i (chain rule) 0 y i w T x i > 1 = 1 y i w T x i < 1 y i x i (expanded m in l (m)) undefined y i w T x i = 1 0 y i w T x i > 1 = y i x i y i w T x i < 1 undefined y i w T x i = 1 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 9 / 43

Gradient of SVM Objective w l ( y i w T x i ) = 0 y i w T x i > 1 y i x i y i w T x i < 1 undefined y i w T x i = 1 So ( 1 n w J(w) = w l ( ) y i w T ) x i + λ w 2 n i=1 = 1 n w l ( y i w T ) x i + 2λw n = i=1 { 1 n i:y i w T x i <1 ( y ix i ) + 2λw all y i w T x i 1 undefined otherwise David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 10 / 43

Gradient Descent on SVM Objective? The gradient of the SVM objective is w J(w) = 1 n ( y i x i ) + 2λw i:y i w T x i <1 when y i w T x i 1 for all i, and otherwise is undefined. Potential arguments for why we shouldn t care about the points of nondifferentiability: If we start with a random w, will we ever hit exactly y i w T x i = 1? If we did, could we perturb the step size by ε to miss such a point? Does it even make sense to check y i w T x i = 1 with floating point numbers? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 11 / 43

Gradient Descent on SVM Objective? If we blindly apply gradient descent from a random starting point seems unlikely that we ll hit a point where the gradient is undefined. Still, doesn t mean that gradient descent will work if objective not differentiable! Theory of subgradients and subgradient descent will clear up any uncertainty. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 12 / 43

Convexity and Sublevel Sets David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 13 / 43

Convex Sets Definition A set C is convex if the line segment between any two points in C lies in C. KPM Fig. 7.4 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 14 / 43

Convex and Concave Functions Definition A function f : R d R is convex if the line segment connecting any two points on the graph of f lies above the graph. f is concave if f is convex. λ 1 λ x y A B KPM Fig. 7.5 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 15 / 43

Examples of Convex Functions on R Examples x ax + b is both convex and concave on R for all a,b R. x x p for p 1 is convex on R x e ax is convex on R for all a R Every norm on R n is convex (e.g. x 1 and x 2 ) Max: (x 1,...,x n ) max{x 1...,x n } is convex on R n David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 16 / 43

Simple Composition Rules Examples If g is convex, and Ax + b is an affine mapping, then g(ax + b) is convex. If g is convex then expg(x) is convex. If g is convex and nonnegative and p 1 then g(x) p is convex. If g is concave and positive then logg(x) is concave If g is concave and positive then 1/g(x) is convex. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 17 / 43

Main Reference for Convex Optimization Boyd and Vandenberghe (2004) Very clearly written, but has a ton of detail for a first pass. See the Extreme Abridgement of Boyd and Vandenberghe. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 18 / 43

Convex Optimization Problem: Standard Form Convex Optimization Problem: Standard Form minimize subject to f 0 (x) f i (x) 0, i = 1,...,m where f 0,...,f m are convex functions. Question: Is the in the constraint just a convention? Could we also have used or =? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 19 / 43

Level Sets and Sublevel Sets Let f : R d R be a function. Then we have the following definitions: Definition A level set or contour line for the value c is the set of points x R d for which f (x) = c. Definition A sublevel set for the value c is the set of points x R d for which f (x) c. Theorem If f : R d R is convex, then the sublevel sets are convex. (Proof straight from definitions.) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 20 / 43

Convex Function Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 21 / 43

Contour Plot Convex Function: Sublevel Set Is the sublevel set {x f (x) 1} convex? Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 22 / 43

Nonconvex Function Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 23 / 43

Contour Plot Nonconvex Function: Sublevel Set Is the sublevel set {x f (x) 1} convex? Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 24 / 43

Fact: Intersection of Convex Sets is Convex Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 25 / 43

Level and Superlevel Sets Level sets and superlevel sets of convex functions are not generally convex. Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 26 / 43

Convex Optimization Problem: Standard Form Convex Optimization Problem: Standard Form minimize subject to f 0 (x) f i (x) 0, i = 1,...,m where f 0,...,f m are convex functions. What can we say about each constraint set {x f i (x) 0}? (convex) What can we say about the feasible set {x f i (x) 0, i = 1,...,m}? (convex) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 27 / 43

Convex Optimization Problem: Implicit Form Convex Optimization Problem: Implicit Form minimize subject to f (x) x C where f is a convex function and C is a convex set. An alternative generic convex optimization problem. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 28 / 43

Convex and Differentiable Functions David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 29 / 43

First-Order Approximation Suppose f : R d R is differentiable. Predict f (y) given f (x) and f (x)? Linear (i.e. first order ) approximation: f (y) f (x) + f (x) T (y x) Boyd & Vandenberghe Fig. 3.2 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 30 / 43

First-Order Condition for Convex, Differentiable Function Suppose f : R d R is convex and differentiable. Then for any x,y R d f (y) f (x) + f (x) T (y x) The linear approximation to f at x is a global underestimator of f : Figure from Boyd & Vandenberghe Fig. 3.2; Proof in Section 3.1.3 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 31 / 43

First-Order Condition for Convex, Differentiable Function Suppose f : R d R is convex and differentiable Then for any x,y R d f (y) f (x) + f (x) T (y x) Corollary If f (x) = 0 then x is a global minimizer of f. For convex functions, local information gives global information. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 32 / 43

Subgradients David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 33 / 43

Subgradients Definition A vector g R d is a subgradient of f : R d R at x if for all z, f (z) f (x) + g T (z x). Blue is a graph of f (x). Each red line x f (x 0 ) + g T (x x 0 ) is a global lower bound on f (x). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 34 / 43

Subdifferential Definitions f is subdifferentiable at x if at least one subgradient at x. The set of all subgradients at x is called the subdifferential: f (x) Basic Facts f is convex and differentiable = f (x) = { f (x)}. Any point x, there can be 0, 1, or infinitely many subgradients. f (x) = = f is not convex. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 35 / 43

Globla Optimality Condition Definition A vector g R d is a subgradient of f : R d R at x if for all z, f (z) f (x) + g T (z x). Corollary If 0 f (x), then x is a global minimizer of f. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 36 / 43

Subdifferential of Absolute Value Consider f (x) = x Plot on right shows {(x,g) x R, g f (x)} Boyd EE364b: Subgradients Slides David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 37 / 43

f (x1, x2 ) = x1 + 2 x2 Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 38 / 43

Subgradients of f (x 1,x 2 ) = x 1 + 2 x 2 Let s find the subdifferential of f (x 1,x 2 ) = x 1 + 2 x 2 at (3,0). First coordinate of subgradient must be 1, from x 1 part (at x 1 = 3). Second coordinate of subgradient can be anything in [ 2, 2]. So graph of h(x 1,x 2 ) = f (3,0) + g T (x 1 3,x 2 0) is a global underestimate of f (x 1,x 2 ), for any g = (g 1,g 2 ), where g 1 = 1 and g 2 [ 2,2]. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 39 / 43

Underestimating Hyperplane to f (x 1,x 2 ) = x 1 + 2 x 2 Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 40 / 43

Subdifferential on Contour Plot Contour plot of f (x 1,x 2 ) = x 1 + 2 x 2, with set of subgradients at (3,0).. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 41 / 43

Contour Lines and Gradients For function f : R d R, graph of function lives in R d+1, gradient and subgradient of f live in R d, and contours, level sets, and sublevel sets are in R d. f : R d R continuously differentiable, f (x 0 ) 0, then f (x 0 ) normal to level set Proof sketch in notes. S = { x R d f (x) = f (x 0 ) }. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 42 / 43

Gradient orthogonal to sublevel sets Plot courtesy of Brett Bernstein. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 43 / 43