Convex optimization COMS 4771

Similar documents
Lecture 2: Convex Sets and Functions

CS-E4830 Kernel Methods in Machine Learning

Linear regression COMS 4771

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines for Classification and Regression

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization & Machine Learning. Introduction to Optimization

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Linear & nonlinear classifiers

CIS 520: Machine Learning Oct 09, Kernel Methods

Midterm exam CS 189/289, Fall 2015

Machine Learning for NLP

Lecture 1: January 12

Lecture 1: Convex Sets January 23

Supremum of simple stochastic processes

Machine Learning And Applications: Supervised Learning-SVM

Discriminative Models

Discriminative Models

Convex Optimization Lecture 16

Logistic regression and linear classifiers COMS 4771

Statistical Methods for SVM

Classification objectives COMS 4771

MLCC 2017 Regularization Networks I: Linear Models

Qualifying Exam in Machine Learning

Linear & nonlinear classifiers

COMS 4771 Regression. Nakul Verma

Support Vector Machines

ORIE 4741: Learning with Big Messy Data. Regularization

Warm up: risk prediction with logistic regression

IE 521 Convex Optimization

CPSC 540: Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SWFR ENG 4TE3 (6TE3) COMP SCI 4TE3 (6TE3) Continuous Optimization Algorithm. Convex Optimization. Computing and Software McMaster University

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning Practice Page 2 of 2 10/28/13

STA141C: Big Data & High Performance Statistical Computing

Introduction and Math Preliminaries

ECS289: Scalable Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Convex Optimization Algorithms for Machine Learning in 10 Slides

Support Vector Machines, Kernel SVM

Machine Learning. Support Vector Machines. Manfred Huber

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Generalization theory

RegML 2018 Class 2 Tikhonov regularization and kernels

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines

Support Vector Machine I

Homework 4. Convex Optimization /36-725

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Chapter 2: Preliminaries and elements of convex analysis

CPSC 540: Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Decidability of consistency of function and derivative information for a triangle and a convex quadrilateral

Basis Expansion and Nonlinear SVM. Kai Yu

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Introduction to Machine Learning Midterm Exam Solutions

Lasso, Ridge, and Elastic Net

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Constrained Optimization and Lagrangian Duality

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Discriminative Learning and Big Data

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Lecture 8. Strong Duality Results. September 22, 2008

Statistical Data Mining and Machine Learning Hilary Term 2016

Support Vector Machine (SVM) and Kernel Methods

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Accelerate Subgradient Methods

Machine Learning for NLP

Introduction to Machine Learning Midterm, Tues April 8

Generalization, Overfitting, and Model Selection

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Math 273a: Optimization Subgradients of convex functions

Day 4: Classification, support vector machines

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Linear, Binary SVM Classifiers

Oslo Class 2 Tikhonov regularization and kernels

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Subgradient Method. Ryan Tibshirani Convex Optimization

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Transcription:

Convex optimization COMS 4771

1. Recap: learning via optimization

Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15

Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + Compare with Empirical Risk Minimization for zero-one loss: 1 w R d n n 1{y ix T i w 0}. 1 / 15

Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + Compare with Empirical Risk Minimization for zero-one loss: 1 w R d n n 1{y ix T i w 0}. In both cases, i-th term in summation is function of y ix T i w. 1 / 15

Zero-one loss vs. hinge loss 1 1 {yx T w 0} Zero-one loss: count 1 if f w(x) y. [ ] 1{yx T w 0} 1 yx T w. 1 + yx T w 2 / 15

Zero-one loss vs. hinge loss 1 [1 yx T w] + 1 {yx T w 0} 1 yx T w Hinge loss: an upper-bound on zero-one loss. [ ] 1{yx T w 0} 1 yx T w. + 2 / 15

Zero-one loss vs. hinge loss 1 [1 yx T w] + 1 {yx T w 0} 1 yx T w Hinge loss: an upper-bound on zero-one loss. [ ] 1{yx T w 0} 1 yx T w. Soft-margin SVM imizes an upper-bound on the training error rate, plus a term that encourages large margins. + 2 / 15

Learning via optimization Zero-one loss ERM Squared loss ERM Logistic regression MLE Soft-margin SVM 1 w R d n 1 w R d n 1 w R d n n 1{y ix T i w 0} n (x T i w y i) 2 n ln (1 + exp( y ix T i w)) λ w R d 2 w 2 2 + 1 n n [ ] 1 y ix T i w + 3 / 15

Learning via optimization Zero-one loss ERM Squared loss ERM Logistic regression MLE Soft-margin SVM Generic learning objective: w R d 1 w R d n 1 w R d n 1 w R d n n 1{y ix T i w 0} n (x T i w y i) 2 n ln (1 + exp( y ix T i w)) λ w R d 2 w 2 2 + 1 n r(w) }{{} regularization + 1 n n [ ] 1 y ix T i w + n φ i(w). } {{ } data fitting Regularization: encodes learning bias (e.g., prefer large margins). Data fitting: how poor is the fit to the training data. 3 / 15

Regularization Other kinds of regularization encode other inductive biases : e.g., r(w) = λ w 1 : encourages w to be small and sparse r(w) = λ d 1 wi+1 wi : encourage piecewise constant weights r(w) = λ w w old 2 2 : encourage closeness to old solution w old 4 / 15

Regularization Other kinds of regularization encode other inductive biases : e.g., r(w) = λ w 1 : encourages w to be small and sparse r(w) = λ d 1 wi+1 wi : encourage piecewise constant weights r(w) = λ w w old 2 2 : encourage closeness to old solution w old Can combine regularization with other data fitting objectives: e.g., λ Ridge regression w R d 2 w 2 2 + 1 n (x T i w y i) 2 2n Lasso λ w w R d 1 + 1 n (x T i w y i) 2 2n Sparse SVM λ w w R d 1 + 1 n [ ] 1 y ix T i w n + Sparse logistic regression w R d λ w 1 + 1 n n ( ln 1 + exp( y ix T i w) ) 4 / 15

2. Introduction to convexity

Convex sets A set A is convex if, for every pair of points {x, x } in A, the line segment between x and x is also contained in A. convex not convex convex convex 5 / 15

Convex sets A set A is convex if, for every pair of points {x, x } in A, the line segment between x and x is also contained in A. Examples: All of R d. Empty set. convex not convex convex convex Half-spaces: {x R d : b T x c}. Intersections of convex sets. Convex hulls: conv(s) := { k αixi : k N, x i S, α i 0, k αi = 1}. 5 / 15

Convex functions A function f : R d R is convex if for any x, x R d and α [0, 1], f ((1 α)x + αx ) (1 α) f(x) + α f(x ). x x x x f is not convex f is convex 6 / 15

Convex functions A function f : R d R is convex if for any x, x R d and α [0, 1], f ((1 α)x + αx ) (1 α) f(x) + α f(x ). x x x x f is not convex f is convex Examples: f(x) = c x for any c > 0 (on R) f(x) = x c for any c 1 (on R) f(x) = b T x for any b R d. f(x) = x for any norm. f(x) = x T Ax for symmetric positive semidefinite A. af + g for convex functions f and g, and a 0. max{f, g} for convex functions f and g. 6 / 15

Verifying convexity Is f(x) = x convex? 7 / 15

Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. 7 / 15

Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx 7 / 15

Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx (1 α)x + αx (triangle inequality) 7 / 15

Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx (1 α)x + αx (triangle inequality) = (1 α) x + α x (homogeneity) 7 / 15

Verifying convexity Is f(x) = x convex? Pick any α [0, 1] and any x, x R d. f ((1 α)x + αx ) = (1 α)x + αx (1 α)x + αx (triangle inequality) = (1 α) x + α x (homogeneity) = (1 α)f(x) + αf(x ). Yes, f is convex. 7 / 15

Convexity of differentiable functions Differentiable functions If f : R d R is differentiable, then f is convex if and only if f(x) f(x 0) + f(x 0) T (x x 0) f(x) a(x) for all x, x 0 R d. x 0 a(x) = f(x 0) + f (x 0)(x x 0) 8 / 15

Convexity of differentiable functions Differentiable functions If f : R d R is differentiable, then f is convex if and only if f(x) f(x 0) + f(x 0) T (x x 0) f(x) a(x) for all x, x 0 R d. x 0 a(x) = f(x 0) + f (x 0)(x x 0) Twice-differentiable functions If f : R d R is twice-differentiable, then f is convex if and only if 2 f(x) 0 for all x R d (i.e., the Hessian, or matrix of second-derivatives, is positive semi-definite for all x). 8 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. Is f(x) = e a,x convex? f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x 0 ) = e a,x ( ) e a,x0 + e a,x0 a, x x 0 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x ) ( ) 0 = e a,x e a,x0 + e a,x0 a, x x 0 ( = e a,x 0 e a,x x0 ( 1 + a, x x )) 0 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x ) ( ) 0 = e a,x e a,x0 + e a,x0 a, x x 0 ( = e a,x 0 e a,x x0 ( 1 + a, x x )) 0 0 (because 1 + z e z for all z R). 9 / 15

Verifying convexity of differentiable functions Is f(x) = x 4 convex? Use second-order condition for convexity. Yes, f is convex. f(x) x = 4x3 2 f(x) x 2 = 12x 2 0. Is f(x) = e a,x convex? Use first-order condition for convexity. f(x) = e a,x { a, x } = e a,x a (chain rule). Difference between f and its affine approximation: f(x) ( f(x 0) + f(x 0), x x ) ( ) 0 = e a,x e a,x0 + e a,x0 a, x x 0 ( = e a,x 0 e a,x x0 ( 1 + a, x x )) 0 Yes, f is convex. 0 (because 1 + z e z for all z R). 9 / 15

3. Convex optimization problems

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. Goal: Find x A so that f 0(x) is as small as possible. 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. Goal: Find x A so that f 0(x) is as small as possible. (Optimal) value of the optimization problem is the smallest value f 0(x) achieved by a feasible point x A. 10 / 15

Optimization problems A typical optimization problem (in standard form) is written as x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n. f 0 : R d R is the objective function; f 1,..., f n : R d R are the constraint functions; inequalities f i(x) 0 are constraints; { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is the feasible region. Goal: Find x A so that f 0(x) is as small as possible. (Optimal) value of the optimization problem is the smallest value f 0(x) achieved by a feasible point x A. Point x A achieving the optimal value is a (global) imizer. 10 / 15

Convex optimization problems Standard form of a convex optimization problem: x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n where f 0, f 1,..., f n : R d R are convex functions. 11 / 15

Convex optimization problems Standard form of a convex optimization problem: x R d f 0(x) s.t. f i(x) 0 for all i = 1,..., n where f 0, f 1,..., f n : R d R are convex functions. Fact: the feasible set { } A := x R d : f i(x) 0 for all i = 1, 2,..., n is a convex set. 11 / 15

Example: Soft-margin SVM w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, ξ i ξ i 0 for all i = 1,..., n. 12 / 15

Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n ξ i 0 ξ i for all i = 1,..., n 12 / 15

Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n ξ i 0 ξ i for all i = 1,..., n (sum of two convex functions, also convex) 12 / 15

Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n ξ i (sum of two convex functions, also convex) s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n (linear constraints) ξ i 0 for all i = 1,..., n 12 / 15

Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n ξ i (sum of two convex functions, also convex) s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n (linear constraints) ξ i 0 for all i = 1,..., n (linear constraints) 12 / 15

Example: Soft-margin SVM w R d,ξ 1,...,ξ n R Bringing to standard form: w R d,ξ 1,...,ξ n R λ 2 w 2 2 + 1 n n s.t. y ix T i w 1 ξ i for all i = 1,..., n, λ 2 w 2 2 + 1 n ξ i ξ i 0 for all i = 1,..., n. n ξ i (sum of two convex functions, also convex) s.t. 1 ξ i y ix T i w 0 for all i = 1,..., n (linear constraints) ξ i 0 for all i = 1,..., n (linear constraints) Which of Zero-one loss ERM, Squared loss ERM, Logistic regression MLE, Ridge regression, Lasso, Sparse SVM are convex? 12 / 15

Local imizers Consider an optimization problem (not necessarily convex): f 0(x) x R d s.t. x A. 13 / 15

Local imizers Consider an optimization problem (not necessarily convex): f 0(x) x R d s.t. x A. We say x A is a local imizer if there is an open ball { } U := x R d : x x 2 < r of positive radius r > 0 such that x is a global imizer for f 0(x) x R d s.t. x A U. 13 / 15

Local imizers Consider an optimization problem (not necessarily convex): f 0(x) x R d s.t. x A. We say x A is a local imizer if there is an open ball { } U := x R d : x x 2 < r of positive radius r > 0 such that x is a global imizer for f 0(x) x R d s.t. x A U. Nothing looks better than x in the immediate vicinity of x. locally optimal 13 / 15

Local imizers of convex problems If the optimization problem is convex, and x A is a local imizer, then it is also a global imizer. local global 14 / 15

Key takeaways 1. Formulation of learning methods as optimization problems. 2. Convex sets, convex functions, ways to check if a function is convex. 3. Standard form of convex optimization problems, concept of local and global imizers. 15 / 15