Recita,on: Loss, Regulariza,on, and Dual*

Similar documents
Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

L5 Support Vector Classification

CS-E4830 Kernel Methods in Machine Learning

Announcements - Homework

Support Vector Machines and Kernel Methods

Machine Learning A Geometric Approach

ICS-E4030 Kernel Methods in Machine Learning

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines for Classification: A Statistical Portrait

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

LECTURE 7 Support vector machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines, Kernel SVM

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Constrained Optimization and Lagrangian Duality

Support Vector Machines: Maximum Margin Classifiers

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Support Vector Machines

Lecture 18: Optimization Programming

10701 Recitation 5 Duality and SVM. Ahmed Hefny

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machines

Lecture 2: Linear SVM in the Dual

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning. Support Vector Machines. Manfred Huber

(Kernels +) Support Vector Machines

Lecture Notes on Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Regression.

COMP 562: Introduction to Machine Learning

Support vector machines

Support Vector Machine (SVM) and Kernel Methods

More on Lagrange multipliers

Constrained Optimization

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

5. Duality. Lagrangian

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture: Duality.

Linear, Binary SVM Classifiers

Convex Optimization and Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Convex Optimization & Lagrange Duality

Generalization to inequality constrained problem. Maximize

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

SVM and Kernel machine

Homework 3. Convex Optimization /36-725

Jeff Howbert Introduction to Machine Learning Winter

Nonlinear Optimization

Lecture 10: A brief introduction to Support Vector Machine

Lecture: Duality of LP, SOCP and SDP

Statistical Methods for Data Mining

Constrained optimization

Convex Optimization M2

Convex Optimization Boyd & Vandenberghe. 5. Duality

On the Method of Lagrange Multipliers

Pattern Recognition 2018 Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Discriminative Learning and Big Data

Regularization Paths

Machine Learning And Applications: Supervised Learning-SVM

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1. f(β) 0 (that is, β is a feasible point for the constraints)

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 7: Convex Optimizations

EE/AA 578, Univ of Washington, Fall Duality

Statistical Methods for SVM

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Lagrange Relaxation: Introduction and Applications

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Support Vector Machines for Regression

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

CE 191: Civil & Environmental Engineering Systems Analysis. LEC 17 : Final Review

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Applied Lagrange Duality for Constrained Optimization

Lagrangian Duality Theory

4. Algebra and Duality

10-725/ Optimization Midterm Exam

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning

Transcription:

10-701 Recita,on: Loss, Regulariza,on, and Dual* Jay- Yoon Lee 02/26/2015 *Adopted figures from 10725 lecture slides and from the book Elements of Sta,s,cal Learning

Loss and Regulariza,on Op,miza,on problem can be expressed as to minimize Loss. arg min models M n l(x i ; M) i=1 If want to maximize your objec,ve func,on, nega,ve of objec,ve func,on is loss.

Loss and Regulariza,on Op,miza,on problem can be expressed as to minimize Loss. arg min models M n i=1 l(x i ; M) Introduce Regulariza,on term (or penalty ) to prevent overfiyng or sa,sfy constraints = arg min models M n l(x i ; M)+penalty(M) i=1

Loss and Regulariza,on Example: Loss of linear regression problem arg min y Xβ 2 2 β

Loss and Regulariza,on Example: Loss of linear regression problem arg min y Xβ 2 2 β Example: Penalty of linear regression arg min y Xβ 2 2 + β 1 β

Loss and Regulariza,on More Examples From wikipedia: h]p://en.wikipedia.org/wiki/regulariza,on_(mathema,cs)

Intermezzo Dual: Programs Lagrangian unc,on Convex for FDummies Primal optimization problem Many constrained op,miza,on minimize f (x) subject to ci (x) can 0 be x Intermezzo expressed i n t erm o f loss a nd penalty. Lagrange function X Convex L(x, ) =Programs f (x) + i ci (x)for Dummies Lagrangian func,on in x order Primal optimization Intermezzo Recall First optimality conditions X problem @Convex =Programs @x f (x) f+(x) for i @xdummies ci (x) x L(x, ) minimize subject to =ci 0(x) 0 Primal x i i Solve for x optimization and plug itproblem back into Primal Lagrange function XL minimize f (x) subject to ci+ = f (x) x L(x, ) maximize L(x( ), )(x) 0 i ci (x) Dual Lagrange function i X (keep explicit constraints) L(x, )optimality = f (x) + i cconditions First order in x i (x) X @x L(x, ) = @xconditions f (x) + in xi @x ci (x) = 0 First order optimality X i @x L(x, ) = @x f (x) + i =0 i @x ci (x) i Solve for x and plug it back into L Solve for x and plug it back into L maximize L(x( ), )

Dual: Lagrangian Func,on More generally, min x2r n f(x) subject to h i (x) apple 0, i =1,...m `j(x) =0, j =1,...r Lagrangian L(x, u, v) =f(x)+ 2 R 2 R mx u i h i (x)+ i=1 rx v j`j(x) j=1 From 10725 Lecture notes

Dual: Lagrangian Func,on Important Property Lagrangian func,on is lower bound of loss func,on. Important property: for any u 0 and v, Why? For feasible x, f(x) L(x, u, v) at each feasible x L(x, u, v) =f(x)+ mx i=1 u i h i (x) + {z } apple0 rx j=1 v j `j(x) {z } =0 apple f(x) From 10725 Lecture notes

Loss Func,ons (Classifica,on) Model Model : f Label : y = ±1 Prediction: sign(f) Loss func,on misclassification (0-1) I(sign(f y)) exponential exp( yf) Loss binomail deviance log(1 + exp( 2yf)) hinge max(1 yf, 0) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Misclassification Exponential Binomial Deviance Squared Error Support Vector 2 1 0 1 2 y f From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

Loss Func,ons (Regression) Loss Loss 0 2 4 6 8 Squared Error Absolute Error Huber 3 2 1 0 1 2 3 y f Squared-Error l(y, f(x)) = (y f(x)) 2 Absolute Loss l(y, f(x)) = y f(x) { (y f(x)) 2 for y f(x) δ Huber Loss l(y, f(x)) = 2δ y f(x) δ 2 otherwise. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

Loss Func,ons Classifica,on misclassification (0-1) I(sign(f y)) exponential exp( yf) binomail deviance log(1 + exp( 2yf)) hinge max(1 yf, 0) Regression Squared-Error l(y, f(x)) = (y f(x)) 2 Absolute Loss l(y, f(x)) = y f(x) { (y f(x)) 2 for y f(x) δ Huber Loss l(y, f(x)) = 2δ y f(x) δ 2 otherwise. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

Classifica,on Examples Linear soft margin problem 1 minimize w,b 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Dual problem maximize subject to X i 1 2 X i,j i j y i y j hx i,x j i + X i i y i = 0 and i 2 [0,C] i From 701 lecture notes

Classifica,on Examples Logis,c Regression

Penalty Func,ons q =4 q =2 q =1 q =0.5 q =0.1 FIGURE 3.12. Contours of constant value of P j β j q for given values of q. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

Penalty Func,ons β 2. ^ β. β 2 ^ β β 1 β 1 LASSO arg min y Xβ 2 2 + β 1 β Ridge Regression arg min y Xβ 2 2 + β 2 β From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

Back up slides

Lagrange Multipliers Lagrange function! Saddlepoint Condition If there are x* and nonnegative α* such that L(x, ) :=f(x)+ From 10701 Lecture 5 nx i c i (x) where i 0 i=1 L(x, ) apple L(x, ) apple L(x, ) then x* is an optimal solution to the constrained optimization problem

Necessary Kuhn-Tucker Conditions From 10701 Lecture 5 Assume optimization problem! satisfies the constraint qualifications! has convex differentiable objective + constraints! Then the KKT conditions are necessary & sufficient @ x L(x, )=@ x f(x )+ X i i @ x c i (x ) = 0 (Saddlepoint in x ) @ i L(x, )=c i (x ) apple 0 (Saddlepoint in ) X i c i (x ) = 0 (Vanishing KKT-gap) i Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions

Lagrangian Consider general minimization problem min x2r n f(x) subject to h i (x) apple 0, i =1,...m `j(x) =0, j =1,...r Need not be convex, but of course we will pay special attention to convex case We define the Lagrangian as L(x, u, v) =f(x)+ mx u i h i (x)+ i=1 rx v j`j(x) j=1 New variables u 2 R m,v 2 R r,withu 0 (implicitly, we define L(x, u, v) = 1 for u<0) From 725lecture notes