Recita,on: Loss, Regulariza,on, and Dual*

Size: px

Start display at page:

Download "Recita,on: Loss, Regulariza,on, and Dual*"

Buck Carson
5 years ago
Views:

1 Recita,on: Loss, Regulariza,on, and Dual* Jay- Yoon Lee 02/26/2015 *Adopted figures from lecture slides and from the book Elements of Sta,s,cal Learning

2 Loss and Regulariza,on Op,miza,on problem can be expressed as to minimize Loss. arg min models M n l(x i ; M) i=1 If want to maximize your objec,ve func,on, nega,ve of objec,ve func,on is loss.

3 Loss and Regulariza,on Op,miza,on problem can be expressed as to minimize Loss. arg min models M n i=1 l(x i ; M) Introduce Regulariza,on term (or penalty ) to prevent overfiyng or sa,sfy constraints = arg min models M n l(x i ; M)+penalty(M) i=1

4 Loss and Regulariza,on Example: Loss of linear regression problem arg min y Xβ 2 2 β

5 Loss and Regulariza,on Example: Loss of linear regression problem arg min y Xβ 2 2 β Example: Penalty of linear regression arg min y Xβ β 1 β

6 Loss and Regulariza,on More Examples From wikipedia: h]p://en.wikipedia.org/wiki/regulariza,on_(mathema,cs)

7 Intermezzo Dual: Programs Lagrangian unc,on Convex for FDummies Primal optimization problem Many constrained op,miza,on minimize f (x) subject to ci (x) can 0 be x Intermezzo expressed i n t erm o f loss a nd penalty. Lagrange function X Convex L(x, ) =Programs f (x) + i ci (x)for Dummies Lagrangian func,on in x order Primal optimization Intermezzo Recall First optimality conditions X f (x) f+(x) for ci (x) x L(x, ) minimize subject to =ci 0(x) 0 Primal x i i Solve for x optimization and plug itproblem back into Primal Lagrange function XL minimize f (x) subject to ci+ = f (x) x L(x, ) maximize L(x( ), )(x) 0 i ci (x) Dual Lagrange function i X (keep explicit constraints) L(x, )optimality = f (x) + i cconditions First order in x i (x) L(x, ) f (x) + in ci (x) = 0 First order optimality X L(x, ) f (x) + i =0 ci (x) i Solve for x and plug it back into L Solve for x and plug it back into L maximize L(x( ), )

8 Dual: Lagrangian Func,on More generally, min x2r n f(x) subject to h i (x) apple 0, i =1,...m `j(x) =0, j =1,...r Lagrangian L(x, u, v) =f(x)+ 2 R 2 R mx u i h i (x)+ i=1 rx v j`j(x) j=1 From Lecture notes

9 Dual: Lagrangian Func,on Important Property Lagrangian func,on is lower bound of loss func,on. Important property: for any u 0 and v, Why? For feasible x, f(x) L(x, u, v) at each feasible x L(x, u, v) =f(x)+ mx i=1 u i h i (x) + {z } apple0 rx j=1 v j `j(x) {z } =0 apple f(x) From Lecture notes

10 Loss Func,ons (Classifica,on) Model Model : f Label : y = ±1 Prediction: sign(f) Loss func,on misclassification (0-1) I(sign(f y)) exponential exp( yf) Loss binomail deviance log(1 + exp( 2yf)) hinge max(1 yf, 0) Misclassification Exponential Binomial Deviance Squared Error Support Vector y f From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

11 Loss Func,ons (Regression) Loss Loss Squared Error Absolute Error Huber y f Squared-Error l(y, f(x)) = (y f(x)) 2 Absolute Loss l(y, f(x)) = y f(x) { (y f(x)) 2 for y f(x) δ Huber Loss l(y, f(x)) = 2δ y f(x) δ 2 otherwise. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

12 Loss Func,ons Classifica,on misclassification (0-1) I(sign(f y)) exponential exp( yf) binomail deviance log(1 + exp( 2yf)) hinge max(1 yf, 0) Regression Squared-Error l(y, f(x)) = (y f(x)) 2 Absolute Loss l(y, f(x)) = y f(x) { (y f(x)) 2 for y f(x) δ Huber Loss l(y, f(x)) = 2δ y f(x) δ 2 otherwise. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

13 Classifica,on Examples Linear soft margin problem 1 minimize w,b 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Dual problem maximize subject to X i 1 2 X i,j i j y i y j hx i,x j i + X i i y i = 0 and i 2 [0,C] i From 701 lecture notes

14 Classifica,on Examples Logis,c Regression

15 Penalty Func,ons q =4 q =2 q =1 q =0.5 q =0.1 FIGURE Contours of constant value of P j β j q for given values of q. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

16 Penalty Func,ons β 2. ^ β. β 2 ^ β β 1 β 1 LASSO arg min y Xβ β 1 β Ridge Regression arg min y Xβ β 2 β From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers

17 Back up slides

18 Lagrange Multipliers Lagrange function! Saddlepoint Condition If there are x* and nonnegative α* such that L(x, ) :=f(x)+ From Lecture 5 nx i c i (x) where i 0 i=1 L(x, ) apple L(x, ) apple L(x, ) then x* is an optimal solution to the constrained optimization problem

19 Necessary Kuhn-Tucker Conditions From Lecture 5 Assume optimization problem! satisfies the constraint qualifications! has convex differentiable objective + constraints! Then the KKT conditions are necessary & x L(x, )=@ x f(x )+ X i x c i (x ) = 0 (Saddlepoint in x i L(x, )=c i (x ) apple 0 (Saddlepoint in ) X i c i (x ) = 0 (Vanishing KKT-gap) i Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions

20 Lagrangian Consider general minimization problem min x2r n f(x) subject to h i (x) apple 0, i =1,...m `j(x) =0, j =1,...r Need not be convex, but of course we will pay special attention to convex case We define the Lagrangian as L(x, u, v) =f(x)+ mx u i h i (x)+ i=1 rx v j`j(x) j=1 New variables u 2 R m,v 2 R r,withu 0 (implicitly, we define L(x, u, v) = 1 for u<0) From 725lecture notes

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection