10-701 Recita,on: Loss, Regulariza,on, and Dual* Jay- Yoon Lee 02/26/2015 *Adopted figures from 10725 lecture slides and from the book Elements of Sta,s,cal Learning
Loss and Regulariza,on Op,miza,on problem can be expressed as to minimize Loss. arg min models M n l(x i ; M) i=1 If want to maximize your objec,ve func,on, nega,ve of objec,ve func,on is loss.
Loss and Regulariza,on Op,miza,on problem can be expressed as to minimize Loss. arg min models M n i=1 l(x i ; M) Introduce Regulariza,on term (or penalty ) to prevent overfiyng or sa,sfy constraints = arg min models M n l(x i ; M)+penalty(M) i=1
Loss and Regulariza,on Example: Loss of linear regression problem arg min y Xβ 2 2 β
Loss and Regulariza,on Example: Loss of linear regression problem arg min y Xβ 2 2 β Example: Penalty of linear regression arg min y Xβ 2 2 + β 1 β
Loss and Regulariza,on More Examples From wikipedia: h]p://en.wikipedia.org/wiki/regulariza,on_(mathema,cs)
Intermezzo Dual: Programs Lagrangian unc,on Convex for FDummies Primal optimization problem Many constrained op,miza,on minimize f (x) subject to ci (x) can 0 be x Intermezzo expressed i n t erm o f loss a nd penalty. Lagrange function X Convex L(x, ) =Programs f (x) + i ci (x)for Dummies Lagrangian func,on in x order Primal optimization Intermezzo Recall First optimality conditions X problem @Convex =Programs @x f (x) f+(x) for i @xdummies ci (x) x L(x, ) minimize subject to =ci 0(x) 0 Primal x i i Solve for x optimization and plug itproblem back into Primal Lagrange function XL minimize f (x) subject to ci+ = f (x) x L(x, ) maximize L(x( ), )(x) 0 i ci (x) Dual Lagrange function i X (keep explicit constraints) L(x, )optimality = f (x) + i cconditions First order in x i (x) X @x L(x, ) = @xconditions f (x) + in xi @x ci (x) = 0 First order optimality X i @x L(x, ) = @x f (x) + i =0 i @x ci (x) i Solve for x and plug it back into L Solve for x and plug it back into L maximize L(x( ), )
Dual: Lagrangian Func,on More generally, min x2r n f(x) subject to h i (x) apple 0, i =1,...m `j(x) =0, j =1,...r Lagrangian L(x, u, v) =f(x)+ 2 R 2 R mx u i h i (x)+ i=1 rx v j`j(x) j=1 From 10725 Lecture notes
Dual: Lagrangian Func,on Important Property Lagrangian func,on is lower bound of loss func,on. Important property: for any u 0 and v, Why? For feasible x, f(x) L(x, u, v) at each feasible x L(x, u, v) =f(x)+ mx i=1 u i h i (x) + {z } apple0 rx j=1 v j `j(x) {z } =0 apple f(x) From 10725 Lecture notes
Loss Func,ons (Classifica,on) Model Model : f Label : y = ±1 Prediction: sign(f) Loss func,on misclassification (0-1) I(sign(f y)) exponential exp( yf) Loss binomail deviance log(1 + exp( 2yf)) hinge max(1 yf, 0) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Misclassification Exponential Binomial Deviance Squared Error Support Vector 2 1 0 1 2 y f From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers
Loss Func,ons (Regression) Loss Loss 0 2 4 6 8 Squared Error Absolute Error Huber 3 2 1 0 1 2 3 y f Squared-Error l(y, f(x)) = (y f(x)) 2 Absolute Loss l(y, f(x)) = y f(x) { (y f(x)) 2 for y f(x) δ Huber Loss l(y, f(x)) = 2δ y f(x) δ 2 otherwise. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers
Loss Func,ons Classifica,on misclassification (0-1) I(sign(f y)) exponential exp( yf) binomail deviance log(1 + exp( 2yf)) hinge max(1 yf, 0) Regression Squared-Error l(y, f(x)) = (y f(x)) 2 Absolute Loss l(y, f(x)) = y f(x) { (y f(x)) 2 for y f(x) δ Huber Loss l(y, f(x)) = 2δ y f(x) δ 2 otherwise. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers
Classifica,on Examples Linear soft margin problem 1 minimize w,b 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Dual problem maximize subject to X i 1 2 X i,j i j y i y j hx i,x j i + X i i y i = 0 and i 2 [0,C] i From 701 lecture notes
Classifica,on Examples Logis,c Regression
Penalty Func,ons q =4 q =2 q =1 q =0.5 q =0.1 FIGURE 3.12. Contours of constant value of P j β j q for given values of q. From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers
Penalty Func,ons β 2. ^ β. β 2 ^ β β 1 β 1 LASSO arg min y Xβ 2 2 + β 1 β Ridge Regression arg min y Xβ 2 2 + β 2 β From Elements of Sta,s,cal Learning, 2 nd edi,on, Springers
Back up slides
Lagrange Multipliers Lagrange function! Saddlepoint Condition If there are x* and nonnegative α* such that L(x, ) :=f(x)+ From 10701 Lecture 5 nx i c i (x) where i 0 i=1 L(x, ) apple L(x, ) apple L(x, ) then x* is an optimal solution to the constrained optimization problem
Necessary Kuhn-Tucker Conditions From 10701 Lecture 5 Assume optimization problem! satisfies the constraint qualifications! has convex differentiable objective + constraints! Then the KKT conditions are necessary & sufficient @ x L(x, )=@ x f(x )+ X i i @ x c i (x ) = 0 (Saddlepoint in x ) @ i L(x, )=c i (x ) apple 0 (Saddlepoint in ) X i c i (x ) = 0 (Vanishing KKT-gap) i Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions
Lagrangian Consider general minimization problem min x2r n f(x) subject to h i (x) apple 0, i =1,...m `j(x) =0, j =1,...r Need not be convex, but of course we will pay special attention to convex case We define the Lagrangian as L(x, u, v) =f(x)+ mx u i h i (x)+ i=1 rx v j`j(x) j=1 New variables u 2 R m,v 2 R r,withu 0 (implicitly, we define L(x, u, v) = 1 for u<0) From 725lecture notes