Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham Kakade 26 26 Sham Kakade 2 Variable Selection by Regularization Simple Variable Selection LASSO: Sparse Regression Ridge regression: Penalizes large weights What if we want to perform feature selection? E.g., Which regions of the brain are important for word prediction? Can t simply choose features with largest coefficients in ridge solution Machine Learning CSE546 Sham Kakade University of Washington October, 26 26 Sham Kakade 3 Try new (convex) penalty: Penalize non-zero weights Regularization penalty: Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ Major impact in: statistics, machine learning & electrical engineering 26 Sham Kakade 4
LASSO Regression (Related) Constrained Optimization LASSO: least absolute shrinkage and selection operator New objective: LASSO solution: ŵ LASSO = arg min w j= 2 t(x j) (w + w ih i(x j))! + i= w i i= 26 Sham Kakade 5 26 Sham Kakade 6 Optimizing the LASSO Objective Coordinate Descent LASSO solution: ŵ LASSO = arg min w j= 2 t(x j) (w + w ih i(x j))! + i= w i i= Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick next coordinate? Super useful approach for *many* problems 26 Sham Kakade 7 Converges to optimum in some cases, such as LASSO 26 Sham Kakade 8 2
Optimizing LASSO Objective One Coordinate at a Time Subgradients of Convex Functions 2 t(x j ) (w + w i h i (x j ))! + w i j= i= i= Taking the derivative: Residual sum of squares (RSS):! @ RSS(w) = 2 h`(x j ) t(x j ) (w + w i h i (x j )) @w` j= i= Gradients lower bound convex functions: Gradients are unique at w iff function differentiable at w Penalty term: Subgradients: Generalize gradients to non-differentiable points: Any plane that lower bounds function: 26 Sham Kakade 9 26 Sham Kakade Taking the Subgradient Gradient of RSS term: @ RSS(w) =a`w` c` @w` If no penalty: Subgradient of full objective:!2 t(xj) (w + wihi(xj)) + wi j= i= i= a` =2 (h`(xj)) 2 j= c` =2 h`(xj) @t(xj) (w + X wihi(xj)) A i6=` j= Setting Subgradient to 8 < @ w`f (w) = : a`w` c` w` < [ c`, c` + ] w` = a`w` c` + w` > 26 Sham Kakade 26 Sham Kakade 2 3
Soft Thresholding ŵ` = 8 < : (c` + )/a` c` < c` 2 [, ] (c` )/a` c` > Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate l at (random or sequentially) 8 Set: < (c` + )/a` c` < ŵ` = c` 2 [, ] : (c` )/a` c` > c` From Kevin Murphy textbook Where: a` =2 (h`(xj)) 2 j= c` =2 h`(xj) @t(xj) (w + X wihi(xj)) A i6=` j= For convergence rates, see Shalev-Shwartz and Tewari 29 Other common technique = LARS 26 Sham Kakade 3 Least angle regression and shrinkage, Efron et al. 24 26 Sham Kakade 4 Recall: Ridge Coefficient Path Now: LASSO Coefficient Path From Kevin Murphy textbook From Kevin Murphy textbook Typical approach: select λ using cross validation 26 Sham Kakade 5 26 Sham Kakade 6 4
What you need to know Sample size issues? Variable Selection: find a sparse solution to learning problem L regularization is one way to do variable selection Applies beyond regression Hundreds of other approaches out there LASSO objective non-differentiable, but convex è Use subgradient No closed-form solution for minimization è Use coordinate descent Shooting algorithm is simple approach for solving LASSO 26 Sham Kakade 7 Sham Kakade 26 8 Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington October 3, 26 Sham Kakade 26 9 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS Sham Kakade 26 2 5
Weather prediction revisted Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] Temperature Sham Kakade 26 2 Sham Kakade 26 22 Classification Learn: h:x! Y X features Y target classes Link Functions Estimating P(Y X): Why not use standard linear regression? Conditional probability: P(Y X) Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: How do we estimate P(Y X)? Sham Kakade 26 23 Combing regression and probability? Need a mapping from real values to [,] A link function! Sham Kakade 26 24 6
.9.8.7.6.5.4.3.2. Logistic Regression Logistic function (or Sigmoid): Understanding the sigmoid Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z w =-2, w =- w =, w =- w =, w =-.5.9.8.7.6.5.4.3.2..9.8.7.6.5.4.3.2..9.8.7.6.5.4.3.2. -6-4 -2 2 4 6-6 -4-2 2 4 6-6 -4-2 2 4 6 Features can be discrete or continuous! Sham Kakade 26 25 Sham Kakade 26 26 Logistic Regression a Linear classifier -6-4 -2 2 4 6 Very convenient! implies implies implies linear classification rule! Sham Kakade 26 27 Sham Kakade 26 28 7
Optimizing concave function Gradient ascent Loss function: Conditional Likelihood Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Have a bunch of iid data of the form: Gradient: Update rule: Step size, h > Discriminative (logistic regression) loss function: Conditional Data Likelihood Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent can be much better Sham Kakade 26 29 Sham Kakade 26 3 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood `(w) = X j y j ln P (Y = x j, w)+( y j )lnp (Y = x j, w) Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize Sham Kakade 26 3 Sham Kakade 26 32 8
Maximize Conditional Log Likelihood: Gradient ascent Gradient Ascent for LR Gradient ascent algorithm: iterate until change < e (t) For i=,,k, (t) repeat Sham Kakade 26 33 Sham Kakade 26 34 Regularization in linear regression Linear Separability Overfitting usually leads to very large parameter choices, e.g.: -2.2 + 3. X.3 X 2 -. + 4,7,9.7 X 8,585,638.4 X 2 + Regularized least-squares (a.k.a. ridge regression), for l >: Sham Kakade 26 35 Sham Kakade 26 36 9
Large parameters Overfitting Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : NY `(w) =ln P (y j x j, w) 2 w 2 2 j= If data is linearly separable, weights go to infinity Practical note about w : Gradient of regularized likelihood: In general, leads to overfitting: Penalizing high weights can prevent overfitting Sham Kakade 26 37 Sham Kakade 26 38 Standard v. Regularized Updates Please Stop!! Stopping criterion Maximum conditional likelihood estimate NY w = arg max ln P (y j x j, w) w j= (t) `(w) =ln Y j P (y j x j, w)) w 2 2 When do we stop doing gradient descent? Regularized maximum conditional likelihood estimate NY w = arg max ln P (y j x j, w) wi 2 w 2 j= i= (t) Because l(w) is strongly concave: i.e., because of some technical condition `(w ) `(w) apple 2 r`(w) 2 2 Thus, stop when: Sham Kakade 26 39 Sham Kakade 26 4
Digression: Logistic regression for more than 2 classes Logistic regression in more general case (C classes), where Y in {,,C-} Digression: Logistic regression more generally Logistic regression in more general case, where Y in {,,C-} for c> P (Y = c x, w) = exp(w c + P k i= w cix i ) + P C c = exp(w c + P k i= w c ix i ) for c= (normalization, so no weights for this class) P (Y = x, w) = + P C c = exp(w c + P k i= w c ix i ) Sham Kakade 26 4 Learning procedure is basically the same as what we derived! Sham Kakade 26 42 The Cost, The Cost!!! Think about the cost Stochastic Gradient Descent What s the cost of a gradient update step for LR??? (t) Machine Learning CSE546 Sham Kakade University of Washington October 3, 26 Sham Kakade 26 43 Sham Kakade 26 44
Learning Problems as Expectations Gradient ascent in Terms of Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = `(w, x j ) N j= However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx True objective function: Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? Sham Kakade 26 45 Sham Kakade 26 46 SGD: Stochastic Gradient Ascent (or Descent) True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! r`(w) =E x [r`(w, x)] Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) Batch gradient ascent updates: w 2 2 8 9 < w (t+) i w (t) i + : w(t) i + = x (j) i [y (j) P (Y = x (j), w (t) )] N ; j= Stochastic gradient ascent updates: Online setting: n o i + t w (t) i + x (t) i [y (t) P (Y = x (t), w (t) )] w (t+) i w (t) Sham Kakade 26 47 Sham Kakade 26 48 2
Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w () Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations What you should know Classification: predict discrete classes rather than real values Logistic regression model: Linear model Logistic function maps real values to [,] Optimize conditional likelihood Gradient computation Overfitting Regularization Regularized optimization Cost of gradient step is high, use stochastic gradient descent Sham Kakade 26 49 Sham Kakade 26 5 Stopping criterion Convergence rates for gradient descent/ascent `(w) =ln Y j P (y j x j, w)) w 2 2 Regularized logistic regression is strongly concave Negative second derivative bounded away from zero: Number of Iterations to get to accuracy `(w ) `(w) apple If func Lipschitz: O(/ϵ 2 ) If gradient of func Lipschitz: O(/ϵ) Strong concavity (convexity) is super helpful!! If func is strongly convex: O(ln(/ϵ)) For example, for strongly concave l(w): `(w ) `(w) apple 2 r`(w) 2 2 Sham Kakade 26 5 Sham Kakade 26 52 3