Classification Logistic Regression

Similar documents
Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs?

Stochastic Gradient Descent

Ad Placement Strategies

Logistic Regression Logistic

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Classification Logistic Regression

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Linear classifiers: Overfitting and regularization

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Classification Logistic Regression

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Announcements Kevin Jamieson

Warm up: risk prediction with logistic regression

Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Linear classifiers: Logistic regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE 5984: Introduction to Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning, Fall 2012 Homework 2

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Lasso Regression: Regularization for feature selection

Support Vector Machines

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Lasso Regression: Regularization for feature selection

Kernelized Perceptron Support Vector Machines

Lecture 2 Machine Learning Review

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Convergence rate of SGD

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Regression.

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Introduction to Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Logistic Regression. Machine Learning Fall 2018

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Ad Placement Strategies

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Lecture 5: Logistic Regression. Neural Networks

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Generative v. Discriminative classifiers Intuition

Logistic Regression & Neural Networks

Announcements. Proposals graded

l 1 and l 2 Regularization

Overfitting, Bias / Variance Analysis

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Discriminative Models

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning - Waseda University Logistic Regression

Midterm exam CS 189/289, Fall 2015

Machine Learning for NLP

Linear Models for Regression

Logistic Regression. William Cohen

Machine Learning

Linear Models for Regression CS534

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

SGD and Deep Learning

CSC 411: Lecture 04: Logistic Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear and logistic regression

Linear Models for Regression CS534

Regression with Numerical Optimization. Logistic

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Optimization and Gradient Descent

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

STA141C: Big Data & High Performance Statistical Computing

Discriminative Models

Tufts COMP 135: Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Coordinate Descent and Ascent Methods

Statistical Methods for SVM

Max Margin-Classifier

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning

Logistic Regression. Stochastic Gradient Descent

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning Lecture 5

Transcription:

Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham Kakade 26 26 Sham Kakade 2 Variable Selection by Regularization Simple Variable Selection LASSO: Sparse Regression Ridge regression: Penalizes large weights What if we want to perform feature selection? E.g., Which regions of the brain are important for word prediction? Can t simply choose features with largest coefficients in ridge solution Machine Learning CSE546 Sham Kakade University of Washington October, 26 26 Sham Kakade 3 Try new (convex) penalty: Penalize non-zero weights Regularization penalty: Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ Major impact in: statistics, machine learning & electrical engineering 26 Sham Kakade 4

LASSO Regression (Related) Constrained Optimization LASSO: least absolute shrinkage and selection operator New objective: LASSO solution: ŵ LASSO = arg min w j= 2 t(x j) (w + w ih i(x j))! + i= w i i= 26 Sham Kakade 5 26 Sham Kakade 6 Optimizing the LASSO Objective Coordinate Descent LASSO solution: ŵ LASSO = arg min w j= 2 t(x j) (w + w ih i(x j))! + i= w i i= Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick next coordinate? Super useful approach for *many* problems 26 Sham Kakade 7 Converges to optimum in some cases, such as LASSO 26 Sham Kakade 8 2

Optimizing LASSO Objective One Coordinate at a Time Subgradients of Convex Functions 2 t(x j ) (w + w i h i (x j ))! + w i j= i= i= Taking the derivative: Residual sum of squares (RSS):! @ RSS(w) = 2 h`(x j ) t(x j ) (w + w i h i (x j )) @w` j= i= Gradients lower bound convex functions: Gradients are unique at w iff function differentiable at w Penalty term: Subgradients: Generalize gradients to non-differentiable points: Any plane that lower bounds function: 26 Sham Kakade 9 26 Sham Kakade Taking the Subgradient Gradient of RSS term: @ RSS(w) =a`w` c` @w` If no penalty: Subgradient of full objective:!2 t(xj) (w + wihi(xj)) + wi j= i= i= a` =2 (h`(xj)) 2 j= c` =2 h`(xj) @t(xj) (w + X wihi(xj)) A i6=` j= Setting Subgradient to 8 < @ w`f (w) = : a`w` c` w` < [ c`, c` + ] w` = a`w` c` + w` > 26 Sham Kakade 26 Sham Kakade 2 3

Soft Thresholding ŵ` = 8 < : (c` + )/a` c` < c` 2 [, ] (c` )/a` c` > Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate l at (random or sequentially) 8 Set: < (c` + )/a` c` < ŵ` = c` 2 [, ] : (c` )/a` c` > c` From Kevin Murphy textbook Where: a` =2 (h`(xj)) 2 j= c` =2 h`(xj) @t(xj) (w + X wihi(xj)) A i6=` j= For convergence rates, see Shalev-Shwartz and Tewari 29 Other common technique = LARS 26 Sham Kakade 3 Least angle regression and shrinkage, Efron et al. 24 26 Sham Kakade 4 Recall: Ridge Coefficient Path Now: LASSO Coefficient Path From Kevin Murphy textbook From Kevin Murphy textbook Typical approach: select λ using cross validation 26 Sham Kakade 5 26 Sham Kakade 6 4

What you need to know Sample size issues? Variable Selection: find a sparse solution to learning problem L regularization is one way to do variable selection Applies beyond regression Hundreds of other approaches out there LASSO objective non-differentiable, but convex è Use subgradient No closed-form solution for minimization è Use coordinate descent Shooting algorithm is simple approach for solving LASSO 26 Sham Kakade 7 Sham Kakade 26 8 Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington October 3, 26 Sham Kakade 26 9 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS Sham Kakade 26 2 5

Weather prediction revisted Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] Temperature Sham Kakade 26 2 Sham Kakade 26 22 Classification Learn: h:x! Y X features Y target classes Link Functions Estimating P(Y X): Why not use standard linear regression? Conditional probability: P(Y X) Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: How do we estimate P(Y X)? Sham Kakade 26 23 Combing regression and probability? Need a mapping from real values to [,] A link function! Sham Kakade 26 24 6

.9.8.7.6.5.4.3.2. Logistic Regression Logistic function (or Sigmoid): Understanding the sigmoid Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z w =-2, w =- w =, w =- w =, w =-.5.9.8.7.6.5.4.3.2..9.8.7.6.5.4.3.2..9.8.7.6.5.4.3.2. -6-4 -2 2 4 6-6 -4-2 2 4 6-6 -4-2 2 4 6 Features can be discrete or continuous! Sham Kakade 26 25 Sham Kakade 26 26 Logistic Regression a Linear classifier -6-4 -2 2 4 6 Very convenient! implies implies implies linear classification rule! Sham Kakade 26 27 Sham Kakade 26 28 7

Optimizing concave function Gradient ascent Loss function: Conditional Likelihood Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Have a bunch of iid data of the form: Gradient: Update rule: Step size, h > Discriminative (logistic regression) loss function: Conditional Data Likelihood Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent can be much better Sham Kakade 26 29 Sham Kakade 26 3 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood `(w) = X j y j ln P (Y = x j, w)+( y j )lnp (Y = x j, w) Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize Sham Kakade 26 3 Sham Kakade 26 32 8

Maximize Conditional Log Likelihood: Gradient ascent Gradient Ascent for LR Gradient ascent algorithm: iterate until change < e (t) For i=,,k, (t) repeat Sham Kakade 26 33 Sham Kakade 26 34 Regularization in linear regression Linear Separability Overfitting usually leads to very large parameter choices, e.g.: -2.2 + 3. X.3 X 2 -. + 4,7,9.7 X 8,585,638.4 X 2 + Regularized least-squares (a.k.a. ridge regression), for l >: Sham Kakade 26 35 Sham Kakade 26 36 9

Large parameters Overfitting Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : NY `(w) =ln P (y j x j, w) 2 w 2 2 j= If data is linearly separable, weights go to infinity Practical note about w : Gradient of regularized likelihood: In general, leads to overfitting: Penalizing high weights can prevent overfitting Sham Kakade 26 37 Sham Kakade 26 38 Standard v. Regularized Updates Please Stop!! Stopping criterion Maximum conditional likelihood estimate NY w = arg max ln P (y j x j, w) w j= (t) `(w) =ln Y j P (y j x j, w)) w 2 2 When do we stop doing gradient descent? Regularized maximum conditional likelihood estimate NY w = arg max ln P (y j x j, w) wi 2 w 2 j= i= (t) Because l(w) is strongly concave: i.e., because of some technical condition `(w ) `(w) apple 2 r`(w) 2 2 Thus, stop when: Sham Kakade 26 39 Sham Kakade 26 4

Digression: Logistic regression for more than 2 classes Logistic regression in more general case (C classes), where Y in {,,C-} Digression: Logistic regression more generally Logistic regression in more general case, where Y in {,,C-} for c> P (Y = c x, w) = exp(w c + P k i= w cix i ) + P C c = exp(w c + P k i= w c ix i ) for c= (normalization, so no weights for this class) P (Y = x, w) = + P C c = exp(w c + P k i= w c ix i ) Sham Kakade 26 4 Learning procedure is basically the same as what we derived! Sham Kakade 26 42 The Cost, The Cost!!! Think about the cost Stochastic Gradient Descent What s the cost of a gradient update step for LR??? (t) Machine Learning CSE546 Sham Kakade University of Washington October 3, 26 Sham Kakade 26 43 Sham Kakade 26 44

Learning Problems as Expectations Gradient ascent in Terms of Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = `(w, x j ) N j= However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx True objective function: Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? Sham Kakade 26 45 Sham Kakade 26 46 SGD: Stochastic Gradient Ascent (or Descent) True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! r`(w) =E x [r`(w, x)] Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) Batch gradient ascent updates: w 2 2 8 9 < w (t+) i w (t) i + : w(t) i + = x (j) i [y (j) P (Y = x (j), w (t) )] N ; j= Stochastic gradient ascent updates: Online setting: n o i + t w (t) i + x (t) i [y (t) P (Y = x (t), w (t) )] w (t+) i w (t) Sham Kakade 26 47 Sham Kakade 26 48 2

Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w () Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations What you should know Classification: predict discrete classes rather than real values Logistic regression model: Linear model Logistic function maps real values to [,] Optimize conditional likelihood Gradient computation Overfitting Regularization Regularized optimization Cost of gradient step is high, use stochastic gradient descent Sham Kakade 26 49 Sham Kakade 26 5 Stopping criterion Convergence rates for gradient descent/ascent `(w) =ln Y j P (y j x j, w)) w 2 2 Regularized logistic regression is strongly concave Negative second derivative bounded away from zero: Number of Iterations to get to accuracy `(w ) `(w) apple If func Lipschitz: O(/ϵ 2 ) If gradient of func Lipschitz: O(/ϵ) Strong concavity (convexity) is super helpful!! If func is strongly convex: O(ln(/ϵ)) For example, for strongly concave l(w): `(w ) `(w) apple 2 r`(w) 2 2 Sham Kakade 26 5 Sham Kakade 26 52 3