Gaussian and Linear Discriminant Analysis; Multiclass Classification

Similar documents
Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Overfitting, Bias / Variance Analysis

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Models for Classification

Linear Regression (continued)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression. Seungjin Choi

CSCI567 Machine Learning (Fall 2018)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Ch 4. Linear Models for Classification

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Lecture 5

Generative v. Discriminative classifiers Intuition

Neural Network Training

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Introduction to Machine Learning

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning for NLP

Introduction to Logistic Regression and Support Vector Machine

Machine Learning for NLP

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Bias-Variance Tradeoff

Neural Networks and Deep Learning

Logistic Regression. Machine Learning Fall 2018

CSCI567 Machine Learning (Fall 2014)

ECE521 Lecture7. Logistic Regression

Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Generative v. Discriminative classifiers Intuition

Linear Classification

STA 4273H: Statistical Machine Learning

Warm up: risk prediction with logistic regression

Generative Learning algorithms

Machine Learning Lecture 7

MIRA, SVM, k-nn. Lirong Xia

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Stochastic Gradient Descent

Machine Learning (CS 567) Lecture 5

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Reading Group on Deep Learning Session 1

Multiclass Classification-1

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning

Linear Models for Classification

CS 188: Artificial Intelligence. Outline

Machine Learning. 7. Logistic and Linear Regression

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression. COMP 527 Danushka Bollegala

Generative v. Discriminative classifiers Intuition

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

ESS2222. Lecture 4 Linear model

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Notes on Discriminant Functions and Optimal Classification

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Introduction to Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Lecture 2: Logistic Regression and Neural Networks

Machine Learning 2017

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Ad Placement Strategies

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Classification: The rest of the story

Introduction to Machine Learning Spring 2018 Note 18

Association studies and regression

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression. William Cohen

Machine Learning Foundations

Nonparametric Bayesian Methods (Gaussian Processes)

Midterm: CS 6375 Spring 2015 Solutions

Logistic Regression & Neural Networks

Statistical Machine Learning Hilary Term 2018

Machine Learning for Signal Processing Bayes Classification

Linear Models in Machine Learning

CS 340 Lec. 16: Logistic Regression

Linear Classification: Probabilistic Generative Models

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Gaussian discriminant analysis Naive Bayes

Linear discriminant functions

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear and logistic regression

Naive Bayes and Gaussian Bayes Classifier

Machine Learning - Waseda University Logistic Regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Transcription:

Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 1 / 40

Outline 1 Administration 2 Review of last lecture 3 Generative versus discriminative 4 Multiclass classification Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 2 / 40

Announcements Homework 2: due on Thursday Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 3 / 40

Outline 1 Administration 2 Review of last lecture Logistic regression 3 Generative versus discriminative 4 Multiclass classification Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 4 / 40

Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Model of conditional distribution p(y = 1 x; b, w) = σ[g(x)] where g(x) = b + d w d x d = b + w T x Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 5 / 40

Why the sigmoid function? What does it look like? where σ(a) = 1 1 + e a 1 0.9 0.8 0.7 0.6 0.5 0.4 Properties a = b + w T x 0.3 0.2 0.1 0 6 4 2 0 2 4 6 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 6 / 40

Why the sigmoid function? What does it look like? where σ(a) = 1 1 + e a 1 0.9 0.8 0.7 0.6 0.5 0.4 Properties a = b + w T x 0.3 0.2 0.1 0 6 4 2 0 2 4 6 Bounded between 0 and 1 thus, interpretable as probability Monotonically increasing thus, usable to derive classification rules σ(a) > 0.5, positive (classify as 1 ) σ(a) < 0.5, negative (classify as 0 ) σ(a) = 0.5, undecidable Nice computational properties Derivative is in a simple form Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 6 / 40

Why the sigmoid function? What does it look like? where σ(a) = 1 1 + e a 1 0.9 0.8 0.7 0.6 0.5 0.4 Properties a = b + w T x 0.3 0.2 0.1 0 6 4 2 0 2 4 6 Bounded between 0 and 1 thus, interpretable as probability Monotonically increasing thus, usable to derive classification rules σ(a) > 0.5, positive (classify as 1 ) σ(a) < 0.5, negative (classify as 0 ) σ(a) = 0.5, undecidable Nice computational properties Derivative is in a simple form Linear or nonlinear classifier? Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 6 / 40

Linear or nonlinear? σ(a) is nonlinear, however, the decision boundary is determined by σ(a) = 0.5 a = 0 g(x) = b + w T x = 0 which is a linear function in x We often call b the offset term. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 7 / 40

Likelihood function Probability of a single training sample (x n, y n ) { σ(b + w p(y n x n ; b; w) = T x n ) if y n = 1 1 σ(b + w T x n ) otherwise Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 8 / 40

Likelihood function Probability of a single training sample (x n, y n ) { σ(b + w p(y n x n ; b; w) = T x n ) if y n = 1 1 σ(b + w T x n ) otherwise Compact expression, exploring that y n is either 1 or 0 p(y n x n ; b; w) = σ(b + w T x n ) yn [1 σ(b + w T x n )] 1 yn Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 8 / 40

Maximum likelihood estimation Cross-entropy error (negative log-likelihood) E(b, w) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Numerical optimization Gradient descent: simple, scalable to large-scale problems Newton method: fast but not scalable Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 9 / 40

Numerical optimization Gradient descent Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 10 / 40

Numerical optimization Gradient descent Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Remarks Gradient is direction of steepest ascent. The step size needs to be chosen carefully to ensure convergence. The step size can be adaptive (i.e. varying from iteration to iteration). Variant called stochastic gradient descent (later this quarter). Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 10 / 40

Intuition for Newton s method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k In particular, we can approximate the cross-entropy error function around w (t) by a quadratic function (its second order Taylor expansion), and then minimize this quadratic function Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 11 / 40

Update Rules Gradient descent w (t+1) w (t) η n { σ(w T x n ) y n } xn Newton method w (t+1) w (t) H (t) 1 E(w (t) ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 12 / 40

Contrast gradient descent and Newton s method Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 13 / 40

Contrast gradient descent and Newton s method Similar Both are iterative procedures. Different Newton s method requires second-order derivatives (less scalable, but faster convergence) Newton s method does not have the magic η to be set Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 13 / 40

Outline 1 Administration 2 Review of last lecture 3 Generative versus discriminative Contrast Naive Bayes and logistic regression Gaussian and Linear Discriminant Analysis 4 Multiclass classification Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 14 / 40

Naive Bayes and logistic regression: two different modelling paradigms Consider spam classification problem First Strategy: Use training set to find a decision boundary in the feature space that separates spam and non-spam emails Given a test point, predict its label based on which side of the boundary it is on. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 15 / 40

Naive Bayes and logistic regression: two different modelling paradigms Consider spam classification problem First Strategy: Use training set to find a decision boundary in the feature space that separates spam and non-spam emails Given a test point, predict its label based on which side of the boundary it is on. Second Strategy: Look at spam emails and build a model of what they look like. Similarly, build a model of what non-spam emails look like. To classify a new email, match it against both the spam and non-spam models to see which is the better fit. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 15 / 40

Naive Bayes and logistic regression: two different modelling paradigms Consider spam classification problem First Strategy: Use training set to find a decision boundary in the feature space that separates spam and non-spam emails Given a test point, predict its label based on which side of the boundary it is on. Second Strategy: Look at spam emails and build a model of what they look like. Similarly, build a model of what non-spam emails look like. To classify a new email, match it against both the spam and non-spam models to see which is the better fit. First strategy is discriminative (e.g., logistic regression) Second strategy is generative (e.g., naive bayes) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 15 / 40

Generative vs Discriminative Discriminative Requires only specifying a model for the conditional distribution p(y x), and thus, maximizes the conditional likelihood n log p(y n x n ). Models that try to learn mappings directly from feature space to the labels are also discriminative, e.g., perceptron, SVMs (covered later) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 16 / 40

Generative vs Discriminative Discriminative Requires only specifying a model for the conditional distribution p(y x), and thus, maximizes the conditional likelihood n log p(y n x n ). Models that try to learn mappings directly from feature space to the labels are also discriminative, e.g., perceptron, SVMs (covered later) Generative Aims to model the joint probability p(x, y) and thus maximize the joint likelihood n log p(x n, y n ). The generative models we ll cover do so by modeling p(x y) and p(y) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 16 / 40

Generative vs Discriminative Discriminative Requires only specifying a model for the conditional distribution p(y x), and thus, maximizes the conditional likelihood n log p(y n x n ). Models that try to learn mappings directly from feature space to the labels are also discriminative, e.g., perceptron, SVMs (covered later) Generative Aims to model the joint probability p(x, y) and thus maximize the joint likelihood n log p(x n, y n ). The generative models we ll cover do so by modeling p(x y) and p(y) Let s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 16 / 40

Determining sex (man or woman) based on measurements 280 red = female, blue=male 260 240 220 200 weight 180 160 140 120 100 80 55 60 65 70 75 80 height Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 17 / 40

Generative approach Model joint distribution of (x = (height, weight), y =sex) our data Sex Height Weight 1 6 175 2 5 2 120 1 5 6 140 1 6 2 240 2 5.7 130 weight 280 260 240 220 200 180 160 140 120 100 red = female, blue=male 80 55 60 65 70 75 80 Intuition: we will model how heights vary (according to a Gaussian) in each sub-population (male and female). Note: This is similar to Naive Bayes (in particular problem 1 of HW2) height Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 18 / 40

Model of the joint distribution (1D) p(x, y) = p(y)p(x y) 1 p 1 = 2πσ1 e (x µ 1 ) 2 2πσ2 e (x µ 2 ) 2 p 2 1 2σ 2 1 if y = 1 2σ 2 2 if y = 2 weight 280 260 240 220 200 180 160 140 120 red = female, blue=male p 1 + p 2 = 1 are prior probabilities, and p(x y) is a class conditional distribution 100 80 55 60 65 70 75 80 height Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 19 / 40

Parameter estimation Log Likelihood of training data D = {(x n, y n )} N n=1 with y n {1, 2} log P (D) = log p(x n, y n ) n = ( ) 1 log p 1 e (xn µ 2 1) 2σ 1 2 2πσ1 n:y n=1 + n:y n=2 ( ) 1 log p 2 e (xn µ 2 2) 2σ 2 2 2πσ2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 20 / 40

Parameter estimation Log Likelihood of training data D = {(x n, y n )} N n=1 with y n {1, 2} log P (D) = n = n:y n=1 + n:y n=2 log p(x n, y n ) ( ) 1 log p 1 e (xn µ 2 1) 2σ 1 2 2πσ1 ( ) 1 log p 2 e (xn µ 2 2) 2σ 2 2 2πσ2 Max log likelihood (p 1, p 2, µ 1, µ 2, σ 1, σ 2 ) = arg max log P (D) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 20 / 40

Parameter estimation Log Likelihood of training data D = {(x n, y n )} N n=1 with y n {1, 2} log P (D) = n = n:y n=1 + n:y n=2 log p(x n, y n ) ( ) 1 log p 1 e (xn µ 2 1) 2σ 1 2 2πσ1 ( ) 1 log p 2 e (xn µ 2 2) 2σ 2 2 2πσ2 Max log likelihood (p 1, p 2, µ 1, µ 2, σ 1, σ 2 ) = arg max log P (D) In HW2 Problem 1 we look at variant of Naive Bayes where σ 1 = σ 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 20 / 40

Parameter estimation Log Likelihood of training data D = {(x n, y n )} N n=1 with y n {1, 2} log P (D) = n = n:y n=1 + n:y n=2 log p(x n, y n ) ( ) 1 log p 1 e (xn µ 2 1) 2σ 1 2 2πσ1 ( ) 1 log p 2 e (xn µ 2 2) 2σ 2 2 2πσ2 Max log likelihood (p 1, p 2, µ 1, µ 2, σ 1, σ 2 ) = arg max log P (D) In HW2 Problem 1 we look at variant of Naive Bayes where σ 1 = σ 2 Max likelihood (D > 1) (p 1, p 2, µ 1, µ 2, Σ 1, Σ 2 ) = arg max log P (D) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 20 / 40

Parameter estimation Log Likelihood of training data D = {(x n, y n )} N n=1 with y n {1, 2} log P (D) = n = n:y n=1 + n:y n=2 log p(x n, y n ) ( ) 1 log p 1 e (xn µ 2 1) 2σ 1 2 2πσ1 ( ) 1 log p 2 e (xn µ 2 2) 2σ 2 2 2πσ2 Max log likelihood (p 1, p 2, µ 1, µ 2, σ 1, σ 2 ) = arg max log P (D) In HW2 Problem 1 we look at variant of Naive Bayes where σ 1 = σ 2 Max likelihood (D > 1) (p 1, p 2, µ 1, µ 2, Σ 1, Σ 2 ) = arg max log P (D) For Naive Bayes we assume Σ i is diagonal Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 20 / 40

Decision boundary As before, the Bayes optimal one under the assumed joint distribution depends on which is equivalent to p(y = 1 x) p(y = 2 x) p(x y = 1)p(y = 1) p(x y = 2)p(y = 2) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 21 / 40

Decision boundary As before, the Bayes optimal one under the assumed joint distribution depends on which is equivalent to Namely, p(y = 1 x) p(y = 2 x) p(x y = 1)p(y = 1) p(x y = 2)p(y = 2) (x µ 1) 2 2σ 2 1 log 2πσ 1 + log p 1 (x µ 2) 2 2σ 2 2 log 2πσ 2 + log p 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 21 / 40

Decision boundary As before, the Bayes optimal one under the assumed joint distribution depends on which is equivalent to Namely, p(y = 1 x) p(y = 2 x) p(x y = 1)p(y = 1) p(x y = 2)p(y = 2) (x µ 1) 2 2σ1 2 log 2πσ 1 + log p 1 (x µ 2) 2 2σ2 2 log 2πσ 2 + log p 2 ax 2 + bx + c 0 the decision boundary not linear! Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 21 / 40

Example of nonlinear decision boundary Parabolic Boundary 2 0 2 2 0 2 Note: the boundary is characterized by a quadratic function, giving rise to the shape of a parabolic curve. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 22 / 40

A special case: what if we assume the two Gaussians have the same variance? (x µ 1) 2 2σ 2 1 with σ 1 = σ 2 log 2πσ 1 + log p 1 (x µ 2) 2 2σ 2 2 log 2πσ 2 + log p 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 23 / 40

A special case: what if we assume the two Gaussians have the same variance? (x µ 1) 2 2σ 2 1 log 2πσ 1 + log p 1 (x µ 2) 2 2σ 2 2 log 2πσ 2 + log p 2 with σ 1 = σ 2 We get a linear decision boundary: bx + c 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 23 / 40

A special case: what if we assume the two Gaussians have the same variance? (x µ 1) 2 2σ 2 1 log 2πσ 1 + log p 1 (x µ 2) 2 2σ 2 2 log 2πσ 2 + log p 2 with σ 1 = σ 2 We get a linear decision boundary: bx + c 0 Note: equal variances across two different categories could be a very strong assumption. weight 280 260 240 220 200 180 160 140 120 100 red = female, blue=male 80 55 60 65 70 75 80 height For example, from the plot, it does seem that the male population has slightly bigger variance (i.e., bigger ellipse) than the female population. So the assumption might not be applicable. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 23 / 40

Mini-summary Gaussian discriminant analysis A generative approach, assuming the data modeled by p(x, y) = p(y)p(x y) where p(x y) is a Gaussian distribution. Parameters (of those Gaussian distributions) are estimated by maximizing the likelihood Computationally, estimating those parameters are very easy it amounts to computing sample mean vectors and covariance matrices Decision boundary In general, nonlinear functions of x in this case, we call the approach quadratic discriminant analysis In the special case we assume equal variance of the Gaussian distributions, we get a linear decision boundary we call the approach linear discriminant analysis Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 24 / 40

So what is the discriminative counterpart? Intuition The decision boundary in Gaussian discriminant analysis is ax 2 + bx + c = 0 Let us model the conditional distribution analogously p(y x) = σ[ax 2 + bx + c] = 1 1 + e (ax2 +bx+c) Or, even simpler, going after the decision boundary of linear discriminant analysis p(y x) = σ[bx + c] Both look very similar to logistic regression i.e. we focus on writing down the conditional probability, not the joint probability. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 25 / 40

Does this change how we estimate the parameters? First change: a smaller number of parameters to estimate Our models are only parameterized by a, b and c. There is no prior probabilities (p 1, p 2 ) or Gaussian distribution parameters (µ 1, µ 2, σ 1 and σ 2 ). Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 26 / 40

Does this change how we estimate the parameters? First change: a smaller number of parameters to estimate Our models are only parameterized by a, b and c. There is no prior probabilities (p 1, p 2 ) or Gaussian distribution parameters (µ 1, µ 2, σ 1 and σ 2 ). Second change: we need to maximize the conditional likelihood p(y x) (a, b, c ) = arg min n { yn log σ(ax 2 n + bx n + c) (1) Computationally harder! + (1 y n ) log[1 σ(ax 2 n + bx n + c)] } (2) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 26 / 40

How easy for our Gaussian discriminant analysis? Example p 1 = µ 1 = σ 2 1 = # of training samples in class 1 # of training samples n:y x n=1 n # of training samples in class 1 n:y (x n=1 n µ 1 ) 2 # of training samples in class 1 (3) (4) (5) Note: detailed derivation is in the books. They can be generalized rather easily to multi-variate distributions as well as multiple classes. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 27 / 40

Generative versus discriminative: which one to use? There is no fixed rule Selecting which type of method to use is dataset/task specific It depends on how well your modeling assumption fits the data For instance, as we show in HW2, when data follows a specific variant of the Gaussian Naive Bayes assumption, p(y x) necessarily follows a logistic function. However, the converse is not true. Gaussian Naive Bayes makes a stronger assumption than logistic regression When data follows this assumption, Gaussian Naive Bayes will likely yield a model that better fits the data But logistic regression is more robust and less sensitive to incorrect modelling assumption Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 28 / 40

Outline 1 Administration 2 Review of last lecture 3 Generative versus discriminative 4 Multiclass classification Use binary classifiers as building blocks Multinomial logistic regression Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 29 / 40

Setup Suppose we need to predict multiple classes/outcomes: C 1, C 2,..., C K Weather prediction: sunny, cloudy, raining, etc Optical character recognition: 10 digits + 26 characters (lower and upper cases) + special characters, etc Studied methods Nearest neighbor classifier Naive Bayes Gaussian discriminant analysis Logistic regression Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 30 / 40

Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 31 / 40

Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 31 / 40

Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? 2 What if multiple classifiers say positive? Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 31 / 40

Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 32 / 40

Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 32 / 40

Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers There are K(K 1)/2 votes! Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 32 / 40

Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 33 / 40

Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Makes a big difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 33 / 40

Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Makes a big difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 33 / 40

Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Makes a big difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Makes a big difference if you have a lot of data to go through. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 33 / 40

Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Makes a big difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Makes a big difference if you have a lot of data to go through. Bad about both of them Combining classifiers outputs seem to be a bit tricky. Any other good methods? Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 33 / 40

Multinomial logistic regression Intuition: from the decision rule of our naive Bayes classifier y = arg max c p(y = c x) = arg max c log p(x y = c)p(y = c) (6) = arg max c log π c + z k log θ ck = arg max c wc T x (7) k Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 34 / 40

Multinomial logistic regression Intuition: from the decision rule of our naive Bayes classifier y = arg max c p(y = c x) = arg max c log p(x y = c)p(y = c) (6) = arg max c log π c + z k log θ ck = arg max c wc T x (7) k Essentially, we are comparing with one for each category. w T 1 x, w T 2 x,, w T C x (8) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 34 / 40

First try So, can we define the following conditional model? p(y = c x) = σ[w T c x] Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 35 / 40

First try So, can we define the following conditional model? p(y = c x) = σ[w T c x] This would not work at least for the reason p(y = c x) = σ[wc T x] 1 c c as each summand can be any number (independently) between 0 and 1. But we are close! Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 35 / 40

First try So, can we define the following conditional model? p(y = c x) = σ[w T c x] This would not work at least for the reason p(y = c x) = σ[wc T x] 1 c c as each summand can be any number (independently) between 0 and 1. But we are close! We can learn the k linear models jointly to ensure this property holds! Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 35 / 40

Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 36 / 40

Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Decision boundary: assign x with the label that is the maximum of posterior arg max k P (C k x) arg max k w T k x Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 36 / 40

How does the softmax function behave? Suppose we have w T 1 x = 100, w T 2 x = 50, w T 3 x = 20 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 37 / 40

How does the softmax function behave? Suppose we have w T 1 x = 100, w T 2 x = 50, w T 3 x = 20 We could have picked the winning class label 1 with certainty according to our classification rule. Softmax translates these scores into well-formed conditional probababilities p(y = 1 x) = e 100 e 100 + e 50 + e 20 < 1 preserves relative ordering of scores maps scores to values between 0 and 1 that also sum to 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 37 / 40

Sanity check Multinomial model reduce to binary logistic regression when K = 2 p(c 1 x) = = e wt 1 x e wt 1 x + e = 1 wt 2 x 1 + e (w 1 w 2 ) T x 1 1 + e wt x Multinomial thus generalizes the (binary) logistic regression to deal with multiple classes. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 38 / 40

Parameter estimation Discriminative approach: maximize conditional likelihood log P (D) = n log P (y n x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 39 / 40

Parameter estimation Discriminative approach: maximize conditional likelihood log P (D) = n log P (y n x n ) We will change y n to y n = [y n1 y n2 y nk ] T, a K-dimensional vector using 1-of-K encoding. { 1 if yn = k y nk = 0 otherwise Ex: if y n = 2, then, y n = [0 1 0 0 0] T. Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 39 / 40

Parameter estimation Discriminative approach: maximize conditional likelihood log P (D) = n log P (y n x n ) We will change y n to y n = [y n1 y n2 y nk ] T, a K-dimensional vector using 1-of-K encoding. { 1 if yn = k y nk = 0 otherwise Ex: if y n = 2, then, y n = [0 1 0 0 0] T. n log P (y n x n ) = n log K k=1 P (C k x n ) ynk = n y nk log P (C k x n ) k Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 39 / 40

Cross-entropy error function Definition: negative log likelihood E(w 1, w 2,..., w K ) = n y nk log P (C k x n ) k Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 40 / 40

Cross-entropy error function Definition: negative log likelihood E(w 1, w 2,..., w K ) = n y nk log P (C k x n ) k Properties Convex, therefore unique global optimum Optimization requires numerical procedures, analogous to those used for binary logistic regression Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015 40 / 40