Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Similar documents
Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning (CS 567) Lecture 5

Gaussian discriminant analysis Naive Bayes

Introduction to Machine Learning Spring 2018 Note 18

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Logistic Regression. Seungjin Choi

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Generative v. Discriminative classifiers Intuition

LDA, QDA, Naive Bayes

Support Vector Machines

Introduction to Machine Learning

MLE/MAP + Naïve Bayes

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

The Bayes classifier

Introduction to Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Statistical Data Mining and Machine Learning Hilary Term 2016

Generative v. Discriminative classifiers Intuition

CPSC 540: Machine Learning

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning Linear Classification. Prof. Matteo Matteucci

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning

Classification. Sandro Cumani. Politecnico di Torino

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Naïve Bayes classification

Generative Learning algorithms

Logistic Regression & Neural Networks

Linear discriminant functions

Machine Learning, Fall 2012 Homework 2

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013

Logistic Regression. Machine Learning Fall 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Machine Learning

CSC 411: Lecture 09: Naive Bayes

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine Learning for Signal Processing Bayes Classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Models for Classification

Statistical Machine Learning Hilary Term 2018

Lecture 9: Classification, LDA

Ch 4. Linear Models for Classification

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning for Signal Processing Bayes Classification and Regression

A Study of Relative Efficiency and Robustness of Classification Methods

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Lecture 9: Classification, LDA

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Inf2b Learning and Data

CMU-Q Lecture 24:

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Classification: Linear Discriminant Analysis

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Machine Learning for NLP

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Linear Classification: Probabilistic Generative Models

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

ECE521 week 3: 23/26 January 2017

ECE 5984: Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Regression and Discrimination

Machine Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

STA 4273H: Statistical Machine Learning

Bias-Variance Tradeoff

CS 340 Lec. 16: Logistic Regression

Naive Bayes and Gaussian Bayes Classifier

Lecture 9: Classification, LDA

Machine Learning Basics

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Generative Model (Naïve Bayes, LDA)

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Naive Bayes and Gaussian Bayes Classifier

Transcription:

Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1

Logistic Regression Linear regression provides a nice representation and an efficient solution to a regression problem Can we apply this representation to classification? Using linear regression directly leads to too many output values that do not match any class well Logistic regression tries to address this by applying a logistic function in order to achieve outputs that are closer to class values h! (x) = g(! T 1 x) = 1+ e!! T x Manfred Huber 2015 2

Logistic Regression For two classes we can map the class labels to 1 and 0, respectively Using this we can interpret the output as the probability to belong to a given class P! (y =1 x) = h! (x) P! (y = 0 x) =1! h! (x) Simplified this gives p! (y x) = h! (x) y ( 1! h! (x)) 1!y Manfred Huber 2015 3

Logistic Regression This gives the likelihood of the parameters as n! L(!) = p(! D) = p! (y (i) x (i) ) i=1! n i=1 ( ) 1"y (i ) = h! (x (i) ) y(i ) 1" h! (x (i) ) Converted to log likelihood # n log L(!) = log% " h! (x (i) ) y(i ) 1! h! (x (i) ) $ i=1 ( ) 1!y (i ) # = ) n log% h! (x (i) ) y(i ) 1! h! (x (i) ) i=1 $ ( ) 1!y (i ) & ( ' & ( ' Manfred Huber 2015 4

Logistic Regression Solving for the maximum likelihood optimization using stochastic gradient descent! log L(!) =! log h! (x) y 1" h! (x)!! j!! j # 1 = % y $ h! (x) " 1" y ( ( ) 1"y ) ( ) # 1 = % y $ h! (x) " 1" y ( ) 1 & (! h! (x) 1" h! (x)'!! j ( ) x j ( ) x j = ( y " h! (x)) x j = y(1" g(! T x))" ( 1" y)g(! T x) = y " g(! T x) 1 & (g(! T x)(1" g(! T x))!! T x 1" h! (x)'!! j Manfred Huber 2015 5

Logistic Regression Gives us a stochastic gradient descent learning method for logistic regression! :=! +" ( y! h! (x)) x (i) This allows us to use regression for classification problems Manfred Huber 2015 6

Softmax Regression Logistic regression allows us to address a classification problem with two classes How can we address classification with more than 2 classes? Multiclass classifier need to compute a probability for each of the classes Corresponds to multinomial distribution h! (x) j = g j (! T x) =! Manfred Huber 2015 e! k T x 7 k e! j T x

Softmax Regression For each class we can interpret the output as the probability to belong to a given class P! (y = k x) = h! (x) k Simplified this gives p! (y x) =! h! (x) k k This gives the likelihood of the parameters as L(!) =! p! (y (i) x (i) ) i " k,y Manfred Huber 2015 8

Softmax Regression Converted to log likelihood n log L(!) = " logp! (y (i) x (i) ) = log h! (x (i) ) i!1 " n i!1 = log # k $ & & % e! j T x (i ) " e! k T x (i ) k n " ' ) ) ( i!1 " k,y (i ) # k k " k,y (i ) Derivative is slightly more complex Can be maximized using gradient ascent Manfred Huber 2015 9

Generative Approaches Logistic and softmax regression are discriminative approaches to classification using a linear parameter representation Can we build generative algorithms that provide similar classification? To build a generative algorithms we need to be able to predict a probability density for the input for a class Gaussian Naïve Bayes made the restrictive assumption that all dimensions are independent We can relax this and assume a multivariate Gaussian to give us Gaussian Discriminant Analysis Manfred Huber 2015 10

Gaussian Discriminant Analysis GDA assumes that the probability density function for the data in a class follows a multivariate Gaussian p(x y) = p µy,! y (x y) = N(x;µ y,! y ) = Prior is assumed to be from a Bernoulli distribution p(y) =! y (1!!) 1!y Using Bayes law a log likelihood function for the parameters can be derived log L(!,µ 0,! 0,µ 1,! 1 ) = " + log" p µy (i ),! (x(i) y (i) ) p(y (i) ) y (i ) i Manfred Huber 2015 11 1 ( 2! ) m 1 2!y 2 e "1 ( 2 x"µ y) T! "1 y ( x"µ y )

Gaussian Discriminant Analysis Learning the parameter takes again the form of finding maximum likelihood parameters given the data! = #(x(i), y (i) ) : y (i) =1 #(x (i), y (i) ) µ y =! x (i) i:y (i ) =y #(x (i), y (i) ) : y (i) = y " y = (x(i) # µ y )(x (i) # µ y ) T #(x (i), y (i) ) : y (i) = y Manfred Huber 2015 12

Gaussian Discriminant Analysis The log likelihood of a class can be written as " log( p!,µy,! y (y x) ) = log p µ y,! y (x y)p(y) % $ ' " $ 1 = log ( 2" ) m 1 $ 2!y 2 # $ #! y e (1 ( 2 x(µ y) T! (1 y ( x(µ y ) ' & % ' ' + log! y (1(!) 1(y & ( ) ( log! y ( ) = ( 1 ( 2 x ( µ y) T! (1 y ( x ( µ y ) ( log( 2! ) m 1 2 ( 2 log! + ylog! + (1( y)log(1(!)( log" y y This forms a decision boundary ( ) " log( p!,µ1,! 1 (y =1 x) ) < 0 log p!,µ0,! 0 (y = 0 x) # ( x " µ 0 ) T! "1 ( x " µ 0 ) " log! 0 " ( x " µ 1 ) T! "1 ( x " µ 1 ) " log! 1 > T Manfred Huber 2015 13

Linear Discriminant Analysis If we make the homoscedastic assumption, i.e. if we assume that the (co)variances of the two classes are identical this results in linear discriminant analysis (LDA) log L(!,µ 0,µ 1,!) = " + log" p µy (i ),! (x(i) y (i) ) p(y (i) ) i Parameter maximization for the (co)variance is! = (x(i) " µ y (i ) )(x (i) " µ y (i ) ) T #(x (i), y (i) ) More stable for small amounts of data Manfred Huber 2015 14

Linear Discriminant Analysis With equal (co)variances the discriminant surface becomes linear 1 0 1 2 3 4 5 6 7 2 1 0 1 2 3 4 5 6 7 Manfred Huber 2015 15

Gaussian Discriminant Analysis In the decision boundary linearity arises as log( p!,µ0,!(y = 0 x) ) " log( p!,µ1,!(y =1 x) ) < 0 # ( x " µ 0 ) T! "1 ( x " µ 0 ) + log! " ( x " µ 1 ) T! "1 ( x " µ 1 ) " log! > T # ( x " µ 0 ) T! "1 ( x " µ 0 ) " ( x " µ 1 ) T! "1 ( x " µ 1 ) > T # "2µ 0 T! "1 x + 2µ 1 T! "1 x > T " µ 0 T! "1 µ 0 + µ 1 T! "1 µ 1 ( ) # ( µ 1 " µ 0 )! "1 x > 1 2 T " µ T 0! "1 µ 0 + µ T 1! "1 µ 1 Manfred Huber 2015 16

Linear Discriminant Analysis In the case of linear discriminant analysis the class likelihood can be rewritten as 1 p!,µ0,µ 1,!(y =1 x) = 1+ e "" (!,µ y,!) T x Solution to linear discriminant analysis has similar form as the representation for logistic regression LDA is more restrictive (only allows some types of ) If the distributions are Gaussian with same (co)variance then LDA will find the solution more efficiently and usually more precisely Logistic regression is more robust (it covers more than just Gaussian distributions and less sensitive to modeling errors Manfred Huber 2015 17

Quadratic Discriminant Analysis If we do not make the homoscedastic assumption, the discrimination surface becomes quadratic resulting in Quadratic Discriminant Analysis (QDA) ( ) " log( p!,µ1,! 1 (y =1 x) ) < 0 log p!,µ0,! 0 (y = 0 x) # ( x " µ 0 ) T! "1 ( x " µ 0 ) " log! 0 " ( x " µ 1 ) T! "1 ( x " µ 1 ) " log! 1 > T QDA is more flexible but also more complex Requires more data to train Manfred Huber 2015 18

Logistic Regression and Linear Discriminant Analysis Logistic regression and Linear discriminant analysis are frequently used for classification Logistic regression provides an effective discriminative classification approach Same learning rule as for linear regression LDA provides generative classification that, if its assumptions are met, solves the same problem Softmax regression generalizes logistic regression Harder to train QDA provides a more general generative classifier Requires more data Manfred Huber 2015 19