Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1
Logistic Regression Linear regression provides a nice representation and an efficient solution to a regression problem Can we apply this representation to classification? Using linear regression directly leads to too many output values that do not match any class well Logistic regression tries to address this by applying a logistic function in order to achieve outputs that are closer to class values h! (x) = g(! T 1 x) = 1+ e!! T x Manfred Huber 2015 2
Logistic Regression For two classes we can map the class labels to 1 and 0, respectively Using this we can interpret the output as the probability to belong to a given class P! (y =1 x) = h! (x) P! (y = 0 x) =1! h! (x) Simplified this gives p! (y x) = h! (x) y ( 1! h! (x)) 1!y Manfred Huber 2015 3
Logistic Regression This gives the likelihood of the parameters as n! L(!) = p(! D) = p! (y (i) x (i) ) i=1! n i=1 ( ) 1"y (i ) = h! (x (i) ) y(i ) 1" h! (x (i) ) Converted to log likelihood # n log L(!) = log% " h! (x (i) ) y(i ) 1! h! (x (i) ) $ i=1 ( ) 1!y (i ) # = ) n log% h! (x (i) ) y(i ) 1! h! (x (i) ) i=1 $ ( ) 1!y (i ) & ( ' & ( ' Manfred Huber 2015 4
Logistic Regression Solving for the maximum likelihood optimization using stochastic gradient descent! log L(!) =! log h! (x) y 1" h! (x)!! j!! j # 1 = % y $ h! (x) " 1" y ( ( ) 1"y ) ( ) # 1 = % y $ h! (x) " 1" y ( ) 1 & (! h! (x) 1" h! (x)'!! j ( ) x j ( ) x j = ( y " h! (x)) x j = y(1" g(! T x))" ( 1" y)g(! T x) = y " g(! T x) 1 & (g(! T x)(1" g(! T x))!! T x 1" h! (x)'!! j Manfred Huber 2015 5
Logistic Regression Gives us a stochastic gradient descent learning method for logistic regression! :=! +" ( y! h! (x)) x (i) This allows us to use regression for classification problems Manfred Huber 2015 6
Softmax Regression Logistic regression allows us to address a classification problem with two classes How can we address classification with more than 2 classes? Multiclass classifier need to compute a probability for each of the classes Corresponds to multinomial distribution h! (x) j = g j (! T x) =! Manfred Huber 2015 e! k T x 7 k e! j T x
Softmax Regression For each class we can interpret the output as the probability to belong to a given class P! (y = k x) = h! (x) k Simplified this gives p! (y x) =! h! (x) k k This gives the likelihood of the parameters as L(!) =! p! (y (i) x (i) ) i " k,y Manfred Huber 2015 8
Softmax Regression Converted to log likelihood n log L(!) = " logp! (y (i) x (i) ) = log h! (x (i) ) i!1 " n i!1 = log # k $ & & % e! j T x (i ) " e! k T x (i ) k n " ' ) ) ( i!1 " k,y (i ) # k k " k,y (i ) Derivative is slightly more complex Can be maximized using gradient ascent Manfred Huber 2015 9
Generative Approaches Logistic and softmax regression are discriminative approaches to classification using a linear parameter representation Can we build generative algorithms that provide similar classification? To build a generative algorithms we need to be able to predict a probability density for the input for a class Gaussian Naïve Bayes made the restrictive assumption that all dimensions are independent We can relax this and assume a multivariate Gaussian to give us Gaussian Discriminant Analysis Manfred Huber 2015 10
Gaussian Discriminant Analysis GDA assumes that the probability density function for the data in a class follows a multivariate Gaussian p(x y) = p µy,! y (x y) = N(x;µ y,! y ) = Prior is assumed to be from a Bernoulli distribution p(y) =! y (1!!) 1!y Using Bayes law a log likelihood function for the parameters can be derived log L(!,µ 0,! 0,µ 1,! 1 ) = " + log" p µy (i ),! (x(i) y (i) ) p(y (i) ) y (i ) i Manfred Huber 2015 11 1 ( 2! ) m 1 2!y 2 e "1 ( 2 x"µ y) T! "1 y ( x"µ y )
Gaussian Discriminant Analysis Learning the parameter takes again the form of finding maximum likelihood parameters given the data! = #(x(i), y (i) ) : y (i) =1 #(x (i), y (i) ) µ y =! x (i) i:y (i ) =y #(x (i), y (i) ) : y (i) = y " y = (x(i) # µ y )(x (i) # µ y ) T #(x (i), y (i) ) : y (i) = y Manfred Huber 2015 12
Gaussian Discriminant Analysis The log likelihood of a class can be written as " log( p!,µy,! y (y x) ) = log p µ y,! y (x y)p(y) % $ ' " $ 1 = log ( 2" ) m 1 $ 2!y 2 # $ #! y e (1 ( 2 x(µ y) T! (1 y ( x(µ y ) ' & % ' ' + log! y (1(!) 1(y & ( ) ( log! y ( ) = ( 1 ( 2 x ( µ y) T! (1 y ( x ( µ y ) ( log( 2! ) m 1 2 ( 2 log! + ylog! + (1( y)log(1(!)( log" y y This forms a decision boundary ( ) " log( p!,µ1,! 1 (y =1 x) ) < 0 log p!,µ0,! 0 (y = 0 x) # ( x " µ 0 ) T! "1 ( x " µ 0 ) " log! 0 " ( x " µ 1 ) T! "1 ( x " µ 1 ) " log! 1 > T Manfred Huber 2015 13
Linear Discriminant Analysis If we make the homoscedastic assumption, i.e. if we assume that the (co)variances of the two classes are identical this results in linear discriminant analysis (LDA) log L(!,µ 0,µ 1,!) = " + log" p µy (i ),! (x(i) y (i) ) p(y (i) ) i Parameter maximization for the (co)variance is! = (x(i) " µ y (i ) )(x (i) " µ y (i ) ) T #(x (i), y (i) ) More stable for small amounts of data Manfred Huber 2015 14
Linear Discriminant Analysis With equal (co)variances the discriminant surface becomes linear 1 0 1 2 3 4 5 6 7 2 1 0 1 2 3 4 5 6 7 Manfred Huber 2015 15
Gaussian Discriminant Analysis In the decision boundary linearity arises as log( p!,µ0,!(y = 0 x) ) " log( p!,µ1,!(y =1 x) ) < 0 # ( x " µ 0 ) T! "1 ( x " µ 0 ) + log! " ( x " µ 1 ) T! "1 ( x " µ 1 ) " log! > T # ( x " µ 0 ) T! "1 ( x " µ 0 ) " ( x " µ 1 ) T! "1 ( x " µ 1 ) > T # "2µ 0 T! "1 x + 2µ 1 T! "1 x > T " µ 0 T! "1 µ 0 + µ 1 T! "1 µ 1 ( ) # ( µ 1 " µ 0 )! "1 x > 1 2 T " µ T 0! "1 µ 0 + µ T 1! "1 µ 1 Manfred Huber 2015 16
Linear Discriminant Analysis In the case of linear discriminant analysis the class likelihood can be rewritten as 1 p!,µ0,µ 1,!(y =1 x) = 1+ e "" (!,µ y,!) T x Solution to linear discriminant analysis has similar form as the representation for logistic regression LDA is more restrictive (only allows some types of ) If the distributions are Gaussian with same (co)variance then LDA will find the solution more efficiently and usually more precisely Logistic regression is more robust (it covers more than just Gaussian distributions and less sensitive to modeling errors Manfred Huber 2015 17
Quadratic Discriminant Analysis If we do not make the homoscedastic assumption, the discrimination surface becomes quadratic resulting in Quadratic Discriminant Analysis (QDA) ( ) " log( p!,µ1,! 1 (y =1 x) ) < 0 log p!,µ0,! 0 (y = 0 x) # ( x " µ 0 ) T! "1 ( x " µ 0 ) " log! 0 " ( x " µ 1 ) T! "1 ( x " µ 1 ) " log! 1 > T QDA is more flexible but also more complex Requires more data to train Manfred Huber 2015 18
Logistic Regression and Linear Discriminant Analysis Logistic regression and Linear discriminant analysis are frequently used for classification Logistic regression provides an effective discriminative classification approach Same learning rule as for linear regression LDA provides generative classification that, if its assumptions are met, solves the same problem Softmax regression generalizes logistic regression Harder to train QDA provides a more general generative classifier Requires more data Manfred Huber 2015 19