Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian Igel: Linear Regression and Discrimination 1 / 21
Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 2 / 21
Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 3 / 21
Recall: Gaussian Distribution Gaussian distribution of a single real-valued variable with mean µ R and variance σ 2 : N(x µ,σ 2 ) = 1 { exp 1 } 2πσ 2σ2(x µ)2 Multivariate Gaussian distribution of a d-dimensional real-valued random vector with mean µ R d and covariance matrix Σ R d d : 1 N(x µ,σ) = { (2π) d detσ exp 1 } 2 (x µ)t Σ 1 (x µ) Christian Igel: Linear Regression and Discrimination 4 / 21
Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 5 / 21
Linear functions f(x) = x,w + b x,w +b w x,w + b > 0 x x x,w + b = 0 b/ w w x x,w + b < 0 Christian Igel: Linear Regression and Discrimination 6 / 21
Recall: Linear regression Goal of regression: Given an input, predict corresponding output from Y R m, 1 m <. Let s consider m = 1. We consider a mapping/encoding/representation φ : X R d, φ(x) = (φ 1 (x),...,φ d (x)) T. The φ i are called basis functions. In linear { regression, the hypotheses are: d H = i=1 w iφ i (x) + b } w R d,b R For convenience, we redefine φ : X R d+1, φ(x) = (φ 1 (x),...,φ d (x),1) T and write H = { w T φ(x) w R d+1 }. Christian Igel: Linear Regression and Discrimination 7 / 21
Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 8 / 21
Sum-of-squares error and Gaussian Noise I consider a deterministic target functions f : X R with zero-mean additive Gaussian noise with variance σ 2, then we have p(y x) = N(y f(x),σ 2 ) given S = {(x 1,y 1 ),...,(x l,y l )} the likelihood of a hypothesis h(x) = w T φ(x) is given by p((y 1,...,y l ) T (x 1,...,x l ),w,σ 2 ) := p(y w) l = N(y i w T φ(x i ),σ 2 ) i=1 Christian Igel: Linear Regression and Discrimination 9 / 21
Sum-of-squares error and Gaussian Noise II for convenience, we consider the logarithm ln p(y w) = ln = l N(y i w T φ(x i ),σ 2 ) i=1 l ln N(y i w T φ(x i ),σ 2 ) i=1 = l 2 ln σ2 l 2 ln(2π) 1 2σ 2 l i=1 { } 2 y i w T φ(x i ) thus, here maximizing the likelihood corresponds to minimizing the sum-of-squares loss Christian Igel: Linear Regression and Discrimination 10 / 21
Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 11 / 21
Solution using pseudo-inverse I (halved) sum-of-squares error of linear model has derivative 1 w 2 l i=1 setting to 0 yields { y i w T φ(x i ) 0 T = } 2 = l i=1 l y i φ(x i ) T w T i=1 { } y i w T φ(x i ) φ(x i ) T l φ(x i )φ(x i ) T i=1 Christian Igel: Linear Regression and Discrimination 12 / 21
Solution using pseudo-inverse II l l y i φ(x i ) T w T φ(x i )φ(x i ) T = y T Φ w T Φ T Φ i=1 Φ := i=1 φ 1 (x 1 ) φ 2 (x 1 ) φ d+1 (x 1 ) φ 1 (x 2 ) φ 2 (x 2 ) φ d+1 (x 2 )............ φ 1 (x l ) φ 2 (x l ) φ d+1 (x l ) 0 T = y T Φ w T Φ T Φ w T Φ T Φ = y T Φ maximum likelihood estimate w = ( Φ T Φ) T w = Φ T y ( Φ T Φ) 1 Φ T y Φ := ( Φ T Φ ) 1 Φ T is called Moore-Penrose pseudo-inverse Christian Igel: Linear Regression and Discrimination 13 / 21
Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 14 / 21
Decision based on class posteriors Classification means assigning an input x X to one class of a finite set of classes Y = {C 1,..., C m }, 2 m <. One approach is to learn appropriate discrimination functions δ k : X R, 1 k m, and assign a pattern x to class ŷ by choosing ŷ = argmax k δ k (x). If we know the class posteriors Pr(Y X) we can perform optimal classification: a pattern x is assigned to class C k with maximum Pr(Y = C k X = x), i.e., ŷ = argmax k Pr(Y = C k X = x). Pr(Y = C k X = x) is proportional to the class-conditional density p(x = x Y = C k ) times the class prior Pr(Y = C k ): Pr(Y = C k X = x) = p(x = x Y = C k)pr(y = C k ) p(x = x) Christian Igel: Linear Regression and Discrimination 15 / 21
Decision based on class probabilities p(x, C 1 ) x 0 x p(x, C 2 ) x R 1 R 2 C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006 Christian Igel: Linear Regression and Discrimination 16 / 21
Gaussian class-conditionals let s consider ln Pr(Y = C k X = x) = ln p(x = x Y = C k ) + ln Pr(Y = C k ) + const for X = R d and Gaussian class-conditionals we have p(x = x Y = C k ) = N(x µ k,σ k,y = C k ) ln N(x µ k,σ k,y = C k ) = ( { 1 ln exp 1 k)}) (2π) d detσ k 2 (x µ k) T Σ 1 k (x µ = d 2 ln 2π 1 2 ln detσ k 1 2 (x µ k) T Σ 1 k (x µ k) = d 2 ln 2π 1 2 ln detσ k 1 2 xt Σ 1 k x 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k Christian Igel: Linear Regression and Discrimination 17 / 21
Linear Discriminant Analysis (LDA) let s assume that all class-conditionals have the same covariance matrix ln Pr(Y = C k X = x) const = ln Pr(Y = C k ) d 2 ln 2π 1 2 ln detσ 1 2 xt Σ 1 x 1 2 µt k Σ 1 µ k + x T Σ 1 µ k we estimate δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + ln Pr(Y = C k ) ˆPr(Y = C k ) = l k /l ˆµ k = 1 l k ˆΣ = 1 l m (x,y) S k x m k=1 (x,y) S k (x ˆµ k )(x ˆµ k ) T with S k = {(x,y) (x,y) S y = C k } and l k = S k Christian Igel: Linear Regression and Discrimination 18 / 21
Effect of learning the covariance 4 4 2 2 0 0 2 2 2 2 6 2 2 6 C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006 Christian Igel: Linear Regression and Discrimination 19 / 21
Linear and Quadratic Discriminant Analysis in LDA the decision boundaries {x δ i (x) = δ j (x)} (i.e., the hypotheses) between two classes i and j are linear functions modeling independent covariance matrices for the class-conditionals leads to quadratic discriminant analysis (QDA) with quadratic decision functions : δ k (x) = 1 2 ln detσ k 1 2 xt Σ 1 k x 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k + ln Pr(Y = C k ) Christian Igel: Linear Regression and Discrimination 20 / 21
Linear Regression and LDA One could determine the δ k by linear regression. If l 1 = l 2 and Y = { 1,1} LDA and linear regression come to the same boundaries, if l 1 l 2 the cutpoint of the decision boundary is shifted. For more than two classes, using std. linear regression for classification is not advisable. Both linear regression and LDA should be considered as baseline algorithms for regression and classification! In particular, LDA often gives very good results! References: T. Hastie, R. Tibshirani, & J. Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001 C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006 Christian Igel: Linear Regression and Discrimination 21 / 21