SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 1 / 14
Contents of This Lecture 1 Two-Class Algorithms J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 2 / 14
Material Chapter 5 in WebCop:2011 except Section 5.4, Support vector machines, which will be the topic of the next lecture. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 3 / 14
What Should You Already Know? Basics on linear discriminant functions, g i (x) = w T i x + w i0, i = 1,..., c. Two classes Combined discriminant function g(x) = w T x + w 0 If a Bayes classifier has linear decision boundaries, the posteriors g i (x) = p(ω i x), i = 1,..., c, can be transformed to the above linear form g i (x) = w T i x + w i0 preserving the orders of magnitudes between g i (x)s. Otherwise linear discriminant functions can, even at best, only approximate the ideal underlying Bayes classifier Underlearning (which is not cosidered as a severe problem here but you should be aware of it in practice.) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 4 / 14
Two-Class Algorithms Generative vs. Discriminative Learning (Simplification) Generative, probabilistic: Design classifiers by estimating the class conditional pdfs and prior probabilities based on training data and plug them into the Bayes classification rule. (Up to this point, with the exception of KNNs). Pros: With enough training data, guaranteed good performance; The models of classes can be useful (e.g. allow recognition of something that cannot be classified. Cons: Can be inefficient if the number of training data is small. Discriminative, geometric: Estimate directly the discriminant functions (i.e. decision regions) without modelling individual classes. Pros: Can be more efficient in a small-sample situations. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 5 / 14
Two-Class Algorithms Example: Generative vs. Discriminative learning The class conditional pdfs are normal densities with equal covariances. The Bayes classifier is linear, and it can be represented with the discriminant function: g(x) = g 1 (x) g 2 (x) = w T x + w 0, where w = Σ 1 (µ 1 µ 2 ) and w 0 = 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 µ 2 ) + ln P(ω 1 ) ln P(ω 2 ). 2d + (d 2 + d)/2 + 1 parameter values for Gaussians but only d + 1 parameter values for the discriminant function J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 6 / 14
Two-Class Algorithms Linear Discriminant Functions: Generalities w weight vector w 0 threshold Discriminant function g(x) = w T x + w 0 : If g(x) > 0, then x ω 1, if g(x) < 0, then x ω 2. Decision surface w T x + w 0 = 0, distance of x to decision surface g(x) / w. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 7 / 14
Preliminaries Two-Class Algorithms Augmented pattern vector y based on x: y = [1, x 1,..., x d ] T. The Augmented weight vector is obtained from w and w 0 : Linear discriminant functions v = [w 0, w 1,..., w d ] T = [w 0, w T ] T. g(x) = w T x + w 0 = v T y = g(y). Reduce the two training sets D 1, D 2 to a single training set D by replacing the training samples from ω 2 by their negatives. This works because v T y < 0 v T ( y) > 0. Replacement by negatives must be performed for augmented pattern vectors. (Why?) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 8 / 14
Perceptron Algorithm Two-Class Algorithms For linearly separable training sets Minimize J p (v) = v T y j. (1) j:v T y j <0 Summation is over misclassified samples Update equation based on gradient descent v(t + 1) = v(t) + η y j, j:v(t) T y j <0 where η can be selected as 1 (fixed increment rule). Converges for linearly separable samples Example from ItoPR. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 9 / 14
Two-Class Algorithms Perceptron Variants Absolute correction rule Fractional correction rule Variable increment allows η to change from iteration to iteration; addresses non-separable samples. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 10 / 14
Margins and Relaxation Two-Class Algorithms Relaxation: J r (v) = j:v T y j b (v T y j b) 2 y j Updates: see the book Support vector machines are based on the idea of margin maximization J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 11 / 14
Two-Class Algorithms Fisher s Linear Discriminant Analysis (LDA) Seek a direction along which two classes are best separated according to the ratio of between-class to within-class variances. Maximize J F (w) = wt (m 1 m 2 ) 2, where m w T S W w 1, m 2 are the class means and S W is the pooled within class covariance matrix. w is proportional to S 1 W (m 1 m 2 ). This should look familiar! Fisher s criterion does not 1) invoke Gaussianity assumption and 2) does not directly provide a classification rule (threshold must be selected). By assuming Gaussianity, we get the plug-in Gaussian linear classifier discussed previously. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 12 / 14
Two-Class Algorithms Least-Mean-Squared Error Procedures Attempt to find v that satisfies the equalities v T y i = t i (approximately). Minimize J s (v) = Y v t 2, where Y is the n d + 1 matrix consisting of the training samples, and t is n component vector t = [t 1, t 2,..., t n ] T. (n = n 1 + n 2 ). The solution is v = Y + t, where Y + is pseudo-inverse of Y (Matlab command pinv). If Y T Y is non-singular, then Y + = (Y T Y ) 1 Y T. t = 1, i.e., t i = 1 for all i is a principled choice (section 5.2.4.3) +item LMS is related to Fisher s LDA (section 5.2.4.2 for more info) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 13 / 14
Two-Class Algorithms Least Mean-Squared-Error and Perceptron Procedures The decision surfaces based on the Perceptron criterion and the Minimum Squared Error criterion. The solid line is the decision surface based on the Perceptron criterion and the dashed line is the decision surface based on the Minimum Squared Error criterion. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 14 / 14