SGN (4 cr) Chapter 5

Similar documents
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear Models for Classification

Linear Discrimination Functions

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

L11: Pattern recognition principles

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Machine Learning Lecture 5

Ch 4. Linear Models for Classification

Machine Learning 2017

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 7

Support Vector Machines

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COMS 4771 Introduction to Machine Learning. Nakul Verma

Linear discriminant functions

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear & nonlinear classifiers

Bayesian Decision Theory

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Inf2b Learning and Data

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Classifiers as Pattern Detectors

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Multi-layer Neural Networks

Linear Models for Classification

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

The Bayes classifier

Overfitting, Bias / Variance Analysis

Max Margin-Classifier

Bayes Decision Theory

Linear & nonlinear classifiers

CMU-Q Lecture 24:

The Perceptron Algorithm 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Regularized Discriminant Analysis and Reduced-Rank LDA

Midterm: CS 6375 Spring 2015 Solutions

Reading Group on Deep Learning Session 1

Linear Classifiers as Pattern Detectors

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Discriminative Models

CSC 411: Lecture 04: Logistic Regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Pattern Recognition. Parameter Estimation of Probability Density Functions

Discriminative Models

Relevance Vector Machines

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Machine Learning

Linear Models for Classification

Machine Learning for Signal Processing Bayes Classification and Regression

Advanced statistical methods for data analysis Lecture 2

Logistic Regression. Machine Learning Fall 2018

Introduction to Machine Learning

Machine Learning (CS 567) Lecture 5

Perceptron (Theory) + Linear Regression

CS260: Machine Learning Algorithms

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Artificial Neural Network

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Single layer NN. Neuron Model

Gaussian Models

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machine. Industrial AI Lab.

Generative v. Discriminative classifiers Intuition

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

3.4 Linear Least-Squares Filter

Minimax risk bounds for linear threshold functions

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

MODULE -4 BAYEIAN LEARNING

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

More about the Perceptron

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Learning Methods for Linear Detectors

INF Introduction to classifiction Anne Solberg

Lecture 3: Pattern Classification. Pattern classification

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

6.867 Machine Learning

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Signal Detection and Classification. Phani Chavali

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

ESS2222. Lecture 4 Linear model

CPSC 540: Machine Learning

9 Classification. 9.1 Linear Classifiers

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Transcription:

SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 1 / 14

Contents of This Lecture 1 Two-Class Algorithms J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 2 / 14

Material Chapter 5 in WebCop:2011 except Section 5.4, Support vector machines, which will be the topic of the next lecture. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 3 / 14

What Should You Already Know? Basics on linear discriminant functions, g i (x) = w T i x + w i0, i = 1,..., c. Two classes Combined discriminant function g(x) = w T x + w 0 If a Bayes classifier has linear decision boundaries, the posteriors g i (x) = p(ω i x), i = 1,..., c, can be transformed to the above linear form g i (x) = w T i x + w i0 preserving the orders of magnitudes between g i (x)s. Otherwise linear discriminant functions can, even at best, only approximate the ideal underlying Bayes classifier Underlearning (which is not cosidered as a severe problem here but you should be aware of it in practice.) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 4 / 14

Two-Class Algorithms Generative vs. Discriminative Learning (Simplification) Generative, probabilistic: Design classifiers by estimating the class conditional pdfs and prior probabilities based on training data and plug them into the Bayes classification rule. (Up to this point, with the exception of KNNs). Pros: With enough training data, guaranteed good performance; The models of classes can be useful (e.g. allow recognition of something that cannot be classified. Cons: Can be inefficient if the number of training data is small. Discriminative, geometric: Estimate directly the discriminant functions (i.e. decision regions) without modelling individual classes. Pros: Can be more efficient in a small-sample situations. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 5 / 14

Two-Class Algorithms Example: Generative vs. Discriminative learning The class conditional pdfs are normal densities with equal covariances. The Bayes classifier is linear, and it can be represented with the discriminant function: g(x) = g 1 (x) g 2 (x) = w T x + w 0, where w = Σ 1 (µ 1 µ 2 ) and w 0 = 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 µ 2 ) + ln P(ω 1 ) ln P(ω 2 ). 2d + (d 2 + d)/2 + 1 parameter values for Gaussians but only d + 1 parameter values for the discriminant function J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 6 / 14

Two-Class Algorithms Linear Discriminant Functions: Generalities w weight vector w 0 threshold Discriminant function g(x) = w T x + w 0 : If g(x) > 0, then x ω 1, if g(x) < 0, then x ω 2. Decision surface w T x + w 0 = 0, distance of x to decision surface g(x) / w. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 7 / 14

Preliminaries Two-Class Algorithms Augmented pattern vector y based on x: y = [1, x 1,..., x d ] T. The Augmented weight vector is obtained from w and w 0 : Linear discriminant functions v = [w 0, w 1,..., w d ] T = [w 0, w T ] T. g(x) = w T x + w 0 = v T y = g(y). Reduce the two training sets D 1, D 2 to a single training set D by replacing the training samples from ω 2 by their negatives. This works because v T y < 0 v T ( y) > 0. Replacement by negatives must be performed for augmented pattern vectors. (Why?) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 8 / 14

Perceptron Algorithm Two-Class Algorithms For linearly separable training sets Minimize J p (v) = v T y j. (1) j:v T y j <0 Summation is over misclassified samples Update equation based on gradient descent v(t + 1) = v(t) + η y j, j:v(t) T y j <0 where η can be selected as 1 (fixed increment rule). Converges for linearly separable samples Example from ItoPR. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 9 / 14

Two-Class Algorithms Perceptron Variants Absolute correction rule Fractional correction rule Variable increment allows η to change from iteration to iteration; addresses non-separable samples. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 10 / 14

Margins and Relaxation Two-Class Algorithms Relaxation: J r (v) = j:v T y j b (v T y j b) 2 y j Updates: see the book Support vector machines are based on the idea of margin maximization J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 11 / 14

Two-Class Algorithms Fisher s Linear Discriminant Analysis (LDA) Seek a direction along which two classes are best separated according to the ratio of between-class to within-class variances. Maximize J F (w) = wt (m 1 m 2 ) 2, where m w T S W w 1, m 2 are the class means and S W is the pooled within class covariance matrix. w is proportional to S 1 W (m 1 m 2 ). This should look familiar! Fisher s criterion does not 1) invoke Gaussianity assumption and 2) does not directly provide a classification rule (threshold must be selected). By assuming Gaussianity, we get the plug-in Gaussian linear classifier discussed previously. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 12 / 14

Two-Class Algorithms Least-Mean-Squared Error Procedures Attempt to find v that satisfies the equalities v T y i = t i (approximately). Minimize J s (v) = Y v t 2, where Y is the n d + 1 matrix consisting of the training samples, and t is n component vector t = [t 1, t 2,..., t n ] T. (n = n 1 + n 2 ). The solution is v = Y + t, where Y + is pseudo-inverse of Y (Matlab command pinv). If Y T Y is non-singular, then Y + = (Y T Y ) 1 Y T. t = 1, i.e., t i = 1 for all i is a principled choice (section 5.2.4.3) +item LMS is related to Fisher s LDA (section 5.2.4.2 for more info) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 13 / 14

Two-Class Algorithms Least Mean-Squared-Error and Perceptron Procedures The decision surfaces based on the Perceptron criterion and the Minimum Squared Error criterion. The solid line is the decision surface based on the Perceptron criterion and the dashed line is the decision surface based on the Minimum Squared Error criterion. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 5 January 21, 2014 14 / 14