Ch 4. Linear Models for Classification

Similar documents
LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Models for Classification

Machine Learning Lecture 5

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Lecture 7

Linear Classification

Logistic Regression. Seungjin Choi

Multiclass Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. 7. Logistic and Linear Regression

Neural Network Training

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Reading Group on Deep Learning Session 1

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Iterative Reweighted Least Squares

Machine Learning 2017

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning

Linear Classification: Probabilistic Generative Models

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Bayesian Logistic Regression

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Linear Models for Classification

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Probabilistic generative models

Max Margin-Classifier

Machine Learning for Signal Processing Bayes Classification and Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear & nonlinear classifiers

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Nonparametric Bayesian Methods (Gaussian Processes)

Linear discriminant functions

Linear & nonlinear classifiers

Linear Discrimination Functions

Stochastic gradient descent; Classification

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. COMP 527 Danushka Bollegala

Multi-layer Neural Networks

Outline Lecture 2 2(32)

Density Estimation. Seungjin Choi

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Bayesian Machine Learning

Multilayer Perceptron

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Variational Bayesian Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Statistical Machine Learning Hilary Term 2018

The Laplace Approximation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

ECE521 week 3: 23/26 January 2017

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for NLP

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CSCI-567: Machine Learning (Spring 2019)

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Kernel Methods and Support Vector Machines

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

5. Discriminant analysis

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

SGN (4 cr) Chapter 5

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

PATTERN CLASSIFICATION

Machine Learning Practice Page 2 of 2 10/28/13

Logistic Regression & Neural Networks

Machine Learning: Logistic Regression. Lecture 04

Machine Learning Lecture 10

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning - Waseda University Logistic Regression

Introduction to Machine Learning

Bayes Decision Theory

Chapter 14 Combining Models

Classification. Sandro Cumani. Politecnico di Torino

Logistic Regression. Machine Learning Fall 2018

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Learning from Data Logistic Regression

CSC 411: Lecture 04: Logistic Regression

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Transcription:

Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea dkim@postech.ac.kr

Contents 4.1. Discriminant Functions 4.2. Probabilistic Generative Models 4.3 Probabilistic Discriminative Models 4.4 he Laplace Approximation 4.5 Bayesian Logistic Regression 2

Classification Models Linear classification model (D-1)-dimensional hyperplane for D-dimensional input space 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0) Discriminant function Directly assigns each vector x to a specific class. ex. Fishers linear discriminant Approaches using conditional probability pc k x Separation of inference and decision states wo approaches Direct modeling of the posterior probability Generative approach Modeling likelihood and prior probability to calculate the posterior probability Capable of generating samples 3

Discriminant Functions-wo Classes Classification by hyperplanes y x yx w x w if 0, xc otherwise, x C 0 1 2 or y x w x w where w, w and x 1, x 0 4

Discriminant Functions-Multiple Classes One-versus-the-rest classifier K-1 classifiers for a K-class discriminant Ambiguous when more than two classifiers say yes. One-versus-one classifier K(K-1)/2 binary discriminant functions Majority voting ambiguousness with equal scores One-versus-the-rest One-versus-one 5

Discriminant Functions-Multiple Classes (Cont d) K-class discriminant comprising K linear functions Assigns x to the corresponding class having the maximum output. x k wk x k y w 0, k 1,..., K he decision regions are always singly connected and convex. x C if y x y x for j k k k j xa xb Ck xˆ xa xb yk xˆ yk xa yk xb x x x x y xˆ y xˆ j k For,, let 1. hen 1. y y and y y for j k, k A j A k B j B therefore for. k j 6

Approaches for Learning Parameters for Linear Discriminant Functions Least square method Fisher s linear discriminant Relation to least squares Multiple classes Perceptron algorithm 7

Least Square Method Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t. y x W x W w 1 w2 wk wk wk 0, wk where... and. For a training data set, {x n, t n } where n = 1,,N. he sum of squares error function is W XW XW 1 ED r, 2 Minimizing SSE gives where X x x... x and t t... t. 1 2 N 1 2 1 N W X X X X Pseudo inverse. 8

Least Square Method (Cont d) -Limit and Disadvantage he least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers Because SSE function penalizes too correct examples i.e. far from the decision boundary. ML under Gaussian conditional distribution Unimodal vs. multimodal 9

Least Square Method (Cont d) -Limit and Disadvantage Lack of robustness comes from Least square method corresponds to the maximum likelihood under the assumption of Gaussian distribution. Binary target vectors are far from this assumption. Least square solution Logistic regression 10

Fisher s Linear Discriminant Linear classification model as dimensionality reduction from the D-dimensional space to one dimension. In case of two classes y w x if y w, then x C otherwise, x C 0 1 Finding w such that the projected data are clustered well. 2 11

Fisher s Linear Discriminant (Cont d) Maximizing projected mean distance? he distance between the cluster means, m 1 and m 2 projected onto w. m m w m m 2 1 2 1 m 1 1 x and m x 1 n 2 N1 N nc 2 nc 1 2 Not appropriate when the covariances are nondiagonal. n 12

Fisher s Linear Discriminant (Cont d) Integrate the within-class variance of the projected data. Finding w that maximizes J(w). 2 m2 m1 2 2 J w, where s y m J w 2 2 i 2 s s B w S w w SW w J(w) is maximized when B k n k n C k S m m m m 2 1 2 1 1 2 Fisher s linear discriminant w SW m2 m1 If the within-class covariance is isotropic, w is proportional to the difference of the class means as in the previous case. S B : Between-class covariance matrix S W : Within-class covariance matrix S x m x m x m x m W n 1 n 1 n 2 n 2 nc nc w S ws w B W W B 1 w S w S w in the direction of (m 2 -m 1 ) 13

Fisher s Linear Discriminant -Relation to Least Squares- Fisher criterion as a special case of least squares When setting target values as: N/N 1 for class C 1 and N/N 2 for class C 2. N N 2 de / dw w xn w0 tn 0 0 w x n1 n 0 n N n1 de / dw 0 w x w0 t 1 E w t 2 N 1 1 0 w m, where n 1 1 2 2 Nx m m N n1 w m N N NN 1 2 SW SB N N w m m 1 2. n1 by solving (2) with the w 0 (1) x 0 (2) n n n by solving (1). 0 above. w S 1 m m S w : always in the direction of m m W 1 2. B 2 1 14

Fisher s Discriminant for Multiple Classes K > 2 classes Dimension reduction from D to D D > 1 linear features, y k (k = 1,,D ) k Generalization of S W and S B K k y w x 1 S S, where S x m x m and m x. W k k n k n k k n N k 1 nc k nc K N N K 1 1 S x m x m, where m x N m. n n n k k N N n1 n1 k 1 S S S W B S N m m m m B k k k k 1. k S B is from the decomposition of total covariance matrix (Duda and Hart, 1997) k 15

Fisher s Discriminant for Multiple Classes (Cont d) Covariance matrices in the projected y-space K K sw yk μk yk μk and sb Nk μk μμk μ, k 1 nc k 1 k 1 1 where μ y and μ N μ. k n k k Nk N nc k 1 k K Fukunaga s criterion Another criterion J 1 1 W r s r W sb WSW W WSBW Duda et al. Pattern Classification, Ch. 3.8.3 Determinant: the product of the eigenvalues, i.e. the variances in the principal directions. WS W J sb B W = s WS W W W 16

Fisher s Discriminant for Multiple Classes (Cont d) 17

Perceptron Algorithm Classification of x by a perceptron, where 1, a y x f w x f a 0. 1, a 0 Error functions he total number of misclassified patterns Piecewise constant and discontinuous gradient is zero almost everywhere. Perceptron criterion. w w EP ntn, where tn is the target output. nm 18

Perceptron Algorithm (cont d) Stochastic gradient descent algorithm 1 w w E w w t he error from a misclassified pattern is reduced after each iteration. Not imply the overall error is reduced. Perceptron convergence theorem. If there exists an exact solution (i.e. linear separable), the perceptron learning algorithm is guaranteed to find it. However P Learning speed, linearly nonseparable, multiple classes n n 1 w t w t t t w t n n n n n n n n n n 19

Perceptron Algorithm (cont d) (a) (b) (c) (d) 20

Probabilistic Generative Models Computation of posterior probabilities using class-conditional densities and class priors. x and x p C p C p C wo classes p C 1 x k k k p x C1 p C1 x x p C p C p C p C 1 1exp 1 1 2 2 Generalization to K > 2 classes x pc x k a a p Ck p Ck exp ak, p C p C exp a x x where a ln p C p C. j k k k j j j j x 1 1 x p C p C where a ln. p C p C 2 2 he normalized exponential is also known as the softmax function, i.e. smoothed version of the max function. 21

Probabilistic Generative Models -Continuous Inputs- Posterior probabilities when the class-conditional densities are Gaussian. When sharing the same covariance matrix, p 1 1 1 1 x C exp / 2 1/ 2 k. D x μ 2 2 k x μ k p x Ck wo classes 1 x w x w0 p C 1 1 1 1 1 w μ1 μ2 and w0 μ1 μ1 μ2 μ2 ln 2 2 p C p C 1 2 pc 1 x he quadratic terms in x from the exponents are cancelled. he resulting decision boundary is linear in input space. he prior only shifts the decision boundary, i.e. parallel contour. 22

Probabilistic Generative Models -Continuous Inputs (cont d)- Generalization to K classes a x k wk x wk 0 1 w μ μ μ 2 1 1 k k and wk 0 k k ln p Ck When sharing the same covariance matrix, the decision boundaries are linear again. If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant. 23

Probabilistic Generative Models -Maximum Likelihood Solution- Determining the parameters for px Ck and pck using maximum likelihood from a training data set. wo classes x t n N pc pc Data set:,, 1,..., xn, 1 1 xn 1 xn μ1, x, x 1 x μ, p C p C p C N p C p C p C N n 2 2 n 2 n 2 he likelihood function n n t 1 or 0, (denoting C and C, respectively) n N 1 2 Priors: and 1 1 2 t t x, μ, μ, x μ, n 1 x μ, 1 p N N 1 2 n 1 n 2 n1 t n t t1,..., tn 24

Probabilistic Generative Models -Maximum Likelihood Solution (cont d)- wo classes (cont d) Maximization of the likelihood with respect to π. erms of the log likelihood that depend on π. Setting the derivative with respect to π equal to zero. N 1 1 1 tnln 1 tnln 1 N N N tn n1 Maximization with respect to μ 1. N N N N N n1 1 x μ x μ x μ n n 1 n n 1 n 1 2 n1 n1 1 N N 1 1 tn n N and analogously μ2 1 tn N 1 2 n 1 n 1 N 1 2 1 t ln N, t const. μ x x n 25

Probabilistic Generative Models -Maximum Likelihood Solution (cont d)- wo classes (cont d) Maximization of the likelihood with respect to the shared covariance matrix. N N 1 1 t t 2 2 1 x μ x μ n n n 1 n 1 n1 n1 N 1 1 2 2 1 1 t 1 t x μ x μ n n n 2 n 2 n1 n1 N N ln r 2 2 N 1 S N1 N2 S S1 S2 N N 1 S N x μ x μ k k n k n k k n C S Weighted average of the covariance matrices associated with each classes. But not robust to outliers. 26

Probabilistic Generative Models -Discrete Features- Discrete feature values xi 0,1 General distribution would correspond to a 2 D size table. When we have D inputs, the table size grows exponentially with the number of features. Naïve Bayes assumption, conditioned on the class C k D xi x 1 1 p C k i1 ki D ki x x i ln p C p C x ln 1 x ln 1 ln p C k k i ki i ki k i1 Linear with respect to the features as in the continuous features. 27

Bayes Decision Boundaries: 2D -Pattern Classification, Duda et al. pp.42 28

Bayes Decision Boundaries: 3D -Pattern Classification, Duda et al. pp.43 29

Probabilistic Generative Models -Exponential Family- For both Gaussian distributed and discrete inputs he posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functions. Generalization to the class-conditional densities of the exponential family wo-classes he subclass for which u(x) = x. Exponential family x λ x λ exp λ ux p h g k k k For some scaling parameter s, 1 1 1 p x λ k, s h g k exp k. s s x λ s λ x x λ λ x ln λ ln λ ln ln a g g p C p C 1 2 1 2 1 2 x a p C. 1 1 K-classes a x λ x ln g λ ln p C Linear with respect to x again. k k k k exp ak where pck x. exp a j j 30

3 Approaches for classification Discriminant Functions Probabilistic Generative Models Fit class-conditional densities and class priors separately Apply Bayes theorem to find the posterior class probabilities Posterior probability of a class can be written as Logistic sigmoid acting on a linear function of x (2 classes) Softmax transformation of a linear function of x (Multiclass) he parameters of the densities as well as the class priors can be determined using Maximum Likelihood Probabilistic Discriminative Models Use the functional form of the generalized linear model explicitly Determine the parameters directly using Maximum Likelihood 31

Fixed basis functions Assume fixed nonlinear transformation ransform inputs using a vector of basis functions he resulting decision boundaries will be linear in the feature space 32

Logistic regression Logistic regression model Posterior probability of a class for two-class problem: he number of adjustable parameters (M-dimensional, 2-class) 2 Gaussian class conditional densities (generative model) 2M parameters for means M(M+1)/2 parameters for (shared) covariance matrix Grows quadratically with M Logistic regression (discriminative model) M parameters for Grows linearly with M 33

Logistic regression (Cont d) Determining the parameters using ML Likelihood function: Cross-entropy error function (negative log likelihood) he gradient of the error function w.r.t. w (the same form as the linear regression model) 34

Iterative reweighted least squares Linear regression models in ch.3 ML solution on the assumption of a Gaussian noise leads to a close-form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w. Logistic regression model No longer a closed-form solution But the error function is concave and has a unique minimum Efficient iterative technique can be used he Newton-Raphson update to minimize a function E(w) Where H is the Hessian matrix, the second derivatives of E(w) 35

Iterative reweighted least squares (Cont d) Sum-of-squares error function: Newton-Raphson update: Cross-entropy error function: Newton-Rhapson update: (iterative reweighted least squares) 36

Multiclass logistic regerssion Posterior probability for multiclass classification We can use ML to determine the parameters directly. Likelihood function using 1-of-K coding scheme Cross-entropy error function for the multiclass classification 37

Multiclass logistic regression (Cont d) he derivative of the error function Same form, the product of error times the basis function. he Hessian matrix IRLS algorithm can also be used for a batch processing 38

Probit regression For a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables. However this is not the case for all choices of class-conditional density It might be worth exploring other types of discriminative probabilistic model 39

Probit regression Noisy threshold model Corresponding activation function when θ is drawn from p(θ) he probit function Sigmoidal shape he generalized linear model based on a probit activation function is known as probit regression. 40

Canonical link functions We have seen that for some models, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector. Logistic regression model with sigmoid activation function Logistic regression model with softmax activation function his is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function. 41

Canonical link functions (Cont d) Conditional distributions of the target variable Log likelihood: he derivative of the log likelihood: where he canonical link function: then 42

he Laplace approximation We cannot integrate exactly over the parameter vector since the posterior is no longer Gaussian. he Laplace approximation: find a Gaussian approximation centered on the mode of the distribution. aylor expansion of the logarithm of the target function: Resulting approximated Gaussian distribution: 43

he Laplace approximation (Cont d) M-dimensional case 44

Model comparison and BIC Laplace approximation to the normalization constant Z his result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison. Consider a set of models having parameters he log of model evidence can be approximated as Further approximation with some more assumption: Bayesian Information Criterion (BIC) 45

Bayesian Logistic Regression Exact Bayesian inference is intractable. Gaussian prior: Posterior: Log of posterior: Laplace approximation of posterior distribution 46

Predictive distribution Can be obtained by marginalizing w.r.t the posterior distribution p (w t) which is approximated by a Gaussian q(w) where a is a marginal distribution of a Gaussian which is also Gaussian 47

Predictive distribution Resulting variational approximation to the predictive distribution o integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function hen where Finally we get 48