LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Similar documents
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Ch 4. Linear Models for Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Bayesian Logistic Regression

Linear Models for Classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Iterative Reweighted Least Squares

Linear Classification

Machine Learning. 7. Logistic and Linear Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning Lecture 7

Multiclass Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 5

Relevance Vector Machines

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Neural Network Training

Reading Group on Deep Learning Session 1

Linear Models for Classification

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

ECE521 week 3: 23/26 January 2017

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Outline Lecture 2 2(32)

6.867 Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

CSC 411: Lecture 04: Logistic Regression

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression. Seungjin Choi

Linear and logistic regression

Statistical Machine Learning Hilary Term 2018

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Machine Learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

An Introduction to Statistical and Probabilistic Linear Models

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Linear Classification: Probabilistic Generative Models

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Linear Regression. Sargur Srihari

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Regression with Numerical Optimization. Logistic

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

5. Discriminant analysis

Introduction to Signal Detection and Classification. Phani Chavali

Least Squares Regression

Pattern Recognition and Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Classification. Sandro Cumani. Politecnico di Torino

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Max Margin-Classifier

The Laplace Approximation

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Overfitting, Bias / Variance Analysis

Multivariate statistical methods and data mining in particle physics

Naive Bayes and Gaussian Bayes Classifier

STA414/2104 Statistical Methods for Machine Learning II

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

STA 4273H: Sta-s-cal Machine Learning

Naive Bayes and Gaussian Bayes Classifier

Linear Models in Machine Learning

Support Vector Machines

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 5: LDA and Logistic Regression

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Least Squares Regression

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Learning (II)

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning for Signal Processing Bayes Classification and Regression

Probabilistic generative models

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

LINEAR MODELS FOR CLASSIFICATION

Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input variable x is still continuous, but the target variable is discrete. In the simplest case, t can have only 2 values.

Example Problem 3 Animal or Vegetable?

4 Linear models for classification separate input vectors into classes using linear decision boundaries. Example: Input vector x Two discrete classes C 1 and C 2 4 2 0!2!4!6!8!4!2 0 2 4 6 8

Discriminant Functions 5 A linear discriminant function y(x) = f ( w t x + w ) 0 maps a real input vector x to a scalar value y(x). f ( ) is called an activation function.

Outline 6 Linear activation functions Least-squares formulation Fisher s linear discriminant Nonlinear activation functions Probabilistic generative models Probabilistic discriminative models Logistic regression Bayesian logistic regression

Two Class Discriminant Function 7 Let f ( ) be the identity: y>0 y =0 y<0 x 2 R 2 R 1 y(x) = w t x + w 0 y(x) 0 x assigned to C 1 w x y(x) w y(x) < 0 x assigned to C 2 Thus y(x) = 0 defines the decision boundary x w 0 w x 1

K>2 Classes 8 Idea #1: Just use K-1 discriminant functions, each of which separates one class C k from the rest. (Oneversus-the-rest classifier.) Problem: Ambiguous regions? R 1 R 2 C 1 R 3 not C 1 C 2 not C 2

K>2 Classes 9 Idea #2: Use K(K-1)/2 discriminant functions, each of which separates two classes C j, C k from each other. (One-versus-one classifier.) Each point classified by majority vote. Problem: Ambiguous regions C 1 C 3 R 1 C 1? R 3 C 2 R 2 C 3 C 2

K>2 Classes 10 Idea #3: Use K discriminant functions y k (x) Use the magnitude of y k (x), not just the sign. y k (x) = w k t x + w k 0 x assigned to C k if y k (x) > y j (x) j k Decision boundary y k (x) = y j (x) ( w k w j ) t x + ( w k 0 w j 0 ) = 0 Results in decision regions that are simply-connected and convex. R i R j R k x A x x B

Learning the Parameters 11 Method #1: Least Squares y k (x) = w k t x + w k 0 y(x) = W t x where x = (1,x t ) t ( ) t W is a (D + 1) K matrix whose kth column is w k = w 0,w k t

Learning the Parameters 12 Method #1: Least Squares y(x) = W t x Training dataset ( x n,t n ), n = 1,,N where we use the 1-of-K coding scheme for t n Let T be the N K matrix whose n th row is t n t Let X be the N (D + 1) matrix whose n th row is x n t We define the error as E D ( W ) = 1 2 Tr X W T {( ) t ( X W T )} Setting derivative wrt W yields: W = ( X t X ) 1 X t T = X T

Fisher s Linear Discriminant 13 Another way to view linear discriminants: find the 1D subspace that maximizes the separation between the two classes. Let m 1 = 1 N 1 x n, m 2 = 1 x n C 1 N n 2 n C 2 For example, might choose w to maximize w t ( m 2 m 1 ), subject to w = 1 This leads to w m 2 m 1 4 2 However, if conditional distributions are not isotropic, this is typically not optimal. 0!2!2 2 6

Fisher s Linear Discriminant 14 Let m 1 = w t m 1, m 2 = w t m 2 be the conditional means on the 1D subspace. Let s k 2 = ( y n m k ) 2 be the within-class variance on the subspace for class C k n C k ( The Fisher criterion is then J(w) = m m 2 1) 2 s 2 2 1 + s 2 This can be rewritten as J(w) = w t S B w w t S W w where S B = m 2 m 1 and ( )( m 2 m 1 ) t is the between-class variance S W = x n m 1 + x n m 2 is the within-class variance n C 1 ( )( x n m 1 ) t n C 2 J(w) is maximized for w S 1 W ( m 2 m 1 ) ( )( x n m 2 ) t 4 2 0!2!2 2 6

15 Connection between Least-Squares and FLD Change coding scheme to t n = N N 1 for C 1 t n = N N 2 for C 2 Then one can show that the ML w satisfies ( ) w S W 1 m 2 m 1

Least Squares Classifier 16 Problem #1: Sensitivity to outliers 4 4 2 2 0 0!2!2!4!4!6!6!8!4!2 0 2 4 6 8!8!4!2 0 2 4 6 8

Least Squares Classifier 17 Problem #2: Linear activation function is not a good fit to binary data. This can lead to problems. 6 4 2 0!2!4!6!6!4!2 0 2 4 6

Outline 18 Linear activation functions Least-squares formulation Fisher s linear discriminant Nonlinear activation functions Probabilistic generative models Probabilistic discriminative models Logistic regression Bayesian logistic regression

Probabilistic Generative Models 19 Consider first K=2: By Bayes' equation, the posterior for class C 1 can be written : p ( C 1 x) = = p ( x C 1 ) p ( C 1 ) p ( x C 1 ) p ( C 1 ) + p ( x C 2 ) p ( C 2 ) 1 1+ exp( a) = σ(a) where a = log p x C 1 p x C 2 ( ) p ( C 1 ) ( ) p ( C 2 ) and σ(a) is the logistic sigmoid function 1 0.5 σ(a) 0!5 0 5 a

Probabilistic Generative Models 20 Let's assume that the input vector x is multivariate normal, when conditioned upon the class C k, and that the covariance is the same for all classes : p ( x C k ) = 1 ( 2π ) exp 1 D / 2 Σ 1/ 2 2 x µ k ( ) t Σ 1 ( x µ k ) Then we have that p ( C 1 x) = σ ( w t x + w 0 ) where ( ) w = Σ 1 µ 1 µ 2 ( ) ( ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + log p C 1 p C 2 Thus we have a generalized linear model, and the decision surfaces will be hyperplanes in the input space.

Probabilistic Generative Models 21 This result generalizes to K > 2 classes : ( ) = p ( x C k ) p ( C k ) p C k x = exp a k exp a j where j ( ) j ( ) a k = log p x C k ( ) p ( C j ) p x C j softmax ( ( ) p ( C k )) Then we have that a k (x) = w k t x + w k 0 where w k = Σ 1 µ k w k 0 = 1 2 µt k Σ 1 µ k + log p ( C k )

Non-Constant Covariance 22 If the class-conditional covariances are different, the generative decision boundaries are in general quadratic. 2.5 2 1.5 1 0.5 0!0.5!1!1.5!2!2.5!2!1 0 1 2

ML for Probabilistic Generative Model 23 Let t n = 1 denote Class 1, t n = 0 denote Class 2. Let π = p C 1 ( ) so that 1- π = p ( C 2 ) Then the ML estimates for the parameters are: π = N 1 N 1 + N 2 N µ 1 = 1 N 1 t n x n n=1 N ( ) x n µ 2 = 1 N 2 1 t n n=1 Σ = N 1 N S + N 2 1 N S 2 where S 1 = 1 N 1 x n µ 1 and n C 1 S 2 = 1 N 2 x n µ 2 n C 2 ( )( x n µ 1 ) t ( )( x n µ 2 ) t

Probabilistic Discriminative Models 24 An alternative to the generative approach is to model the dependence of the target variable t on the input vector x directly, using the activation function f. One big advantage is that there will typically be fewer parameters to determine.

Logistic Regression (K = 2) 25 ( ) p ( C 1 φ) = y ( φ) = σ w t φ p ( C 2 φ) = 1 p ( C 1 φ) p ( C 1 φ) = y ( φ) = σ ( w t φ) where σ(a) = #" 1 1 exp( a) w t φ x 1

Logistic Regression 26 ( ) = y ( φ) = σ w t φ ( ) = 1 p ( C 1 φ) p C 1 φ p C 2 φ where 1 σ(a) = 1 exp( a) ( ) Number of parameters Logistic regression: M Generative model: 2M + M(M+1)/2 + 1 = M(M+5)/2+1

ML for Logistic Regression 27 ( ) t = y n { n 1 y } 1 t n n p t w N where t = t 1,,t N n=1 ( ) t and y n = p ( C 1 φ ) n We define the error function to be E(w) = log p ( t w) Given y n = σ ( a ) n and a n = w t φ n, one can show that E(w) = N ( y n t n )φ n n=1 Unfortunately, there is no closed form solution for w.

ML for Logistic Regression: 28 Iterative Reweighted Least Squares Although there is no closed form solution for the ML estimate of w, fortunately, the error function is convex. Thus an appropriate iterative method is guaranteed to find the exact solution. A good method is to use a local quadratic approximation to the log likelihood function (Newton- Raphson update): w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w)

ML for Logistic Regression 29 w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w) : H = Φ t RΦ where R is the N N diagonal weight matrix with R nn = y n ( 1 y n ) (Note that, since R nn 0, R is positive semi-definite, and hence H is positive semi-definite Thus E(w) is convex.) Thus ( ) w new = w (old ) ( Φ t RΦ) 1 Φ t y t

ML for Logistic Regression 30 Iterative Reweighted Least Squares ( ) = y ( φ) = σ w t φ p C 1 φ ( )! $ & % w 1 " # w 2 w t φ

Bayesian Logistic Regression 31 We can make logistic regression Bayesian by applying a prior over w: p(w) = N(w m 0,S 0 ) Unfortunately, the posterior over w will not be normal for logistic regression, and hence we cannot integrate over it analytically. This means that we cannot do Bayesian prediction analytically. However, there are methods for approximating the posterior that allow us to do approximate Bayesian prediction.

The Laplace Approximation 32 In the Laplace approximation, we approximate the log of a distribution by a local, second order (quadratic) form, centred at the mode. This corresponds to a normal approximation to the distribution, with mean given by the mode of the original distribution precision matrix given by the Hessian of the negative log of the distribution 0.8 p(z) 40 log p(z) 0.6 30 0.4 20 0.2 10 0!2!1 0 1 2 3 4 0!2!1 0 1 2 3 4

Bayesian Logistic Regression 33 When applied to the posterior over w in logistic regression, this yields p(w) q(w) = N ( w w MAP,S ) N where N S 1 = N S 1+ y ( 1 y 0 n n )φ n φ t n n=1

Prediction 34 Bayesian prediction requires that we integrate out this posterior over w: ( ) = p ( C 1 φ,w) p C 1 φ,t ( ) p(w t)dw σ w t φ q(w)dw This integral is not tractable analytically. However, approximation of the sigmoid function σ( ) by the inverse probit (cumulative normal) function yields an analytical solution: p C 1 φ,t ( ( ) µ a ), ( ) σ κ σ a 2 t where µ a = w MAP φ, σ 2 a = φ t S N φ and κ ( 2 σ ) a = ( 1+ πσ 2 a / 8) 1/ 2

Bayesian Logistic Regression 35 $!"! ( ) σ w t φ q(w)dw σ ( κ ( 2 σ ) a µ ) a Residual '( )( *( $ $!"!&!"!#!!! %& & %& & %& & This last approximation is excellent!