Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Similar documents
LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Ch 4. Linear Models for Classification

Linear Models for Classification

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Machine Learning Lecture 5

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning Lecture 7

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. 7. Logistic and Linear Regression

Reading Group on Deep Learning Session 1

Iterative Reweighted Least Squares

Advanced statistical methods for data analysis Lecture 2

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models for Classification

6.867 Machine Learning

Intelligent Systems Discriminative Learning, Neural Networks

Linear & nonlinear classifiers

Neural Network Training

Multiclass Logistic Regression

The Perceptron Algorithm

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Bayesian Logistic Regression

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear & nonlinear classifiers

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Linear Discrimination Functions

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Linear Classification

3.4 Linear Least-Squares Filter

Midterm: CS 6375 Spring 2015 Solutions

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 6. Notes on Linear Algebra. Perceptron

Neural Networks and Deep Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear Models for Classification

Logistic Regression. Seungjin Choi

Machine Learning and Data Mining. Linear classification. Kalev Kask

In the Name of God. Lecture 11: Single Layer Perceptrons

Machine Learning: The Perceptron. Lecture 06

18.6 Regression and Classification with Linear Models

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear Models in Machine Learning

Course 395: Machine Learning - Lectures

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear discriminant functions

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Single layer NN. Neuron Model

Multi-layer Neural Networks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Max Margin-Classifier

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Warm up: risk prediction with logistic regression

Nonlinear Classification

Machine Learning Lecture 10

Lecture 5: Logistic Regression. Neural Networks

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Classification with Perceptrons. Reading:

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Introduction to Neural Networks

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

SGN (4 cr) Chapter 5

Multilayer Perceptron

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Introduction To Artificial Neural Networks

Machine Learning 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Regression with Numerical Optimization. Logistic

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

The Perceptron. Volker Tresp Summer 2016

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning for NLP

Learning from Data Logistic Regression

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

5. Discriminant analysis

Statistical Machine Learning Hilary Term 2018

Neural Networks and the Back-propagation Algorithm

Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Artificial Neural Networks

Linear Decision Boundaries

Inf2b Learning and Data

Artificial Neural Networks

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Transcription:

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS

Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class.

Classification: Problem Statement 3 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input variable x may still be continuous, but the target variable is discrete. In the simplest case, t can have only 2 values. e.g., Let t = +1 x assigned to C 1 t = 1 x assigned to C 2

Example Problem 4 Animal or Vegetable?

Discriminative Classifiers 5 If the conditional distributions are normal, the best thing to do is to estimate the parameters of these distributions and use Bayesian decision theory to classify input vectors. Decision boundaries are generally quadratic. However if the conditional distributions are not exactly normal, this generative approach will yield sub-optimal results. Another alternative is to build a discriminative classifier, that focuses on modeling the decision boundary directly, rather than the conditional distributions themselves.

Linear Models for Classification 6 Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. Example: 2D Input vector x Two discrete classes C 1 and C 2 4 2 0 x 2 2 4 6 8 4 2 0 2 4 6 8 x 1

Two Class Discriminant Function 7 y>0 y =0 y<0 x 2 R 2 R 1 y(x) = w t x +w 0 y(x) 0 x assigned to C 1 w x y(x) w y(x) < 0 x assigned to C 2 Thus y(x) = 0 defines the decision boundary x w 0 w x 1

Two-Class Discriminant Function 8 y(x) = w t x +w 0 y(x) 0 x assigned to C 1 y(x) < 0 x assigned to C 2 y>0 y =0 y<0 x 2 R 2 R 1 For convenience, let w = w 1 w M t w 0 w 1 w M and t w x x y(x) w x = x 1 x M t 1 x1 x M t w 0 w x 1 So we can express y(x) = w t x

Generalized Linear Models 9 For classification problems, we want y to be a predictor of t. In other words, we wish to map the input vector into one of a number of discrete classes, or to posterior probabilities that lie between 0 and 1. For this purpose, it is useful to elaborate the linear model by introducing a nonlinear activation function f, which typically will constrain y to lie between -1 and 1 or between 0 and 1. y(x) = f ( w t x + w ) 0

The Perceptron 10 y(x) = f ( w t x + w ) 0 y(x) 0 x assigned to C 1 y(x) < 0 x assigned to C 2 A classifier based upon this simple generalized linear model is called a (single layer) perceptron. It can also be identified with an abstracted model of a neuron called the McCulloch Pitts model.

End of Lecture Oct 15, 2012

Parameter Learning 12 How do we learn the parameters of a perceptron?

Outline 13 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Case 1. Linearly Separable Inputs 14 For starters, let s assume that the training data is in fact perfectly linearly separable. In other words, there exists at least one hyperplane (one set of weights) that yields 0 classification error. We seek an algorithm that can automatically find such a hyperplane.

The Perceptron Algorithm 15 The perceptron algorithm was invented by Frank Rosenblatt (1962). The algorithm is iterative. The strategy is to start with a random guess at the weights w, and to then iteratively change the weights to move the hyperplane in a direction that lowers the classification error. Frank Rosenblatt (1928 1971)

The Perceptron Algorithm 16 Note that as we change the weights continuously, the classification error changes in discontinuous, piecewise constant fashion. Thus we cannot use the classification error per se as our objective function to minimize. What would be a better objective function?

The Perceptron Criterion 17 Note that we seek w such that w t x 0 when t = +1 w t x < 0 when t = 1 In other words, we would like w t x n t n 0 n Thus we seek to minimize E P ( w) = w t x n t n n M where M is the set of misclassified inputs.

The Perceptron Criterion 18 E P ( w) = w t x n t n n M where M is the set of misclassified inputs. Observations: E P (w) is always non-negative. E P (w) is continuous and piecewise linear, and thus easier to minimize. E P ( w) w i

The Perceptron Algorithm 19 E P ( w) = w t x n t n n M where M is the set of misclassified inputs. ( ) de P w dw = x t n n n M where the derivative exists. Gradient descent: E P ( w) w τ +1 = w τ η E P ( w) = w τ + η x n t n n M w i

The Perceptron Algorithm 20 w τ +1 = w τ η E P ( w) = w t + η x n t n n M Why does this make sense? If an input from C 1 (t = +1) is misclassified, we need to make its projection on w more positive. If an input from C 2 (t = -1) is misclassified, we need to make its projection on w more negative.

The Perceptron Algorithm 21 The algorithm can be implemented sequentially: Repeat until convergence: n For each input (x n, t n ): n If it is correctly classified, do nothing n If it is misclassified, update the weight vector to be w τ +1 = w τ + ηx n t n n Note that this will lower the contribution of input n to the objective function: ( ) t (τ +1) x n t n ( w ) t x n t n = ( w ) (τ ) t x n t n η x n t n w (τ ) ( ) t x n t n < w (τ ) ( ) t x n t n.

Not Monotonic 22 While updating with respect to a misclassified input n will lower the error for that input, the error for other misclassified inputs may increase. Also, new inputs that had been classified correctly may now be misclassified. The result is that the perceptron algorithm is not guaranteed to reduce the total error monotonically at each stage.

The Perceptron Convergence Theorem 23 Despite this non-monotonicity, if in fact the data are linearly separable, then the algorithm is guaranteed to find an exact solution in a finite number of steps (Rosenblatt, 1962).

Example 24 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1

The First Learning Machine 25 Mark 1 Perceptron Hardware (c. 1960) Visual Inputs Patch board allowing configuration of inputsφ Rack of adaptive weights w (motor-driven potentiometers)

Practical Limitations 26 The Perceptron Convergence Theorem is an important result. However, there are practical limitations: Convergence may be slow If the data are not separable, the algorithm will not converge. We will only know that the data are separable once the algorithm converges. The solution is in general not unique, and will depend upon initialization, scheduling of input vectors, and the learning rate η.

Generalization to inputs that are not linearly separable. 27 The single-layer perceptron can be generalized to yield good linear solutions to problems that are not linearly separable. Example: The Pocket Algorithm (Gal 1990) Idea: n Run the perceptron algorithm n Keep track of the weight vector w* that has produced the best classification error achieved so far. n It can be shown that w* will converge to an optimal solution with probability 1.

Generalization to Multiclass Problems 28 How can we use perceptrons, or linear classifiers in general, to classify inputs when there are K > 2 classes?

K>2 Classes 29 Idea #1: Just use K-1 discriminant functions, each of which separates one class C k from the rest. (Oneversus-the-rest classifier.) Problem: Ambiguous regions? R 1 R 2 C 1 R 3 not C 1 C 2 not C 2

K>2 Classes 30 Idea #2: Use K(K-1)/2 discriminant functions, each of which separates two classes C j, C k from each other. (One-versus-one classifier.) Each point classified by majority vote. Problem: Ambiguous regions C 1 C 3 R 1 C 1? R 3 C 2 R 2 C 3 C 2

K>2 Classes 31 Idea #3: Use K discriminant functions y k (x) Use the magnitude of y k (x), not just the sign. y k (x) = w k t x x assigned to C k if y k (x) > y j (x) j k Decision boundary y k (x) = y j (x) ( w k w j ) t x + ( w k 0 w j 0 ) = 0 Results in decision regions that are simply-connected and convex. R i R j R k x B x A x

Example: Kesler s Construction 32 The perceptron algorithm can be generalized to K- class classification problems. Example: Kesler s Construction: n Allows use of the perceptron algorithm to simultaneously learn K separate weight vectors w i. n Inputs are then classified in Class i if and only if w t i x > w t j x j i n The algorithm will converge to an optimal solution if a solution exists, i.e., if all training vectors can be correctly classified according to this rule.

1-of-K Coding Scheme 33 When there are K>2 classes, target variables can be coded using the 1-of-K coding scheme: Input from Class C i t = [0 0 1 0 0] t Element i

Computational Limitations of Perceptrons 34 Initially, the perceptron was thought to be a potentially powerful learning machine that could model human neural processing. However, Minsky & Papert (1969) showed that the singlelayer perceptron could not learn a simple XOR function. This is just one example of a non-linearly separable pattern that cannot be learned by a single-layer perceptron. x 2 x 1 Marvin Minsky (1927 - )

Multi-Layer Perceptrons 35 Minsky & Papert s book was widely misinterpreted as showing that artificial neural networks were inherently limited. This contributed to a decline in the reputation of neural network research through the 70s and 80s. However, their findings apply only to single-layer perceptrons. Multilayer perceptrons are capable of learning highly nonlinear functions, and are used in many practical applications.

Outline 36 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Dealing with Non-Linearly Separable Inputs 37 The perceptron algorithm fails when the training data are not perfectly linearly separable. Let s now turn to methods for learning the parameter vector w of a perceptron (linear classifier) even when the training data are not linearly separable.

The Least Squares Method 38 In the least squares method, we simply fit the (x, t) observations with a hyperplane y(x). Note that this is kind of a weird idea, since the t values are binary (when K=2), e.g., 0 or 1. However it can work pretty well.

Least Squares: Learning the Parameters 39 Assume D dimensional input vectors x. For each class k 1 K : y k (x) = w k t x + w k 0 y(x) = W t x where x = (1,x t ) t ( ) t W is a (D + 1) K matrix whose kth column is w k = w 0,w k t

Learning the Parameters 40 Method #2: Least Squares y(x) = W t x Training dataset ( x n,t n ), n = 1,,N where we use the 1-of-K coding scheme for t n Let T be the N K matrix whose n th row is t n t Let X be the N (D + 1) matrix whose n th row is x n t Let R D ( W ) = X W T Then we define the error as E D ( W ) = 1 2 i,j R ij 2 { ( )} = 1 2 Tr R W D ( ) t R D W Setting derivative wrt W to 0 yields: W = ( X t X ) 1 X t T = X T Recall: Tr ( AB) = B t. A

Outline 41 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Fisher s Linear Discriminant 42 Another way to view linear discriminants: find the 1D subspace that maximizes the separation between the two classes. Let m 1 = 1 x N n, m 2 = 1 x 1 N n 2 n C 1 n C 2 For example, might choose w to maximize w t ( m 2 m 1 ), subject to w = 1 This leads to w m 2 m 1 4 2 However, if conditional distributions are not isotropic, this is typically not optimal. 0 2 2 2 6

Fisher s Linear Discriminant 43 Let m 1 = w t m 1, m 2 = w t m 2 be the conditional means on the 1D subspace. Let s k 2 = ( y n m k ) 2 be the within-class variance on the subspace for class C k n C k ( The Fisher criterion is then J(w) = m m 2 1) 2 s 2 2 1 + s 2 This can be rewritten as J(w) = w t S B w w t S W w where S B = m 2 m 1 and ( )( m 2 m 1 ) t is the between-class variance S W = x n m 1 + x n m 2 is the within-class variance n C 1 ( )( x n m 1 ) t n C 2 J(w) is maximized for w S 1 W ( m 2 m 1 ) ( )( x n m 2 ) t 4 2 0 2 2 2 6

Connection to MVN Maximum Likelihood 44 J(w) is maximized for w S 1 W ( m 2 m 1 ) Recall that if the two distributions are normal with the same covarianceσ, the maximum likelihood classifier is linear, with w Σ 1 ( m 2 m ) 1 Further, note that S W is proportional to the maximum likelihood estimator forσ. Thus FLD is equivalent to assuming MVN distributions with common covariance.

45 Connection to Least-Squares Change coding scheme used in least-squares method to t n = N N 1 for C 1 t n = N N 2 for C 2 Then one can show that the ML w satisfies ( ) w S W 1 m 2 m 1

End of Lecture October 17, 2012

Problems with Least Squares 47 Problem #1: Sensitivity to outliers 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8

Problems with Least Squares 48 Problem #2: Linear activation function is not a good fit to binary data. This can lead to problems. 6 4 2 0 2 4 6 6 4 2 0 2 4 6

Outline 49 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Logistic Regression (K = 2) 50 ( ) p( C 1 φ) = y ( φ) = σ w t φ p( C 2 φ) = 1 p( C 1 φ) p ( C 1 φ) = y ( φ) = σ ( w t φ) where σ (a) = 1 1+ exp( a) w 1 w 2 φ 2 w t φ x φ 1

Logistic Regression 51 ( ) = y ( φ) = σ w t φ ( ) = 1 p ( C 1 φ) p C 1 φ p C 2 φ where σ (a) = 1 1+ exp( a) ( ) Number of parameters Logistic regression: M Gaussian model: 2M + 2M(M+1)/2 + 1 = M 2 +3M+1

ML for Logistic Regression 52 ( ) t = y n { n 1 y } 1 t n n p t w N where t = t 1,,t N n=1 ( ) t and y n = p ( C 1 φ ) n We define the error function to be E(w) = log p ( t w) Given y n = σ ( a ) n and a n = w t φ n, one can show that E(w) = N ( y n t n )φ n n=1 Unfortunately, there is no closed form solution for w.

ML for Logistic Regression: 53 Iterative Reweighted Least Squares Although there is no closed form solution for the ML estimate of w, fortunately, the error function is convex. Thus an appropriate iterative method is guaranteed to find the exact solution. A good method is to use a local quadratic approximation to the log likelihood function (Newton- Raphson update): w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w)

The Hessian Matrix H 54 H = w w E(w), i.e., H ij = E(w) w i w j. H ij describes how the i th component of the gradient varies as we move in the w j direction. Let u be any unit vector. Then Hu describes the variation in the gradient as we move in the direction u. u t Hu describes the projection of this variation onto u. Thus u t Hu measures how much the gradient is changing in the u direction as we move in the u direction.

ML for Logistic Regression 55 w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w) : H = Φ t RΦ where R is the N N diagonal weight matrix with R nn = y n ( 1 y n ) and Φ is the N M design matrix whose n th row is given by φ n t. (Note that, since R nn 0, R is positive semi-definite, and hence H is positive semi-definite Thus E(w) is convex.) Thus ( ) w new = w (old ) ( Φ t RΦ) 1 Φ t y t See Problem 8.3 in the text!

ML for Logistic Regression 56 Iterative Reweighted Least Squares ( ) = y ( φ) = σ w t φ p C 1 φ ( ) w 1 w 2 w t φ

Logistic Regression 57 For K>2, we can generalize the activation function by modeling the posterior probabilities as p( C k φ) = y k ( φ) = exp a k ( ) ( ) exp a j where the activations a k are given by a k = w k t φ j

Example 58 6 6 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 Least-Squares Logistic

Outline 59 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers