Introduction to Machine Learning Spring 2018 Note 18

Similar documents
Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

7 Gaussian Discriminant Analysis (including QDA and LDA)

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning (CS 567) Lecture 5

Support Vector Machines

CS 195-5: Machine Learning Problem Set 1

Introduction to Machine Learning

y Xw 2 2 y Xw λ w 2 2

Classification Methods II: Linear and Quadratic Discrimminant Analysis

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

6.867 Machine Learning

Introduction to Machine Learning

The Bayes classifier

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Regression and Discrimination

Introduction to Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Linear Classification: Probabilistic Generative Models

CMSC858P Supervised Learning Methods

Classification 1: Linear regression of indicators, linear discriminant analysis

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

LDA, QDA, Naive Bayes

Lecture 9: Classification, LDA

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 9: Classification, LDA

Midterm: CS 6375 Spring 2015 Solutions

Lecture 9: Classification, LDA

Bayesian Decision Theory

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

High Dimensional Discriminant Analysis

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Linear Models for Classification

STA 4273H: Statistical Machine Learning

Linear Methods for Prediction

Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Generative Learning algorithms

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

ECE521 week 3: 23/26 January 2017

Parametric Techniques

Least Squares Regression

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Classification: Linear Discriminant Analysis

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

FINAL: CS 6375 (Machine Learning) Fall 2014

Naïve Bayes classification

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Linear Methods for Prediction

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Regularized Discriminant Analysis and Reduced-Rank LDA

Parametric Techniques Lecture 3

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Least Squares Regression

Introduction. Chapter 1

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Machine Learning Hilary Term 2018

5. Discriminant analysis

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Linear Methods for Classification

The generative approach to classification. A classification problem. Generative models CSE 250B

Gaussian Models

Classification 2: Linear discriminant analysis (continued); logistic regression

4 Bias-Variance for Ridge Regression (24 points)

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Gaussian discriminant analysis Naive Bayes

Machine Learning, Fall 2012 Homework 2

CS534 Machine Learning - Spring Final Exam

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification. Chapter Introduction. 6.2 The Bayes classifier

8 Classification. 8.1 Class Conditionals

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Study of Relative Efficiency and Robustness of Classification Methods

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning HW6

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Ch 4. Linear Models for Classification

10-701/ Machine Learning - Midterm Exam, Fall 2010

CSE 546 Final Exam, Autumn 2013

MLE/MAP + Naïve Bayes

Transcription:

CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes the joint probability p(x, Y over the label Y : ŷ = arg max p(x, Y = Generative models typically form the joint distribution by explicitly forming the following: A prior probability distribution over all classes: P ( = P (class = A conditional probability distribution for each class {1, 2,..., K}: p (X = p(x class In total there are K + 1 probability distributions: 1 for the prior, and K for all of the individual classes. Note that the prior probability distribution is a categorical distribution over the K discrete classes, whereas each class conditional probability distribution is a continuous distribution over R d (often represented as a Gaussian. Using the prior and the conditional distributions in conjunction, we have (from Bayes rule that maximizing the joint probability over the class labels is equivalent to maximizing the posterior probability of the class label: ŷ = arg max p(x, Y = = arg max P ( p (x = arg max P (Y = x Consider the example of digit classification. Suppose we are given dataset of images of handwritten digits each with nown values in the range {0, 1, 2,..., 9}. The tas is, given an image of a handwritten digit, to classify it to the correct digit. A generative classifier for this this tas would effectively form a the prior distribution and conditional probability distributions over the 10 possible digits and choose the digit that maximizes posterior probability: ŷ = arg max p(digit = image = arg max P (digit = p(image digit = {0,1,2...,9} {0,1,2...,9} Maximizing the posterior will induce regions in the feature space in which one class has the highest posterior probability, and decision boundaries in between classes where the posterior probability of two classes are equal. Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 1

Gaussian Discriminant Analysis (GDA is a specific generative method in which the class conditional probability distributions are Gaussian: (X Y = N (µ, Σ. (Caution: the term discriminant in GDA is misleading; GDA is a generative method, it is not a discriminative method! Assume that we are given a training set D = {(x i, y i } n i=1 of n points. Estimating the prior distribution is the same for any other generative model. The probability of a class is P ( = n n where n is the number of training points that belong to class. We can estimate the parameters of the conditional distributions with MLE. Once we form the estimated prior conditional distributions, we use Bayes Rule to directly solve the optimization problem ŷ = arg max = arg max = arg max = arg max p( x P ( p (x ln ( P ( ( + ln ( 2π d p (x ln ( P ( 1 2 (x ˆµ (x ˆµ 1 ( = Q (x For future reference, let s use Q (x = ln( 2π dp ( p (x to simplify our notation. We classify an arbitrary test point ŷ = arg max Q (x {1,2,...,K} GDA comes in two flavors: Quadratic Discriminant Analysis (QDA in which the decision boundary is quadratic, and Linear Discriminant Analysis (LDA in which the decision boundary is linear. We will now present both and compare them in detail. 1.1 QDA Classification In Quadratic Discriminant Analysis (QDA, the class conditional probability distributions are independent Gaussians namely, the covariance Σ of class has no dependence/relation to that of the other classes. Due to this independence property, we can estimate the true mean and covariance µ, Σ for each class conditional probability distribution p (X independently, with the n samples in our training data that are classified as class. The MLE estimate for the parameters of p (X is: ˆµ = 1 n ˆΣ = 1 n i:y i = x i (x i ˆµ (x i ˆµ i:y i = Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 2

1.2 LDA Classification While QDA is a reasonable approach to classification, we might be interested in simplifying our model to reduce the number of parameters we have to learn. One way to do this is through Linear Discriminant Analysis (LDA classification. Just as in QDA, LDA assumes that the class conditional probability distributions are normally distributed with different means µ, but LDA is different from QDA in that it requires all of the distributions to share the same covariance matrix Σ. This is a simplification which, in the context of the Bias-Variance tradeoff, increases the bias of our method but may help decrease the variance. The training and classification procedures for LDA are almost identical that of QDA. To compute the within-class means, we still want to tae the empirical mean. However, the empirical covariance for all classes is now computed as ˆΣ = 1 n (x i ˆµ n yi (x i ˆµ yi i=1 One way to understand this formula is as a weighted average of the within-class covariances. Here, assume we have sorted our training data by class and we can index through the x i s by specifying a class and the index within that class j: ˆΣ = 1 n (x i ˆµ n yi (x i ˆµ yi = 1 n = 1 n = i=1 K n =1 i:y i = K =1 K n Σ =1 n n Σ 1.3 LDA and QDA Decision Boundary (x i ˆµ (x i ˆµ Let s now derive the form of the decision boundary for QDA and LDA. As we will see, the term quadratic in QDA and linear in LDA actually signify the shape of the decision boundary. We will prove this claim using binary (2-class examples for simplicity (class A and class B. An arbitrary point x is classified according to the following cases: A Q A (x > Q B (x ŷ = B Q A (x < Q B (x Either A or B Q A (x = Q B (x The decision boundary is the set of all points in x-space that are classified according to the third case. Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 3

1.3.1 Identical Conditional Distributions with Identical Priors The simplest case is when the two classes are equally liely in prior, and their conditional probability distributions are isotropic with identical covariances. Recall that isotropic Gaussian distributions have covariances of the form of Σ = σ 2 I, which means that their isocontours are circles. In this case, p A (X and p B (X have identical covariances of the form Σ A = Σ B = σ 2 I. Figure 1: Contour plot of two isotropic, identically distributed Gaussians in R 2. The circles are the level sets of the Gaussians. Geometrically, we can see that the tas of classifying a 2-D point into one of the two classes amounts simply to figuring out which of the means it s closer to. Using our notation of Q (x from before, this can be expressed mathematically as: Q A (x = Q B (x ln ( P (A 1 2 (x ˆµ A A (x ˆµ A 1 ( A = ln ( P (B 1 2 (x ˆµ B B (x ˆµ B 1 ( B ( 1 ln 2 1 2 (x ˆµ A σ 2 I(x ˆµ A 1 ( ( 1 2 ln σ 2 I = ln 2 1 2 (x ˆµ B σ 2 I(x ˆµ B 1 ( 2 ln σ 2 I (x ˆµ A (x ˆµ A = (x ˆµ B (x ˆµ B The decision boundary is the set of points x for which x ˆµ A 2 = x ˆµ B 2, which is simply the set of points that are equidistant from ˆµ A and ˆµ B. This decision boundary is linear because the set of points that are equidistant from ˆµ A and ˆµ B are simply the perpendicular bisector of the segment connecting ˆµ A and ˆµ B. The next case is when the two classes are equally liely in prior, and their conditional probability distributions are anisotropic with identical covariances. Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 4

Figure 2: Two anisotropic, identically distributed Gaussians in R 2. The ellipses are the level sets of the Gaussians. The anisotropic case can be reduced to the isotropic case simply by performing a linear change of coordinates that transforms the ellipses bac into circles, which induces a linear decision boundary both in the transformed and original space. Therefore, the decision boundary is still the set of points that are equidistant from ˆµ A and ˆµ B. 1.3.2 Identical Conditional Distributions with Different Priors Now, let s find the decision boundary when the two classes still have identical covariances but are not necessarily equally liely in prior: Q A (x = Q B (x ln ( P (A 1 2 (x ˆµ A A (x ˆµ A 1 ( A = ln ( P (B 1 2 (x ˆµ B B (x ˆµ B 1 ( B ln ( P (A 1 2 (x ˆµ A (x ˆµA 1 ( = ln ( P (B 1 2 (x ˆµ B (x ˆµB 1 ( 2 ln ˆΣ ln ( P (A 1 2 (x ˆµ A (x ˆµA = ln ( P (B 1 2 (x ˆµ B (x ˆµB 2 ln ( P (A x x + 2x ˆµA ˆµ A ˆµA = 2 ln ( P (B x x + 2x ˆµB ˆµ B ˆµB 2 ln ( P (A + 2x ˆµA ˆµ A ˆµA = 2 ln ( P (B + 2x ˆµB ˆµ B ˆµB Simplifying, we have that x (ˆµA ˆµ B + ln ( P (A P (B ˆµ A ˆµA ˆµ B ˆµB 2 = 0 The decision boundary is the level set of a linear function f(x = w x b. In fact, the decision boundary is the level set of a linear function (which itself is linear as long as the two class con- Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 5

ditional probability distributions share the same covariance matrices. This is the reason for why LDA has a linear decision boundary. 1.3.3 Nonidentical Conditional Distributions with Different Priors This is the most general case. We have that: ln ( P (A 1 2 (x ˆµ A A (x ˆµ A 1 2 ln ( ˆΣ A Q A (x = Q B (x = ln ( P (B 1 2 (x ˆµ B B (x ˆµ B 1 2 ln ( ˆΣ B Here, unlie in LDA when Σ A = Σ B, we cannot cancel out the quadratic terms in x from both sides of the equation, and thus our decision boundary is now represented by the level set of an arbitrary quadratic function. It should now mae sense why QDA is short for quadratic discriminant analysis and LDA is short for linear discriminant analysis! 1.4 LDA and Logistic Regression As it turns out, LDA and logistic regression share the same type of posterior distribution. We already showed that the posterior distribution in logistic regression is P (Y = A x = 1 1 + e w x b = s(w x b for some appropriate vector w and bias b. Now let s derive the posterior distribution for LDA. From Bayes rule we have that P (Y = A x = p(x Y = AP (Y = A p(x Y = BP (Y = B + p(x Y = BP (Y = B e Q A(x = e Q A(x + e Q B(x 1 = 1 + e Q A(x Q B (x We already showed the the decision boundary in LDA is linear it is the set of points x such that Q A (x Q B (x = w x b = 0 for some appropriate vector w and bias b. We therefore have that P (Y = A x = 1 1 + e w x b = s(w x b As we can see, even though logistic regression is a discriminative method and LDA is a generative method, both methods complement each other, arriving at the same form for the posterior distribution. Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 6

1.5 Generalizing to Multiple Classes The analysis on the decision boundary in QDA and LDA can be extended to the general case when there are more than two classes. In the multiclass setting, the decision boundary is a collection of linear boundaries in LDA and quadratic boundaries in QDA. The following Voronoi diagrams illustrate the point: Figure 3: LDA (left vs QDA (right: a collection of linear vs quadratic level set boundaries. Source: Professor Shewchu s notes Note 18, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 7