Machine Learning (CS 567) Lecture 5

Similar documents
Introduction to Machine Learning

Machine Learning (CS 567) Lecture 3

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning (CS 567) Lecture 2

Introduction to Machine Learning

The Bayes classifier

Introduction to Machine Learning

Introduction to Machine Learning Spring 2018 Note 18

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

5. Discriminant analysis

Overfitting, Bias / Variance Analysis

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning

Week 5: Logistic Regression & Neural Networks

Midterm: CS 6375 Spring 2015 Solutions

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning for Signal Processing Bayes Classification and Regression

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Naïve Bayes classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machines

Linear Regression (continued)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning for Signal Processing Bayes Classification

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear Classifiers 1

Linear & nonlinear classifiers

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Lecture 3: Pattern Classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian discriminant analysis Naive Bayes

Linear & nonlinear classifiers

Inf2b Learning and Data

Support Vector Machines

Machine Learning - MT Classification: Generative Models

ECE521 week 3: 23/26 January 2017

Lecture 9: Classification, LDA

Machine Learning Practice Page 2 of 2 10/28/13

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning, Fall 2009: Midterm

Machine Learning Lecture 7

Linear Regression and Discrimination

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Lecture 9: Classification, LDA

L11: Pattern recognition principles

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Midterm exam CS 189/289, Fall 2015

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Mining Classification Knowledge

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning Lecture 5

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

1 Machine Learning Concepts (16 points)

Support Vector Machines for Classification: A Statistical Portrait

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Generative Learning algorithms

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

STA 4273H: Statistical Machine Learning

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Statistical Machine Learning Hilary Term 2018

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Lecture 3: Pattern Classification. Pattern classification

CS 195-5: Machine Learning Problem Set 1

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Bayesian Decision Theory

Learning Neural Networks

Perceptron (Theory) + Linear Regression

Bias-Variance Tradeoff

10-701/ Machine Learning - Midterm Exam, Fall 2010

Introduction to Logistic Regression and Support Vector Machine

Introduction to Machine Learning Midterm Exam Solutions

CSC 411: Lecture 09: Naive Bayes

Classification. Chapter Introduction. 6.2 The Bayes classifier

LDA, QDA, Naive Bayes

Gaussian Models

Linear Decision Boundaries

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning. Instructor: Pranjal Awasthi

Statistical Data Mining and Machine Learning Hilary Term 2016

Transcription:

Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html 1

Administrative: Regrading This regrading policy holds for quizzes, homeworks, midterm and final exam. Grading complaint policy: if students have problems at assignment grading, feel free to talk to the instructor/ta for it within one week of getting the graded work back. After one week no regrading will be done. Even if students only request re-grading one of the answers, all this assignment will be checked and graded again. (take the risk of possible losing total points.) 2 Lecture 4 - Sofus A. Macskassy

Lecture 5 Outline Logistic Regression (from last week) Linear Discriminant Analysis Off-the-shelf Classifiers 3

Linear Threshold Units h(x) = ( +1 if w1 x 1 + : : : + w n x n w 0 1 otherwise We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear threshold units: Perceptron: classifier Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution 4

Linear Discriminant Analysis Learn P(x,y). This is sometimes called the generative approach, because we can think of P(x,y) as a model of how the data is generated. For example, if we factor the joint distribution into the form P(x,y) = P(y) P(x y) we can think of P(y) as generating a value for y according to P(y). Then we can think of P(x y) as generating a value for x given the previouslygenerated value for y. This can be described as a Bayesian network y x 5

Linear Discriminant Analysis (2) P(y) is a discrete multinomial distribution example: P(y = 0) = 0.31, P(y = 1) = 0.69 will generate 31% negative examples and 69% positive examples For LDA, we assume that P(x y) is a multivariate normal distribution with mean k and covariance matrix P (xjy = k) = 1 (2¼) n=2 j j 1=2 exp µ 1 2 [x ¹ k] T 1 [x ¹ k ] y x 6

Multivariate Normal Distributions: A tutorial Recall that the univariate normal (Gaussian) distribution has the formula " # p(x) = 1 (2¼) 1=2 ¾ exp where is the mean and 2 is the variance Graphically, it looks like this: 1 (x ¹) 2 2 ¾ 2 7

The Multivariate Gaussian A 2-dimensional Gaussian is defined by a mean vector = ( 1, 2 ) and a covariance matrix " ¾ 2 = 1;1 ¾1;2 2 # ¾ 2 1;2 ¾2 2;2 where 2 i,j = E[(x i i )(x j - j )] is the variance (if i = j) or co-variance (if i j). 8

The Multivariate Gaussian (2) If is the identity matrix = " 1 0 0 1 # and = (0, 0), we get the standard normal distribution: 9

The Multivariate Gaussian (3) If is a diagonal matrix, then x 1, and x 2 are independent random variables, and lines of equal probability are ellipses parallel to the coordinate axes. For example, when " # 2 0 = 0 1 and ¹ = (2; 3) we obtain 10

The Multivariate Gaussian (4) Finally, if is an arbitrary matrix, then x 1 and x 2 are dependent, and lines of equal probability are ellipses tilted relative to the coordinate axes. For example, when = " 2 0:5 0:5 1 # and ¹ = (2; 3) we obtain 11

Estimating a Multivariate Gaussian Given a set of N data points {x 1,, x N }, we can compute the maximum likelihood estimate for the multivariate Gaussian distribution as follows: ^¹ = 1 N ^ = 1 N X x i i X i (x i ^¹) (x i ^¹) T Note that the dot product in the second equation is an outer product. The outer product of two vectors is a matrix: x y T = 2 6 4 x 1 x 2 x 3 3 7 5 [y 1 y 2 y 3 ] = 12 2 6 4 x 1 y 1 x 1 y 2 x 1 y 3 x 2 y 1 x 2 y 2 x 2 y 3 x 3 y 1 x 3 y 2 x 3 y 3 For comparison, the usual dot product is written as x T y 3 7 5

The LDA Model Linear discriminant analysis assumes that the joint distribution has the form P(x y) 1 P (x; y) = P (y) (2¼) n=2 j j 1=2 exp µ 1 2 [x ¹ y] T 1 [x ¹ y ] where each y is the mean of a multivariate Gaussian for examples belonging to class y and is a single covariance matrix shared by all classes. 13

Fitting the LDA Model It is easy to learn the LDA model in two passes through the data: ^¼ k Let be our estimate of P(y = k) Let N k be the number of training examples belonging to class k. ^¼ k = N k N ^¹ k = 1 N k X ^ = 1 N x i fi:y i =kg X i (x i ^¹ yi ) (x i ^¹ yi ) T Note that each x i is subtracted from its corresponding prior to taking the outer product. This gives us the pooled estimate of 14 ^¹ yi

LDA learns an LTU Consider the 2-class case with a 0/1 loss function. Recall that P (y = 0jx) = P (x; y = 0) P (x; y = 0) + P (x; y = 1) P (y = 1jx) = P (x; y = 1) P (x; y = 0) + P (x; y = 1) Also recall from our derivation of the Logistic Regression classifier that we should classify into class ŷ = 1 if log Hence, for LDA, we should classify into ŷ = 1 if log P (y = 1jx) P (y = 0jx) > 0 P (x; y = 1) P (x; y = 0) > 0 because the denominators cancel 15

LDA learns an LTU (2) 1 P (x; y) = P (y) (2¼) n=2 j j 1=2 exp µ 1 2 [x ¹ y] T 1 [x ¹ y ] P (x; y = 1) P (y = 1) 1 P (x; y = 0) = (2¼) n=2 j j 1=2 exp ³ 1 2 [x ¹ 1] T 1 [x ¹ 1 ] P (y = 0) 1 (2¼) n=2 j j 1=2 exp ³ 1 2 [x ¹ 0] T 1 [x ¹ 0 ] P (x; y = 1) P (x; y = 0) ³ P (y = 1) exp 1 2 [x ¹ 1 ] T 1 [x ¹ 1 ] = P (y = 0) exp ³ 1 2 [x ¹ 0] T 1 [x ¹ 0 ] P (x; y = 1) log P (x; y = 0) P (y = 1) = log P (y = 0) 1 ³ [x ¹1 ] T 1 [x ¹ 1 ] [x ¹ 0 ] T 1 [x ¹ 0 ] 2 16

LDA learns an LTU (3) Let s focus on the term in brackets: ³ [x ¹1 ] T 1 [x ¹ 1 ] [x ¹ 0 ] T 1 [x ¹ 0 ] Expand the quadratic forms as follows: [x ¹ 1 ] T 1 [x ¹ 1 ] = x T 1 x x T 1 ¹ 1 ¹ T 1 1 x + ¹ T 1 1 ¹ 1 [x ¹ 0 ] T 1 [x ¹ 0 ] = x T 1 x x T 1 ¹ 0 ¹ T 0 1 x + ¹ T 0 1 ¹ 0 Subtract the lower from the upper line and collect similar terms. Note that the quadratic terms cancel! This leaves only terms linear in x. x T 1 (¹ 0 ¹ 1 )+(¹ 0 ¹ 1 ) 1 x+¹ T 1 1 ¹ 1 ¹ T 0 1 ¹ 0 17

LDA learns an LTU (4) x T 1 (¹ 0 ¹ 1 )+(¹ 0 ¹ 1 ) 1 x+¹ T 1 1 ¹ 1 ¹ T 0 1 ¹ 0 a T 1 b = b T 1 a Note that since -1 is symmetric for any two vectors a and b. Hence, the first two terms can be combined to give 2x T 1 (¹ 0 ¹ 1 ) + ¹ T 1 1 ¹ 1 ¹ T 0 1 ¹ 0 : Now plug this back in P (x; y = 1) log P (x; y = 0) P (y = 1) = log P (y = 0) 1 h 2x T 1 (¹ 0 ¹ 1 ) + ¹ T 1 2 1 ¹ 1 ¹ T i 0 1 ¹ 0 log P (x; y = 1) P (x; y = 0) = log P (y = 1) P (y = 0) + xt 1 (¹ 1 ¹ 0 ) 1 2 ¹T 1 1 ¹ 1 + 1 2 ¹T 0 1 ¹ 0 18

LDA learns an LTU (5) P (x; y = 1) log P (x; y = 0) = log P (y = 1) P (y = 0) + xt 1 (¹ 1 ¹ 0 ) 1 2 ¹T 1 1 ¹ 1 + 1 2 ¹T 0 1 ¹ 0 Let w = 1 (¹ 1 ¹ 0 ) c = P (y = 1) log P (y = 0) 1 2 ¹T 1 1 ¹ 1 + 1 2 ¹T 0 1 ¹ 0 Then we will classify into class ^y = 1 if w x + c > 0: This is an LTU. 19

Two Geometric Views of LDA View 1: Mahalanobis Distance The quantity D M (x; u) 2 = (x u) T 1 (x u) is known as the (squared) Mahalanobis distance between x and u. We can think of the matrix -1 as a linear distortion of the coordinate system that converts the standard Euclidean distance into the Mahalanobis distance Note that log P (xjy = k) / log ¼ k 1 2 [(x ¹ k) T 1 (x ¹ k )] log P (xjy = k) / log ¼ k 1 2 D M(x; ¹ k ) 2 Therefore, we can view LDA as computing D M (x; ¹ 0 ) 2 D M (x; ¹ 1 ) 2 and and then classifying x according to which mean 0 or 1 is closest in Mahalanobis distance (corrected by log k ) 20

Mahalanobis Distance Which point is closer to the vertical line? Weak point of mahalanobis: if only few points observed, then the variance is not meaningful. 21

View 2: Most Informative Low- Dimensional Projection LDA can also be viewed as finding a hyperplane of dimension K 1 such that x and the { k } are projected down into this hyperplane and then x is classified to the nearest k using Euclidean distance inside this hyperplane 22

Nonlinear boundaries with LDA Left: Linear decision boundary found by LDA Right: Quadratic decision boundary found by LDA. This was obtained by finding linear decision boundaries in the five dimensional space X1, X2, X1*X2, X1 2, X2 2. Linear inequalities in this space are quadratic inequalities in the original space. 23

Generalizations of LDA General Gaussian Classifier Instead of assuming that all classes share the same, we can allow each class k to have its own k. In this case, the resulting classifier will be a quadratic threshold unit (instead of an LTU) Naïve Gaussian Classifier Allow each class to have its own k, but require that each k be diagonal. This means that within each class, any pair of features x j1 and x j2 will be assumed to be statistically independent. The resulting classifier is still a quadratic threshold unit (but with a restricted form) 24

Quadratic Discriminant Analysis (QDA) Quadratic Discriminant Analysis (QDA) is a generalization of LDA where different classes are allowed to have different covariance matrices. i.e., k l for classes k l. 1 P (x; y = k) = P (y = k) (2¼) n=2 j k j We can solve for log P (y = 1jx) P (y = 0jx) > 0 ^ k = 1 N k X i;y i =k 1=2 exp (x i ^¹ k ) (x i ^¹ k ) T in a similar way as before. Note, however, that the 0 and 1 determinants do not cancel. µ 1 2 [x ¹ k] T 1 k [x ¹ k] 25

LDA vs. QDA Left: Quadratic fit by LDA in transformed space. Right: Quadratic fit by QDA. The differences in the fit are very small. 26

Generalizations of LDA General Gaussian Classifier Instead of assuming that all classes share the same, we can allow each class k to have its own k. In this case, the resulting classifier will be a quadratic threshold unit (instead of an LTU) Naïve Gaussian Classifier Allow each class to have its own k, but require that each k be diagonal. This means that within each class, any pair of features x j1 and x j2 will be assumed to be statistically independent. The resulting classifier is still a quadratic threshold unit (but with a restricted form) 27

Different Discriminant Functions 28

Summary of Linear Discriminant Analysis Learns the joint probability distribution P(x, y). Direct Computation. The maximum likelihood estimate of P(x,y) can be computed from the data without search. However, inverting the matrix requires O(n 3 ) time. Eager. The classifier is constructed from the training examples. The examples can then be discarded. Batch. Only a batch algorithm is available. An online algorithm could be constructed if there is an online algorithm for incrementally updated -1. [This is easy for the case where is diagonal.] 29

Comparing Perceptron, Logistic Regression, and LDA How should we choose among these three algorithms? There is a big debate within the machine learning community! 30

Issues in the Debate Statistical Efficiency. If the generative model P(x,y) is correct, then LDA usually gives the highest accuracy, particularly when the amount of training data is small. If the model is correct, LDA requires 30% less data than Logistic Regression in theory Computational Efficiency. Generative models typically are the easiest to learn. In our example, LDA can be computed directly from the data without using gradient descent. 31

Issues in the Debate Robustness to changing loss functions. Both generative and conditional probability models allow the loss function to be changed at run time without re-learning. Perceptron requires re-training the classifier when the loss function changes. Robustness to model assumptions. The generative model usually performs poorly when the assumptions are violated. For example, if P(x y) is very non- Gaussian, then LDA won t work well. Logistic Regression is more robust to model assumptions, and Perceptron is even more robust. Robustness to missing values and noise. In many applications, some of the features x ij may be missing or corrupted in some of the training examples. Generative models typically provide better ways of handling this than non-generative models. 32