Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Similar documents
Lecture 5: LDA and Logistic Regression

Linear Methods for Prediction

Statistical Data Mining and Machine Learning Hilary Term 2016

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

LDA, QDA, Naive Bayes

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Linear Decision Boundaries

Introduction to Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Methods for Prediction

Classification 2: Linear discriminant analysis (continued); logistic regression

A Study of Relative Efficiency and Robustness of Classification Methods

Introduction to Machine Learning Spring 2018 Note 18

Linear Regression and Discrimination

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning

Lecture 9: Classification, LDA

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 9: Classification, LDA

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Lecture 5: Classification

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Does Modeling Lead to More Accurate Classification?

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

FSAN815/ELEG815: Foundations of Statistical Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

STA 450/4000 S: January

Classification 1: Linear regression of indicators, linear discriminant analysis

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

The Bayes classifier

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Applied Multivariate and Longitudinal Data Analysis

CMSC858P Supervised Learning Methods

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Lecture 9: Classification, LDA

Machine Learning Lecture 7

Linear Methods for Classification

Infinitely Imbalanced Logistic Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning for OR & FE

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Linear Models for Classification

Introduction to Machine Learning

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Machine Learning (CS 567) Lecture 5

Machine Learning Linear Classification. Prof. Matteo Matteucci

Iterative Reweighted Least Squares

CS281 Section 4: Factor Analysis and PCA

Support Vector Machines

Introduction to Machine Learning

Logistic Regression. Seungjin Choi

Introduction to Logistic Regression and Support Vector Machine

LEC 4: Discriminant Analysis for Classification

Machine Learning - MT Classification: Generative Models

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

A Bahadur Representation of the Linear Support Vector Machine

6.867 Machine Learning

Spring 2006: Linear Discriminant Analysis, Etc.

Linear Regression Models P8111

Support Vector Machines for Classification: A Statistical Portrait

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Introduction to Signal Detection and Classification. Phani Chavali

Linear Models in Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 16 Solving GLMs via IRWLS

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

The generative approach to classification. A classification problem. Generative models CSE 250B

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm exam CS 189/289, Fall 2015

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSCI-567: Machine Learning (Spring 2019)

Day 5: Generative models, structured classification

High Dimensional Discriminant Analysis

CS 195-5: Machine Learning Problem Set 1

Machine Learning Practice Page 2 of 2 10/28/13

Transcription:

Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University

Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems (i.e Y = {0, 1} or Y = { 1, 1}). The loss function for classification is the 0-1 loss given as l(y, a) = I(y a). Linear methods for classification assume that the decision boundary is given as {x : β 0 + x β = 0}. That is, for a given linear function f(x) = β 0 + x β and Y = { 1, 1}, we construct the corresponding classifier G : R p Y by G(x) = signf(x). Seoul National University. 1

Three popular linear classifiers Linear Discriminant Analysis (LDA): Mixture of Gaussian models Logistic regression: Regression approach Optimal separating hyperplane: Machine learning approach, SVM * Among these, in this section, we only consider the LDA and logistic regression. We will study the optimal separating hyperplane when we study SVM. Seoul National University. 2

2. LDA Model Let f j (x) is the class conditional density of x in class y = j where y { 1, 1}. Let π j, j = 1, 1 be the prior probabilities (i.e. π j = Pr(y = j).) Suppose that we model each class density as multivariate Gaussian 1 ( f j (x) = (2π) p/2 Σ j exp 1 ) 1/2 2 (x µ j) Σ 1 j (x µ j ) where µ j is the mean vector and Σ j is the covariance matrix. Seoul National University. 3

Bayes classifier As we have seen in Chap 1, the Bayes classifier is given as ( ) Pr(y = 1 x) G(x) = sign log. Pr(y = 0 x) Since Pr(y = j x) f j (x)π j, we have log Pr(y = j x)(= δ j (x)) = 1 2 Σ j 1 2 (x µ j) Σ 1 j (x µ j )+log π j +C. Hence, the Bayes classifier is given as G(x) = sign(δ 1 (x) δ 1 (x)). We call the functions δ j (x) the discriminant functions. Seoul National University. 4

LDA We assume that all Σ 1 = Σ 1 (= Σ). In this case,we can easily see that the Bayes classifier is given as G(x) = sign(δ 1 (x) δ 1 (x)) where δ j (x) = x Σ 1 µ j 1 2 µ jσ 1 µ j + log π j. That is, the Bayes classifier is a linear classifier. The functions δ j are called the linear discriminant functions. Seoul National University. 5

QDA QDA is an abbreviation of the Quadratic Discriminant Analysis. When Σ 1 Σ 1, the decision boundary of the Bayes classifier is a quadratic function. Seoul National University. 6

Estimation We can easily estimate µ j and Σ j by ˆµ j = n i=1 x ii(y i = j)/n j ˆΣ j = n i=1 (x i µ j )(x i µ j ) I(y i = j)/(n j 1) n j = n i=1 I(y i = j). Also, unless specified, we estimate π j by n j /n. For LDA, we estimate Σ by the pooled variance-covariance matrix ˆΣ = (n 1 1)ˆΣ 1 + (n 1 1)ˆΣ 1 n 2. Seoul National University. 7

3. Logistic Regression Model Y = {0, 1}. The logistic model assumes Pr(y = 1 x) = exp(β 0 + x β) (= ϕ(x, β)). 1 + exp(β 0 + x β) * We abuse the notation slightly to let β = (β 0, β). Seoul National University. 8

Motivation 1 Consider a linear regression Pr(y = 1 x) = β 0 + x β. It violates that the constraint Pr(y = 1 x) [0, 1]. A simple remedy for this problem is to set Pr(y = 1 x) = F (β 0 + x β) where F is a distribution function. Examples for F Gaussian: Probit model Gompertz: F (x) = exp( exp(x)), popularly used in Insurance Logistic: F (x) = exp(x)/(1 + exp(x)). Seoul National University. 9

Motivation 2 Consider the decision boundary {x : Pr(Y = 1 X = x) = 0.5}. This is equivalent to {x : log(pr(y = 1 X = x)/pr(y = 0 X = x)) = 0}. Suppose that the log-odds is linear. That is log(pr(y = 1 X = x)/pr(y = 0 X = x)) = β 0 + x β, This implies that Pr(Y = 1 X = x) = exp(β 0 + x β) 1 + exp(β 0 + x β). Seoul National University. 10

Estimation Use the maximum likelihood approach The likelihood is simply the probability of the observations given as n L(β) = Pr(y = y i x = x i ). i=1 Estimate β by maximizing the log-likelihood l(β) = n ( ) y i (β 0 + x iβ) log(1 + exp (β 0 + x iβ)). i=1 Seoul National University. 11

Computation One obstacle of using the logistic regression would be computation since maximizing the log-likelihood is not easy. However, we can do it efficiently using the Iteratively Reweighted Least Squares (IRLS) algorithm explained as follows. Find the MLE of β via (Newton-Raphson algorithm) β new = β old ( 2 l(β) β β ) 1 l(β) β, Seoul National University. 12

This is equivalent to β new = (X WX) 1 X Wz where W is a n n diagonal matrix of its (i, i)th element being ϕ(x i : β old )(1 ϕ(x i : β old )) and z = Xβ old + W 1 (y p) where X is the design matrix, y = (y 1,..., y n ) p = (ϕ(x i : β old ), i = 1,..., n). and Hence, β new is the weighted square estimator with the adjusted response z: β new = argmax β (z Xβ) W(z Xβ). To sum up, the MLE of the logistic regression coefficient can be obtained by applying the weighted least square iteratively.sion Seoul National University. 13

4. LDA or Logistic regression Note that logistic regression and LDA have linear decision boundaries. Logistic regression only needs the specification of Pr(Y = 1 X = x) (that is, Pr(X = x) is completely undetermined). On the other hand, the LDA needs the specification of the joint distribution Pr(Y, X). In fact, in LDA, the marginal distribution of x is a mixture of Gaussians Pr(x) = π 1 N(µ 1, Σ) + π 1 N(µ 1, Σ). Hence, LDA needs more assumptions and hence less applicability than the logistic regression. Seoul National University. 14

Also, categorical input variables are allowable for the logistic regression (using dummy variables) while LDA has troubles with such inputs. However, LDA is a useful tool when some of the output are missing (semi-supervised learning). Seoul National University. 15

5. Extension to Multi-class problems Let Y = {1,..., K}. In this case, we construct K many linear functions f k (x) = β 0k + x β k for k = 1,..., K. Then, construct a classifier by G(x) = argmax k f k (x). Seoul National University. 16

Linear regression For k = 1,..., K Let y (k) i = I(y i = k). Construct f k (x) by regressing x i s on y (k) i. Note that E(y (k) i x i ) = Pr(y i = k x i ), and hence we would expect that it works reasonably well. Seoul National University. 17

LDA and QDA LDA Simply, assume that x i y i = k N p (µ k, Σ). Estimate µ k s and Σ from the data and construct a Bayes classifier. QDA Assume x i y i = k N p (µ k, Σ k ). Estimate µ k s and Σ k from the data and construct a Bayes classifier. Seoul National University. 18

Logistic regression Assume That is Pr(y = k x) exp(β 0k + x β k ). Pr(y = k x) = exp(β 0k + x β k ) K l=1 exp(β 0l + x β l ) For identifiability, we let β 01 = 0 and β 1 = 0. The parameters are estimated by the maximum likelihood estimator. Seoul National University. 19

Masking effect of the linear regression Seoul National University. 20

Masking effect of the linear regression (continued) Seoul National University. 21

Empirical comparison Go to http://www-stat.stanford.edu/~tibs/elemstatlearn/ for the data set vowel. Seoul National University. 22

6. HW Reconstruct Table 4.1. Seoul National University. 23