DISCRIMINANT ANALYSIS: LDA AND QDA

Similar documents
Assignment 2: K-Nearest Neighbors and Logistic Regression

HW1 Roshena MacPherson Feb 1, 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

SKLearn Tutorial: DNN on Boston Data

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

Multiple Regression Part I STAT315, 19-20/3/2014

Bayesian Classification Methods

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Data Mining Techniques. Lecture 2: Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Stat 401B Final Exam Fall 2016

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification: Linear Discriminant Analysis

FSAN815/ELEG815: Foundations of Statistical Learning

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Performance Evaluation

SB2b Statistical Machine Learning Bagging Decision Trees, ROC curves

Lecture 9: Classification, LDA

Chapter 4 Dimension Reduction

Lecture 5: LDA and Logistic Regression

Prediction problems 3: Validation and Model Checking

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Assignment 4. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14

CMSC858P Supervised Learning Methods

Statistical Machine Learning Hilary Term 2018

LDA, QDA, Naive Bayes

Introduction to PyTorch

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Generalised linear models. Response variable can take a number of different formats

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Introduction to Machine Learning Spring 2018 Note 18

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Review on Spatial Data

ECE521 week 3: 23/26 January 2017

Sparse polynomial chaos expansions as a machine learning regression technique

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Decision Boundaries

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Logistic Regression 21/05

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Single Index Quantile Regression for Heteroscedastic Data

Bayesian Learning (II)

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Machine Learning (CS 567) Lecture 5

Lecture 5: Classification

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine Learning, Fall 2012 Homework 2

CS7267 MACHINE LEARNING

Statistical Machine Learning from Data

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Applied Machine Learning Annalisa Marsico

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Bayesian Model Averaging (BMA) with uncertain Spatial Effects A Tutorial. Martin Feldkircher

Applied Multivariate and Longitudinal Data Analysis

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Lecture 6: Methods for high-dimensional problems

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

ESS2222. Lecture 4 Linear model

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

arxiv: v1 [stat.me] 16 Feb 2017

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Classification using stochastic ensembles

Support Vector Machines

Machine Learning for OR & FE

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

On the Harrison and Rubinfeld Data

Introduction to Logistic Regression

Machine Learning 4771

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Neural networks (NN) 1

Data Mining Analysis of the Parkinson's Disease

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Chapter 10 Logistic Regression

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Naive Bayes classification

Introduction to Signal Detection and Classification. Phani Chavali

Week 5: Logistic Regression & Neural Networks

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Statistisches Data Mining (StDM) Woche 6. Read and do the excersises of chapter 4.6.1, 4.6.2, and in ILSR

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

1 A Support Vector Machine without Support Vectors

ANOVA, ANCOVA and MANOVA as sem

Transcription:

Stat 427/627 Statistical Machine Learning (Baron) HOMEWORK 6, Solutions DISCRIMINANT ANALYSIS: LDA AND QDA. Chap 4, exercise 5. (a) On a training set, LDA and QDA are both expected to perform well. LDA uses a correct model, and QDA includes the linear model as a special case. QDA, including a wider range of models, will fit the given particular data set better. On a test set, however, LDA should be more accurate because QDA, with additional parameters and an overfit, will have a higher variance. (b) For a nonlinear boundary, QDA should perform better on both the training and test sets because LDA will not capture the nonlinear pattern. (c) As the sample size increases, QDA will perform better because its higher variance will be offset by a large sample size. (d) False. A higher variance of QDA prediction, without any gain in bias, will result in a higher error rate. 2. See the last page. 3. Chap 4, exercise 0[e,f,g,h]. (e) > library(islr) > attach(weekly) > Weekly_training Weekly[ Year < 2008, ] # Split the data by year into > Weekly_testing Weekly[ Year > 2008, ] # the training and testing sets > library(mass) # This library contains LDA > lda.fit lda(direction ~ Lag2, dataweekly_training) # Fit using training data only # Now use this model to predict classification of the testing data > Direction_Predicted_LDA predict(lda.fit, data.frame(weekly_testing))$class > > table( Weekly_testing$Direction, Direction_Predicted_LDA ) # Confusion matrix Direction_Predicted_LDA Direction Down Up Down 9 34 Up 5 56 > mean( Weekly_testing$Direction Direction_Predicted_LDA ) [] 0.625 # Correct classification rate of LDA is 62.5%. (f) # Similarly with QDA. > qda.fit qda(direction ~ Lag2, dataweekly_training) > Direction_Predicted_QDA predict(qda.fit, data.frame(weekly_testing))$class > table( Weekly_testing$Direction, Direction_Predicted_QDA ) Direction_Predicted_QDA

Direction Down Up # Correct classification rate of QDA is only 58.7%. Down 0 43 # However, QDA always predicted that the market goes Up. Up 0 6 # LDA has a higher prediction power. QDA seems to be an overfit. > mean( Weekly_testing$Direction Direction_Predicted_QDA ) [] 0.5865385 # Logistic regression and LDA yield the best results so far. (g) # KNN with k is a very flexible method, matching the local features. > X.train matrix( Lag2[ Year < 2008 ], sum(year < 2008), ) > X.test matrix( Lag2[ Year > 2009 ], sum(year > 2009), ) > Y.train Direction[ Year < 2008 ] > Y.test Direction[ Year > 2009 ] > knn.fit knn( X.train, X.test, Y.train, ) > table( Y.test, knn.fit ) knn.fit Y.test Down Up Down 2 22 Up 29 32 > mean( Y.test knn.fit ) [] 0.509654 # With k4, the KNN method gives a better classification rate for these data: > knn.fit knn( X.train, X.test, Y.train, 4 ) > mean( Y.test knn.fit ) [] 0.596538 (h) Among the methods that we explored, Logistic regression and LDA yield the highest classification rate, 0.625. KNN with K4 yields the classification rate 0.596. QDA yields the classification rate 0.587. KNN with K yields the classification rate 0.50. 4. Chap 4, exercise 3. > attach(boston) # Attach a new dataset Boston from ISLR package >?Boston # Data description > names(boston) [] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio" "black" "lstat" "medv" > Crime_Rate ( crim > median(crim) ) # Categorical variable Crime_Rate > table(crime_rate) Crime_Rate FALSE TRUE 2

253 253 # The median splits the data evenly > Boston cbind( Boston, Crime_Rate ) # Logistic regression First, we ll fit the full model. > lreg.fit glm(crime_rate ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat + medv, family"binomial") > summary(lreg.fit) Estimate Std. Error z value Pr(> z ) (Intercept) -34.03704 6.53004-5.223.76e-07 *** zn -0.07998 0.03373-2.369 0.0782 * indus -0.059389 0.043722 -.358 0.7436 chas 0.785327 0.728930.077 0.2832 nox 48.523782 7.396497 6.560 5.37e- *** rm -0.425596 0.7004-0.607 0.54383 age 0.02272 0.0222.84 0.06963. dis 0.69400 0.28308 3.67 0.0054 ** rad 0.656465 0.52452 4.306.66e-05 *** tax -0.00642 0.002689-2.385 0.0709 * ptratio 0.36876 0.2236 3.09 0.00254 ** black -0.03524 0.006536-2.069 0.03853 * lstat 0.043862 0.04898 0.895 0.37052 medv 0.6730 0.066940 2.497 0.0254 * # Some variables are not significant. We ll fit a reduced model, without them. > lreg.fit glm(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, family"binomial") > summary(lreg.fit) Estimate Std. Error z value Pr(> z ) (Intercept) -3.44272 6.048989-5.98 2.02e-07 *** zn -0.082567 0.03424-2.628 0.00860 ** nox 43.95824 6.45282 6.694 2.7e- *** age 0.02285 0.009894 2.30 0.0209 * dis 0.634380 0.207634 3.055 0.00225 ** rad 0.78773 0.42066 5.059 4.2e-07 *** tax -0.007676 0.002503-3.066 0.0027 ** ptratio 0.303502 0.09255 2.778 0.00547 ** black -0.02866 0.006334-2.03 0.04224 * medv 0.2882 0.034362 3.285 0.0002 ** # To evaluate the model s predictive performance, we ll use the validation set method. > n length(crim) # Split the data randomly into training and testing subsets > n [] 506 3

> Z sample(n,n/2) > Data_training Boston[ Z, ] > Data_testing Boston[ -Z, ] > lreg.fit glm(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, datadata_training, family"binomial") # Fit a logistic regression model on the training data # and use this model to predict the testing data > Predicted_probability predict( lreg.fit, data.frame(data_testing), type"response" ) > Predicted_Crime_Rate *(Predicted_probability > 0.5) # Predicted crime rate: high, 0 low > table( Data_testing$Crime_Rate, Predicted_Crime_Rate ) Predicted_Crime_Rate 0 0 6 4 98 > mean( Data_testing$Crime_Rate Predicted_Crime_Rate ) [] 0.8953975 # Logistic regression shows the correct classification rate of 89.5%, the error rate is 0.5%. # Now, try LDA. For a fair comparison, use the same training data, then predict the testing data. > lda.fit lda(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, datadata_training) > Predicted_Crime_Rate_LDA predict(lda.fit, data.frame(data_testing))$class > table( Data_testing$Crime_Rate, Predicted_Crime_Rate_LDA ) Predicted_Crime_Rate_LDA 0 0 7 0 28 84 > mean( Data_testing$Crime_Rate! Predicted_Crime_Rate_LDA ) [] 0.589958 # LDA s error rate is higher, 5.9%. # What about QDA? If logit has a nonlinear relation with predictors, QDA may be a better method. 4

> qda.fit qda(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, datadata_training) > Predicted_Crime_Rate_QDA predict(qda.fit, data.frame(data_testing))$class > table( Data_testing$Crime_Rate, Predicted_Crime_Rate_QDA ) Predicted_Crime_Rate_QDA 0 0 2 6 26 86 > mean( Data_testing$Crime_Rate! Predicted_Crime_Rate_QDA ) [] 0.33892 # QDA s error rate of 3.4% is between the LDA and the logistic regression. According to these quick results, logistic regression has the best prediction power. # It may be interesting to look at the distribution of predicted probabilities of high crime zones, produced by the three classification methods. All of them mostly discriminate between the two groups pretty well. > hist(predicted_probability, main'predicted PROBABILITY BY LOGISTIC REGRESSION') > hist(predict(lda.fit, data.frame(data_testing))$posterior, main'posterior DISTRIBUTION BY LDA') > hist(predict(qda.fit, data.frame(data_testing))$posterior, main'posterior DISTRIBUTION BY QDA') KNN method # Define training and testing data > names(data_training) [] "crim" "zn" "indus" "chas" "nox" [6] "rm" "age" "dis" "rad" "tax" [] "ptratio" "black" "lstat" "medv" "Crime_Rate" # Variables zn + nox + age + dis + rad + tax + ptratio + black + medv are in columns 2,5,7,8,9,0,,2,4. > X.train Data_training[, c(2,5,7,8,9,0,,2,4) ] 5

2. (a) For Class A, based on (4, 2), X A 3, s 2 A (4 3)2 +(2 3) 2 2, and s 2 A.4. For Class B, based on (9, 6, 3), X B 6, s 2 B (9 6)2 +(6 6) 2 +(3 6) 2 9, and s 3 B 3. The pooled variance is s 2 (4 3)2 +(2 3) 2 +(9 6) 2 +(6 6) 2 +(3 6) 2 20, and the pooled standard 5 2 3 deviation is s 2.58. (b) We know that in this situation, when π A π B 0.5, L(A B) L(B A), and σ A σ B, the point is classified to the closest class mean. x 5 is closer to X B 6, so we classify it to Class B. (c) Calculate posterior probabilities P {A x 5} π A f A (5) π A f A (5) + π B f B (5) π A σ exp { (5 µ 2π A) 2 /2σ 2 } π A σ exp { (5 µ 2π A) 2 /2σ 2 } + π B σ exp { (5 µ 2π B) 2 /2σ 2 } estimated by (0.8) exp { (5 3) 2 /(2 20/3)} (0.8) exp { (5 3) 2 /(2 20/3)} + (0.2) exp { (5 6) 2 /(2 20/3)} 0.5927 0.5927 + 0.855 0.766 P {B x 5} P {A x 5} 0.2384 With a zero-one loss function, we classify the new observation according to the higher posterior probability, i.e., to Class A. (d) We already have the posterior probabilities. Then, The risk of classifying the new observation into Class A is Risk(A) (3)P {B x 5} 0.75. The risk of classifying the new observation into Class B is Risk(B) ()P {A x 5} 0.766. So, we classify x 5 into Class A because this decision has a lower risk. (e) Now we have to use separate variances instead of the pooled variance. With equal prior probabilities, we have P {A x 5} π A σ A 2π exp { (5 µ A ) 2 /2σA} 2 π A σ A 2π exp { (5 µ A ) 2 /2σA} 2 + π B σ B 2π exp { (5 µ B ) 2 /2σB} 2 estimated by (0.5) exp { (5.4 3)2 /(2 2)} (0.5) exp { (5.4 3)2 /(2 2)} + (0.5) exp { (5 3 6)2 /(2 9)} 0.30 0.30 + 0.577 0.4520 P {B x 5} P {A x 5} 0.5480 With a zero-one loss function, we classify the new observation according to the higher posterior probability, i.e., to Class B. With unequal probabilities π A 0.8 and π B 0.2, P {A x 5} (0.8) exp { (5.4 3)2 /(2 2)} (0.8) exp { (5.4 3)2 /(2 2)} + (0.2) exp { (5 3 6)2 /(2 9)}

0.208 0.208 + 0.063 0.7674 P {B x 5} P {A x 5} 0.2326 and we classify x to Class A. With L(A B) 3L(B A), we compute the risks Risk(A) (3)P {B x 5} (3)(0.2326) 0.6977. Risk(B) ()P {A x 5} 0.7674 and classify x 5 into Class A because this decision has a lower risk. (f) In fact, we used LDA in (b-d) and QDA in (e), because LDA assumes σ A σ B whereas QDA does not make this assumption.

0.208 0.208+0.063 0.7674 P {B x 5} P {A x 5} 0.2326 and we classify x to Class A. With L(A B) 3L(B A), we compute the risks Risk(A) (3)P {B x 5} (3)(0.2326) 0.6977. Risk(B) ()P {B x 5} 0.7674 and classify x 5 into Class A because this decision has a lower risk. (f) In fact, we used LDA in (b-d) and QDA in (e), because LDA assumes σ A σ B whereas QDA does not make this assumption.