Stat 427/627 Statistical Machine Learning (Baron) HOMEWORK 6, Solutions DISCRIMINANT ANALYSIS: LDA AND QDA. Chap 4, exercise 5. (a) On a training set, LDA and QDA are both expected to perform well. LDA uses a correct model, and QDA includes the linear model as a special case. QDA, including a wider range of models, will fit the given particular data set better. On a test set, however, LDA should be more accurate because QDA, with additional parameters and an overfit, will have a higher variance. (b) For a nonlinear boundary, QDA should perform better on both the training and test sets because LDA will not capture the nonlinear pattern. (c) As the sample size increases, QDA will perform better because its higher variance will be offset by a large sample size. (d) False. A higher variance of QDA prediction, without any gain in bias, will result in a higher error rate. 2. See the last page. 3. Chap 4, exercise 0[e,f,g,h]. (e) > library(islr) > attach(weekly) > Weekly_training Weekly[ Year < 2008, ] # Split the data by year into > Weekly_testing Weekly[ Year > 2008, ] # the training and testing sets > library(mass) # This library contains LDA > lda.fit lda(direction ~ Lag2, dataweekly_training) # Fit using training data only # Now use this model to predict classification of the testing data > Direction_Predicted_LDA predict(lda.fit, data.frame(weekly_testing))$class > > table( Weekly_testing$Direction, Direction_Predicted_LDA ) # Confusion matrix Direction_Predicted_LDA Direction Down Up Down 9 34 Up 5 56 > mean( Weekly_testing$Direction Direction_Predicted_LDA ) [] 0.625 # Correct classification rate of LDA is 62.5%. (f) # Similarly with QDA. > qda.fit qda(direction ~ Lag2, dataweekly_training) > Direction_Predicted_QDA predict(qda.fit, data.frame(weekly_testing))$class > table( Weekly_testing$Direction, Direction_Predicted_QDA ) Direction_Predicted_QDA
Direction Down Up # Correct classification rate of QDA is only 58.7%. Down 0 43 # However, QDA always predicted that the market goes Up. Up 0 6 # LDA has a higher prediction power. QDA seems to be an overfit. > mean( Weekly_testing$Direction Direction_Predicted_QDA ) [] 0.5865385 # Logistic regression and LDA yield the best results so far. (g) # KNN with k is a very flexible method, matching the local features. > X.train matrix( Lag2[ Year < 2008 ], sum(year < 2008), ) > X.test matrix( Lag2[ Year > 2009 ], sum(year > 2009), ) > Y.train Direction[ Year < 2008 ] > Y.test Direction[ Year > 2009 ] > knn.fit knn( X.train, X.test, Y.train, ) > table( Y.test, knn.fit ) knn.fit Y.test Down Up Down 2 22 Up 29 32 > mean( Y.test knn.fit ) [] 0.509654 # With k4, the KNN method gives a better classification rate for these data: > knn.fit knn( X.train, X.test, Y.train, 4 ) > mean( Y.test knn.fit ) [] 0.596538 (h) Among the methods that we explored, Logistic regression and LDA yield the highest classification rate, 0.625. KNN with K4 yields the classification rate 0.596. QDA yields the classification rate 0.587. KNN with K yields the classification rate 0.50. 4. Chap 4, exercise 3. > attach(boston) # Attach a new dataset Boston from ISLR package >?Boston # Data description > names(boston) [] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio" "black" "lstat" "medv" > Crime_Rate ( crim > median(crim) ) # Categorical variable Crime_Rate > table(crime_rate) Crime_Rate FALSE TRUE 2
253 253 # The median splits the data evenly > Boston cbind( Boston, Crime_Rate ) # Logistic regression First, we ll fit the full model. > lreg.fit glm(crime_rate ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat + medv, family"binomial") > summary(lreg.fit) Estimate Std. Error z value Pr(> z ) (Intercept) -34.03704 6.53004-5.223.76e-07 *** zn -0.07998 0.03373-2.369 0.0782 * indus -0.059389 0.043722 -.358 0.7436 chas 0.785327 0.728930.077 0.2832 nox 48.523782 7.396497 6.560 5.37e- *** rm -0.425596 0.7004-0.607 0.54383 age 0.02272 0.0222.84 0.06963. dis 0.69400 0.28308 3.67 0.0054 ** rad 0.656465 0.52452 4.306.66e-05 *** tax -0.00642 0.002689-2.385 0.0709 * ptratio 0.36876 0.2236 3.09 0.00254 ** black -0.03524 0.006536-2.069 0.03853 * lstat 0.043862 0.04898 0.895 0.37052 medv 0.6730 0.066940 2.497 0.0254 * # Some variables are not significant. We ll fit a reduced model, without them. > lreg.fit glm(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, family"binomial") > summary(lreg.fit) Estimate Std. Error z value Pr(> z ) (Intercept) -3.44272 6.048989-5.98 2.02e-07 *** zn -0.082567 0.03424-2.628 0.00860 ** nox 43.95824 6.45282 6.694 2.7e- *** age 0.02285 0.009894 2.30 0.0209 * dis 0.634380 0.207634 3.055 0.00225 ** rad 0.78773 0.42066 5.059 4.2e-07 *** tax -0.007676 0.002503-3.066 0.0027 ** ptratio 0.303502 0.09255 2.778 0.00547 ** black -0.02866 0.006334-2.03 0.04224 * medv 0.2882 0.034362 3.285 0.0002 ** # To evaluate the model s predictive performance, we ll use the validation set method. > n length(crim) # Split the data randomly into training and testing subsets > n [] 506 3
> Z sample(n,n/2) > Data_training Boston[ Z, ] > Data_testing Boston[ -Z, ] > lreg.fit glm(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, datadata_training, family"binomial") # Fit a logistic regression model on the training data # and use this model to predict the testing data > Predicted_probability predict( lreg.fit, data.frame(data_testing), type"response" ) > Predicted_Crime_Rate *(Predicted_probability > 0.5) # Predicted crime rate: high, 0 low > table( Data_testing$Crime_Rate, Predicted_Crime_Rate ) Predicted_Crime_Rate 0 0 6 4 98 > mean( Data_testing$Crime_Rate Predicted_Crime_Rate ) [] 0.8953975 # Logistic regression shows the correct classification rate of 89.5%, the error rate is 0.5%. # Now, try LDA. For a fair comparison, use the same training data, then predict the testing data. > lda.fit lda(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, datadata_training) > Predicted_Crime_Rate_LDA predict(lda.fit, data.frame(data_testing))$class > table( Data_testing$Crime_Rate, Predicted_Crime_Rate_LDA ) Predicted_Crime_Rate_LDA 0 0 7 0 28 84 > mean( Data_testing$Crime_Rate! Predicted_Crime_Rate_LDA ) [] 0.589958 # LDA s error rate is higher, 5.9%. # What about QDA? If logit has a nonlinear relation with predictors, QDA may be a better method. 4
> qda.fit qda(crime_rate ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, datadata_training) > Predicted_Crime_Rate_QDA predict(qda.fit, data.frame(data_testing))$class > table( Data_testing$Crime_Rate, Predicted_Crime_Rate_QDA ) Predicted_Crime_Rate_QDA 0 0 2 6 26 86 > mean( Data_testing$Crime_Rate! Predicted_Crime_Rate_QDA ) [] 0.33892 # QDA s error rate of 3.4% is between the LDA and the logistic regression. According to these quick results, logistic regression has the best prediction power. # It may be interesting to look at the distribution of predicted probabilities of high crime zones, produced by the three classification methods. All of them mostly discriminate between the two groups pretty well. > hist(predicted_probability, main'predicted PROBABILITY BY LOGISTIC REGRESSION') > hist(predict(lda.fit, data.frame(data_testing))$posterior, main'posterior DISTRIBUTION BY LDA') > hist(predict(qda.fit, data.frame(data_testing))$posterior, main'posterior DISTRIBUTION BY QDA') KNN method # Define training and testing data > names(data_training) [] "crim" "zn" "indus" "chas" "nox" [6] "rm" "age" "dis" "rad" "tax" [] "ptratio" "black" "lstat" "medv" "Crime_Rate" # Variables zn + nox + age + dis + rad + tax + ptratio + black + medv are in columns 2,5,7,8,9,0,,2,4. > X.train Data_training[, c(2,5,7,8,9,0,,2,4) ] 5
2. (a) For Class A, based on (4, 2), X A 3, s 2 A (4 3)2 +(2 3) 2 2, and s 2 A.4. For Class B, based on (9, 6, 3), X B 6, s 2 B (9 6)2 +(6 6) 2 +(3 6) 2 9, and s 3 B 3. The pooled variance is s 2 (4 3)2 +(2 3) 2 +(9 6) 2 +(6 6) 2 +(3 6) 2 20, and the pooled standard 5 2 3 deviation is s 2.58. (b) We know that in this situation, when π A π B 0.5, L(A B) L(B A), and σ A σ B, the point is classified to the closest class mean. x 5 is closer to X B 6, so we classify it to Class B. (c) Calculate posterior probabilities P {A x 5} π A f A (5) π A f A (5) + π B f B (5) π A σ exp { (5 µ 2π A) 2 /2σ 2 } π A σ exp { (5 µ 2π A) 2 /2σ 2 } + π B σ exp { (5 µ 2π B) 2 /2σ 2 } estimated by (0.8) exp { (5 3) 2 /(2 20/3)} (0.8) exp { (5 3) 2 /(2 20/3)} + (0.2) exp { (5 6) 2 /(2 20/3)} 0.5927 0.5927 + 0.855 0.766 P {B x 5} P {A x 5} 0.2384 With a zero-one loss function, we classify the new observation according to the higher posterior probability, i.e., to Class A. (d) We already have the posterior probabilities. Then, The risk of classifying the new observation into Class A is Risk(A) (3)P {B x 5} 0.75. The risk of classifying the new observation into Class B is Risk(B) ()P {A x 5} 0.766. So, we classify x 5 into Class A because this decision has a lower risk. (e) Now we have to use separate variances instead of the pooled variance. With equal prior probabilities, we have P {A x 5} π A σ A 2π exp { (5 µ A ) 2 /2σA} 2 π A σ A 2π exp { (5 µ A ) 2 /2σA} 2 + π B σ B 2π exp { (5 µ B ) 2 /2σB} 2 estimated by (0.5) exp { (5.4 3)2 /(2 2)} (0.5) exp { (5.4 3)2 /(2 2)} + (0.5) exp { (5 3 6)2 /(2 9)} 0.30 0.30 + 0.577 0.4520 P {B x 5} P {A x 5} 0.5480 With a zero-one loss function, we classify the new observation according to the higher posterior probability, i.e., to Class B. With unequal probabilities π A 0.8 and π B 0.2, P {A x 5} (0.8) exp { (5.4 3)2 /(2 2)} (0.8) exp { (5.4 3)2 /(2 2)} + (0.2) exp { (5 3 6)2 /(2 9)}
0.208 0.208 + 0.063 0.7674 P {B x 5} P {A x 5} 0.2326 and we classify x to Class A. With L(A B) 3L(B A), we compute the risks Risk(A) (3)P {B x 5} (3)(0.2326) 0.6977. Risk(B) ()P {A x 5} 0.7674 and classify x 5 into Class A because this decision has a lower risk. (f) In fact, we used LDA in (b-d) and QDA in (e), because LDA assumes σ A σ B whereas QDA does not make this assumption.
0.208 0.208+0.063 0.7674 P {B x 5} P {A x 5} 0.2326 and we classify x to Class A. With L(A B) 3L(B A), we compute the risks Risk(A) (3)P {B x 5} (3)(0.2326) 0.6977. Risk(B) ()P {B x 5} 0.7674 and classify x 5 into Class A because this decision has a lower risk. (f) In fact, we used LDA in (b-d) and QDA in (e), because LDA assumes σ A σ B whereas QDA does not make this assumption.