Assignment 2: K-Nearest Neighbors and Logistic Regression

Size: px

Start display at page:

Download "Assignment 2: K-Nearest Neighbors and Logistic Regression"

Doreen Freeman
5 years ago
Views:

1 Assignment 2: K-Nearest Neighbors and Logistic Regression SDS293 - Machine Learning Due: 4 Oct 2017 by 11:59pm Conceptual Exercises 4.4 parts a-d (p ISLR) When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that parametric approaches often perform poorly when p is large. We will now investigate this curse. In each of the following scenarios, we assume that each predictor is uniformly (evenly) distributed on [0, 1], and that each observation is associated with a response value. (a) Suppose that we have a set of observations, each with measurements on p = 1 feature, X. Suppose that we wish to predict a test observation s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X = 0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction? Solution: On average, 10%. For simplicity, ignoring cases when X < 0.05 and X > (b) Now suppose that we have a set of observations, each with measurements on p = 2 features, X 1 and X 2. We wish to predict a test observation s response using only observations that are within 10% of the range of X 1 and within 10% of the range of X 2 closest to that test observation. On average, what fraction of the available observations will we use to make the prediction? Solution: On average, 1%. (c) Now suppose that we have a set of observations on p = 100 features. We wish to predict a test observation?s response using observations within the 10% of each feature?s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction? Solution: On average, = %. 1

2 (d) Using your answers to parts (a)-(c), argue that a drawback of KNN when p is large is that there are very few training observations near any given test observation. Solution: As p increases linearly, observations that are geometrically nearby decrease exponentially. 4.6 (p. 170 ISLR) Suppose we collect data for a group of students in a statistics class with variables X 1 = hours studied, X 2 = undergrad GPA, and Y = received an A. We fit a logistic regression and produce estimated coefficients ˆβ 0 = 6, ˆβ 1 = 0.05, and ˆβ 2 = 1. (a) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class. Solution: p(x) = e X 1+X e X 1+X 2 = e e = e e 0.5 = 37.75% (b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class? Solution: p(x) = e X 1+X e X 1+X = e X e X = e X e X (1 + e X 1 ) = e X e X 1 ) = e X = 0.5 e X 1 1 = e X 1 log(1) = X 1 0 = X = 0.05X 1 50 hours = X 1 2

3 4.8 (p. 170 ISLR) Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e.k = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we use for classification of new observations? Why? Solution: For KNN with K=1, the training error rate is 0% because for any training observation, its nearest neighbor will be the response itself. This means that KNN has a test error rate of 36%. Thus, it makes sense to choose logistic regression to classify new data because of its lower test error rate of 30%. Applied Exercises 4.13 variation (p. 173 ISLR) Using the Boston data set, fit a KNN classification model to predict whether a given suburb has high, low, or average crime rate. You get to decide where the cutoffs are, which predictors to use, and how many neighbors to consider. Justify your choices, and describe your findings. 3

4 A2 Applied Jordan Crouser 10/7/2017 Step 1: Get data library(mass) attach(boston) Step 2: Split data into training and test library(dplyr) set.seed(1) train_boston = Boston %>% sample_frac(0.8) test_boston = Boston %>% setdiff(train_boston) Step 3: Split off response column, convert to factor and vectorize summary(boston$crim) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## # Cutoffs: # LOW = anything below the median # MEDIUM = anything between the median and mean # HIGH = anything above the mean train_crim = train_boston %>% select(crim) %>% mutate(crim = cut(crim, breaks = c(-inf, , , Inf), labels = c("low","average","high"))) %>%.$crim test_crim = test_boston %>% select(crim) %>% mutate(crim = cut(crim, breaks = c(-inf, , , Inf), labels = c("low","average","high"))) %>%.$crim # Remove response column from dataframe train_boston = train_boston %>% select(-crim) 1

5 test_boston = test_boston %>% select(-crim) Step 4: Build model using knn() library(class) knn_pred = knn(train_boston, test_boston, train_crim, k = 3) Step 5: Evaluate performance table(knn_pred, test_crim) ## test_crim ## knn_pred low average high ## low ## average ## high mean(knn_pred == test_crim) ## [1]

DISCRIMINANT ANALYSIS: LDA AND QDA

Stat 427/627 Statistical Machine Learning (Baron) HOMEWORK 6, Solutions DISCRIMINANT ANALYSIS: LDA AND QDA. Chap 4, exercise 5. (a) On a training set, LDA and QDA are both expected to perform well. LDA