Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes), Y {0, 1} or Y { 1, 1} Examples: Y = 1 if the user clicks on an ad, X is the age, gender,... Y = 1 if the customer picks coke over pepsi, X is the age, gender,... Y = 1 if the email is spam, X is the email length, the number of commas, etc. Y = 1 if the patient responds to a treatment, X are baseline measures Y = 1 if the patient has breast cancer, X are 10,000 genetic markers We need different models and measures of success Outline: 1. Logistic regression 2. Nearest neighbors methods 3. Discriminant analysis 4. Classification trees 5. Support vector machines (5) Classification - Part 1 Page 1

Logistic regression Logistic regression is the most common method to analyze binary data among statisticians. It has nice interpretation and statistical properties, but often isn t the best for prediction. The logistic regression model is: Plot of the logistic and expit functions Other link functions are possible: Interpretation of the logistic regression coefficients: (5) Classification - Part 1 Page 2

Estimating the parameters The MLE of β is typically estimated using Newton-Raphson optimization The sampling distribution is Gaussian for large samples: The glm function in R fits logistic regression (5) Classification - Part 1 Page 3

Making classifications Logistic regression produces the estimated probability of an event, p i = expit(x T ˆβ). i In many cases the application requires a yes/no classification. One approach is to simply predict Y i = 1 if p i > 0.5 and vice versa. Depending on the nature of the problem other thresholds may be preferable. For example, if we assign different losses to different types of error we get: (5) Classification - Part 1 Page 4

Multi-class logistic regression In some cases there are more than two possible outcomes. Example, Y i {small,medium,large} or Y i {Democrat, Republican, Independent}. If the outcomes are ordered then the ordinal logistic regression model is: If the outcomes are unordered a common model is the discrete choice model: (5) Classification - Part 1 Page 5

Large p As with linear regression, the MLE is not unique if p > n. Also, the large sample approximation to the sampling distribution is invalid if p n. Many of the same variable selection methods apply with slight modification. Forward/backward/stepwise: Sure independence screening: PCA: (5) Classification - Part 1 Page 6

Penalized regression Penalized regression is also very popular. Conceptually it is the same, but computationally is a bit more challenging. The LASSO becomes: The R function glmnet uses coordinate descent to compute the solution Large sample approximation (LSA; Wang and Leng) can be fit in lars: (5) Classification - Part 1 Page 7

Non-linear logistic regression The expected value is non-linear in the covariates, but the model relies on a linear predictor. Many of the non-linear regression methods extend to logistic regression. For example, the mgcv package in R fits logistic regression with GAM The model is: Neural networks for logistic regression is: (5) Classification - Part 1 Page 8

Large n - Stochastic gradient descent Now say we have n = 10M observations. The optimization algorithm is: Both the gradient and Hessian are sums over observations. This sum can be approximated using SGD: (5) Classification - Part 1 Page 9

Large n - meta-analysis Another form of approximation relies on the asymptotic normality. Say we split the data into B subsets. From subset b the estimate and its approximate sampling distribution are: These first-stage estimates can be done in parallel. Now we treat these first-stage estimates as the data for our second stage analysis. The pooled estimate is: (5) Classification - Part 1 Page 10

glmnet with large n If n is large and p << n the lasso probably won t improve fit. However, if both n and p are massive: (5) Classification - Part 1 Page 11

Nearest neighbors Nearest neighbor methods can be applied to classification. Let d ij be the distance between X i and X j. For example d ij = X i X j. For a new observation Y 0 with covariates X 0, the predicted probability of Y 0 = 1 is The classifier is then: (5) Classification - Part 1 Page 12

Evaluating classification accuracy Cross-validation is a robust way to compare classifiers. A soft-classifier (SC) gives a probability (or some other weight) to each class, ˆp i [0, 1]. A hard-classifier (HC) definitively picks a class Ŷi {0, 1} SC can be converted to HC by thresholding, Ŷi = I(p i > c). The Brier score is a common way to compare SCs: HCs are compared by summaries of the contingency/confusion tables: Classification accuracy: False positive rate: True positive rate: False negative rate: True negative rate: (5) Classification - Part 1 Page 13

Evaluating classification accuracy The receiver-operating curve (ROC) evaluates HCs created by thresholding SCs. Let c be the threshold so that Ŷc = I(ˆp > c). For each c, we compute the FPR and TPR The ROC curve plots the TPR as a function of the TPR A common one-number summary of the ROC curve is the area under the ROC curve (AUC) ACU near one if perfect, AUC equal 0.5 is random guessing It can be shown that is AUC equals the probability of ranking a randomly-chosen positive observation higher than a randomly chosen negative observation The R function ROC compute the ROC curve and AUC (5) Classification - Part 1 Page 14

Discriminant analysis (DA) DA is an alternative method of classification with a tie to logistic regression. In classification we want f(y X). DA turns this around by estimating f(y) and f(x y) and then applying Bayes Theorem for classification: This is only advantageous when f(x y) is very simple. For example: (5) Classification - Part 1 Page 15

Discriminant analysis (DA) The naive Bayes classifier assumes the elements of X are independent given Y. For example, if X j Y = y Normal(µ jy, σjy), 2 then we set µ jy and σ jy using sample moments and the classifier is: Connection to logistic regression: (5) Classification - Part 1 Page 16