BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture 4. Logistic Regression 1 / 23

From Continuous to Categorical Outcome The response variable, Y, is categorical Examples: - Banking: default vs. nondefault - Medical: disease positive vs. negative - Computer vision: sign recognition by self-driving cars - Many others... Data Mining I Lecture 4. Logistic Regression 2 / 23

Classication Denote C(X ) as a classier Most DM algorithms estimate the probability that X belong to each class Based on specic decision rules, classication can be produced Example: The model prediction tells that the probability of default is 0.2, then Threshold >0.2 <0.2 Class Nondefault Default Data Mining I Lecture 4. Logistic Regression 3 / 23

Classication Methods K-Nearest Neighbor Logistic regression Classication tree Discriminant analysis Support vector machine Neural networks Deep learning... Is clustering a classication model? Data Mining I Lecture 4. Logistic Regression 4 / 23

Why Not Linear Regression Example: default prediction - Default (Y = 1) vs. Nondefault (Y = 0) - X 1 : credit card balance level, X 2 : income level Suppose the estimated linear regression is Ŷ = 1.5 + 2X 1 X 2 What is the predicted value if a person's balance level is 1 and income level is 3? How to interpret this value? Data Mining I Lecture 4. Logistic Regression 5 / 23

An Illustration Data Mining I Lecture 4. Logistic Regression 6 / 23

Logistic Regression Generalized linear model Logistic model for binary response Function of X P(y i = 1 x i ) = exp(xt i β) 1 + exp(x T i β) Outcome is predicted probability of event More than two classes: multinomial logistic model Data Mining I Lecture 4. Logistic Regression 7 / 23

Generalized Linear Models Still linear models Three components: 1 Probability distribution of response variable - E.g. Binary, Poisson, Gamma... 2 Linear predictor η = β 0 + β 1 X 1 +... + β p X p 3 Link function g[e(y )] = η or E(Y ) = g 1 (η) Here are some notes for other link functions for binary response. Data Mining I Lecture 4. Logistic Regression 8 / 23

Odds and Interpretation of β Logistic model is also called log odds model Odds: ratio of probabilities: Odds(X ) = P(Y = 1 X )/P(Y = 0 X ) Logit link (logit transformation) ( ) P logit(p) = log = β 0 + β 1 X 1 +... + β p X p 1 P By simple algebra, given all X 's are xed except X j ( ) Odds(Xj + 1) β j = log Odds(X j ) which is log of odds ratio. Data Mining I Lecture 4. Logistic Regression 9 / 23

Multinomial Logit Model Response Y = 1, 2,..., K, K classes Given predictors x i ( ) P(yi = 2) log = β T 2 P(Y i = 1) x i ( ) P(yi = 3) log = β T 3 P(Y i = 1) x i. log The rst class 1 is the reference ( ) P(yi = K) = β T K P(Y i = 1) x i There are (K 1) (p + 1) coecients need to be estimated. Data Mining I Lecture 4. Logistic Regression 10 / 23

Estimation for Binary Logit Model Maximum likelihood estimation y i x i Ber(p i (x i )) Likelihood function of y i x i : L(y i ; x i, β) = p y i i (1 p i ) 1 y i ( exp(x T ) yi ( = i β) 1 1 + exp(x T i β) 1 + exp(x T i β) By simple algebra, total log-likelihood is (show in exercise) ) 1 yi l(β) = n i=1 { ( )} y i x T i β log 1 + exp(x T i β) Numerical optimization: Newton's method (a very good tutorial) Data Mining I Lecture 4. Logistic Regression 11 / 23

Prediction From Probability to Class Direct outcome of model: probability Next step: classication Need decision rule (cut-o probability p-cut) Not unique Data Mining I Lecture 4. Logistic Regression 12 / 23

Confusion Matrix Classication table based on a specic cut-o probability Used for model assessment Pred=1 Pred=0 True=1 True Positive (TP) False Negative (FN) True=0 False Positive (FP) True Negative (TN) FP: type I error; FN: type II error Dierent p-cut results in dierent confusion matrix Try to understand this table instead of memorizing! Data Mining I Lecture 4. Logistic Regression 13 / 23

Some Useful Measures Misclassication rate (MR) = FP+FN Total True positive rate (TPR) = TP : Sensitivity or Recall TP+FN True negative rate (TNR) = False positive rate (FPR) = True negative rate (FNR) = TN : Specicity FP+TN FP : 1 Specicity FP+TN FN : 1 Sensitivity TP+FN Positive predictive rate (PPR) = TP : Precision TP+FP False discovery rate (FDR) = 1 Precision Data Mining I Lecture 4. Logistic Regression 14 / 23

ROC Curve Receiver Operating Characteristic Plot of FPR (X) against TPR (Y) at various p-cut values Overall model assessment (not for a particular decision rule) Unique for a given model Area under the curve (AUC): a measure of goodness-of-t Data Mining I Lecture 4. Logistic Regression 15 / 23

ROC Curve Data Mining I Lecture 4. Logistic Regression 16 / 23

Precision and Recall More accurate measure for imbalanced data Widely used in document retrieval Precision = TP : TP+FP - fraction of retrieved instances that are relevant Recall = TP : TP+FN - fraction of relevant instances that are retrieved Neither incorporates TN (Y = 1 is of more interest) F -score: F = 2 precision recall precision+recall More details: see this highly cited paper Data Mining I Lecture 4. Logistic Regression 17 / 23

Precision-Recall Curve Data Mining I Lecture 4. Logistic Regression 18 / 23

Asymmetric Cost Example: compare following two confusion matrices based on two p-cut values Pred=1 Pred=0 True=1 10 40 True=0 10 440 Pred=1 Pred=0 True=1 40 10 True=0 130 320 Which one is better? In terms of what? What if this is about loan application - Y = 1: default customer - Default will cost much more than reject a loan application Data Mining I Lecture 4. Logistic Regression 19 / 23

Choice of Decision Threshold (p-cut) Do NOT simply use 0.5! In general, we use grid search method to optimize a measure of classication accuracy/loss - Cost function (symmetric or asymmetric) - F-score based on precision and recall Grid search with cross-validation Data Mining I Lecture 4. Logistic Regression 20 / 23

Discriminant Analysis Based on Bayes theorem: P(Y = k X = x) = P(X = x Y = k) P(Y = k) P(X = x) Discriminant analysis P(Y = k X = x) = f k(x) π k K l=1 f l(x) π l f k (x) is the assumed density function of X in class k π k can be simply calculated as the fraction of Y = k Data Mining I Lecture 4. Logistic Regression 21 / 23

Linear Discriminant Function Given x, nd the k such that f k (x) π k is largest Therefore, only f k (x) π k is of interest We assume f k (x) to be Gaussian density 1 f k (x) = exp ( (x µ ) k) 2 2πσk 2σ 2 By taking log and discard terms without k, we have δ k (x) = x µk σ 2 µ2 k 2σ 2 + log(π k) This is called linear discriminant score function Data Mining I Lecture 4. Logistic Regression 22 / 23

Comparison Between Logistic Model and LDA Logistic regression is a very popular classier especially for binary classication problem LDA is often used when n is small and classes are well separated, and Gaussian assumption is reasonable. Also when K > 2. Both are linear methods Data Mining I Lecture 4. Logistic Regression 23 / 23