Introduction to Logistic Regression

Misclassification 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.0 0.2 0.4 0.6 0.8 1.0 Cutoff Introduction to Logistic Regression

Problem & Data Overview Primary Research Questions: 1. What skills are important in winning tennis matches? Regression Questions: 1. What is Y? 2. What is X? Did player win or lose? Match statistics

Exporatory Data Analysis 1. Side-by-side boxplots

Exporatory Data Analysis 2. Scatterplot (Win =1, Lose = 0)

Exporatory Data Analysis 3. Scatterplot w/smooth curve

Exporatory Data Analysis 4. Cross-tabulations 0 1 Sum 0 1 0 1 1 3 1 4 2 4 0 4 3 6 5 11 4 4 3 7 5 9 7 16 6 5 4 9 7 4 4 8 8 6 6 12 9 4 7 11 Ace Result 10 5 4 9 11 4 2 6 12 1 2 3 13 4 3 7 14 0 4 4 15 1 1 2 16 0 3 3 17 1 0 1 19 2 0 2 20 1 0 1 21 0 1 1 23 0 2 2 26 1 0 1 29 1 0 1 Sum 67 59 126

Can we use linear regression? Our response is a categorical variables so can we ust use indicator variables and set, Y i = ( 1 if Win 0 otherwise then use regular least squares multiple regression? No, because 1. predictions will be outside of {0,1} 2. linear assumption might be violated 3. errors certainly won t be normal 4. equal variance is also likely to be violated. We need an entirely new regression framework!

Logistic regression Going back to Day 1, we have the following generic framework for statistical modeling: Y i iid p Y (y i ) E(y i )=f(x i1,...,x ip ) E.g, for simple and multiple linear regression! modeling we had: Y i iid N 0 + E(y i )= 0 + p=1 p=1 x ip x ip p, Where the normal assumption was OK because Y was quantitative p 2

Logistic regression What s an appropriate distribution when Y i 2 {0, 1}? Bernoulli Distribution: f(y i )=p y i (1 p) 1 y i If our response follows a Bernoulli distribution then E(y i )=p = Prob(Y = 1) So can we ust set E(y i )=p = 0 + p=1 x ip p No because p is has to be between 0 and 1. We need to choose a different math function than we have used before (one that keeps p between 0 and 1).

Logistic regression Logistic Regression Model: (Generalized Linear Model) Odds Ratio log Logit Transform Y i ind Bern(p i ) JX = 0 + x i ) p i = exp{ 0 + P J x i } 1 + exp{ 0 + P J x i } Logistic Function 2 (0, 1)

Logistic Regression Model: log = 0 + How do we interpret? 1. For every unit increase in x, the log-odds ratio increases by. 2. Just interpret the sign: If > 0, then p i increases as x increases. 3. As x increases by 1, a player is exp{ } times more likely to win the game. 4. As x increases by 1, a player is more likely to win. JX x i 100 (exp{ } 1)%

Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i How do we estimate the s? We use maximum likelihood (see Stat 340) In this class, we ll let R do it for us.

Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i Example: - ˆDBF = 0.272 - How do we interpret this number? 1. As DBF increases by 1 then the log(odds) goes down by 0.272. 2. As DBF increases by 1 then the probability of winning goes down by 100*(e -0.272-1) 24%.

Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i What assumptions are we making? Linear in log-odds (monotone in probability) Scatterplot w/smoother

What assumptions are we making? Linear in log-odds (monotone in probability) Scatterplot w/smoother

Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i What assumptions are we making? Linear in log-odds (monotone in probability) Check using ittered scatterplot Independence Normality Equal Variance

Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i How can we perform variable selection? Same way as before - compare AIC or BIC.

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we build confidence intervals (or perform hypothesis tests) for our effects? ˆ N(0, 1) SE( ˆ) ˆ ± z? SE( ˆ) x i

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we build confidence intervals (or perform hypothesis tests) for our effects? - 95% CI for DBF is (-0.487, -0.078). - How do we interpret this interval? 1. We are 95% confident that as DBF increases by 1 the log(odds) of winning goes down by between 0.487 and 0.078. x i

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we build confidence intervals (or perform hypothesis tests) for our effects? - 95% CI for DBF is (-0.487, -0.078). - How do we interpret this interval? 2. We are 95% confident that as DBF increases by 1 the probability of winning decreases between 100 (exp{( 0.487, 0.078)} 1) = ( 38.6%, 7.5%) x i

Logistic Regression Model: y i ind How do we predict? Predict probabilities Bern(p i ) log = 0 + ˆp = n exp ˆ0 + P P 1 + exp x i p=1 x ip ˆp o n ˆ0 + P P p=1 x ip ˆp o

Logistic Regression Model: Bern(p i ) log = 0 + y i ind Many times we want to classify so we set: ŷ = ( 1 if ˆp>c 0 if ˆp apple c x i where c = Cuto Probability

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we choose the cutoff value? 1. c =0.5! Bayes Classifier 2. Choose c to minimize the misclassification rate 1 n nx I(y i 6=ŷ i ) = Percent Misclassified i=1 x i

Misclassification 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.0 0.2 0.4 0.6 0.8 1.0 Cutoff

Logistic Regression Model: Bern(p i ) log = 0 + y i ind Can we build a prediction interval? Sort of we can build a confidence interval for a predicted probability by untransforming the interval: log ˆp 1 ˆp ± z? SE log x i ˆp 1 ˆp

Steps to building an interval for a predicted probability: 1. Calculate ˆp ˆp Low = log z? SE log 1 ˆp 1 ˆp ˆp ˆp Up = log + z? SE log 1 ˆp 1 ˆp 2. Untransform ˆ Low = exp{low}/(1 + exp{low}) ˆ Up = exp{up}/(1 + exp{up})

Confidence Interval for a probability example: If a player has the following: FSP = 68, FSW = 60 SSP = 79, SSW = 16 ACE = 6, DBF = 2 NPA = 6, NPW = 64 then the estimated probability of winning is between 66% and 95%. Note: This was Dokovic vs. Nadal and Nadal won (Nadal beat the odds).

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How can we tell how well our model fits? In sample confusion matrix: x i Predicted Wins Predicted Loss True Win 49 10 True Loss 14 53

Important Definitions: Predicted Wins Predicted Loss True Win 49 10 True Loss 14 53 Sensitivity: Percent of True Positives (49/59) Specificity: Percent of True Negatives (53/67) Positive Predictive Value: % Correctly Predicted Yes s (49/63) Negative Predictive Value: % Correctly Predicted No s (53/63)

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How can we tell how well our model fits? Pseudo -R 2 R 2 pseudo =1 Whats Left Over After Model Total Variation x i =1 Residual Deviance Null Deviance Interpretation: Percent of variation in log(p/(1-p)) explained by modeling.

Logistic Regression Model: Bern(p i ) log = 0 + y i ind How can we tell how well our model predicts? Cross validated confusion matrix: Repeat confusion matrix but first split into test and training sets x i

End of Tennis Analysis (see webpage for R code)