Introduction to Logistic Regression

Size: px

Start display at page:

Download "Introduction to Logistic Regression"

Easter Barton
5 years ago
Views:

1 Misclassification Cutoff Introduction to Logistic Regression

2 Problem & Data Overview Primary Research Questions: 1. What skills are important in winning tennis matches? Regression Questions: 1. What is Y? 2. What is X? Did player win or lose? Match statistics

3 Exporatory Data Analysis 1. Side-by-side boxplots

4 Exporatory Data Analysis 2. Scatterplot (Win =1, Lose = 0)

5 Exporatory Data Analysis 3. Scatterplot w/smooth curve

6 Exporatory Data Analysis 4. Cross-tabulations 0 1 Sum Ace Result Sum

7 Can we use linear regression? Our response is a categorical variables so can we ust use indicator variables and set, Y i = ( 1 if Win 0 otherwise then use regular least squares multiple regression? No, because 1. predictions will be outside of {0,1} 2. linear assumption might be violated 3. errors certainly won t be normal 4. equal variance is also likely to be violated. We need an entirely new regression framework!

8 Logistic regression Going back to Day 1, we have the following generic framework for statistical modeling: Y i iid p Y (y i ) E(y i )=f(x i1,...,x ip ) E.g, for simple and multiple linear regression! modeling we had: Y i iid N 0 + E(y i )= 0 + p=1 p=1 x ip x ip p, Where the normal assumption was OK because Y was quantitative p 2

9 Logistic regression What s an appropriate distribution when Y i 2 {0, 1}? Bernoulli Distribution: f(y i )=p y i (1 p) 1 y i If our response follows a Bernoulli distribution then E(y i )=p = Prob(Y = 1) So can we ust set E(y i )=p = 0 + p=1 x ip p No because p is has to be between 0 and 1. We need to choose a different math function than we have used before (one that keeps p between 0 and 1).

10 Logistic regression Logistic Regression Model: (Generalized Linear Model) Odds Ratio log Logit Transform Y i ind Bern(p i ) JX = 0 + x i ) p i = exp{ 0 + P J x i } 1 + exp{ 0 + P J x i } Logistic Function 2 (0, 1)

11 Logistic Regression Model: log = 0 + How do we interpret? 1. For every unit increase in x, the log-odds ratio increases by. 2. Just interpret the sign: If > 0, then p i increases as x increases. 3. As x increases by 1, a player is exp{ } times more likely to win the game. 4. As x increases by 1, a player is more likely to win. JX x i 100 (exp{ } 1)%

12 Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i How do we estimate the s? We use maximum likelihood (see Stat 340) In this class, we ll let R do it for us.

13 Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i Example: - ˆDBF = How do we interpret this number? 1. As DBF increases by 1 then the log(odds) goes down by As DBF increases by 1 then the probability of winning goes down by 100*(e ) 24%.

14 Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i What assumptions are we making? Linear in log-odds (monotone in probability) Scatterplot w/smoother

15 What assumptions are we making? Linear in log-odds (monotone in probability) Scatterplot w/smoother

16 Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i What assumptions are we making? Linear in log-odds (monotone in probability) Check using ittered scatterplot Independence Normality Equal Variance

17 Logistic Regression Model: Bern(p i ) log = 0 + y i ind x i How can we perform variable selection? Same way as before - compare AIC or BIC.

18 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we build confidence intervals (or perform hypothesis tests) for our effects? ˆ N(0, 1) SE( ˆ) ˆ ± z? SE( ˆ) x i

19 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we build confidence intervals (or perform hypothesis tests) for our effects? - 95% CI for DBF is (-0.487, ). - How do we interpret this interval? 1. We are 95% confident that as DBF increases by 1 the log(odds) of winning goes down by between and x i

20 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we build confidence intervals (or perform hypothesis tests) for our effects? - 95% CI for DBF is (-0.487, ). - How do we interpret this interval? 2. We are 95% confident that as DBF increases by 1 the probability of winning decreases between 100 (exp{( 0.487, 0.078)} 1) = ( 38.6%, 7.5%) x i

21 Logistic Regression Model: y i ind How do we predict? Predict probabilities Bern(p i ) log = 0 + ˆp = n exp ˆ0 + P P 1 + exp x i p=1 x ip ˆp o n ˆ0 + P P p=1 x ip ˆp o

22 Logistic Regression Model: Bern(p i ) log = 0 + y i ind Many times we want to classify so we set: ŷ = ( 1 if ˆp>c 0 if ˆp apple c x i where c = Cuto Probability

23 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How do we choose the cutoff value? 1. c =0.5! Bayes Classifier 2. Choose c to minimize the misclassification rate 1 n nx I(y i 6=ŷ i ) = Percent Misclassified i=1 x i

24 Misclassification Cutoff

25 Logistic Regression Model: Bern(p i ) log = 0 + y i ind Can we build a prediction interval? Sort of we can build a confidence interval for a predicted probability by untransforming the interval: log ˆp 1 ˆp ± z? SE log x i ˆp 1 ˆp

26 Steps to building an interval for a predicted probability: 1. Calculate ˆp ˆp Low = log z? SE log 1 ˆp 1 ˆp ˆp ˆp Up = log + z? SE log 1 ˆp 1 ˆp 2. Untransform ˆ Low = exp{low}/(1 + exp{low}) ˆ Up = exp{up}/(1 + exp{up})

27 Confidence Interval for a probability example: If a player has the following: FSP = 68, FSW = 60 SSP = 79, SSW = 16 ACE = 6, DBF = 2 NPA = 6, NPW = 64 then the estimated probability of winning is between 66% and 95%. Note: This was Dokovic vs. Nadal and Nadal won (Nadal beat the odds).

28 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How can we tell how well our model fits? In sample confusion matrix: x i Predicted Wins Predicted Loss True Win True Loss 14 53

29 Important Definitions: Predicted Wins Predicted Loss True Win True Loss Sensitivity: Percent of True Positives (49/59) Specificity: Percent of True Negatives (53/67) Positive Predictive Value: % Correctly Predicted Yes s (49/63) Negative Predictive Value: % Correctly Predicted No s (53/63)

30 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How can we tell how well our model fits? Pseudo -R 2 R 2 pseudo =1 Whats Left Over After Model Total Variation x i =1 Residual Deviance Null Deviance Interpretation: Percent of variation in log(p/(1-p)) explained by modeling.

31 Logistic Regression Model: Bern(p i ) log = 0 + y i ind How can we tell how well our model predicts? Cross validated confusion matrix: Repeat confusion matrix but first split into test and training sets x i

32 End of Tennis Analysis (see webpage for R code)

Introduction to Logistic Regression

Introduction to Logistic Regression Problem & Data Overview Primary Research Questions: 1. What are the risk factors associated with CHD? Regression Questions: 1. What is Y? 2. What is X? Did player develop