Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform

Size: px

Start display at page:

Download "Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform"

Hollie Higgins
5 years ago
Views:

Statistics II Logistic Regression Çağrı Çöltekin Exam date & time: June 21, 10:00 13:00 (The same day/time lanned at the beginning of the semester) University of Groningen, Det of Information Science

1 Statistics II Logistic Regression Çağrı Çöltekin Exam date & time: June 21, 10:00 13:00 (The same day/time lanned at the beginning of the semester) University of Groningen, Det of Information Science May 29, 2013 More on the exam next week Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 So far Two-way reeated measures ANOVA: an examle Simle regresion Single (tyically) numeric redictor, numeric resonse Multile regression Multile redictors, numeric resonse ANOVA Single categorical redictor, numeric resonse Factorail ANOVA Multile categorical redictor, numeric resonse Reated-measures ANOVA Like single/factorial ANOVA with deendent observations This week: logistic regression We return to last week s roduction exeriment examle We want to analyze the auses after three different hrase tyes: IP, PP and PAR In addition, we are also interested in effect of length of the hrase Particularly, whether it is single word or multi word We have a 3 2 reeated measures ANOVA We use 5 different examles for each exerimental condition, resulting in utterances recorded for each of our 30 articiants Note: we need to randomize the order of the utterances for each subject Why? Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 RM-ANOVA examle: the data RM-ANOVA examle: the data after log transform Pause (s) log(pause) IP PAR PP Phrase Tye multi single Length IP PAR PP Phrase Tye multi single Phrase Length Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 RM-ANOVA examle: assumtions RM-ANOVA examle: main ANOVA results Check each (3 2) grou for normality Check for shericity: Mauchly s Test for Shericity Effect W <05 2 hrase tye hrase tye : length Why don t we have a shericity test for length? What if shericity assumtion was violated? R outut: > summary ( aov ( log ( ause ) ~ hrase tye * length + Error ( subj ))) Error : subj Df Sum Sq Mean Sq F value Pr(>F) Residuals Error : Within Df Sum Sq Mean Sq F value Pr(>F) hrase tye <2e -16 *** length hrase tye : length * Residuals Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37

2 RM-ANOVA examle: interaction RM-ANOVA: a summary and further directions mean of log(ause) IP PAR PP RM-ANOVA is an efficient method for exerimentation ANOVA design can be mixed, where on can analyze multile between- as well as within-subject factors RM-ANOVA is difficult to lease, excet in carefully constructed exerimental conditions In our examle today, we ignored inter-hrase reeated measures This is a common roblem known as language-as-a-fixed-effect fallacy in sycholinguistics A newer alternative is to use linear mixed-effect models (or multilevel models) multi single Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Logistic regression: motivation Logistic regression: some examles For all methods discussed in this class, the resonse variable is numeric Sometimes we want to analyze/redict categorical resonses: Alive or dead Grammatical or ungrammatical Correct or incorrect Pass or fail Win or loose Present or absent Logistic regression is an extension of regression for categorical (tyically binary) resonse variables survival after a surgery deending on age, length of surgery, whether urchase occurs deending on age, income, website characteristics, whether seech errors occur deending on alcohol level when linguistic rules aly (final [t] in Dutch) deending on seed of utterance, stress, social grou, whether one votes to a olitical arty (or not) deending on age, income, ethnicity, Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Simle regression: a refresher Regression assumtions y i = a + bx i + e i y is the resonse (or outcome, or deendent) variable The index i reresent each unit observation/measurement (sometimes called a case ) x is the redictor (or exlanatory, or indeendent) variable a is the intercet b is the sloe of the regression line a + bx is the deterministic art of the model (we sometimes use ^y) e is the residual, error, or the variation that is not accounted for by the model Assumed to be (aroximately) normally distributed with 0 mean (e i are assumed to be iid) indeendence cases (observations) should be indeendent linearity the relation between y vs x should be linear normality residuals should be normally distributed with 0 mean constant variance residual variance should be constant Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Multile regression Multile regression: examle y i = a + b 1 x i,1 +b 2 x 2,i + + b k x }{{ k,i +e } i ^y Multile regression is a generalization of the simle regression, where we redict the outcome using multile redictors Multicollinearity causes roblems in estimation and interretation of multile-regression models Model selection (finding a model that fits the data well, but not more comlex than necessary) is imortant We will try to determine reaction time differences in a lexical decision task in two exerimental conditions C 1 and C 2, and also investigate the effect of age Here is how our data looks like, Reaction time (ms) condition age 563 C C C C 2 31 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37

3 Multile regression: examle (2) Running multile regression (R outut): Call : lm( formula = log (rt) ~ age + condition ) Estimate Std Error t value Pr ( > t ) ( Intercet ) < 2e -16 *** age conditionc e -14 *** --- Residual standard error : on 37 degrees of freedom Multile R- squared : 07927, Adjusted R- squared : F- statistic : 7074 on 2 and 37 DF, - value : 2276e -13 Logistic Regression Logistic regression is a secial case of regression, where a binary outcome is redicted based on a number of redictors The main trick is to redict robability of an event, rather than the outcome directly This allows converting a categorical variable to a numeric variable (robabilities) One needs to overcome difficulties due to regression assumtions Probabilities are strictly bounded between 0 and 1 Error (residuals) with binary (or roortion) resonse is not normally distributed Note: you should do model diagnostics before interreting the model Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 An examle for logistic regression We go back to our lexical decision task examle In similar exeriments tyically we also whether the reaction was correct or incorrect Our data actually looks like this: Reaction time (ms) condition age correct 563 C C C C Our aim is to redict robability of correct decision Trying ordinary regression Model the outcome directly: lm( formula = correct ~ age + condition ) Estimate Std Error t value Pr ( > t ) ( Intercet ) age * conditionc * P(correct) = age 0210 condition Estimated robability of correct resonse to condition 1 from a 40-year old is: = 1102 Resonse variable (robability) is not bounded between 0 and 1 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Transforming the resonse variable The logit function Instead of redicting the robability,, we can redicts odds Odds in favor of an event (eg, FC Groningen winning against Ajax) is defined based on, the robability of the event occurring, as odds = 1 odds are bounded between 0 and infinity, 0 odds further if we take the logarithm of odds, log 1 = logit() Bounds of logit is logit() + logit() logit() = log Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 But, what about the name logistic? Logistic function is the inverse of the logit function: logistic(x) logistic(x) = 1 1 e x x Trying ordinary regression with logit transform lm( formula = logit ( correct ) ~ age + condition ) Estimate Std Error t value Pr ( > t ) ( Intercet ) age * conditionc * Residual standard error : 2182 on 37 degrees of freedom Multile R- squared : 02501, Adjusted R- squared : F- statistic : 6169 on 2 and 37 DF, - value : Now our estimate is: logit(p(correct)) = age 1538 condition The model is more correct, but interretation is difficult Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37

4 Logistic regression: are we there yet? Let s look at the residual distribution: Regression for binary resonse: roblems so far Frequency Histogram of residuals Binary data is strictly bounded, robabilities smaller than 0 and larger than 1 are meaningless We solve this by transforming the resonse variable using logit function (other transformations are also ossible) When resonse is binary, the residuals are not normally distributed with 0 mean We estimate the coefficients without assuming normally distributed error The cost of is also giving u the familiar (and uniquely determined) least-squares regression Estimation is done with more comlicated rocedures (tyically maximum likelihood) More recisely, logistic regression is an instance of generalized linear models (GLMs) Residuals Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Logistic regression Maximum likelihood estimation Logistic regression is similar to (multile) regression, differences are: Categorical resonse variable Non-normal errors Relationshi is non-linear (linear after the transformation) Estimation is (tyically) done using maximum likelihood estimation Least-squares regression is not alicable logit( i ) = a + b 1 x 1,i + + b k x k,i + e i Maximum likelihood estimation tries to find the set of model arameters, or coefficients, a, b 1, b k, which make the data most likely (or minimizes the error) MLE is an iterative search for the otimum arameter values There is no exact solution In some cases, MLE may fail to find a solution If errors are normally distributed, MLE is equivalent to least-squares estimation Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Logistic regression: examle We return to the same examle, this time fitting a correct model: glm ( formula = correct ~ age + condition, family = binomial ) Deviance Estimate Std Error z value Pr ( > z ) ( Intercet ) age conditionc ( Disersion arameter for binomial family taken to be 1) Null deviance : on 39 degrees of freedom Residual deviance : on 37 degrees of freedom AIC : Number of Fisher Scoring iterations : 19 Estimated equation is: logit(p(correct)) = age condition Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Logistic regression: interreting the coefficients log P(correct) = age condition 1 P(correct) Sloe of a variable indicates the change in log odds for unit-change in the redictor (while other redictors ket constant) For examle, one unit change in age, say from 40 to 41, means log P(correct41) P(correct40) 1 P(correct 41) log 1 P(correct 40) = e log P(correct 41 ) 1 P(correct 41 ) log P(correct 40 ) 1 P(correct 40 ) = e} {{} ex(b) odds(correct 41) odds(correct 40) = When age is increased form 40 to 41, odds of correct resonse increase 12 times Note: the change is non-linear Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Logistic regression: inference and model fit Logistic regression: where things may go wrong In logistic regression the significance testing for coefficients are carried out using z-statistics (also known as Wald statistic or Wald s z ) We do not have r 2 as measure of fit (aroximations exists, but none have all the roerties of r 2 2LL ( 2 times Log Likelihood, smaller better) is the measure of fit in maximum likelihood estimation Log likelihood is not bounded like r 2, comarisons are valid only on the same data set 2LL is aroximately χ 2 distributed, a model can be tested against null model (similar to F-test for least-squares regression) Overdisersion is the case when variance in the data is higher than exected Logistic regression requires resonse to follow binomial distribution Variance of binomial distribution is redictable from it s mean In case of comlete searation, ie, when redictors erfectly searate cases of success and failure, logistic regression arameters cannot be estimated Logistic regression may also fail to find a solution, esecially when data is not evenly distributed Logistic regression is multile regression, other roblems, such as collinearity is also roblems for logistic regression Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37

5 Logistic regression: another examle We would like to guess whether a child would develo dyslexia or not based on a test alied to re-verbal children Here is a simlified roblem: We test children when they are less than 2 years of age The hyothesis is that the test may be relevant in diagnosing dyslexia We observe the same children when they are in the school age, and note whether they are diagnosed with dyslexia or not Our data looks like the following: Test score Dyslexia Examle: the analysis glm ( formula = dys ~ score, family = binomial, data = d) Deviance Estimate Std Error z value Pr ( > z ) ( Intercet ) ** score ** --- ( Disersion arameter for binomial family taken to be 1) Null deviance : on 39 degrees of freedom Residual deviance : on 38 degrees of freedom AIC : Number of Fisher Scoring iterations : 5 logit((dyslexia)) = score * The research question is from the longitudinal study by Ben Maasen and his colleagues Data is fake as usual Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Examle: interretation Visualizing the logistic regression logit((dyslexia)) = score score logit = log 1 odds = (dyslexia) Test score Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Summary Next week Logistic regression is alicable when resonseis binomial Two major differences from ordinary least-squares regression: Resonse variable is strictly bounded logit transformation make sure regression outut is maed to range [0, 1] Errors are not normal GLM framework allows non-normal errors However, we use maximum likelihood estimation instead of least squares Problems to watch out for: overdiserion comlete searation, or insufficient data for estimation like in multile regression: multicollinearity Extensions Multinomial logistic regression, when there are more than two categories of the resonse variable Genralized linear mixed-effect models when observations are not indeendent A summary of all methods When to use which analysis What to do when assumtions fail Beyond hyothesis testing: effect sizes Time for your questions Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37 Ç Çöltekin / RuG Statistics II: Logistic Regression May 29, / 37

STK4900/ Lecture 7. Program

STK4900/ Lecture 7. Program STK4900/9900 - Lecture 7 Program 1. Logistic regression with one redictor 2. Maximum likelihood estimation 3. Logistic regression with several redictors 4. Deviance and likelihood ratio tests 5. A comment