Count data page 1. Count data. 1. Estimating, testing proportions

Similar documents
ssh tap sas913, sas

Simple logistic regression

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

COMPLEMENTARY LOG-LOG MODEL

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Models for Binary Outcomes

CHAPTER 1: BINARY LOGIT MODEL

Modeling Machiavellianism Predicting Scores with Fewer Factors

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Age 55 (x = 1) Age < 55 (x = 0)

Logistic Regressions. Stat 430

Chapter 14 Logistic and Poisson Regressions

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Chapter 1. Modeling Basics

Multinomial Logistic Regression Models

Binary Logistic Regression

The Flight of the Space Shuttle Challenger

Sociology 6Z03 Review II

Lecture 8: Summary Measures

STAT 7030: Categorical Data Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Sections 4.1, 4.2, 4.3

Measuring relationships among multiple responses

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Introduction to Signal Detection and Classification. Phani Chavali

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

STA6938-Logistic Regression Model

Longitudinal Modeling with Logistic Regression

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Logistic Regression Analyses in the Water Level Study

Turning a research question into a statistical question.

Biostat Methods STAT 5820/6910 Handout #5a: Misc. Issues in Logistic Regression

Exam Applied Statistical Regression. Good Luck!

Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression

Inference for Binomial Parameters

Linear Regression Models P8111

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM.

0.3. Proportion failing Temperature

Cohen s s Kappa and Log-linear Models

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

SAS Analysis Examples Replication C11

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

Lecture 12: Effect modification, and confounding in logistic regression

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Generalized Linear Models for Non-Normal Data

Generalized linear models

R Hints for Chapter 10

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Chapter 5: Logistic Regression-I

Generalized Linear Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Exercise 5.4 Solution

Testing Independence

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Chapter 4: Generalized Linear Models-I

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

REVIEW 8/2/2017 陈芳华东师大英语系

Investigating Models with Two or Three Categories

Week 7 Multiple factors. Ch , Some miscellaneous parts

9 Generalized Linear Models

Model Estimation Example

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

CHAPTER 5. Outlier Detection in Multivariate Data

Possibly useful formulas for this exam: b1 = Corr(X,Y) SDY / SDX. confidence interval: Estimate ± (Critical Value) (Standard Error of Estimate)

The Logit Model: Estimation, Testing and Interpretation

Generalized linear models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Introduction to the Analysis of Tabular Data

Appendix: Computer Programs for Logistic Regression

ECONOMETRICS II TERM PAPER. Multinomial Logit Models

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Binary Dependent Variables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

UNIVERSITY OF TORONTO Faculty of Arts and Science

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

ESTIMATE PROP. IMPAIRED PRE- AND POST-INTERVENTION FOR THIN LIQUID SWALLOW TASKS. The SURVEYFREQ Procedure

Transcription:

Count data page 1 Count data 1. Estimating, testing proportions 100 seeds, 45 germinate. We estimate probability p that a plant will germinate to be 0.45 for this population. Is a 50% germination rate a reasonable possibility? Pr{ 45 or less germinate in 100 trials if p=0.5 } =??? Binomial: n independent trials Each trial results in success or failure p=probability of success same on every trial X= observed number of successes in n trials Pr{X=r} = n!/[r!(n-r)!] p r (1-p) n-r Recall: 0!=1, 1!=1, 2!=2*1=2, 3!=3*2*1=6, 4!= 4*3*2*1=24 etc. Example: soft drinks - 2 cups, one has artificial sweetener, 1 sugar Task: guess which has artificial sweetener 6 people, 5 get it right. Can they tell the difference? H 0 : p=0.5 (indistinguishable) Pr{ 5 or 6 right if p=0.5} = [6!/(5!1!)(0.5) 5 (0.5)] + [6!/(6!0!)(0.5) 6 ] = 6/64 + 1/64 = 7/64 = 0.11 > 0.05 not significant. Logistic Regression Idea: p=probability of germinating = function of some variables (maybe temperature, moisture, or both). Example: Temperatures Germinating 70 73 78 64 67 71 77 85 82 Not germ. 50 63 58 72 67 75 Temperature Idea: Regress Germ (0 or 1) on Temperature: Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 1.09755 1.09755 5.70 0.0328 Error 13 2.50245 0.19250 Corrected Total 14 3.60000 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1-1.55013 0.90756-1.71 0.1114 temperature 1 0.03066 0.01284 2.39 0.0328

Predicted response (= estimated probability of germination) Y p temperature 1 0.59591 70 1 0.68789 73 1 0.84117 78 1 0.41197 64 1 0.50394 67 1 0.62657 71 1 0.81052 77 1 1.05578 85 <-- 1 0.96380 82 0-0.01724 50 <-- 0 0.38131 63 0 0.22802 58 0 0.65723 72 0 0.50394 67 0 0.74920 75 * Normal residuals? * Reasonable predicted probabilities? 1.0558? -0.0172? Better idea: Map 0<p<1 into L = logit =ln(p/(1-p)) then model L as L = α 0 + β(temperature) or equivalently, L = α + βx where X=(temperature-70) (approximate centering for nicer graphics) L is a function [ ln(p/(1-p)) ] of the expected value [p] of a binary variable Y so L=ln(E{Y}/(1+E{Y})). "Likelihood" = probability of sample = p(1-p)ppp(1-p)p...p Use p for germinated, 1-p for not germinated. Values of p are all different, depending on temperature. Substitute p= e α + βx /(1+e α + βx ) and thus 1-p = 1/(1+e α + βx ) "Maximum Likelihood Estimates" Likelihood = [e α + β(70-70) /(1+e α + β(70-70) ) ][ e α + β(73-70) /(1+e α + β(73-70) )] [ 1/(1+e α + β(75-70) )] Graph Likelihood vs. α and β and find values α and β of that maximize. Example 2: Germination vs. Temperature Likelihood, using input (temperature-70) like 0.000591 0.000394 0.000197 0.7 0.4 0.000000 3 0.1 beta 1 alpha -1-3 -0.2 β=0.1821 α=0.4961

For statistical reasons we often plot -2ln(likelihood). Here this quantity has been truncated at a value such that the elliptical region forms a 95% confidence region for the true (α,β) Example 2: Germination vs. Temperature -2log(Likelihood), using input (temperature-70) Plane shows 95% confidence ellipsoid -2 ln(l) 5.99 3.27 0.55 0.7 0.4-2.17 3 0.1 beta 1 alpha -1-3 -0.2 β=0.1821 α=0.4961 Maximum likelihood theory also gives standard errors (large sample approximations). Use "PROC LOGISTIC" to fit or "PROC CATMOD". We get Pr{ Germinate } = e L /(1+e L )=1/(1+e -L ) where L =.4961 + 0.1821*X and X=temperature-70. Title " "; Data seeds; Input Germ $ 1-3 n @; Y=(Germ="Yes"); If Germ=" " then Y=.; do i=1 to n; input temp @; X =temp-70; output; end; cards; Yes 9 64 67 70 71 73 77 78 82 85 No 6 50 58 63 67 72 75 23 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 PROC LOGISTIC data=seeds order=data; model germ=x / itprint ctable pprob=0.7000; output out=out1 predicted=p xbeta=logit; proc sort data=out1; by X; proc sgplot; series X=temp Y=p; scatter X=temp Y=Y; run; quit;

The LOGISTIC Procedure Model Information Data Set WORK.SEEDS Response Variable Germ Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 38 Number of Observations Used 15 Response Profile Ordered Total Value Germ Frequency 1 Yes 9 2 No 6 Probability modeled is Germ='Yes'. NOTE: 23 observations were deleted due to missing values for the response or explanatory variables. Maximum Likelihood Iteration History Iter Ridge -2 Log L Intercept X 0 0 20.190350 0.405465 0 1 0 15.205626 0.388433 0.127740 2 0 14.878609 0.478775 0.171150 3 0 14.866742 0.495409 0.181644 4 0 14.866718 0.496105 0.182141 Last Change in -2 Log L 0.0000235249 Last Evaluation of Gradient Intercept X 1.5459414E-6 0.0000962751 Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 22.190 18.867 SC 22.898 20.283-2 Log L 20.190 14.867

Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 5.3236 1 0.0210 Score 4.5731 1 0.0325 Wald 3.1024 1 0.0782 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.4961 0.6413 0.5984 0.4392 X 1 0.1821 0.1034 3.1024 0.0782 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits X 1.200 0.980 1.469 Association of Predicted Probabilities and Observed Responses Percent Concordant 79.6 Somers' D 0.611 Percent Discordant 18.5 Gamma 0.623 Percent Tied 1.9 Tau-a 0.314 Pairs 54 c 0.806 Classification Table Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.700 5 4 2 4 60.0 55.6 66.7 28.6 50. ctable pprob=.7000; Print out a "classification table" using "prior probability" 0.7000. In SAS, each observation is removed, the logistic regression refit, and if the new estimated p exceeds 0.7000, a 1 is predicted, otherwise 0. This does not happen in this example. The model fit on the full data gives the same classification table as this more careful method. Odds: p/(1-p) L is ln(odds) = ln(p/(1-p)). At X, let s say L is L 0 = α + βx and p becomes p 0 =exp(l 0 )/(1+exp(L 0 )) odds: exp(l 0 ) At X+1, L is L 1 = α + β(x+1) and p becomes p 1 =exp(l 1 )/(1+exp(L 1 )) odds: exp(l 1 ) Each L is a logarithm of odds so their difference (which is β, the slope on the logit scale) is the logarithm of the odds RATIO because with logarithms, ln(a/b)=ln(a)-ln(b) ). Thus we see that (α + β(x+1)) (α + β(x)) =β is the log of the odds RATIO and exp(β) is the odds ratio itself. For numbers Y near 0, like Y= 0.18 for example, exp(y) is approximately 1+Y so if the log odds ratio is 0.18 then approximately the odds becomes 1.18.

Small example from notes organized in a table : Actual Non Event Decision Event Non Event 4 4 (8) Event 2 5 (7) (6) (9) Your DECISION is either correct or not correct. Number of correct events (say event and be correct) is 5 in lower right corner Number of incorrect events Does this mean that you said it was an event and were wrong OR Does this mean that it was an event and you said it was not? It is the first you said event and were incorrect twice (2 of 7 declarations of event were wrong) It is the number of event decisions that were wrong. Lower left corner. Number of correct non events 4 Number of incorrect non events 4 (you said non event when it was really an event 4 times) Out of 8 declarations of non event, 4 were incorrect. Classification Table Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.700 5 4 2 4 60.0 55.6 66.7 28.6 50. Actual Non Event Decision Event Non Event 4 4 (8) Event 2 5 (7) (6) (9) Sensitivity Pr{say event given that it is an event} which we denote as Pr{say event event}=pr{say event and it was an event}/pr{true event} = 5/9 =0.556 Specificity Pr{say non event non event} = 4/6 = 0.667 Proportion False positives = Proportion of positive decisions that are false = Pr{non event say event} = 2/7=.286

Proportion False negatives = Proportion of negative decisions that are false = 4(upper right) divided by 8 = 4/8 = 0.50 The following statistics are all rank based correlation statistics for assessing the predictive ability of a model: (n c = # concordant, n d = # discordant, N points, t pairs with different responses) C (area under the ROC curve) (n c + ½ (# ties))/t Somers D Kendall s Tau-a (n c -n d )/t (n c -n d )/(½N(N-1)) Goodman-Kruskal Gamma (n c -n d )/ (n c +n d )