STA6938-Logistic Regression Model

Size: px

Start display at page:

Download "STA6938-Logistic Regression Model"

Kristin Atkins
5 years ago
Views:

1 Dr. Ying Zhang STA6938-Logistic Regression Model Topic 6-Logistic Regression for Case-Control Studies Outlines: 1. Biomedical Designs 2. Logistic Regression Models for Case-Control Studies 3. Logistic Regression Model for Matched Case-Control Studies 1

2 1. Biomedical Designs A. Types of Studies 1. Observational Study: collect data from an existing situation. The data collection does not intentionally interfere with the running of the system. Example: Sample Survey - sending out questionnaires about health care. 2. Experiment Study: investigator deliberately sets one or more factors to a specific level. Laboratory experiment: it takes place in an environment where experiment manipulation is facilitated. Comparative experiment: it compares two or more techniques, treatments or levels of some variables. Experimental unit (study unit): the smallest unit upon which the experiment or study is performed. Example: In a clinical study, the experimental units are usually humans. Cross-over experiment: the same experimental unit receives more than one treatment, or is investigated under more than one condition of the experiment. The different treatments are given during non-overlapping time periods. 2

3 Example: Laboratory animals are treated sequentially with more than one drug, and blood levels of certain metabolites are measured for each of the drugs. 3. Clinical Study: it takes place in the setting of clinical medicine. 4. Prospective Study: a cohort of people is followed for the occurrence or nonoccurrence of specified endpoints or events or measurements. Cohort: a group of people whose membership is clearly defined. Example: All individuals enrolling in the Graduate School at UCF. Endpoint: clearly defined outcome or event associated with an experimental or study unit. Baseline characteristics or baseline variables: values collected at the time of entry into the study. Example: (Cardiovascular Disease) A study was conducted to look at the effects of oral contraceptives (OC) on heart disease in women years of age. It is found that among 5,000 current OC users at baseline, 13 women develop a myocardial infarction (MI) over a 3- year period, while among 10,000 non-oc users, 7 develop an MI over a 3-year period. 3

4 OC Use Group MI Status Over 3 Years OC Users Non-OC Users This is a typical example of a prospective design. Cohort: Women in age years were disease free at baseline and had their exposure (OC use) measured at time of entry. They were followed for three years, during which some developed disease, while others remained disease free. Drawback: The endpoint of interest may occur infrequently. If that happens, extremely large numbers of people need to be followed. 5. Retrospective Study: individuals having a particular outcome or endpoint are identified and studied. The individuals with endpoint are usually compared to other individuals without the endpoint to investigate the possible causal factors of the endpoint. Subjects with particular characteristics of interest are often collected into registries. In the United States there are cancer registries, often defined for a specified metropolitan area. A cancer registry can be used retrospectively to compare presence or absence of possible causal factors of cancer after generating appropriate controls either randomly from the same population or by some matching mechanism. Alternatively, a cancer registry can be used prospectively by comparing survival times of cancer patients having various therapies. 4

5 Example: A hypothesis has been proposed that breast cancer in women is caused in part by events that occur between the age at menarche (i.e., the age when menstruation begins) and at first childbirth. In particular, the hypothesis states that the risk of breast cancer increases as the length of this time interval increases. If the theory is correct, then an important risk factor for breast cancer is age at first birth. This theory would explain in part why breast cancer incidence seems to be higher for women in the upper socioeconomic groups, since they tend to have their children relatively late. This is a typical example of a retrospective study. Breast cancer cases were identified together with controls who were in the hospital at the same time as the cases but did not have breast cancer and were of comparable age to the controls. Pregnancy history (age at first birth) was compared between cases and controls. Comparison between the two types of studies: Less bias in a prospective study than in a retrospective study, because (i) Patient s knowledge of their current health situation will be more precise than their recall of their past health situation. (ii) It is much more difficult to obtain a representative sample of people who already have the disease in question since, for example, some of the severe diseased individuals may have already died. (iii) The diseased individuals, if still alive, may tend to give biased answers about prior 5

6 health situation if they believe there is a relationship between their prior health situation and the disease. Retrospective study is much less expensive to perform and can be completed in much less time than a prospective study. 6. Case-control Study: select all cases, usually a disease, meeting fixed criteria. A comparison group for the cases, called controls is also selected. The cases and controls are compared with respect to various characteristics. Usually, the case-control study is applied to some rare disease in which the cases are hard to get. 7. Longitudinal Study: collect information on study units over a specified time interval. 8. Cross-sectional Study: collect information on study units at some fixed time. Longitudinal Study Time Cross- Sectional Study 9. Placebo: the treatment is designed to appear exactly like a comparison treatment, but to be devoid of the active part of the treatment. 6

7 10. Blind Study: B. Ethics Single blind: treated subjects are unaware of which treatment (including any control) they are receiving. Double blind: both patients and the individuals evaluating the outcome variables are unaware of which treatment the subjects are receiving. In the biomedical field, many researches and experiments involve both animal and/or human participants. Moral and legal issues are involved in both areas. 1. All investigators involved in a study are responsible for the conduct of an ethical study to the extent that they may be expected to know what is involved in the study. 2. Proposed studies involving humans (or animals) should be reviewed by individuals not concerned or connected with the study or the investigators. 3. Individuals participating in an experiment should understand and sign an informed consent form. 4. Participants should be free to withdraw at any time, or to refuse initial participation, without being penalized or jeopardized with respect to current and future care and activities. 5. Animal studies should be done prior to human experimentation. C. Steps to perform a study 7

8 Step 1: A question or problem area of interest is considered. Scientific question, not involve biostatistics Step 2: A study is to be designed to tackle the question (i) Identify the data to be collected Variable Experiment units Sample size (ii) An appropriate analytical model needs to be developed for describing and processing the data. (iii) What are the inferences one hopes to make from the particular study? What conclusions might one draw from the study? To what population(s) is the conclusion applicable? Step 3: The study is carried out and the data are collected Step 4: The data are analyzed and conclusions and inferences are drawn. Step 5: The results are used (i) changing operating procedures (ii) publishing results (iii) planning a subsequent study 8

9 2. Logistic Regression Models for Case-Control Studies a. Prospective Cohort studies In prospective cohort studies, a subject selected randomly from corresponding population is followed for a fixed period of time and the outcome variable is measured. A. Random sampling (including randomized trial according to treatments) Likelihood: L( β ) = PY ( i = yi xi) n i= 1 B. Stratified randomization: location, clinic, age. k Likelihood: L( β ) = PY ( ij = yij xij ) n i i= 1 j= 1 1. Model account for stratum effect π ( xij ) P( Yij 1 xij ) ' ( β0, i + β xij) ' ( β0, i β xi ) exp = = = 1 + exp + j 2. Model account for stratum-specific response π ( xij ) P( Yij 1 xij ) ' ( β0, i + βixij) ' ( β0, i βixij) exp = = = 1 + exp + To estimate these parameters, one can just use regular logistic regression model by adding dummy variable for 9

10 stratum and also consider the interaction between each covariate and stratum variables. b. Retrospective Case-Control Studies In retrospective case-control studies, we actually have two strata defined by the outcome variable (Y=1 or 0). The characteristics of subjects in each stratum are collected retrospectively. Let s be the selection variable, recording the selection status for each subject in the population. Suppose we have a case-control study with and n 0 controls (Y=0). n 1 cases (Y=1) n 1 0 Full Likelihood = Px ( y= 1, s= 1) Px ( y= 0, s= 1) (1) i i i i i i i= 1 i= 1 n Question: How do we incorporate the logistic regression model, ' exp( β0 + β x) π ( x) = P( Y = 1 x) = ' 1 + exp β + β x into the full likelihood (1)? ( 0 ) 10

11 Note: ( ) ( ) ( ) ( ) P( y s ) P x, y s = 1 P y x, s = 1 P x s = 1 P( x y, s = 1 ) = = (2) P y s = 1 = 1 ( 1 x, s 1) P y = = = = ( = 1, = 1 ) P( s = 1 x) P( y = 1 x) P( s = 1 y = 1, x) ( = 0 ) ( = 1 = 0, ) + ( = 1 ) ( = 1 = 1, ) P y s x P y x P s y x P y x P s y x Assume that the selection criterion is covariate independent, then 1 and 0 ( 1 1, ) ( 1 1) τ = P s = y = x = P s= y = τ = P( s = 1 y = 0, x) = P( s= 1 y = 0) and thus ( 1 x, s 1) P y = = = τπ 1 ( x) ( x ) + τ 1 π( ) τ π( x) 0 1 τ1 π ( x) τ 1 π( x) ' ( β + β x) ' ( x) 0 0 = = = π τ1 π( x) 1+ exp β0 + β 1+ τ 1 π( x) 0 exp ( x) with ( ) β = τ τ + β. 0 ln 1/

12 Also, we can assume that ( ) Px ( s= 1) = P x: the probability distribution of the covariates, because usually sampling is carried out independent of covariate values. Thus (2) can be written as ( 1, s 1) P x y = = = ( x) P( x) ( = 1 s= 1) π P y. We can do the same for P( x y 0, s 1) likelihood (1) can be reformulated as = =. Thus the full P( xi ) ( s = 1) n = L ( β ) i= 1 P yi i Full Likelihood, where L n y ( ) ( ) 1 ( ) 1 i β = π x i π x i yi. i= 1 -This is the likelihood obtained when we pretend the casecontrol data were collected in a cohort study. 12

13 What does this likelihood imply? Note that ( ) ( ) P y = 1 s = 1 = n / n and P y = 0 s = 1 = n / n. i i 1 i i 0 If we assume that the distribution of covariates in non informative about the regression coefficient, then maximizing the full likelihood with respect to β is equivalent to maximizing ( ) L β. Thus, maximization of the full likelihood with respect to the parameters in π ( x) need only considers that portion of the likelihood which looks like a cohort study. Implication: analysis of data from case-control studies via logistic regression may proceed in the same way and using the same computer programs as cohort study. But inference about β 0 is not available. Theoretically, the maximum likelihood estimators obtained by pretending that case-control data resulted from a cohort sample have the usual asymptotic properties associated with maximum likelihood estimators. Prentice and Pyke (1979) Biometrika, 66,

14 3. Logistic Regression Model for Matched Case-Control Studies Matched Case-Control Study: Subjects are stratified on the basis of variables belied to be associated with the outcome. Variables like age, sex are commonly used as stratification variables. Within each stratum, samples of cases (y=1) and controls (y=0) are chosen 1-1 matched study: one case and one control selected from each stratum 1-M matched study: one case and M controls selected from each stratum. A. Theoretical aspects of matched study Suppose we have total p strata, which amounts of total (n1+ n0) pobservations, which contains n1 cases and n0 controls for each stratum. Since we assume that the stratified variable is related to outcome, we should consider build a stratum-specific logistic regression model, i.e. p n + n 0 1 yki L ( β) = πk ( xk, i) 1 πk ( xk, i) where k= 1 i= 1 1 y, ki, 14

15 π k ( x) β k,0 e = 1 + e β + β' x k,0 + β' x (3) Note that the number of parameters increases in the rate of sample size. Consequence: failure to use the regular large sample theorem for making statistical inference. Remedy: Consider the conditional likelihood of observed data given the stratum total and the total number of cases observed. For the kth stratum, assume we have total n k,1 n k,0 which contain cases and controls. n k observations The conditional probability of the observed data given the total observations and cases is L ( β ) = n k,1 k k,1 n k,1 k ( = 0 ) ( = 0) P x y P x y i i i i i= 1 i= n + 1 k,1 k c n n where k ( ji, 0 ) (, 0) j i = j jij i = j P x y P x y j= 1 i = 1 i = n + 1 j j k,1 n c k nk nk! = = n n! n n n k subjects. cases to ( ) k,1 k,1 k k,1! : total possible ways to assign n k,1 15

16 If we assume the stratum-specific logistic regression model (3), then applying Bayes theorem to P( x y ) leads that L k ( ) n k,1 i= 1 β = c n k k,1 j= 1 i = 1 j e β ' x e i β ' x ji, j (4) Then the likelihood function for the K strata data is L ( ) n k,1 β ' xi K e i 1 c n k k,1 k = 1 β ' x ji, j. = e j= 1 i j = 1 β = Note that the number of unknown parameters in this likelihood function is fixed as the number of strata increases. The regular statistical inference procedures based on likelihood analysis apply. How to solve the problem? (Estimate the regression parameters) Modify the proportional hazards regression Some tricks in logistic regression model 16

17 B. 1-1 matched study Let x 1,k and x 0,k denote the covariate vectors for the case and control, respectively, in the kth stratum. The likelihood function (4) for the kth stratum is simply equal to L ( ) β = e β ' x k β x x 1, k e (5) ' 1, k β' 0, k + e L β = if x 1, k = x 0, k. Note that ( ) 0.5 k Implication: the kth stratum or kth pair is uninformative in estimating the coefficients-the value of covariate does not help distinguish which subject is more likely to be the case. So the estimator may be only based on small fraction of the total number of possible pairs (discordant pairs) if the covariate is a binary variable itself. Trick: We rewrite likelihood (5) in to L ( β ) β '( 1, 0, ) β ' = e e '( 1, 0, ) = β 1+ e 1 + e x x x k k k k x x x k k β ' k and then the full relevant conditional likelihood function can be written as L ( ) yk 1 yk k k k K K β' x K β' x β' x β e e e L ( β = k ) = = 1 β' x k β' xk β' xk k 1 k 11 e k 1 1 e 1 e = = + =

18 for all y k = 1, k = 1, 2,, K This implies that we can just think of the scenario that we have K observations with covariates given by xk = x1, k x0, k and all response being the case. Then the regular logistic regression package can be applied to find the estimate. But be aware that In making use of any statistical package for logistic regression model, make sure that the procedure allowing you to suppress the intercept term. Before making any adjustment, make sure that you have already had the design variables for the nominal covariates. Don t try to recode the categorical-like covariates in x k, just pretend that all covariates in that vector are continuous. Fortunately, SAS can do it! 18

19 Example 6.1 We fit the low birth-weight data LOWBWT11.DAT by taking care of the nature of 1-1 matched case-control experiment for this particular study. The data consists of 56 age matched case-control pairs. The description of the data set is given as follows Table: Code Sheet for the Variables in the Paired Low Birth Weight Data Set. Columns Variable Abbreviation Pair Number PAIR# 19 Low Birth Weight (0 = Birth Weight ge 2500g, LOW l = Birth Weight < 2500g) Age of the Mother in Years AGE Weight in Pounds at the Last Menstrual Period LWT 41 Race (1 = White, 2 = Black, 3 = Other) RACE 50 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 57 History of Premature Labor (0 = None, 1 = Yes) PTD 63 History of Hypertension (1 = Yes, 0 = No) HT 69 Presence of Uterine Irritability (1 = Yes, 0 = No) UI We want to fit a logistic regression model to explain the risk factors for the low birth weight. 19

20 SAS code a. Data preparation data diffs; set Logistic.Lowbwtm11; / Initialize temporary variables to hold covariate values for the control of each pair. / retain pair1 lwt1 race3 race4 smoke1 ptd1 ht1 ui1 0; drop pair1 lwt1 race race3 race4 smoke1 ptd1 ht1 ui1; / Create design variables for race. / select(race); when (1) do; race1=0; race2=0; end; when (2) do; race1=1; race2=0; end; when (3) do; race1=0; race2=1; end; end; / Copy the values of the covariates of each control into temporary variables. / if(pair ne pair1) then do; pair1=pair; lwt1=lwt; race3=race1; race4=race2; smoke1=smoke; ptd1=ptd; ht1=ht; ui1=ui; end; / Calculate the differences (case-control) between the covariates of each pair and output the observation. / else do; lwt=lwt-lwt1; race1=race1-race3; race2=race2-race4; smoke=smoke-smoke1; ptd=ptd-ptd1; ht=ht-ht1; 20

21 end; run; ui=ui-ui1; output; b. Univariate Analysis / display the number of discordant pairs for the covariates. / proc freq data=diffs; tables lwt smoke race1 race2 ptd ht ui; run; / Univariate analysis. / proc logistic data=diffs; model low=lwt /noint; run; proc logistic data=diffs; model low=smoke /noint; run; proc logistic data=diffs; model low=race1 race2 /noint; run; proc logistic data=diffs; model low=ptd /noint; run; proc logistic data=diffs; model low=ht /noint; run; proc logistic data=diffs; model low=ui /noint; run; Table 1. Results from Univariate Analysis Variable Coeff. S.D. OR 95% Discordant Pairs ( n, n ) LWT (0.807,1.028) NA SMOKE (1.224,6.177) (22,8) RACE_ (0.391,3.042) NA RACE_ (0.446,2.114) NA PTD (1.245,11.299) (15,4) HT (0.603,9.023) (7,3) UI (0.968,9.302) (12,4) 21

22 Remarks: Odds ratio for LWT is based on the increment of 10 pounds For a binary variable, the coefficient is simply ˆ 10 β = ln n n 01 The univariate analysis shows that only SMOKE and PTD appear to be a significant factor for the low birth-weight. We have to be aware that the statistical inference we are making here is based on large sample theory. However, in the univariate analysis, the data are quite thin for many binary variables. So we shouldn t trust the results in Table 1 too much. c. Multivariate Analysis title1 'Multivariate Analysis of 1:1 Matched Case-Control Data'; proc logistic data=diffs; model low=lwt smoke race1 race2 ptd ht ui /noint; race: test race1=0, race2=0; run; 22

23 Model Fit Statistics Without With Criterion Covariates Covariates AIC SC Log L Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq LWT SMOKE race race PTD HT UI Linear Hypotheses Testing Results Wald Label Chi-Square DF Pr > ChiSq race Comments: By Wald-test, only RACE is apparently not a significant risk factor for the low birth-weight. However, the race is always considered as a potential confounding variable. Let examine the model without RACE. 23

24 proc logistic data=diffs; model low=lwt smoke ptd ht ui /noint; run; Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Without With Criterion Covariates Covariates AIC SC Log L Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq LWT SMOKE PTD HT UI Likelihood ratio test for RACE effect: G = ( ) = ( χ2 ) ( χ2 ) p value = Pr > G = Pr > = Adding RACE into the model does not seem to alter the regression coefficients too much. (22% is the maximum change happened for covariate LWT) 24

25 It is up to domain expert to decide if it is any advantage to keep this variable in the model. Here, we pretend that this variable is not practically important, so we will not consider including this variable in the model. d. Model diagnosis / Create an output data set of diagnostic statistics. / proc logistic data=diffs; model low=lwt smoke ptd ht ui /noint; output out=stats difchisq=d_chi difdev=d_dev h=hat predicted=pred; run; title1; options nolable; / Create some diagnostic plots. / proc gplot data=stats; axis1 label=(angle=-90 rotate=90 'Deviance Change'); plot d_devpred /frame vaxis=axis1; title2 'Change in Deviance Versus Predicted Probability'; run; axis2 label=(angle=-90 rotate=90 'Chi-Square Change'); bubble d_chipred=hat /frame vaxis=axis2; title2 'Change in Chi-Square Versus Predicted Probability'; run; 25

26 We plot the change of deviance and Pearson Chi-square statistics. 26

27 We found that there are 5 pairs that had large deviance and Pearson Chi-square changes if deleted. Among these five, two have the largest leverage values. These pairs should be closely examined. 27

STA6938-Logistic Regression Model

Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of