Beyond GLM and likelihood

Size: px

Start display at page:

Download "Beyond GLM and likelihood"

Christian Henry
5 years ago
Views:

1 Stat 6620: Applied Linear Models Department of Statistics Western Michigan University

2 Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence theorems) Math stat 2 (likelihood estimation, testing) Linear models (correlation, regression, ANOVA) Generalized linear models (repeated measures, random effects) Bayesian data analysis Data types Categorical data (logistic regression, odds ratios, CMH χ 2 ) Nonparametric data analysis Survival data Multivariate data Computing Stat computing 1 (SAS, R, SPSS, Python) Stat computing 2 (data mining, machine learning)

3 Logistic regression: binary response (Y) (X) Subj Group LogHcy Sex Age BMI SBP 1 Stroke Female Stroke Female Nonstroke Male Nonstroke Female : : : 1919 Stroke Male Nonstroke Male

4 Logistic regression

5 Logistic regression Model 1: unadjusted Model 2: adjusted for age and sex Model 3: adjusted for age, sex, BMI, SBP, DBP, Gluc, TCh, Trigl, HDL, LDL, HoS, HoA

6 Logistic regression { 1 if Stroke Y = 0 if Nonstroke X =Log Hcy Logistic Model: P[Y = 1 X ] = eβ 0+β 1 X 1 + e β 0+β 1 X Probit Model: P[Y = 1 X ] = Φ(β 0 + β 1 X )

7 Q: What values of (β 0, β 1 ) fit the data best? (X) (Y) (-2,2) (-7,5) (-19,15) Obs loghcy pred1 pred2 pred : : :

8 Many criteria for choosing best fit ( ˆβ 0, ˆβ 1 ). For example, let Y = (1, 1, 0, 0,..., 1, 0) and let Ŷ be vector of predicted values. Minimize D 1 (Y, Ŷ ) = Y Ŷ = Y i Ŷi or or or or D 2 (Y, Ŷ ) = Y Ŷ 2 = (Y i Ŷi) 2 D 3 (Y, Ŷ ) = Med Y i Ŷi D 4 (Y, Ŷ ) = Max Y i Ŷi D 5 = total misclassification rate

9 The likelihood principle Choose ( ˆβ 0, ˆβ 1 ) to maximize L(β 0, β 1 ) = i P[Y = y i X = x i ] In log Hcy example, maximize P[Y = 1 X = 1.45]P[Y = 1 X = 1.33] P[Y = 0 X = 0.89] [ e β 0 +β 1 ] [ 1.45 e β 0 +β 1 ] 1.33 = 1 + e β 0+β e β [1 eβ 0+β 1 ] β e β 0+β The maximum likelihood estimates are ( ˆβ 0, ˆβ 1 ) = ( 3.37, 2.91)

10 The LOGISTIC Procedure Maximum Likelihood Estimates Std Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept loghcy Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits loghcy

11 SAS code DATA strokedat; INPUT stroke $ loghcy; DATALINES; Yes 1.45 Yes 1.33 No 1.26 No 1.10 : : Yes 1.45 No 0.89 ; PROC LOGISTIC; MODEL stroke (EVENT= Yes ) = loghcy;

12 Logistic regression: odds ratio Define the odds of an event E as Odds(E) = P(E) 1 P(E) Then and P(E) = eβ 0+β 1 X 1 + e β 0+β 1 X 1 P(E) = Odds(Y = 1 x + δ) Odds(Y = 1 x) Odds(E) = e β 0+β 1 X = eβ 0+β 1 (x+δ) e β 0+β 1 x e β 0+β 1 X = e β 1δ

13 The effect of having log Hcy one standard deviation higher is Odds(Y = 1 x ) Odds(Y = 1 x) = e ˆβ 1 (.20) = e (2.91)(.20) = 1.79 The odds of having a stroke increases by 79% for every 1 SD increase in log Hcy.

14 What about categorical predictors? For example, suppose log Hcy was categorized into three levels: Low (< 1.09), Normal ( ), or High (> 1.23). Subj Group Y LogHcy Level X1 X2 1 Stroke High Stroke High Nonstroke High Nonstroke Normal 0 1 : : : 1919 Stroke High Nonstroke Low 0 0

15 P[Y = 1 Level] = eβ 0+β 1 X 1 +β 2 X e β 0+β 1 X 1 +β 2 X 2 Odds[Y = 1 Level] = e β 0+β 1 X 1 +β 2 X 2 e β 0+β 1, if High = e β 0+β 2, if Normal e β 0, if Low Odds[Y = 1 Level] Odds[Y = 1 Low] = e β 1, if High e β 2, if Normal 1, if Low

16 Logistic regression

17 Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence theorems) Math stat 2 (likelihood estimation, testing) Linear models (correlation, regression, ANOVA) Generalized linear models (repeated measures, random effects) Bayesian data analysis Data types Categorical data (logistic regression, odds ratios, CMH χ 2 ) Nonparametric data analysis Survival data Multivariate data Computing Stat computing 1 (SAS, R, SPSS, Python) Stat computing 2 (data mining, machine learning)

18 Likelihood Theory Given a sample y i,..., y n, the likelihood function is L(θ; y 1,..., y n ) = (Note: θ may be a vector.) n f (y i ; θ) i=1 The value ˆθ which maximizes this is called the MLE. Comment: It is often easier to maximize the log-likelihood log L(θ) = n log f (y i ; θ) i=1

19 The shape of f ( ) determines the properties of ˆθ. Normal: If f (y i ; θ) = ( 1 2π ) 1/2 e (y i θ) 2 /2, then and ˆθ = y 1 + +y n n. L(θ) = Laplace: If f (y i ; θ) = ( 1 2) e y i θ, then L(θ) = and ˆθ = med{y1,, y n }. ( ) 1 n/2 e P (y i θ) 2 /2 2π ( ) 1 n e P y i θ 2

20 Theorem L1: Let θ 0 denote the true value of θ. Under regularity conditions, ( ) 1 ˆθ N θ 0, I (θ 0 ) where [ 2 ] [ logl(θ) 2 ] log f (Y ; θ) I (θ) = E θ 2 = E θ 2 Example: Poisson(θ) L(θ) = e θ θ y i y i! = e nθ θ P y i yi! log L(θ) = nθ + y i log θ log y i! log L(θ) θ = n + yi θ 0 MLE is ˆθ = P yi n.

21 Theorem L1 says Var(ˆθ). = 1 I (θ) log L(θ) = nθ + y i log θ log y i! log L(θ) θ = n + y i θ 2 log L(θ) θ 2 = y i θ 2 [ 2 ] log L(θ) [ I (θ) = E θ 2 = E y ] i θ 2 = n θ so Var(ˆθ) =. θ 0 n. The standard error is SE =. y n

23 Theorem L2 (RCLB): Let y 1,..., y n be a random sample from f (y; θ 0 ). Let U(y 1,..., y n ) be a function such that E(U) = θ 0. Then Var(U) 1 I (θ 0 ) Implications: 1 The MLE is efficient, in the sense of smallest possible variance. 2 Provides a framework for comparing estimates Q: When is the sample median better than the sample mean? A: When the distribution is closer to the Laplace than the Normal.

24 Definition: Let the efficiency of an estimator U(y 1,..., y n ) be the ratio of its variance to the RCLB. i.e. eff = 1/[I (θ 0)] Var(U) Table: Efficiency of estimators Normal Laplace Sample mean Sample median

25 Definition: Let the efficiency of an estimator U(y 1,..., y n ) be the ratio of its variance to the RCLB. i.e. eff = 1/[I (θ 0)] Var(U) Table: Efficiency of estimators Normal Laplace Sample mean Sample median Med{(y i + y j )/2}

26 Confirm by simulation > nsamp<-rnorm(n=21,mean=0,sd=3) > nsamp [1] [7] [13] [19] > mean(nsamp) [1] > median(nsamp) [1] > for (i in 1:10000){nsamp<-rnorm(n=21,mean=0,sd=3); stomean[i]<-mean(nsamp);stomed[i]<-median(nsamp)} > var(stomean)/var(stomed) [1]

27 Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence theorems) Math stat 2 (likelihood estimation, testing) Linear models (correlation, regression, ANOVA) Generalized linear models (repeated measures, random effects) Bayesian data analysis Data types Categorical data (logistic regression, odds ratios, CMH χ 2 ) Nonparametric data analysis Survival data Multivariate data Computing Stat computing 1 (SAS, R, SPSS, Python) Stat computing 2 (data mining, machine learning)

28 Multi-parameter likelihood estimation Let (ˆθ 1, ˆθ 2 ) maximize log L(θ 1, θ 2 ) = n log f (y i ; x i, θ 1, θ 2 ) i=1 The 2 2 information matrix I(θ 1, θ 2 ) has (j, k)th element [ 2 ] E log f (Y ; xi ; θ 1, θ 2 ) θ j θ k Theorem L3: The variances of ˆθ 1 and ˆθ 2 are the diagonals of I 1 (θ 1, θ 2 )

29 The LOGISTIC Procedure Maximum Likelihood Estimates Std Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept loghcy Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits loghcy

30 Confidence interval for odds ratio Recall: Odds[Y = 1 Level] Odds[Y = 1 Low] = Confidence interval for β 1 : Confidence interval for e β 1 : e β 1, if High e β 2, if Normal 1, if Low ± 2(1.0699) [0.7739, ] [ e , e ] = [2.263, ]

32 Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence theorems) Math stat 2 (likelihood estimation, testing) Linear models (correlation, regression, ANOVA) Generalized linear models (repeated measures, random effects) Bayesian data analysis Data types Categorical data (logistic regression, odds ratios, CMH χ 2 ) Nonparametric data analysis Survival data Multivariate data Computing Stat computing 1 (SAS, R, SPSS, Python) Stat computing 2 (data mining, machine learning)

33 Beyond likelihood Example: Survival times for heart transplant patients (T) (X1) (X2) (X3) Subj Survive Trans Prior Age : :

34 Suppose that survival time T has probability density function f (t). The cumulative density function is F (t) = P[T t] = Define the survivor function as and the hazard function as It can be shown that t 0 f (u) du S(t) = P[T > t] = 1 F (t) h(t) = f (t) S(t) f (t) = h(t)e R t 0 h(u) du

35 Some hazard models: λ, Exponential h(t) = λγ t, Gompertz λt α, Weibull Incorporating covariates like transplant, prior surgery, and age? λ e β 1x 1 + +β k x k, Exponential h(t) = λγ t e β 1x 1 + +β k x k, Gompertz λt α e β 1x 1 + +β k x k, Weibull = λ 0 (t) e β x Proportional hazards

36 Estimate parameters by maximum likelihood n L = f i (t i ) i=1 where f i (t) = h i (t)e R t 0 h i (u) du and Problem: Specify λ 0 (t)? h i (t) = λ 0 (t) e β x i λ, Exponential λ 0 (t) = λγ t, Gompertz λt α, Weibull

37 Partial Likelihood Cox (1972) proposed an estimation method for the βs without needing to specify λ 0 (t). Maximize the partial likelihood PL = where L i is the conditional probability of failure at time t i given the number of cases at risk at time t i. n i=1 L i

38 Example: Survival times for heart transplant patients (T) (X1) (X2) (X3) Subj Survive Trans Prior Age : :

39 A death occurred 5 days after enrollment. What is the probability that it happened to patient 1 instead of to one of the other at-risk patients? h 1 (5) L 1 = h 1 (5) + + h 75 (5) h 2 (15) L 2 = h 2 (15) + + h 75 (15) : : L 74 = L 75 = 1 h 74 (541) h 74 (541) + h 75 (541)

40 L 1 = h 1 (5) h 1 (5) + + h 75 (5) λ 0 (5) e β x 1 = λ 0 (5) e β x λ0 (5) e β x 75 e β x 1 = e β x e β x 75 e β x 2 L 2 = e β x e β x 75 The combination of PH assumption and partial likelihood PL = allow estimation of β without specifying the baseline hazard. n i=1 L i

41 SAS output The PHREG Procedure Dependent variable: Survive Maximum Likelihood Estimates Standard Wald Pr> Risk Var DF Estimate Error Chi-sq Chi-sq Ratio Trans Prior Age

42 The hazard ratio for age is h(t; Age = x + 1) h(t; Age = x) = λ 0(t)e 1.708x x (x 3 +1) λ 0 (t)e 1.708x x x 3 = e.0586 = so every additional year of age increases hazard of failure by 6%. 95% confidence interval for hazard ratio is [ e (.0150), e (.0150)] = [1.0289, ]

43 Example: Survival times for heart transplant patients Standard Risk Var DF Estimate Error Ratio Trans Prior Age Age increases hazard of death by 6%, and getting a transplant reduces hazard of death by 82%. While the age effect may be real, the magnitude of transplant effect is likely false. Transplant Late death No Transplant Early death

44 Example: Survival times for heart transplant patients Standard Risk Var DF Estimate Error Ratio Trans Prior Age Age increases hazard of death by 6%, and getting a transplant reduces hazard of death by 82%. While the age effect may be real, the magnitude of transplant effect is likely false. Transplant Late death No Transplant Early death

45 What does the data say? Challenge: 1 Data integrity 2 Appropriate methodology 3 Building a correct narrative significant relationships effect size

MAS3301 / MAS8311 Biostatistics Part II: Survival

MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the