Generalized linear models IV Examples

Size: px

Start display at page:

Download "Generalized linear models IV Examples"

Melanie Hubbard
6 years ago
Views:

1 Generalized linear models IV Examples Peter McCullagh Department of Statistics University of Chicago Polokwame, South Africa November 2013

2 Outline Decay rates of vitamin C Ship damage data Fisher s tuberculin data Birth date and death date Drosophila diet and assortative mating

3 Example: Decay rates of vitamin C Ascorbic acid concentrations of snap-beans after cold storage Weeks of storage Temp Total 0 F F F Response Y = concentration Gaussian model versus gamma model Half life estimation Model checking Compatibility with pure error estimate

4 Gaussian non-linear model Response: Y = concentration (not log(concentration)) Exponential decay model E(Y (t)) = µ(t) = exp(β 0 β T t) log µ(t) = β 0 β T t log link Model formula: as.factor(temp):time β 0 = log µ(0) (same for every temp) Half life: log(2)/β T at storage temp T glm(y~temp:time, family=gaussian(link=log)) Coefficients: Estimate Std. Error t-value Pr(> t ) (Intercept) e-16 *** temp0:time temp10:time *** temp20:time e-08 *** -- (Dispersion parameter for gaussian family taken to be ) Null deviance: on 11 degrees of freedom Residual deviance: on 8 degrees of freedom

5 Half-life,... Decay rate β 0: Half life: log(2)/β CI for β: ( ˆβ t se, ˆβ + t se)+ Confidence interval for half-life: (90%) (log(2)/( ˆβ + t se), log(2)/( ˆβ t se)) + Temp F ˆβ SE log(2)/ ˆβ half-life CI (weeks) (89.4, ) (22.3, 43.0) (4.8, 5.7) Gamma model (82.2, ) (21.9, 42.3) (4.9, 5.6) Half-life intervals not symmetric

6 Model checking using replicate info External check: Each response is a sum of measurements for 3 packets: var(y i ) = 3σ 2 : σ 2 = packet variance Individual measurements not available, but replicate mean squared error = on 24 df Model mean squared error on 8 df F -ratio: F = 1.125/ = 0.53 at 18% point of F 8,24 distribution External variance estimate helps to avoid over-fitting

7 Ship damage data Background: structured data from Lloyd s of London in 1980 Cargo-carrying ships of five types A E Construction year: 4-level factor with levels 60 64; 65 69; 70 74; Operation period: 2-level factor with levels and 74 79; (pre and post OPEC) Exposure unit t: ship-months at risk ( ) Response: Number of damage incidents caused by waves to the forward section... Same ship may experience more than one incident Same ship may operate in both periods

8 Number of reported damage incidents and aggregate months service by ship type, year of construction and period of operation Ship Year of Period of Aggregate Number of type construction operation months service damage incidents A A A A A A A * A B B B B B B B * B C C C C C C C * C D D D D D D D * D E

9 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

10 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

11 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

12 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

13 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

14 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

15 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

16 Statistical considerations for ship damage data Response is an event count in (0, t), a non-negative integer suggesting Poisson process or a renewal process Possibility of moderate over-dispersion var(y ) > E(Y ) various reasons... Focus on event rates suggesting E(Y t ) t t = 0 implies µ = 0 and Y = 0 Three factors stype, const, oper affecting accident rate multiplicative effects more plausible than additive Leading to initial log-linear model Y i Po(µ i ) (independent components) Main effects additive model log µ i = β 0 + log(t i ) + α stype + β cons + γ oper Role of log(t): offset, (coefficient = 1) not a covariate; does not appear in the model formula glm(y~stype+cons+oper, family=poisson(), offset=log(t)) Six cells with t = 0 are uninformative and may be omitted

17 Some conclusions for ship damage data Main-effects model: Deviance = 38.7 on 25 df; X 2 = 42.3 on 25 df Suggests moderate over-dispersion (or interaction) X 2 /df = 1.69 Stationarity: coefficient of log(t): ˆβ = 0.9 ± 0.1 consistent with β = 1 Interactions: by adding 2-factor terms (one at a time) Not much evidence of interaction X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/25) to get correct standard errors with dispersion factor Conclusions regarding risk factors ship type relative rates: 1.00, 0.58, 0.51, 0.93, 1.38 operation: pre 74: 1.00, post 74: 1.47 construction period:

18 Some conclusions for ship damage data Main-effects model: Deviance = 38.7 on 25 df; X 2 = 42.3 on 25 df Suggests moderate over-dispersion (or interaction) X 2 /df = 1.69 Stationarity: coefficient of log(t): ˆβ = 0.9 ± 0.1 consistent with β = 1 Interactions: by adding 2-factor terms (one at a time) Not much evidence of interaction X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/25) to get correct standard errors with dispersion factor Conclusions regarding risk factors ship type relative rates: 1.00, 0.58, 0.51, 0.93, 1.38 operation: pre 74: 1.00, post 74: 1.47 construction period:

19 Some conclusions for ship damage data Main-effects model: Deviance = 38.7 on 25 df; X 2 = 42.3 on 25 df Suggests moderate over-dispersion (or interaction) X 2 /df = 1.69 Stationarity: coefficient of log(t): ˆβ = 0.9 ± 0.1 consistent with β = 1 Interactions: by adding 2-factor terms (one at a time) Not much evidence of interaction X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/25) to get correct standard errors with dispersion factor Conclusions regarding risk factors ship type relative rates: 1.00, 0.58, 0.51, 0.93, 1.38 operation: pre 74: 1.00, post 74: 1.47 construction period:

20 Some conclusions for ship damage data Main-effects model: Deviance = 38.7 on 25 df; X 2 = 42.3 on 25 df Suggests moderate over-dispersion (or interaction) X 2 /df = 1.69 Stationarity: coefficient of log(t): ˆβ = 0.9 ± 0.1 consistent with β = 1 Interactions: by adding 2-factor terms (one at a time) Not much evidence of interaction X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/25) to get correct standard errors with dispersion factor Conclusions regarding risk factors ship type relative rates: 1.00, 0.58, 0.51, 0.93, 1.38 operation: pre 74: 1.00, post 74: 1.47 construction period:

21 Some conclusions for ship damage data Main-effects model: Deviance = 38.7 on 25 df; X 2 = 42.3 on 25 df Suggests moderate over-dispersion (or interaction) X 2 /df = 1.69 Stationarity: coefficient of log(t): ˆβ = 0.9 ± 0.1 consistent with β = 1 Interactions: by adding 2-factor terms (one at a time) Not much evidence of interaction X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/25) to get correct standard errors with dispersion factor Conclusions regarding risk factors ship type relative rates: 1.00, 0.58, 0.51, 0.93, 1.38 operation: pre 74: 1.00, post 74: 1.47 construction period:

22 Some conclusions for ship damage data Main-effects model: Deviance = 38.7 on 25 df; X 2 = 42.3 on 25 df Suggests moderate over-dispersion (or interaction) X 2 /df = 1.69 Stationarity: coefficient of log(t): ˆβ = 0.9 ± 0.1 consistent with β = 1 Interactions: by adding 2-factor terms (one at a time) Not much evidence of interaction X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/25) to get correct standard errors with dispersion factor Conclusions regarding risk factors ship type relative rates: 1.00, 0.58, 0.51, 0.93, 1.38 operation: pre 74: 1.00, post 74: 1.47 construction period:

23 Tuberculin study design: a 4 4 Latin square Design for tuberculin assay Site Cow class on neck I II III IV 1 A B C D 2 B A D C 3 C D A B 4 D C B A Responses in mm. Cow class I II III IV Treatments: A: Standard double: Wey=0; log vol = 1 B: Standard single: Wey=0; log vol = 0 C: Weybridge single: Wey=1; log vol = 0 D: Weybridge half; Wey=1; log vol = 1 y <- c(454, 249, 349, 249, 408, 322,...,290) site <- gl(4, 4, 16); class <- gl(4, 1, 16) wey <- c(0,0,1,1, 0,0,1,1, 1,1,0,0, 1,1,0,0) vol <- c(1,0,0,-1, 0,1,-1,0, 0,-1,1,0, -1,0,0,1)

24 Log-linear model for tuberculin data Note: Response is a measured variable, not a count fit <- glm(y~site+class+wey+vol, family=poisson(link=log)) X2 <- sum((y - fit$fitted)^2/fit$fitted) summary(fit, dispersion=x2/7) > wey > vol Fitted response... + β w I(W ) + β log(vol) Relative potency of A to B: ratio vol(b)/vol(a) required to produce equal responses β w + β log(vol(w )) = β s + β log(vol(s)) log(vol(s)/ vol(w )) = (β w β s )/β Need a CI for the ratio β w /β of regression coefficients

25 Fieller s method (CI for ratio of means) Suppose ( Y1 Y 2 ) N 2 ( ( µ1 µ 2 ), ( σ11 σ )) 12 σ 21 σ 22 with Σ known (for simplicity). We observe (y 1, y 2 ) and want a CI for θ = µ 1 /µ 2. Fieller s (1954) pivotal argument: R(Y ; θ) = Y 1 θy 2 N(0, τ 2 (θ)) Y 1 θy 2 σ11 2θσ 12 + θ 2 σ 22 N(0, 1) 90% CI : {θ : < R(y; θ) < 1.645} Properties: exact 90% coverage under assumptions as stated Interval always includes ˆθ = y 1 /y 2 : (non-empty) Either I = (θ L (y), θ U (y)) (bounded) or I = (θ L (y), θ U (y)) Interval may be whole space: I =

26 Fieller interval with estimated variance ( Y1 Y 2 with C known. ) ( ( µ1 ) N 2, σ 2( C 11 C )) 12 µ 2 C 21 C 22 s 2 σ 2 χ 2 f Y Fieller pivotal t-statistic: R(Y ; θ) Y 1 θy 2 s t f C 11 2θC 12 + θ 2 C 22 (1 α)ci : {θ : t f (α/2) < R(y; θ) < t f (1 α/2)} R 2 (y, θ) = t 2 (α/2) (quadratic in θ) Generates random intervals of same structure: exact coverage as stated may be a bounded real interval or the complement may be the whole space does not meet with universal approval

27 Tuberculin relative potency (continued) Relative potency of Weybridge to Standard: ratio vol(s)/vol(w) required to produce equal responses β w + β log(vol(w )) = β s + β log(vol(s)) θ = log(vol(s)/ vol(w )) = (β w β s )/β Need a CI for the ratio of regression coefficients: ( ˆβ wˆβ ) N 2 ( ( βw β s 2 = on 7 d.f. ), σ 2( ) ) Point estimate: ˆθ = / = Fieller 95% CI for θ: (0.8735, ) for log rel potency CI for rel potency: 2 θ : (1.83, 2.22) (logs coded to base 2) Weybridge roughly twice as potent as Standard

28 Graphical illustration of Fieller interval R R^2(y,theta) plotted against theta 95% confidence interval for the log relative potency of Weybridge to Standard using the Fieller (or L R) pivotal statistic. 95% cutoff = qf(0.95, 1, 7) = % CI: (0.874, , 1.153) 99% CI: (0.804, , 1.223) theta

29 Association between birth month and death month Dates for 348 famous Americans: Phillips and Feldmam 1973: Month of death J F M A M J J A S O N D J F M A M J J A S O N D Q1: Is there an association between B and D? Q2: Is the distn of differences D B unusual?

30 Association between birth month and death month Month of death J F M A M J J A S O N D T J F M A M J J A S O N D T Q1: Is there an association between B and D? Q2: Is the distn of differences D B unusual?

31 Statistical strategies for birth and death 1. Terminology: the table and the values M = {Jan, Feb,..., Nov, Dec} Tbl = M 2 (144 ordered pairs (empty cells)) Y : Tbl R value in the table structure in the table; pattern in the values 2. The indexing system is the table Homologous factors A and B: (same set of levels) Diagonal is special : A = B 3. Circularly ordered levels gives additional structure a metric d(x, x ) = d(x, x) on M is a function on M 2, a covariate 4. Q: Is there a pattern in the values associated with the table structure, e.g. with the metric? W/O structure: X 2 = 109.9; dev = on 121 df no indication of any pattern...

32 Focusing the question by exploiting structure Data tabulated by metric: death month - birth month death month - birth month (mod 12) diff diff count fitted resid (uniform): X 2 = 22.1 on 11 df; p = 2.4% (non-uniform): X 2 = 21.7 on 11 df; p = 2.7% Total deaths (3m before, 3m after) = (73, 114) Total deaths (5m before, 5m after) = (124, 174) Similar analysis using GLMs fit0 <- glm(y birth+death, offset=log(days), family=poisson()) fit1 <- glm(y birth+death+diff,... Dev0 - Dev1 = 22.6 on 11 df Other conclusions: Excess of famous births in Jan; big deficit May, June Excess deaths April, July; big deficit in Nov

33 Hypergeometric distributions Space: 2-way tables of non-negative integers y rs Row and column totals specified Hypergeometric distn: p(y totals) = y..! y rs! yr.! y.s! Hypergeometric distribution by random matching Two lists of length n = y.. R-labels: R = (R 1,..., R n ) (values 1,..., k) y r. individuals have R i = r C-labels: C = (C 1,..., C n ) (values 1,..., k ) y.c individuals have C i = c List matching: (i π(i)) (R, C π ) = (R 1, C π(1) ),..., (R n, C π(n) ) Table: H rs = #{i : (R i, C π(i) ) = (r, s)} Row totals of H: same as those of y Uniform random matching: π uniform on the group

34 p-values by hypergeometric simulation Given 2-way array Y Generate a random hypergeometric table Y H(...) with the same marginal totals as Y y, rowsums, colsums, n <- sum(y) xr <- rep(1:nrow(y), rowsums) xc <- rep(1:ncol(y), colsums) ystar <- table(xr, xc[order(runif(n))]) Given a scalar statistic T defined on tables, compute the distn of T (Y ) with Y H(...) for(sim in 1:nsim){ ystar <- table(xr, xc[order(runif(n))]) value[sim] <- T(ystar) } Compute Monte carlo p-value: sum(value >= T(y)) / nsim

35 Distribution of Pearson statistic in sparse case Pearson statistic versus chisq(121) Deviance statistic versus chisq(121) X X Reduced P statistic versus chisq(11) X Reduced D statistic versus chisq(11) X

36 Experimental Design: PNAS paper Fig. 1

37 Mating events for 18 generations Single mating wells Double mating wells Zero Two flies Three flies Four flies Activity level Gen null cc cs sc ss cc.cs sc.ss cc.ss cs.sc zero two three four Tot X 2 = 48.4 X 2 = 30.2 X 2 = 7.8 X 2 = Data taken from the Yekutieli report Activity level: double mating rate decreasing with gen Given the activity level, the type distribution is constant?

38 Are the events in different wells independent? y: homogamic rate for single mating wells x: homogamic rate for double mating wells weighted Pearson correlation:

39 Null distribution of Pearson correlation statistic Statistic: weighted correlation of 18 pairs of binomial fractions cor(y1/m1, y2/m2, wt) Simulate a table Y for homrate1 Simulate an indep table Y2 for homrate2 same row and col totals as in observed tables Compute the sample correlation r Repeat 10 4 times for a null distribution Where does the observed value lie relative to the distribution? Answer by simulation: F( 0.65) 1/850 Answer by normal approx: F ( 0.65) 1/350 Are events in distinct wells independent?

40 Are the events in different wells independent? weighted sample correlation distribution Density F( 0.65)=1/850 X X2value[, 3]

41 Limitations of significance testing Posterior odds versus significance levels posterior odds = How to evaluate the denominator P(data independence) prior odds P(data non-independence) Big question: Why are events in different wells not independent? Open scientific question: Not a statistical question: Speculations:

Lecture 8. Poisson models for counts

Lecture 8. Poisson models for counts Jesper Rydén Department of Mathematics, Uppsala University jesper.ryden@math.uu.se Statistical Risk Analysis Spring 2014 Absolute risks The failure intensity λ(t) describes