1 Likelihood Ratio Tests that Certain Variance Components Are Zero Ciprian M. Crainiceanu Department of Statistical Science www.people.cornell.edu/pages/cmc59 Work done jointly with David Ruppert, School of ORIE Cornell University
2 Outline A. Examples of problems (Ruppert, Wand and Carroll, 2003) B. Penalized splines as a particular case of LMMs C. Null hypotheses including zero random effects variance in LMMs D. LRTs and F-type tests for this type of hypotheses E. Null distributions and power properties of LRT and RLRT F. Other applications: additive, nonlinear models, goodness-of-fit G. Conclusions
3 Mother age vs. child birthweight 6000 5000 data ML fit GCV fit birthweight (grams) 4000 3000 2000 1000 0 15 20 25 30 35 40 45 50 maternal age (years)
4 Janka hardness 8.5 8 data linear fit p spline fit log(janka hardness) 7.5 7 6.5 6 20 30 40 50 60 70 density
5 Nonparametric framework Consider the regression equation, where ɛ i are i.i.d. N ( 0,σɛ 2 ) y i = m (x i ) + ɛ i, Null hypothesis: m( ) is a p q degree polynomial m (x,β) = β 0 + β 1 x +... + β p q x p q Alternative hypothesis: m( ) is a regression spline m (x,θ) = β 0 + β 1 x +... + β p x p + Idea: use likelihood ratio tests K k=1 b k (x κ k ) p +
6 P-splines as LMMs Penalized sum of squares estimation criterion (avoid overfitting) n {y i m (x i ;θ)} 2 + 1 λ θt Dθ i=1 The Penalized Spline criterion is equivalent to 1 σ 2 ɛ Y Xβ Zb 2 + 1 λσɛ 2 b T Σ 1 b, If σb 2 = λσ2 ɛ then minimizing this criterion ML for the LMM Y = Xβ + Zb + ɛ, Cov b ɛ = σ 2 b Σ O K n O n K σ 2 ɛi n
7 Smoothing parameter estimation ML(Θ) = n log(σ 2 ɛ) + log {det(v λ )} + (y Xβ)T V 1 λ (y Xβ) REML(Θ) = ML(Θ) (p + 1) log(σ 2 ɛ) + log σ 2 ɛ { det(x T V 1 λ X) }, where Θ = (β, σ 2 ɛ, λ), Cov(Y ) = σ 2 ɛv λ, V λ = I n + λzσz T. Generalized Cross Validation (GCV) can be used to estimate λ (not suited for LRT testing).
8 Hypotheses ) H 0 : β p q+1 =... = β p = 0 and λ = 0 (σ b 2 = 0 ) H A : β p q+1 0... β p 0 λ > 0 (σ b 2 > 0 Why is the problem hard? Non-standard hypothesis under the null (boundary problem) Correlated observations (at least under the alternative) Technicalities (e.g. fixed number of knots, not o(n)!)
9 Likelihood Ratio Tests Likelihood and Restricted Likelihood Ratio Tests LRT n = inf H 0 ML(Θ) inf H A ML(Θ) RLRT n = inf H 0 REML(Θ) inf H A REML(Θ) Note: RLRT makes sense only when the fixed effects are the same under H 0 and H A Generality acknowledgement: Same considerations hold for any LMM with one variance component!
10 Asymptotic Theory Trivialities Asymptotic distribution is useful iff Provides a good approximation of the finite sample distribution(s) Is much simpler than the finite sample distribution(s) Asymptotic reasoning makes sense Key fact: Asymptotics is just an approximation of finite sample
11 Boundary problem asymptotics Test for one parameter on the boundary (q = 0) For independent observations under the alternative: Chernoff (1954), Moran (1971), Chant (1974), Self & Liang (1987, 1995) LRT n 0.5χ 2 0 + 0.5χ2 1 Longitudinal Mixed Effects Model, same is true if K Stram and Lee (1994) Y i = X i β + Z i b i + ɛ i
12 Pinheiro and Bates (2000), simulations RLRT n : 0.5χ 2 0 + 0.5χ2 1, LRT n : 0.65χ 2 0 + 0.35χ2 1 Shephard and Harvey (1990) found p 0 0.95 in a related model
13 Probability mass at zero Test for zero random effects variance in LMM with one variance component (Crainiceanu, Ruppert, Vogelsang, 2002). 1st order conditions for local minimum at λ = 0 β ML(Θ) = 0, σɛ 2 ML(Θ) = 0, λ ML(Θ) 0 Null finite sample LRT n mass at zero ( ) u T P P 0 ZΣZ T P 0 u u T 1 P 0 u n tr(zσzt ) ( 1 u N(0,I n ) and P 0 = I n X X X) T X T
14 One-way ANOVA: best case scenario Model Y ik = β 0 + b k + ɛ ik i = 1,...,I; k = 1,...,K; b N(0,σ 2 b I K); ɛ N(0,σ 2 ɛi n ) H 0 : σ 2 b = 0 vs. H A : σ 2 b > 0 Asymptotic (I, K cons.) probability mass at zero ) ) p ML (K) = P (χ 2 K 1 < K p REML (K) = P (χ 2 K 1 < K 1
15 ANOVA: Asymptotic mass at zero 0.9 0.85 LRT RLRT 0.8 Probability 0.75 0.7 0.65 0.6 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 Number of levels (K)
16 Non-parametric testing example Constant mean (p = 0) vs. piecewise constant spline K Y i = β 0 + b k I{x i > κ k } + ɛ i k=1 i = 1,...,n; k = 1,...,K; b N(0,σ 2 b I K); ɛ N(0,σ 2 ɛi n ) H 0 : σ 2 b = 0 vs. H A : σ 2 b > 0 Asymptotic (n, K cons.) Example: x i are equally spaced, κ k is the sample quantile corresponding to k/k + 1
17 Asymptotic mass at zero 1 0.95 0.9 ML REML 0.85 Probability 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 Number of knots
18 Cracking the nut 0.5 : 0.5 chi-square mixture approximation useless in this case Calculation of probability mass at zero (finite sample and asymptotic): simple 0.5 : 0.5 - improved by p n : 1 p n approximation Maybe non-zero part of the distribution is not χ 2 1 What happens when q > 0? Number of knots (levels) fixed not o(n)!
19 LRT n = n log h(λ,w,µ,ξ) = n log Distribution of LRT n (I) ( q ) 1 + s=1 u2 s n p 1 s=1 ws 2 { 1 + N n(λ,w,µ) D n (λ,w,µ) N n (λ,w,µ) = D n (λ,w,µ) = u s, w s, are i.i.d. N(0, 1) K s=1 K s=1 + sup h(λ,w,µ,ξ) λ 0 } K s=1 λµ s,n 1 + λµ s,n w 2 s w 2 s 1 + λµ s,n + n p 1 s=k+1 log(1 + λξ s,n ) µ s,n ; ξ s,n - K eigenvalues of Σ 1/2 Z T P 0 ZΣ 1/2 ; Σ 1/2 Z T ZΣ 1/2 w 2 s
20 Distribution of LRT n (II) Simulation of the finite sample distribution is very simple For K = 20: 5,000 sim/sec (2.66Ghz CPU) Simulations are feasible as long as we can diagonalize matrices Finite sample probability mass at zero for LRT n (q=0) Ks=1 P µ s,n ws 2 n p 1 s=1 ws 2 1 K ξ s,n n s=1 ξ s,n µ s,n for RLRT n (one eigenvalue dominates the others) Wilks phenomenon in finite sample
21 What kind of asymptotics? Number of observations n Number of knots K-fixed Conditions on Σ 1/2 Z T P 0 ZΣ 1/2 and Σ 1/2 Z T ZΣ 1/2 n α µ s,n µ s and n α ξ s,n ξ s Many observations, mean function has at most 1 + p + K degrees of freedom, asymptotic structure of K K design matrices
22 LRT n asymptotic distribution LRT n q s=1 u 2 s + sup d 0 K s=1 dµ s 1 + dµ s w 2 s K log(1 + dξ s ) s=1 u s, w s are i.i.d. N(0, 1); first part corresponds to q fixed effects; second part corresponds to zero variance; independent Asymptotic probability mass at zero (q = 0) K K LRT : P µ s ws 2 RLRT : P s=1 s=1 ξ s K s=1 µ s w 2 s K s=1 µ s
23 Proof details Establish convergence in C[0, ) of the profile LRT (finite dimensional convergence + tightness) Show that a Continuous Mapping Theorem type result holds Catch: sup is not continuous on C[0, ) x n (t) = (n 2 t + n n 3 )I{n 1/n < t n} + ni{t > n} 0 sup t 0 Crainiceanu and Ruppert (2002) x n (t)
24 One-way ANOVA: asymptotics LRT n K { XK 1 log ( XK )} { I X K > 1 } RLRT n (K 1) {X K 1 log (X K )} I {X K > 1} X K χ2 K 1 K, X K χ2 K 1 K 1 When K the two distributions 0.5χ 2 0 + 0.5χ2 1 Probability at zero always > 0.5. Is the non-zero part χ 2 1?
25 QQ plot for LRT LRT>0 ANOVA I: K=3, 5, 20 levels 9 χ 1 2 6.63 4.99 3 5 20 3 0 0 3 Q 9 0.99 =6.63
26 (R)LRT of linearity Linear mean (p = 1) vs. linear spline (q = 0) K Y i = β 0 + β 1 x i + b k (x i κ k ) + + ɛ i k=1 i = 1,...,n; k = 1,...,K; b N(0,σ 2 b I K); ɛ N(0,σ 2 ɛi n ) H 0 : σ 2 b = 0 vs. H A : σ 2 b > 0 Asymptotic (n, K cons.) Example: x i are equally spaced, κ k is the sample quantile for k/k + 1. Finite sample and asymptotic LRT n distributions χ 2 0!
27 Testing linear regression vs. penalized linear spline (K=20 knots). REML Quantiles of distributions 5.32 4.20 n=50 n=100 n= 0.5:0.5 mixture 1.74 0 q 0.66 q 0.95 q 0.99 q 0.995 Quantiles of the asymptotic distribution (n= )
28 F and F-type tests Hastie and Tibshirani (1990), Cantoni and Hastie (2002) H 0 : λ = λ 0 vs. H A : λ = λ 1 F γ0,γ 1 = (RSS 0 RSS 1 )/(γ 0 γ 1 ) RSS 1 /γ 1, R λ0,λ 1 = Y T ( Sλ1 S λ0 ) Y Y T ( I n S λ1 ) Y γ λ = tr (I n S λ ) 2 = # d.f. of residuals, S λ = smoother matrix. H 0 : λ = 0 vs. H A : λ > 0 (λ = ˆλ). The null distribution of R 0,λ1 has no mass at zero The null distribution of R 0,ˆλ has >> 0.5 mass at zero
29 Power properties Test constant mean vs. a general alternative piecewise constant spline (RLRT) linear spline (LRT) Types of alternatives considered increasing, concave, periodic Crainiceanu, Ruppert, Claeskens, Wand, 2002
30 Notations R test is from Cantoni and Hastie (2002) F test is from Hastie and Tibshirani (1990) C : alternative is modeled by a piecewise constant spline L : alternative is modeled by a linear spline 1 : estimate under the alternative has DF one greater than p O : the design matrix is orthogonalized ML : smoothing parameter is estimated using (RE)ML GCV : smoothing parameter is estimated using GCV
31 Tests Average Maximum Minimum RLRT-C 0.89 0.97 0.82 R-GCV-L 0.87 0.99 0.72 R-ML-C 0.86 0.99 0.70 F-ML-L 0.85 0.88 0.83 R-ML-L 0.85 0.88 0.83 F-ML-C 0.84 0.99 0.67 F-GCV-L 0.84 0.99 0.66 LRT-L 0.76 0.85 0.68 LRT-L-O 0.69 0.86 0.43 F-1-L 0.68 0.94 0.30 R-1-L 0.62 0.91 0.15 R-GCV-C 0.61 0.92 0.34 RLRT-C-O 0.61 0.93 0.32
32 Power results No most powerful test for the three alternatives considered RLRT-C has good power compared to competing tests LRT-L is worse (probability mass at zero) Other good tests exist but their null distributions have to be simulated (5,000 simulations, n = 100, K = 20) R-GCV-L: 30 min vs. RLRT-C: 1 sec Our approach: R and F tests include variability in ˆλ
33 Mother age/child birthweight - no effect K = 10 K = 20 value p-value value p-value RLRT-C 0 0.35 0.04 0.29 F-ML-C 0.90 0.35 1.30 0.26 F-GCV-C 1.58 0.21 1.58 0.19 LRT-L 2.46 0.12 2.43 0.11 F-ML-L 2.50 0.13 2.50 0.12 F-GCV-L 4.12 0.05 4.59 0.03
34 Janka hardness - linearity K = 5 K = 10 value p-value value p-value RLRT-L 16.90 1 10 5 17.27 1 10 5 F-ML 11.59 5 10 6 12.13 9 10 6
35 Null: more than one covariate y i = m 1 (x 1i ) + m 2 (x 2i ) + ɛ i Both m 1 ( ) and m 2 ( ) can be modeled as splines. The model is Y = X 1 β 1 + X 2 β 2 + Z 1 b 1 + Z 2 b 2 + ɛ b 1 N(0,σ1 2); b 2 N(0,σ2 2); ɛ N(0,σ2 ɛ); independent Tests σ1 2 = 0 and/or σ2 2 = 0 tests vs. nonparametric H A Spectral decomposition of (R)LRT n for LMMs with > 1 variance components (Crainiceanu, Ruppert, Claeskens, Wand, 2002) Warning: theoretical result, practical limitations
36 Null: nonlinear regression Linearity in parameters is the only special property of polynomials A parametric regression function can be embedded in a larger space (e.g. using p-splines) Y = f(x, β 1 ) + Xβ 2 + Zb + ɛ b N(0,σ 2 b Σ), ɛ N(0,σ2 ɛi n ), independent H 0 : β 2 = 0, σ 2 b = 0 vs. H A : β 2 0 or σ 2 b 0
37 LRT for non-linear regression H 0 : Y i = a + b exp(cx i ) + ɛ i H A : Y i = a + b exp(cx i ) + βx i + K k=1 b k (x i κ k ) + n = 100, x i equally spaced in [0, 2], K = 20 equally spaced knots. 5,000 simulations / 20 minutes Distribution LRT n for 2 parameters (β = 0, σ 2 b = 0): χ2 1 WARNING: NO RLRT!
38 Model linearization Let ˆβ 1 be the MLE under the null Y = f(x, β 1 ) + ɛ Approximate f(x,β 1 ) by a first order Taylor expansion f(x,β 1 ) = f(x, ˆβ 1 ) + [ ] T f(x, β ˆβ 1 ) (β 1 ˆβ 1 ) 1 Test for the null of linear regression against a general alternative This problem is solved! RLRT can be used!
39 Goodness-of-fit Testing (constant, linear mean, etc.) against a general alternative is testing goodness-of-fit Classical goodness-of-fit tests use standardized residuals If the null model is correct, standardized residuals under the null should behave like white noise We can use (R)LRT for no-effect of the estimated residuals r i = µ + e i Empirical approach: Ruppert, Wand, Carroll, 2002
40 (R)LRT bootstrap R(LRT) distribution for LMM with more than one variance component (feasible: L = 2, 3) Exact linear, non-linear, GLM null distributions Low order smoothers make simulations tractable Hard to implement in standard software Recommended when unsure about theoretical results Our initial approach to inference (that s how we got p 0 >> 0.5)
41 Conclusions Testing for null including zero random effects variance in LMMs Unified testing theory (penalized likelihood models) (R)LRT null finite sample, asymptotic distributions + power Standard asymptotics for i.i.d. data does not apply Feasible extensions: additive, nonlinear models, GLMs P-splines are a powerful and flexible tool for (R)LRT testing of parametric versus nonparametric models