Generalized Cp (GCp) in a Model Lean Framework

Generalized Cp (GCp) in a Model Lean Framework Linda Zhao University of Pennsylvania Dedicated to Lawrence Brown (1940-2018) September 9th, 2018 WHOA 3 Joint work with Larry Brown, Juhui Cai, Arun Kumar Kuchibhotla, and the Wharton Team Richard Berk, Andreas Buja, Ed George, Weijie Su Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 1 / 36

Table of Content Introduction Conventional Linear Model Assumption Lean Framework OLS and Predictive Risk under Model Lean Framework Generalized C p (GC p ) Definition Properties An alternative: boot GCp Distribution of the Difference in GC p s Simulations Summary and Ongoing Research Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 2 / 36

Table of Content 1 Introduction 2 OLS and Predictive Risk under Model Lean Framework 3 Generalized C p (GC p ) 4 Distribution of GC p Difference 5 Simulations 6 Summary Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 3 / 36

Conventional Linear Model The conventional linear model assumes: Y = Xβ + ϵ (1) Y N 1 is the response vector X N r are the r predictors β r 1 is the vector of parameters ϵ N (0, σ 2 I N N ) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 4 / 36

Linear Model Violation OFTEN, the model assumptions may not hold! Nonlinearity Heteroscedasticity Missing important variables Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 5 / 36

Assumption Lean Setup We proceed without many of the restrictions Without assuming a well-specified linear model To include a random design Without homoscedasticity Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 6 / 36

Assumption Lean Setup: Well-defined β Assumption Lean Framework: Observe sample (X i, Y i ) with X i IR r, (X i, Y i ) iid F No assumptions about F, other than existence of low order moments A well-defined parameter β: [ ( ) ] 2 β = argmine F Y X b β = [ E b ( XX )] 1 E [XY ]. (2) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 7 / 36

Interpretation of the β [ ( ) ] 2 [ ( β = argmine F Y X b = E XX )] 1 E [XY ]. b It is a statistical functional Best linear approximation or Best linear prediction or The linear portion in a semi-parametric model Same meaning as in the linear model when all the usual assumptions are held See Buja et al (2014, 2016) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 8 / 36

β p of Sub-Model Let M p be a sub-model where Note: M p = {i 1, i 2,..., i p } {1,..., p}, p r X p contains only (x i1, x i2,..., x ip ) [ ( ) ] 2 [ ( )] 1 β p = argmine F Y X p b = E X p X p E [Xp Y ] b For simplicity the submodel subscript will be dropped if unnecessary β p is defined within M p Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 9 / 36

Model Lean Framework: OLS Estimate ˆβ Given data X and Y, the usual sample matrix presentation Natural estimate of β is the Least Square Estimate: ˆβ = (X X) 1 X Y. (3) Goal: 1 Properties of the OLS ˆβ 2 Criterion to choose a good submodel 3 Properties of the criterion Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 11 / 36

Properties of OLS Asymptotic Sandwich formula n ( ˆβ β) Dist N (0, Σ sand ) (4) where Σ sand = [ E (XX )] [ 1 ( ] 2 [ ( E XX Y X β) E XX )] 1. See White, Halbert (1980 a,b) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 12 / 36

The Sandwich Estimator Simple (and rather naïve) plug in yields the sandwich estimator: ˆΣ sand = where ˆρ = Y X ˆβ. [ ] 1 { } [ n 1 X X n 1 (ˆρ 2 1 i X i X i ) n 1 X X] (5) See White, Halbert (1980 a,b) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 13 / 36

The Sandwich Estimator Theorem (Kuchibhotla et al. 2018) Under mild assumptions, the sandwich estimator ˆΣ sand is a consistent estimator of Σ sand, i.e., ˆΣ sand P Σsand. Moreover, ˆΣ sand is a semi-parametrically efficient estimator of Σ sand. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 14 / 36

Model Lean Framework: Predictive Risk Contemplate a future observation (X, Y) F. For any submodel M p, the predictive risk of the LS is R p E F [ ( Y X p ˆβ p ) 2 ]. (6) We next Propose a good estimator GC p for R p Study the properties of GC p Derive the distribution of GC p difference Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 15 / 36

Estimation of Predictive Risk: GC p Define the Generalized C p (GC p ) as follows. GC p = n 1 SSE + 2n 1 ˆξ2 (7) where SSE = Y X ˆβ 2 (8a) ( ) 1 ( ˆξ 2 = tr X X X D 2 ) r X n n (8b) and D 2 r is the diagonal matrix with D 2 r,ii = (Y i X i ˆβ) 2. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 17 / 36

Properties of GC p Theorem I GC p is a consistent estimator for the predictive risk R, i.e., GC p P R. Remark: The theorems are true under mild assumptions such as existence of moments. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 18 / 36

GCp: Derivation [ R E (Y X ˆβ) 2] [ = E (Y X β) 2] [ ( ) ] 2 + E X ( ˆβ β) [ E (Y X β) 2] [ + E X Σ sand X ] Sandwich ) n 1 Y Xβ 2 + n 1 tr (ˆΣ sand X X/n Empirical moment n 1 Y X ˆβ 2 ) ) + n 1 tr (ˆΣ sand X X/n + n 1 tr (ˆΣ sand X X/n ( = n 1 Y X ˆβ 2 + 2n 1 tr X X n = n 1 Y X ˆβ 2 + 2n 1 ˆξ2 GC p. ) 1 ( X D 2 r X n ) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 19 / 36

GC p boot : GC p through Bootstrap An alternative estimator formulation is obtained through M-of-N Bootstrap. GC p boot n 1 Y X ˆβ 2 + 2tr(n 1 X XˆΣ boot ) (9) where and ˆβ bt i ˆΣ boot = 1 n boot ( ˆβ bt i n boot i=1 ˆβ)( ˆβ bt i is an M-of-N bootstrap OLS estimator. ˆβ) Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 20 / 36

GC p and GC p boot To compare GC p and GC p boot we have Theorem II: GC p is the limit of the M-of-N bootstrap GC p boot as M for a fixed sample of size n, i.e., lim M GC p boot = GC p. Note: GC p and GC p boot are different for fixed n. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 21 / 36

Remark: GC p and Mallows C p Mallows version for a sub-model of size p is C p = (SSE p /ˆσ 2 r ) n + 2p (10) C U p, an alternate form of C p GC p and Mallows C p are very different! C U p = n 1 SSE p + 2n 1 pˆσ 2 r (11) Mallows C p is for fixed design and all the related results only hold under strict linear model assumptions. Comparison and examples are presented in our paper. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 22 / 36

Comparison of Sub-Models For simplicity, let M p M p+q be two nested sub-models where M p = {1,..., p} with β p = (β p 1,..., βp p) M p+q = {1,..., p + q} with β p+q = (β p+q 1,..., β p+q p+q). Goal: Choose a model with min{r p, R p+q }. Question: How good is the decision based on = GC p+q GC p? Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 24 / 36

Contiguity setup Question: How good is the decision based on = GC p+q GC p?. Decisions based on works well when the predictive risks of two nested submodels are well-separated, i.e., R p R p+q = O(1). The problem of interest is when the predictive risks of two nested submodels are close, i.e., under the contiguity condition, R p R p+q = O(1/n). Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 25 / 36

Distribution of = GC p+q GC p ( ) WLOG assume all the X i s are in their canonical form, i.e., E X i = 0, ) ( ) E (X i X j = 0 and E = 1. Consider two nested models M p M p+q. Theorem III X 2 i Under the contiguous setting, i.e. R p+q R p = O(1/n), consider the two nested models M p M p+q. Also assume the canonical conditions for X, then in distribution. n(gc p+q GC p ) c 1 G Z 2 + c 2 Note: Z follows a multivariate normal distribution and G denotes the CDF of Z 2. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 26 / 36

Distribution of = GC p+q GC p As a special case of Theorem III, we have the following Corollary In addition to its canonical form, assume Full model is well-specified and Homoscedasticity, i.e., Var(Y i X i ) = σ 2 = 1 Then and ) ( n (GC p+q GC p χ 2 q n β [p+1,...,p+q] 2) + 2q n(r p+q R p ) = q n β [p+1,...,p+q] 2 Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 27 / 36

( ) P GC p+q GC p < 0 ρ, q ρ = n β [p+1,...,p+q] 2 /q ρ 1 R p+q R p ρ = 0 β [p+1,...,p+q] 2 = 0 M p = M p+q Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 28 / 36

P(Choosing the model with smaller R) ρ = n β [p+1,...,p+q] 2 /q ρ 1 R p+q R p ρ = 0 β [p+1,...,p+q] 2 = 0 M p = M p+q Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 29 / 36

Set-up Sample size n = 100, 1000 X 1 = 1, X i = ( ) 2cos π(i 1)U i = 2,..., m + 1 = r where U i.i.d. ( ) Unif 1, 1, The design is in canonical form ( ) ) E = 1, i = 1,, r and E (X i X j = 0 for i j X 2 i β 2 p = E ( X 2 p σ2( X) ) + β [ p] 2 n+p 1 Y = X β + ϵ where ϵ i.i.d. N ( ) 0, 1 Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 31 / 36

Goal For each of the models: M 1, M 2,..., M r R Mi 10,000 j=1 GC p,mi /10, 000 Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 32 / 36

Summary Set up the assumption lean framework to explore the relationship between Y and X. Studied the OLS estimator and the predictive risk R under the assumption lean framework. Proposed the Generalized C p (GC p ) and an alternative GC p boot to estimate the predictive risk. Derived the distribution of GC p difference between nested models. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 34 / 36

Ongoing and Future Research GC p based decision rules are optimal. General formulation of the distribution of GC p difference between non-nested models. GC p for Generalized Linear Model (GLM) Semi-supervised Regression Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 35 / 36

References Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M.,... & Zhao, L. (2014). Models as approximations, Part I: A conspiracy of nonlinearity and random regressors in linear regression. arxiv preprint arxiv:1404.1578. Buja, A., Berk, R., Brown, L., George, E., Kuchibhotla, A. K., & Zhao, L. (2016). Models as Approximations Part II: A General Theory of Model-Robust Regression. arxiv preprint arxiv:1612.03257. Kuchibhotla, A. K., Brown, L. D., Buja, A., George, E. I., & Zhao, L. (2018). Valid Post-selection Inference in Assumption-lean Linear Regression. arxiv preprint arxiv:1806.04119. Linda Zhao University of Pennsylvania Generalized Cp (GCp) in a Model Lean Framework 36 / 36