Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Andreas Buja joint with: Richard Berk, Lawrence Brown, Linda Zhao, Arun Kuchibhotla, Kai Zhang Werner Stützle, Ed George, Mikhail Traskin, Emil Pitkin, Dan McCarthy Mostly at the Department of Statistics, The Wharton School University of Pennsylvania WHOA-PSI 2017/08/14

Outline Rehash 1: The big problem, irreproducible empirical research Rehash 2: Review of the PoSI version of post-selection inference. Assumption-lean inference because degrees of misspecification are universal, especially after model selection. Regression functionals = statistical functionals of joint x y distributions. Assumption-lean standard errors: sandwich estimators, x y bootstrap Which is better, sandwich or bootstrap? Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 2 / 20

Principle: Models, yes, but they are not be believed Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 3 / 20

Principle: Models, yes, but they are not be believed Contributing to irreproducibility: Formal selection based on algorithms: Lasso, stepwise, AIC,... Informal selection based on plots, diagnostics,... Post hoc selection based on subject matter preference, cost,... Every potential (!) decision informed by data. Resulting models take advantage of data in unaccountable ways. Traditions of not believing in model correctness: G.E.P. Box:... Cox (1995): The very word model implies simplification and idealization. Tukey: The hallmark of good science is that it uses models and theory but never believes them. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 3 / 20

Principle: Models, yes, but they are not be believed Contributing to irreproducibility: Formal selection based on algorithms: Lasso, stepwise, AIC,... Informal selection based on plots, diagnostics,... Post hoc selection based on subject matter preference, cost,... Every potential (!) decision informed by data. Resulting models take advantage of data in unaccountable ways. Traditions of not believing in model correctness: G.E.P. Box:... Cox (1995): The very word model implies simplification and idealization. Tukey: The hallmark of good science is that it uses models and theory but never believes them. But: Lip service to these maxims is not enough. There are consequences. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 3 / 20

PoSI, just briefly PoSI: Reduction of post-selection inference to simultaneous inference Computational cost: exponential in p, linear in N PoSI covering all submodels feasible for p 20. Sparse PoSI for M 5, e.g., feasible for p 60. Coverage guarantees are marginal, not conditional. Statistical practice is minimally changed: Use a single ˆσ across all submodels M. For PoSI-valid CIs, enlarge the multiplier of SE[ ˆβ j M ] suitably. PoSI covers for formal, informal, and post-hoc selection. PoSI solves the circularity problem of selective inference: Selecting regressors with PoSI tests does not invalidate the tests. PoSI is possible for response and transformation selection. Current PoSI allows 1st order misspecification, but uses fixed-x theory and assumes normality, homoskedasticity and a valid ˆσ. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 4 / 20

Under Misspecification, Conditioning on X is Wrong Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 5 / 20

Under Misspecification, Conditioning on X is Wrong What is the justification for conditioning on X, i.e., treating X as fixed? Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 5 / 20

Under Misspecification, Conditioning on X is Wrong What is the justification for conditioning on X, i.e., treating X as fixed? Answer: Fisher s ancillarity argument for X. p(y, x; β (1) ) p(y, x; β (0) ) Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 5 / 20

Under Misspecification, Conditioning on X is Wrong What is the justification for conditioning on X, i.e., treating X as fixed? Answer: Fisher s ancillarity argument for X. p(y, x; β (1) ) p(y, x; β (0) ) = p(y x; β(1) ) p(x) p(y x; β (0) ) p(x) Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 5 / 20

Under Misspecification, Conditioning on X is Wrong What is the justification for conditioning on X, i.e., treating X as fixed? Answer: Fisher s ancillarity argument for X. p(y, x; β (1) ) p(y, x; β (0) ) = p(y x; β(1) ) p(x) p(y x; β (0) ) p(x) = p(y x; β(1) ) p(y x; β (0) ) Important! The ancillarity argument assumes correctness of the model. Refutation of Regressor Ancillarity under Misspecification: Assume xi random, y i = f (x i ) nonlinear deterministic, no errors. Fit a linear function ŷi = β 0 + β 1 x i. A situation with misspecification, but no errors. Conditional on x i, estimates ˆβ j have no sampling variability. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 5 / 20

Under Misspecification, Conditioning on X is Wrong What is the justification for conditioning on X, i.e., treating X as fixed? Answer: Fisher s ancillarity argument for X. p(y, x; β (1) ) p(y, x; β (0) ) = p(y x; β(1) ) p(x) p(y x; β (0) ) p(x) = p(y x; β(1) ) p(y x; β (0) ) Important! The ancillarity argument assumes correctness of the model. Refutation of Regressor Ancillarity under Misspecification: Assume xi random, y i = f (x i ) nonlinear deterministic, no errors. Fit a linear function ŷi = β 0 + β 1 x i. A situation with misspecification, but no errors. Conditional on x i, estimates ˆβ j have no sampling variability. However, watch The Unconditional Movie... source("http://stat.wharton.upenn.edu/ buja/src-conspiracy-animation2.r") Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 5 / 20

Random X and Misspecification Randomness of X and misspecification conspire to create sampling variability in the estimates. Consequence: Under misspecification, conditioning on X is wrong. Method 1 for model-robust & unconditional inference: Asymptotic inference based on Eicker/Huber/White s Sandwich Estimator of Standard Error V[ ˆβ] = E[ X X ] 1 E[(Y X β) X X ] E[ X X ] 1. Method 2 for model-robust & unconditional inference: the Pairs or x-y Bootstrap (not the residual bootstrap) Validity of inference is in an assumption-lean/model-robust framework... Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 6 / 20

An Assumption-Lean Framework Joint ( x, y) distribution, i.i.d. sampling: (y i, x i ) P(dy, d x) = P Y, X = P Assume properties sufficient to grant CLTs for estimates of interest. No assumptions on µ( x) = E[Y X = x ], σ 2 ( x) = V[Y X = x] Define a population OLS parameter: [ ( ) ] 2 β := argmin β E Y β X = E[ X X ] -1 E[ X Y ] Target of inference: β = β(p) (best L 2 approximation at P) OLS Estimator: ˆβ = β(ˆp N ) (plug-in estimator, ˆP N = 1 N δ( xi,y i )) β = Statistical functional =: Regression Functional Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 7 / 20

Implication 1 of Misspecification misspecified well-specified Y Y Y = µ(x) Y = µ(x) X Under misspecification parameters depend on the X distribution: β(p Y, X ) = β(p Y X, P X ) β(p Y, X ) = β(p Y X ) Upside down: Ancillarity = parameters do not affect the X distribution. Upside up: "Ancillarity = the parameters are not affected by the X distribution. X Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 8 / 20

Implication 2 of Misspecification misspecified well-specified y y x x Misspecification and random X generate sampling variability: V[ E[ ˆβ X ] ] > 0 V[ E[ ˆβ X ] ] = 0 (Recall The Unconditional Movie. ) Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 9 / 20

Diagnostic for Dependence on the X-Distribution 10 20 30 40 50 60 70 2.0 1.5 1.0 0.5 0.0 MedianInc Coeff: MedianInc 0 2 4 6 8 10 0 5 10 15 PercVacant Coeff: PercVacant 0 20 40 60 80 1.0 0.5 0.0 0.5 1.0 1.5 PercMinority Coeff: PercMinority 0 20 40 60 80 1.0 0.5 0.0 0.5 PercResidential Coeff: PercResidential 10 0 10 20 30 1 0 1 2 PercCommercial Coeff: PercCommercial 10 0 10 20 1 0 1 2 PercIndustrial Coeff: PercIndustrial Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 10 / 20

CLT under Misspecification Sandwich Estimator N ( ˆβ β(p) ) D ( N 0, E[ x x ] 1 E[ (y x β(p)) 2 x x ] E[ x x ] 1 ) }{{} AV [P] AV [P] is another statistical functional (dependence on β not shown). Sandwich Estimator of Asymptotic Variance: AV [ˆP N ] (Plug-in) Sandwich Standard Error Estimate: ˆ SE sand [ ˆβ j ] = Assumption-Lean Asymptotically Valid Standard Error 1 N AV [ˆP N ] jj (Linear Models Assumptions: ɛ = y x β(p) AV [P] = E[ x x ] 1 σ 2.) Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 11 / 20

The x-y Bootstrap for Regression Assumption-lean framework: The nonparametric x-y bootstrap applies. Estimate SE( ˆβ j ) by Resample ( x i, y i ) pairs ( x i, y i ) ˆβ. ˆ SEboot ( ˆβ j ) = SD (β j ). Let SElin ˆ ( ˆβ j ) = models theory. ˆσ x j be the usual standard error erstimate of linear Question: Is the following always true? ˆ SE boot ( ˆβ j )? ˆ SElin ( ˆβ j ) Compare conventional and bootstrap standard errors empirically... Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 12 / 20

Conventional vs Bootstrap Std Errors LA Homeless Data (Richard Berk, UPenn) Response: StreetTotal of homeless in a census tract, N = 505 R 2 0.13, residual dfs = 498 ˆβ j SE lin SE boot SE boot /SE lin t lin MedianInc -4.241 4.342 2.651 0.611-0.977 PropVacant 18.476 3.595 5.553 1.545 5.140 PropMinority 2.759 3.935 3.750 0.953 0.701 PerResidential -1.249 4.275 2.776 0.649-0.292 PerCommercial 10.603 3.927 5.702 1.452 2.700 PerIndustrial 11.663 4.139 7.550 1.824 2.818 Reason for the discrepancy: misspecification. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 13 / 20

Sandwich and x-y Bootstrap Estimators The x-y bootstrap and sandwich are both asymptotically correct in the assumption-lean framework. Connection? Which should we use? Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 14 / 20

Sandwich and x-y Bootstrap Estimators The x-y bootstrap and sandwich are both asymptotically correct in the assumption-lean framework. Connection? Which should we use? Consider the M-of-N bootstrap (e.g., Bickel, Götze, van Zwet, 1997): Draw M iid resamples ( x i, yi ) (i = 1...M) from samples ( x i, y i ) (i = 1...N). PM = 1 δ( x M ), ˆPN = 1 δ( xi N,y i ). i,y i M-of-N Bootstrap estimator of AV [P]: M VˆPN [ β(p M) ] Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 14 / 20

Sandwich and x-y Bootstrap Estimators The x-y bootstrap and sandwich are both asymptotically correct in the assumption-lean framework. Connection? Which should we use? Consider the M-of-N bootstrap (e.g., Bickel, Götze, van Zwet, 1997): Draw M iid resamples ( x i, yi ) (i = 1...M) from samples ( x i, y i ) (i = 1...N). PM = 1 δ( x M ), ˆPN = 1 δ( xi N,y i ). i,y i M-of-N Bootstrap estimator of AV [P]: M VˆPN [ β(p M) ] Apply the CLT to β(p M ) as M for fixed N, treating ˆP N as population: M ( β(p M ) β(ˆp N ) ) D N ( 0, AV (ˆP N ) ) M Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 14 / 20

Sandwich and x-y Bootstrap Estimators The x-y bootstrap and sandwich are both asymptotically correct in the assumption-lean framework. Connection? Which should we use? Consider the M-of-N bootstrap (e.g., Bickel, Götze, van Zwet, 1997): Draw M iid resamples ( x i, yi ) (i = 1...M) from samples ( x i, y i ) (i = 1...N). PM = 1 δ( x M ), ˆPN = 1 δ( xi N,y i ). i,y i M-of-N Bootstrap estimator of AV [P]: M VˆPN [ β(p M) ] Apply the CLT to β(p M ) as M for fixed N, treating ˆP N as population: M ( β(p M ) β(ˆp N ) ) D N ( 0, AV (ˆP N ) ) M Consequence: M VˆPN [ β(p M ) ] M AV (ˆP N ) Bootstrap Estimator of AV Sandwich Estimator of AV Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 14 / 20

Connection with M-Bagging So we have a family of bootstrap estimators M VˆPN [ β(p M ) ] of asymptotic variance, parametrized by M. Connection with bagging: Bagging = averaging of an estimate over bootstrap resamples. For example, M-bagged β(ˆp N ): EˆPN [ β(p M ) ] Observation about bootstrap variance (assume β one-dimensional): VˆPN [ β(p M) ] = EˆPN [ β(p M) 2 ] EˆPN [ β(p M) ] 2 A simple function of M-bagged functionals Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 15 / 20

M-Bagged Functionals Rather than Statistics Q: What is the target of estimation of an M-bagged statistic EˆPN [ β(p M ) ]? A: Lift the notion of M-bagging to functionals: ( x i, y i ) P, rather than ˆP N β B M(P) = E P [ β(p M) ]. The functional β B M(P) is the M-bagged version of the functional β(p). Plug-in produces the usual bagged statistic: Its target of estimation is β B M(P). β B M(ˆP N ) = EˆPN [ β(p M) ]. The notion of M-bagged functional is made possible by distinguishing resample size M and sample size N. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 16 / 20

Von Mises Expansions of M-Bagged Functionals Von Mises expansion of β( ) around P: β(q) = β(p) + E Q [ψ 1 (X)] + 1 2 E Q[ψ 2 (X 1, X 2 )] +... ψ k (X 1,..., X k ) = k th order influence function at P Used to prove asymptotic normality with Q = ˆP N. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 17 / 20

Von Mises Expansions of M-Bagged Functionals Von Mises expansion of β( ) around P: β(q) = β(p) + E Q [ψ 1 (X)] + 1 2 E Q[ψ 2 (X 1, X 2 )] +... ψ k (X 1,..., X k ) = k th order influence function at P Used to prove asymptotic normality with Q = ˆP N. Theorem (B-S): M-bagged functionals have von Mises expansions of length M +1. Detail: The influence functions are obtained from the Serfling-Efron-Stein Anova decomposition of β(p M ) = β(x 1,..., X M ). Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 17 / 20

M-Bootstrap or Sandwich? A sense of smoothing achieved by bagging: The smaller M, the shorter the von Mises expansion, the smoother the M-bagged functional. Recall: M-of-N bootstrap estimates of standard error are simple functions of M-bagged functionals, and sandwich estimates of standard error are the limit M. It seems that some form of bootstrapping should be beneficial by providing more stable standard error estimates. (This agrees with Bickel, Götze, van Zwet (1997) who let M/N 0.) Caveat: This is not a fully worked out argument, rather a heuristic. The OLS functional β(p) is already very smooth, so there may not be much of an effect. But: The argument goes through for any type of regression, not just OLS. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 18 / 20

Conclusions Misspecification makes fixed-x regression illegitimate. Misspecification and random X create sampling variability. Misspecification creates dependence of parameters on X distribution. Such dependence can be diagnosed. Misspecification may invalidate usual standard errors. Alternatives: M-of-N bootstrap, x-y version, not residuals Bootstrap standard errors may be more stable than sandwich. Bagging makes anything smooth. Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 19 / 20

Conclusions Misspecification makes fixed-x regression illegitimate. Misspecification and random X create sampling variability. Misspecification creates dependence of parameters on X distribution. Such dependence can be diagnosed. Misspecification may invalidate usual standard errors. Alternatives: M-of-N bootstrap, x-y version, not residuals Bootstrap standard errors may be more stable than sandwich. Bagging makes anything smooth. THANKS! Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 19 / 20

The Meaning of Slopes under Misspecification Allowing misspecification messes up our regression practice: What is the meaning of slopes under misspecification? y y x x Case-wise slopes: ˆβ = i w i b i, b i := y i x i, w i := x 2 i i x 2 i Pairwise slopes: ˆβ = ik w ik b ik, b ik := y i y k x i x k, w ik := (x i x k ) 2 i k (x i x k )2 Andreas Buja (Wharton, UPenn) Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference 2017/08/14 20 / 20