Statistical Inference - PDF Free Download

Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27

Outline The Bayesian Lasso Trevor Park and George Casella, The Bayesian Lasso, Journal of the American Statistical Association, June 2008, Vol. 103, No. 482, Theory and Methods DOI Bootstrap Lasso Trevor Hastie, Robert Tibshirani and Martin Wainwright (2015). Statistical Learning with Sparsity-The lasso and Generalizations. (pp. 142-147). Chapman and Hall/CRC Post-Selection Inference for the Lasso - The Covariance Test Trevor Hastie, Robert Tibshirani and Martin Wainwright (2015). Statistical Learning with Sparsity-The lasso and Generalizations. (pp. 147-150). Chapman and Hall/CRC Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 2 / 27

Why Statistical Inference Statistical Inference provides confidence interval and statistical strength of variables, as p-value in models. Statistical Inference is well proposed in low dimension cases, for high dimensional problems, traditional methods may not be applicable. We need some statistical inference methods for high dimensional studies. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 3 / 27

The Bayesian Lasso We adopt the approach of Park and Casella (2008), involving a hierarchical model of the form y µ, X, β, σ 2 N n (µ1 n + Xβ, σ 2 I n ) (1) β λ, σ p j=1 λ 2 λ β σ 2 e σ 2 j For a complete Bayesian model, we use the improper prior density for σ 2 and a hyperprior for λ 2 with the class of gamma priors: π(σ 2 ) = 1/σ 2 and π(λ 2 ) = δr Γ(r) (λ2 ) r 1 exp( δλ 2 ) (3) (2) Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 4 / 27

Bayesian Lasso Prior and posterior distribution for the seventh variable in the diabetes example. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 5 / 27

Bayesian Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 6 / 27

Bayesian Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 7 / 27

Bootstrap Lasso The bootstrap is popular for assessing the statistical properties of complex estimators. How do we obtain the sampling distribution of ˆβ through bootstrap method? The non parametric bootstrap is one method for approximating this sampling distribution. Another method is parametric bootstrap. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 8 / 27

Bootstrap Lasso Suppose that we have obtained an estimate ˆβ(ˆλ CV ) for a lasso problem according to Cross-validation procedure: Fit a lasso path to (X, y) over a dense grid of values Λ = {λ ι } L ι=1. Divide the training samples into 10 groups at random. With the k th group left out, fit a lasso path to the remaining 9/10ths, using the same grid Λ. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 9 / 27

Bootstrap Lasso For each λ Λ compute the mean-squared prediction error for the left-out group. Average these errors to obtain a prediction error curve over the grid Λ. Find the value ˆλ CV that minimizes this curve, and then return the coefficient vector from our original fit in the first step at that value of λ. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 10 / 27

Non parametric Bootstrap It approximates the cumulative distribution function F of the random pair (X, Y ) by the empirical CDF ˆF N defined by the N samples. Draw N samples with replacement from the given dataset. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 11 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 12 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 13 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 14 / 27

Parametric Bootstrap We have a parametric estimate of F, or its corresponding density function f. We can sample from residual or Gaussian process regression. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 15 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 16 / 27

Bayesian Lasso vs Bootstrap Lasso In a general sense, results for the Bayesian lasso and lasso/bootstrap are similar. Nonparametric bootstrap can be treated as a kind of posterior-bayes estimate under a non-informative prior in the multinomial model. (Rubin 1981, Efron 1982) Bayesian Lasso leans more on parametric assumptions. Bootstrap scales better. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 17 / 27

Bayesian Lasso vs Bootstrap Lasso p Bayesian Lasso Lasso/Bootstrap 10 3.3 secs 163.8 secs 50 184.8 secs 374.6 secs 100 28.6 mins 14.7 mins 200 4.5 hours 18.1 mins Table: Timing for Bayesian lasso and bootstrapped lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 18 / 27

The Covariance Test Bayesian methods and the bootstrap are two "traditional "models and we would like to present some newer approaches. We would like to describe two methods proposed for assigning p-values or confidence interval to predictors as they are successively entered by the lasso and forward stepwise regression. These two methods have different results. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 19 / 27

The Covariance Test Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 20 / 27

The Covariance Test We start from a usual linear regression setup: y = Xβ + ɛ, ɛ N(0, σ 2 I N N ) (4) Consider forward-stepwise regression. Defining RSS k to be the residual sum of squares for the model containing k predictors, we can use this change in residual sum-of-squares to form a test statistic and compare it to a χ 2 1 distribution. R k = 1 σ 2 (RSS k 1 RSS k ) (5) Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 21 / 27

The Covariance Test Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 22 / 27

The Covariance Test Forward stepwise procedure has chosen the strongest predictor among all of the available choices and it yields a larger drop in training error than expected. It is difficult to derive an appropriate p-value for forward stepwise regression, if we want properly account the adaptive nature of the fitting. For the lasso, a simple test can be derived that properly accounts for the adaptivity. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 23 / 27

The Covariance Test Suppose that we wish to test significance of the predictor entered by LAR at λ k. Let A k 1 be the set of predictors with nonzero coefficients before this predictor was added and let the estimate at the end of this step be ˆβ(λ k+1 ). We refit the lasso, keeping λ = λ k+1 but using just the variables in A k 1. This yields the estimate ˆβ Ak 1 (λ k+1 ). The covariance test statistic is: T k = 1 σ 2 ( y, X ˆβ(λ k+1 ) y, X ˆβ Ak 1 (λ k+1 ) ) (6) T k Exp(1) (7) This statistic measures how much of the covariance between the outcome and the fitted model can be attributed to the predictor that has just entered the model. We can present a quantile-quantile plot for T 1 versus Exp(1): Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 24 / 27

The Covariance Test Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 25 / 27

The exponential limiting distribution for the covariance test requires certain conditions on the data: signal variables are not too correlated with the noise variables. Assumes linearity of the underlying model. In the next section, Libo will show a more general scheme that gives the spacing test, which works for any data matrix X, and the null distribution holds exactly for finite N and p. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 26 / 27

Reference Trevor Hastie, Robert Tibshirani, Martin Wainwright Statistical Learning with Sparsity - The Lasso and Generalizations. Chapman and Hall/CRC, 2015. Trevor Park and George Casella The Bayesian Lasso Journal of the American Statistical Association, June 2008, Vol. 103, No. 482, Theory and Methods DOI Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 27 / 27

Post-selection Inference and Bayesian Inference Libo Wang Department of Statistics Florida State University Oct 19th, 2016

Motivation Assigning significance in high-dimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid p-values are not available. Statistics versus Machine Learning Libo Wang Post-selection Inference and Bayesian Inference

What is post-selection inference? Inference the old way (pre- 1980?): 1. Devise a model 2. Collect data 3. Test hypothesis Classical inference Inference the new way: 1. Collect data 2. Select a model 3. Test hypothesis Post-selection inference Classical tools cannot be used post-selection, because they do not yield valid inferences (generally, too optimistic, underestimate p-value). Reason? For parametric model with full column rank p < n, the true parameter β is well-defined as the target of statistical inference. When we allow p n, we should add an selection step before working on inference that contains only submodels of full column rank. Libo Wang Post-selection Inference and Bayesian Inference

Example: Lasso with fixed-λ HIV data: mutations that predict response to a drug. Selection intervals for lasso with fixed tuning parameter λ. Libo Wang Post-selection Inference and Bayesian Inference

Formal goal of Post-selective inference [Lee et al. and Fithian, Sun, Taylor] Having selected a model ˆM based on our data y, we d like to test an hypothesis Ĥ 0. Note that Ĥ 0 will be random a function of the selected model and hence of y If our rejection region is {T (y) R}, we want to control the selected type I error: Prob(T (y) R ˆM, Ĥ 0 ) α Libo Wang Post-selection Inference and Bayesian Inference

Existing Approaches Data splitting - fit on one half of the data, do inferences on the other half. Problem: fitted model changes varies with random choice of half ; loss of power. Reference: P-Values for High-Dimensional Regression, Meinshausen N, Meier L, Buhlmann P, 2009 Permutations and related methods: not clear how to use these, beyond the global null Libo Wang Post-selection Inference and Bayesian Inference

A key mathematical result Polyhedral lemma: Provides a good solution for FS; an optimal solution for the fixed -λ lasso Polyhedrol section events Response vector y N(µ, Σ). Suppose we make a selection that can be written as y : Ay b with A, b not depending on y. This is true for forward stepwise regression, lasso with fixed λ, least angle regression and other procedures. Libo Wang Post-selection Inference and Bayesian Inference

Some intuition for Forward stepwise regression Suppose that we run forward stepwise regression for k steps {y : Ay b} is the set of y vectors that would yield the same predictors and their signs entered at each step. Each step represents a competition involving inner products between each x j and y; Polyhedral Ay b summarizes the results of the competition after k steps. Similar result holds for Lasso (fixed-λ or LAR) Libo Wang Post-selection Inference and Bayesian Inference

Example: The lasso and its selection event Lasso estimation: ˆβ argmin y Xβ 2 2/2 + λ β 1 Re-write the KKT condition by partitioning them according to activate set ( ˆM) or inactivate set ( ˆM). (to be continued.) X Ṱ M (X Ṱ M ˆ β M y) + λŝ ˆM = 0, X T ˆM (X Ṱ M ˆ β M y) + λŝ ˆM = 0, sign( ˆβ ˆM ) = ŝ ˆM, ŝ ˆM < 1 Libo Wang Post-selection Inference and Bayesian Inference

Example: The lasso and its selection event If and only if: X Ṱ M (X Ṱ w y) + λs = 0, M X T ˆM (X Ṱ w y) + λu = 0, M sign(w) = s, u < 1 Solve w and u by the first 2 equations substituting expression for w and u Libo Wang Post-selection Inference and Bayesian Inference

The polyhedral lemma [Lee et al, Ryan Tibs. et al.] For vector η F ν,ν + (η T y) {Ay b} Unif (0, 1) η T µ,σ 2 η 2 2 (truncated Gaussian distribution), where ν, ν + are (computable) values that are functions of η, A, b. Let F [a,b] denote the CDF of µ,σ 2 a N(µ, σ 2 ) r.v. truncated to the interval [a, b], that is, F [a,b] µ,σ 2 = φ((x µ)/σ) φ((a µ)/σ) φ((b µ)/σ) φ((a µ)/σ) Typically choose η so that η T y is the partial least squares estimate for a selected variable. Libo Wang Post-selection Inference and Bayesian Inference

Schematic illustrating the polyhedral lemma for the case N = 2. Libo Wang Post-selection Inference and Bayesian Inference

Example: Fixed-λ inference for the Lasso The intervals of Lasso is calculated by the selection procedure which is adaptive to the data (choose probabilistic model for the data first then formulate testing). The OLS model is pre-specified with only the selected 7 variables by LASSO. Selection-adjusted intervals are similar with strong signals, but larger for weak signals because these variables are close to the endpoint of the selection region, therefore wide ranges of µ would be consistent with observations. Libo Wang Post-selection Inference and Bayesian Inference

Current Work: Lasso with λ estimated by Cross-validation and unknown σ Selective inference with a randomized response by Tian, Taylor Can condition on the selection of λ by CV, and addition to the selection of the model Not clear about the difference between the Lasso with fixed-λ and the Lasso with λ estimated by cross-validation. Selective inference with unknown variance via the square-root LASSO by Tian, Loftus and Taylor Focuses on adapting post-selection inference to the case of unknown σ and the choice of tuning parameter. Square-root LASSO is favorable because the independence of λ with noise level σ. Libo Wang Post-selection Inference and Bayesian Inference

Improving the power The preceding approach conditions on the part of y orthogonal to the direction of interest η. This is for computational convenience yielding an analytic solution. Conditioning on less more power Are we conditioning on too much? Libo Wang Post-selection Inference and Bayesian Inference

Data splitting, carving, and adding noise Further improvements in power Fithian, Sun, Taylor, Tian Selection inference yields correct post-selection type I error. But confidence intervals are sometimes quite large. How to do better? (say, to make the randomness in selection is independent with the data for inference) Data carving: withholds a small proportion (say 10%) of data in selection stage, then uses all data for inference (conditioning using theory outlined above) Randomized response: add noise to y in selection stage. Like withholding data, but smoother. Then use unnoised data in inference stage. Related to differential privacy techniques. Libo Wang Post-selection Inference and Bayesian Inference

Data splitting, carving, and adding noise Libo Wang Post-selection Inference and Bayesian Inference

Alternative Method: Bayesian quantile regression inference Setting: y i = β 0 + β 1 X 1i + β 2 X 2i + ɛ i with β 0 = 1/3, β 1 = β 2 = 1, ɛ i exp(1) log(2). Table : Simulation results of Bayesian quantile regression Method ˆβ0 ˆβ1 ˆβ2 Err( ˆβ 0 ) 100 Err( ˆβ) 100 Err(ŷ) 100 BQR 0 333 0 998 1 010 0 228 0 717 23 154 QR 0 325 0 999 1 010 0 249 0 821 27 567 Libo Wang Post-selection Inference and Bayesian Inference

Difference between Bayesian and Classical Frequentist Inference Frequentist: 1. Point estimates and standard errors or 95% confidence intervals. 2. Deduction from P(data H 0 ), by setting α in advance. 3. Accept H 1 if P(data H 0 ) < α. Bayesian: 1. Induction from P(θ data), starting with P(θ). 2. Broad descriptions of the posterior distribution such as means and quantiles. Frequentist: P(data H 0 ) is the sampling distribution of the data given the parameter Bayesian: P(θ) is the prior distribution, P(θ data) is the posterior distribution of the parameter Libo Wang Post-selection Inference and Bayesian Inference

Bayesian feature selection methods Laplacian shrinkage. Bayesian lasso Adaptive shrinkage. Spike and Slab Get the selection inference P(θ data) by running 5000 or more iterations. Libo Wang Post-selection Inference and Bayesian Inference

Conclusions Post-selection inference is an exciting new area. Lots of potential research problems and generalizations. Bayesian and Frequentist methods both have drawbacks in finite sample settings. R package on CRAN: selectiveinference. Forward stepwise regression, Lasso, Lars Libo Wang Post-selection Inference and Bayesian Inference

References Book: Statistical Learning with Sparsity, Chapter 6 (Hastie, Tibshirani, Wainwright) Lee, Sun, Sun, Taylor (2013) Exact post-selection inference with the lasso. arxiv; To appear Tian, X. and Taylor, J. (2015) Selective inference with a randomized response. arxiv Libo Wang Post-selection Inference and Bayesian Inference

Thank you! Libo Wang Post-selection Inference and Bayesian Inference