Statistical Inference

Similar documents
Post-selection inference with an application to internal inference

Recent Advances in Post-Selection Statistical Inference

Recent Advances in Post-Selection Statistical Inference

Some new ideas for post selection inference and model assessment

Post-selection Inference for Forward Stepwise and Least Angle Regression

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Post-Selection Inference

A significance test for the lasso

Recent Developments in Post-Selection Inference

A Significance Test for the Lasso

A significance test for the lasso

Regression, Ridge Regression, Lasso

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

Exact Post Model Selection Inference for Marginal Screening

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Sparse Linear Models (10/7/13)

Or How to select variables Using Bayesian LASSO

Regression Shrinkage and Selection via the Lasso

Lecture 17 May 11, 2018

Cross-Validation with Confidence

Machine Learning Linear Regression. Prof. Matteo Matteucci

Biostatistics Advanced Methods in Biostatistics IV

arxiv: v2 [math.st] 9 Feb 2017

Consistent high-dimensional Bayesian variable selection via penalized credible regions

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Shrinkage Methods: Ridge and Lasso

Cross-Validation with Confidence

Bayesian linear regression

Machine Learning for OR & FE

MSA220/MVE440 Statistical Learning for Big Data

Stat 5101 Lecture Notes

Frequentist Accuracy of Bayesian Estimates

A Significance Test for the Lasso

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

The lasso, persistence, and cross-validation

Generalized Elastic Net Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Bootstrap & Confidence/Prediction intervals

arxiv: v5 [stat.me] 11 Oct 2015

Discussion of Least Angle Regression

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Exact Post-selection Inference for Forward Stepwise and Least Angle Regression

Inference After Variable Selection

Post-selection Inference for Changepoint Detection

MS-C1620 Statistical inference

Bayesian methods in economics and finance

arxiv: v1 [math.st] 9 Feb 2014

Regularization Paths

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

g-priors for Linear Regression

Selective Inference for Effect Modification

High-dimensional Ordinary Least-squares Projection for Screening Variables

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Uncertainty quantification in high-dimensional statistics

Fast Regularization Paths via Coordinate Descent

Default Priors and Effcient Posterior Computation in Bayesian

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Robust methods and model selection. Garth Tarr September 2015

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Conjugate direction boosting

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Selective Inference for Effect Modification: An Empirical Investigation

Bayesian Methods for Machine Learning

A General Framework for High-Dimensional Inference and Multiple Testing

Lecture 14: Shrinkage

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

DEGREES OF FREEDOM AND MODEL SEARCH

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

A Confidence Region Approach to Tuning for Variable Selection

High-dimensional regression with unknown variance

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Regularization Path Algorithms for Detecting Gene Interactions

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Frequentist Accuracy of Bayesian Estimates

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

Marginal Screening and Post-Selection Inference

Linear Regression (9/11/13)

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

A Bias Correction for the Minimum Error Rate in Cross-validation

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Bayesian Linear Regression

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Some Curiosities Arising in Objective Bayesian Analysis

A Magiv CV Theory for Large-Margin Classifiers

arxiv: v2 [math.st] 9 Feb 2017

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Iterative Selection Using Orthogonal Regression Techniques

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Least Angle Regression, Forward Stagewise and the Lasso

Transcription:

Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27

Outline The Bayesian Lasso Trevor Park and George Casella, The Bayesian Lasso, Journal of the American Statistical Association, June 2008, Vol. 103, No. 482, Theory and Methods DOI Bootstrap Lasso Trevor Hastie, Robert Tibshirani and Martin Wainwright (2015). Statistical Learning with Sparsity-The lasso and Generalizations. (pp. 142-147). Chapman and Hall/CRC Post-Selection Inference for the Lasso - The Covariance Test Trevor Hastie, Robert Tibshirani and Martin Wainwright (2015). Statistical Learning with Sparsity-The lasso and Generalizations. (pp. 147-150). Chapman and Hall/CRC Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 2 / 27

Why Statistical Inference Statistical Inference provides confidence interval and statistical strength of variables, as p-value in models. Statistical Inference is well proposed in low dimension cases, for high dimensional problems, traditional methods may not be applicable. We need some statistical inference methods for high dimensional studies. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 3 / 27

The Bayesian Lasso We adopt the approach of Park and Casella (2008), involving a hierarchical model of the form y µ, X, β, σ 2 N n (µ1 n + Xβ, σ 2 I n ) (1) β λ, σ p j=1 λ 2 λ β σ 2 e σ 2 j For a complete Bayesian model, we use the improper prior density for σ 2 and a hyperprior for λ 2 with the class of gamma priors: π(σ 2 ) = 1/σ 2 and π(λ 2 ) = δr Γ(r) (λ2 ) r 1 exp( δλ 2 ) (3) (2) Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 4 / 27

Bayesian Lasso Prior and posterior distribution for the seventh variable in the diabetes example. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 5 / 27

Bayesian Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 6 / 27

Bayesian Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 7 / 27

Bootstrap Lasso The bootstrap is popular for assessing the statistical properties of complex estimators. How do we obtain the sampling distribution of ˆβ through bootstrap method? The non parametric bootstrap is one method for approximating this sampling distribution. Another method is parametric bootstrap. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 8 / 27

Bootstrap Lasso Suppose that we have obtained an estimate ˆβ(ˆλ CV ) for a lasso problem according to Cross-validation procedure: Fit a lasso path to (X, y) over a dense grid of values Λ = {λ ι } L ι=1. Divide the training samples into 10 groups at random. With the k th group left out, fit a lasso path to the remaining 9/10ths, using the same grid Λ. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 9 / 27

Bootstrap Lasso For each λ Λ compute the mean-squared prediction error for the left-out group. Average these errors to obtain a prediction error curve over the grid Λ. Find the value ˆλ CV that minimizes this curve, and then return the coefficient vector from our original fit in the first step at that value of λ. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 10 / 27

Non parametric Bootstrap It approximates the cumulative distribution function F of the random pair (X, Y ) by the empirical CDF ˆF N defined by the N samples. Draw N samples with replacement from the given dataset. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 11 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 12 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 13 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 14 / 27

Parametric Bootstrap We have a parametric estimate of F, or its corresponding density function f. We can sample from residual or Gaussian process regression. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 15 / 27

Bootstrap Lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 16 / 27

Bayesian Lasso vs Bootstrap Lasso In a general sense, results for the Bayesian lasso and lasso/bootstrap are similar. Nonparametric bootstrap can be treated as a kind of posterior-bayes estimate under a non-informative prior in the multinomial model. (Rubin 1981, Efron 1982) Bayesian Lasso leans more on parametric assumptions. Bootstrap scales better. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 17 / 27

Bayesian Lasso vs Bootstrap Lasso p Bayesian Lasso Lasso/Bootstrap 10 3.3 secs 163.8 secs 50 184.8 secs 374.6 secs 100 28.6 mins 14.7 mins 200 4.5 hours 18.1 mins Table: Timing for Bayesian lasso and bootstrapped lasso Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 18 / 27

The Covariance Test Bayesian methods and the bootstrap are two "traditional "models and we would like to present some newer approaches. We would like to describe two methods proposed for assigning p-values or confidence interval to predictors as they are successively entered by the lasso and forward stepwise regression. These two methods have different results. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 19 / 27

The Covariance Test Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 20 / 27

The Covariance Test We start from a usual linear regression setup: y = Xβ + ɛ, ɛ N(0, σ 2 I N N ) (4) Consider forward-stepwise regression. Defining RSS k to be the residual sum of squares for the model containing k predictors, we can use this change in residual sum-of-squares to form a test statistic and compare it to a χ 2 1 distribution. R k = 1 σ 2 (RSS k 1 RSS k ) (5) Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 21 / 27

The Covariance Test Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 22 / 27

The Covariance Test Forward stepwise procedure has chosen the strongest predictor among all of the available choices and it yields a larger drop in training error than expected. It is difficult to derive an appropriate p-value for forward stepwise regression, if we want properly account the adaptive nature of the fitting. For the lasso, a simple test can be derived that properly accounts for the adaptivity. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 23 / 27

The Covariance Test Suppose that we wish to test significance of the predictor entered by LAR at λ k. Let A k 1 be the set of predictors with nonzero coefficients before this predictor was added and let the estimate at the end of this step be ˆβ(λ k+1 ). We refit the lasso, keeping λ = λ k+1 but using just the variables in A k 1. This yields the estimate ˆβ Ak 1 (λ k+1 ). The covariance test statistic is: T k = 1 σ 2 ( y, X ˆβ(λ k+1 ) y, X ˆβ Ak 1 (λ k+1 ) ) (6) T k Exp(1) (7) This statistic measures how much of the covariance between the outcome and the fitted model can be attributed to the predictor that has just entered the model. We can present a quantile-quantile plot for T 1 versus Exp(1): Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 24 / 27

The Covariance Test Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 25 / 27

The exponential limiting distribution for the covariance test requires certain conditions on the data: signal variables are not too correlated with the noise variables. Assumes linearity of the underlying model. In the next section, Libo will show a more general scheme that gives the spacing test, which works for any data matrix X, and the null distribution holds exactly for finite N and p. Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 26 / 27

Reference Trevor Hastie, Robert Tibshirani, Martin Wainwright Statistical Learning with Sparsity - The Lasso and Generalizations. Chapman and Hall/CRC, 2015. Trevor Park and George Casella The Bayesian Lasso Journal of the American Statistical Association, June 2008, Vol. 103, No. 482, Theory and Methods DOI Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 27 / 27

Post-selection Inference and Bayesian Inference Libo Wang Department of Statistics Florida State University Oct 19th, 2016

Motivation Assigning significance in high-dimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid p-values are not available. Statistics versus Machine Learning Libo Wang Post-selection Inference and Bayesian Inference

What is post-selection inference? Inference the old way (pre- 1980?): 1. Devise a model 2. Collect data 3. Test hypothesis Classical inference Inference the new way: 1. Collect data 2. Select a model 3. Test hypothesis Post-selection inference Classical tools cannot be used post-selection, because they do not yield valid inferences (generally, too optimistic, underestimate p-value). Reason? For parametric model with full column rank p < n, the true parameter β is well-defined as the target of statistical inference. When we allow p n, we should add an selection step before working on inference that contains only submodels of full column rank. Libo Wang Post-selection Inference and Bayesian Inference

Example: Lasso with fixed-λ HIV data: mutations that predict response to a drug. Selection intervals for lasso with fixed tuning parameter λ. Libo Wang Post-selection Inference and Bayesian Inference

Formal goal of Post-selective inference [Lee et al. and Fithian, Sun, Taylor] Having selected a model ˆM based on our data y, we d like to test an hypothesis Ĥ 0. Note that Ĥ 0 will be random a function of the selected model and hence of y If our rejection region is {T (y) R}, we want to control the selected type I error: Prob(T (y) R ˆM, Ĥ 0 ) α Libo Wang Post-selection Inference and Bayesian Inference

Existing Approaches Data splitting - fit on one half of the data, do inferences on the other half. Problem: fitted model changes varies with random choice of half ; loss of power. Reference: P-Values for High-Dimensional Regression, Meinshausen N, Meier L, Buhlmann P, 2009 Permutations and related methods: not clear how to use these, beyond the global null Libo Wang Post-selection Inference and Bayesian Inference

A key mathematical result Polyhedral lemma: Provides a good solution for FS; an optimal solution for the fixed -λ lasso Polyhedrol section events Response vector y N(µ, Σ). Suppose we make a selection that can be written as y : Ay b with A, b not depending on y. This is true for forward stepwise regression, lasso with fixed λ, least angle regression and other procedures. Libo Wang Post-selection Inference and Bayesian Inference

Some intuition for Forward stepwise regression Suppose that we run forward stepwise regression for k steps {y : Ay b} is the set of y vectors that would yield the same predictors and their signs entered at each step. Each step represents a competition involving inner products between each x j and y; Polyhedral Ay b summarizes the results of the competition after k steps. Similar result holds for Lasso (fixed-λ or LAR) Libo Wang Post-selection Inference and Bayesian Inference

Example: The lasso and its selection event Lasso estimation: ˆβ argmin y Xβ 2 2/2 + λ β 1 Re-write the KKT condition by partitioning them according to activate set ( ˆM) or inactivate set ( ˆM). (to be continued.) X Ṱ M (X Ṱ M ˆ β M y) + λŝ ˆM = 0, X T ˆM (X Ṱ M ˆ β M y) + λŝ ˆM = 0, sign( ˆβ ˆM ) = ŝ ˆM, ŝ ˆM < 1 Libo Wang Post-selection Inference and Bayesian Inference

Example: The lasso and its selection event If and only if: X Ṱ M (X Ṱ w y) + λs = 0, M X T ˆM (X Ṱ w y) + λu = 0, M sign(w) = s, u < 1 Solve w and u by the first 2 equations substituting expression for w and u Libo Wang Post-selection Inference and Bayesian Inference

The polyhedral lemma [Lee et al, Ryan Tibs. et al.] For vector η F ν,ν + (η T y) {Ay b} Unif (0, 1) η T µ,σ 2 η 2 2 (truncated Gaussian distribution), where ν, ν + are (computable) values that are functions of η, A, b. Let F [a,b] denote the CDF of µ,σ 2 a N(µ, σ 2 ) r.v. truncated to the interval [a, b], that is, F [a,b] µ,σ 2 = φ((x µ)/σ) φ((a µ)/σ) φ((b µ)/σ) φ((a µ)/σ) Typically choose η so that η T y is the partial least squares estimate for a selected variable. Libo Wang Post-selection Inference and Bayesian Inference

Schematic illustrating the polyhedral lemma for the case N = 2. Libo Wang Post-selection Inference and Bayesian Inference

Example: Fixed-λ inference for the Lasso The intervals of Lasso is calculated by the selection procedure which is adaptive to the data (choose probabilistic model for the data first then formulate testing). The OLS model is pre-specified with only the selected 7 variables by LASSO. Selection-adjusted intervals are similar with strong signals, but larger for weak signals because these variables are close to the endpoint of the selection region, therefore wide ranges of µ would be consistent with observations. Libo Wang Post-selection Inference and Bayesian Inference

Current Work: Lasso with λ estimated by Cross-validation and unknown σ Selective inference with a randomized response by Tian, Taylor Can condition on the selection of λ by CV, and addition to the selection of the model Not clear about the difference between the Lasso with fixed-λ and the Lasso with λ estimated by cross-validation. Selective inference with unknown variance via the square-root LASSO by Tian, Loftus and Taylor Focuses on adapting post-selection inference to the case of unknown σ and the choice of tuning parameter. Square-root LASSO is favorable because the independence of λ with noise level σ. Libo Wang Post-selection Inference and Bayesian Inference

Improving the power The preceding approach conditions on the part of y orthogonal to the direction of interest η. This is for computational convenience yielding an analytic solution. Conditioning on less more power Are we conditioning on too much? Libo Wang Post-selection Inference and Bayesian Inference

Data splitting, carving, and adding noise Further improvements in power Fithian, Sun, Taylor, Tian Selection inference yields correct post-selection type I error. But confidence intervals are sometimes quite large. How to do better? (say, to make the randomness in selection is independent with the data for inference) Data carving: withholds a small proportion (say 10%) of data in selection stage, then uses all data for inference (conditioning using theory outlined above) Randomized response: add noise to y in selection stage. Like withholding data, but smoother. Then use unnoised data in inference stage. Related to differential privacy techniques. Libo Wang Post-selection Inference and Bayesian Inference

Data splitting, carving, and adding noise Libo Wang Post-selection Inference and Bayesian Inference

Alternative Method: Bayesian quantile regression inference Setting: y i = β 0 + β 1 X 1i + β 2 X 2i + ɛ i with β 0 = 1/3, β 1 = β 2 = 1, ɛ i exp(1) log(2). Table : Simulation results of Bayesian quantile regression Method ˆβ0 ˆβ1 ˆβ2 Err( ˆβ 0 ) 100 Err( ˆβ) 100 Err(ŷ) 100 BQR 0 333 0 998 1 010 0 228 0 717 23 154 QR 0 325 0 999 1 010 0 249 0 821 27 567 Libo Wang Post-selection Inference and Bayesian Inference

Difference between Bayesian and Classical Frequentist Inference Frequentist: 1. Point estimates and standard errors or 95% confidence intervals. 2. Deduction from P(data H 0 ), by setting α in advance. 3. Accept H 1 if P(data H 0 ) < α. Bayesian: 1. Induction from P(θ data), starting with P(θ). 2. Broad descriptions of the posterior distribution such as means and quantiles. Frequentist: P(data H 0 ) is the sampling distribution of the data given the parameter Bayesian: P(θ) is the prior distribution, P(θ data) is the posterior distribution of the parameter Libo Wang Post-selection Inference and Bayesian Inference

Bayesian feature selection methods Laplacian shrinkage. Bayesian lasso Adaptive shrinkage. Spike and Slab Get the selection inference P(θ data) by running 5000 or more iterations. Libo Wang Post-selection Inference and Bayesian Inference

Conclusions Post-selection inference is an exciting new area. Lots of potential research problems and generalizations. Bayesian and Frequentist methods both have drawbacks in finite sample settings. R package on CRAN: selectiveinference. Forward stepwise regression, Lasso, Lars Libo Wang Post-selection Inference and Bayesian Inference

References Book: Statistical Learning with Sparsity, Chapter 6 (Hastie, Tibshirani, Wainwright) Lee, Sun, Sun, Taylor (2013) Exact post-selection inference with the lasso. arxiv; To appear Tian, X. and Taylor, J. (2015) Selective inference with a randomized response. arxiv Libo Wang Post-selection Inference and Bayesian Inference

Thank you! Libo Wang Post-selection Inference and Bayesian Inference