Median Cross-Validation

Similar documents
High-dimensional regression

A Practitioner s Guide to Cluster-Robust Inference

A Significance Test for the Lasso

Regression, Ridge Regression, Lasso

Machine Learning for OR & FE

Model Fitting. Jean Yves Le Boudec

Single Index Quantile Regression for Heteroscedastic Data

Introduction to Statistical modeling: handout for Math 489/583

Single Index Quantile Regression for Heteroscedastic Data

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

ISyE 691 Data mining and analytics

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Business Statistics. Lecture 10: Course Review

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Efficient and Robust Scale Estimation

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Stat 5101 Lecture Notes

Model comparison and selection

Learning Objectives for Stat 225

Statistics 203: Introduction to Regression and Analysis of Variance Course review

MSA220/MVE440 Statistical Learning for Big Data

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Diagnostics and Remedial Measures

Statistical Inference

Linear model selection and regularization

Applied Econometrics (QEM)

Regression Analysis for Data Containing Outliers and High Leverage Points

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

STAT5044: Regression and Anova

What s New in Econometrics? Lecture 14 Quantile Methods

Advanced Introduction to Machine Learning CMU-10715

MS-C1620 Statistical inference

Inference For High Dimensional M-estimates. Fixed Design Results

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

9. Robust regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

12 Statistical Justifications; the Bias-Variance Decomposition

Introduction and Single Predictor Regression. Correlation

Model Checking and Improvement

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Inference For High Dimensional M-estimates: Fixed Design Results

9. Linear Regression and Correlation

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Correlation 1. December 4, HMS, 2017, v1.1

Day 4: Shrinkage Estimators

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Half-Day 1: Introduction to Robust Estimation Techniques

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Exam Applied Statistical Regression. Good Luck!

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Bayesian linear regression

Generalized Linear Models

Supplementary Material for Wang and Serfling paper

Sparse Linear Models (10/7/13)

On Model Selection Criteria for Climate Change Impact Studies. Xiaomeng Cui Dalia Ghanem Todd Kuffner UC Davis UC Davis WUSTL

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Robustness to Parametric Assumptions in Missing Data Models

Review of Statistics 101

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Can we do statistical inference in a non-asymptotic way? 1

Feature selection with high-dimensional data: criteria and Proc. Procedures

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Recitation 5. Inference and Power Calculations. Yiqing Xu. March 7, 2014 MIT

Statistical Distribution Assumptions of General Linear Models

Focused fine-tuning of ridge regression

Construction of PoSI Statistics 1

Regression I: Mean Squared Error and Measuring Quality of Fit

MS&E 226: Small Data

Biostatistics Advanced Methods in Biostatistics IV

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Chapter 1 Statistical Inference

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

CHAPTER 5. Outlier Detection in Multivariate Data

Accounting for Complex Sample Designs via Mixture Models

Analysis Methods for Supersaturated Design: Some Comparisons

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

SUMMARIZING MEASURED DATA. Gaia Maselli

Robustness. James H. Steiger. Department of Psychology and Human Development Vanderbilt University. James H. Steiger (Vanderbilt University) 1 / 37

Multivariate Calibration with Robust Signal Regression

Residuals and model diagnostics

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Distribution Fitting (Censored Data)

MS&E 226: Small Data

Flexible Estimation of Treatment Effect Parameters

Dynamic Time Series Regression: A Panacea for Spurious Correlations

Machine Learning and Data Mining. Linear regression. Kalev Kask

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Linear regression for heavy tails

AR, MA and ARMA models

Chapter 3. Linear Models for Regression

A Bayesian perspective on GMM and IV

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Lecture 3: Statistical Decision Theory (Part II)

Transcription:

Median Cross-Validation Chi-Wai Yu 1, and Bertrand Clarke 2 1 Department of Mathematics Hong Kong University of Science and Technology 2 Department of Medicine University of Miami IISA 2011

Outline Motivational Example 1 Motivational Example 2 3 4 5

1. Motivational Example Consider a data set n = 30, p = 2, one response. Fit Y = β 1 X 1 + β 2 X 2 + ɛ, regression through the origin. Do the usual Frequentist analysis with normal error...much the same as a Bayes analysis with a flat prior. Param. Estimate SE p-value β 1 3.7988 1.1524 0.0027 β 2 1.0014 0.0497 10 16 It is not reasonable to take either estimate as 0. Q-Q plot will confirm this.

Q-Q plot Motivational Example Quantile plot of the model: Y on X1 and X2 quantile of standardized residuals 2 1 0 1 2 1 0 1 quantile of N(0,1) Figure: The normal quantile plot of model of Y on the model with X 1 and X 2.

Here is the data: Motivational Example Outcome Y X 1 X 2 Outcome Y X 1 X 2 1-5.3033-0.4101-3.2433 16 83.2148 0.3581 73.9643 2-3.6200-2.0153 4.5951 17 1.1077 0.0698-0.4236 3-5.5315-0.3860 1.9783 18 0.3226 0.0524 0.5338 4-7.9936 0.9578-18.4266 19-0.0885 0.1625-1.6023 5 0.9535 1.2655-5.5287 20 1.9883-0.3533 6.2485 6-3.7250-1.3486-2.6474 21-0.4950-0.1338-0.1414 7-0.2690-0.0403 12.0591 22 1.7951-0.4801 12.7869 8-9.0174-1.3326-0.0708 23-1.3429 0.6400-1.0665 9 72.6267 0.1414 68.4117 24-3.0678-0.8689 0.4740 10 1.7789 0.4701 7.1028 25-2.6882-1.1170-8.1612 11 5.5903 0.1628 3.8304 26-2.7023-1.3752-3.8681 12 6.0789 1.1990 5.8940 27 4.8959 1.4115 1.4447 13 2.0914 0.2894 2.0439 28 0.0509-0.9251 12.0864 14 6.9873-0.7600 13.8218 29-2.1155-0.3520 5.8767 15-1.8672-0.7837-1.3106 30-0.8377 0.2684-0.9703

The Problem: Motivational Example The model we used fits nearly perfectly, natural regression equation is Ŷ = 3.8X 1 + X 2. PROBLEM: This is dead wrong. The data generator for Y was Y i = 2X 1i + E i, for i = 1,..., n, (1) where {E i : i = 1,..., n} are IID standard Cauchy and X 1 is generated IID N(0, 1). That is, X 2 is not in the correct model! (So what did the p-value mean???) In fact, X 2 was constructed by setting X 2 = ê + normal noise, where ê is the residual from the LSE s in (1) using only Y and X 1. The heavy tailed Cauchy is built into X 2.

Implications: The technical term for this is cheating. However: The standard checks on normality can be misleading when a heavy tail is built into the explanatory variables. Otherwise put: We must not search naively through variables to select the ones that create a normal noise term when the noise term is not normal. Could this have been detected? Yes...but you use up data as you estimate/perform tests. With n = 30 we ve already estimated β 1, β 2, σ and done 3 hypothesis tests (β 1 = 0, β 2 = 0, and H 0 : normal error; plot as test statistic Di Cook) using up degrees of freedom ( 4,?). If we sphere our data or use other transformations we also use degrees of freedom.

Often must deal with heavy tails: With small n and not-small p, you can t be sure you re not just constructing a normal noise in place of a heavy tailed noise by variable selection. In addition, there is a lot of work on heavy tailed distributions that deserves to be better known. Normal-Independent (NI) class from Lange and Sinsheimer (1992) used in Lachos, Bandyopadhyay, and Dey (2011) to analyse viral loads in HIV. The NI class includes the t m distributions among others. Tressou (2008) used a (heavy tailed) Pareto in a nonparametric Bayes setting for a clustering step in estimating dietary risks. In general, Cauchy = N(0,1)/N(0,1), so ratios are often heavy tailed.

More generally: Motivational Example Levy-alpha-stable distributions: Like Cauchy they usually don t have a mean. Occur in Brownian motions. Inverse Gaussian has a Levy distribution limit in some cases. Applications in finance. Log-normal, Weibull (shape parameter < 1)... Our Message: For heavy tailed error, and other contexts, we propose a median version of CV that seems to work better than mean squared error CV. Simple: Just replace the sum in regular CV by a median.

2. Forms of CV Consider a regression model Y i = f λ (X i ; β) + E i, where {E i : i = 1,..., n} are IID median 0. Suppose ˆf λ (X) = f λ (X, ˆβ) is an estimate of the regression function from M = {f λ, λ Λ}. Assume λ varies over a finite set, then estimate β. To choose λ, could use Information Criteria (AIC, BIC,...) or shrinkage methods (SCAD, ALASSO, AEN...)

Forms of CV: Motivational Example Here we ll use CV approaches based on within-sample predictive accuracy. Benefits of CV: CV does not require choosing a penalty term, CV combines both (internal) prediction and fit. LOO-CV: Find the model with smallest value of CV (λ) = 1 n (y i f λ (x i ; ˆβ n (i) )) 2. i=1 Usually use leave-k-out CV, k increasing with n. Choose k n? Vast literature on CV...review by Arlot (2010), theorem of Yang (2007). Need second moments for CV; very sensitive to outliers.

Robust CV: Motivational Example Robust CV (Huber 1964, 1973) is to find λ to minimize 1 n ρ(y i f λ (x i ; ˆβ n (i) )), i=1 where ρ is subjectively chosen. If we choose ρ(t) so that ρ increases slower than t 2 when t, then the minimum is less sensitive to extreme values of residuals, than regular CV. (This still requires choice of ρ subjective.) In nonparametric regression, Leung (2005) shows the minimum is asymptotically independent of the choice of ρ. For small or moderate sample sizes, the minimum of may depend on ρ.

: Motivational Example Use the sample median in place of the mean in CV: ˆλ = arg min λ med (y i f λ (x i ; ˆβ (i) )) 2. 1 i n Three advantages 1) median automatically gives invariance of the estimators up to increasing functions, 2) the minimum always exists, 3) resistant to outliers more stable than moments. Loss functions are nonnegative and therefore right skewed. The median is a better location for skewed distributions than the mean is. Zheng and Yang (1998) used an MCV to choose the k in k-nn s regression. More generally: model selection, smoothing parameters (Yu 2009), decay parameters, anywhere you might use CV.

Another Median Criterion: Rousseeuw (1984) least median of squares (LMS): ˆβ LMS = arg min median [y i f (x i ; β)] 2. β 1 i n LMS is an alternative to the LSE s and to other robustified estimators. LMS has cube-root rate, asymptotic distribution, Kim and Pollard (1987). When the error term in a regression model is heavy tailed, LSE s tend to do poorly because there will be outliers. By contrast, the LMS is highly resistant to outliers. MCV is an alternative to CV like LMS is an alternative to LSE. CV works better with LSE s than with LMS s; MCV works better with LMS than with LSE s.

: Consistency Suppose three nested models: Model 1: Y = 2(1 + X 1 ) + E, Model 2: Y = 2(1 + X 1 + X 2 ) + E, Model 3: Y = 2(1 + X 1 + X 2 + X 3 ) + E. If Model 2 is true, Model 1 underfits and Model 3 overfits. Generate samples of size n = 50, take the covariates IID Unif[0,1] and use three noise distributions: i) standard normal, ii) standard Cauchy and iii) contamination, 80%*N(0, 1) + 20%*N(15, 1). Over 1000 reps, find P M2 (MSP chooses M k ), k = 1, 2, 3, MSP = CV-LS, MCV-LMS.

N(0,1) error, 5-fold: Motivational Example 5 fold : Standard normal N(0,1) % for the chosen model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 1 2 3 Model Figure: Proportion of times each model chosen by MCV and CV.

Cauchy error, 5-fold: Motivational Example 5 fold : Standard Cauchy % for the chosen model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 1 2 3 Model Figure: Proportion of times each model is selected by MCV and CV.

Contamination, 5-fold: 5 fold : 4/5N(0,1)+1/5N(10,1) % for the chosen model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 1 2 3 Model Figure: Proportion of times each model is selected by MCV and CV.

: Size of β Suppose we use CV and MCV to compare { Model 1: Y = X1 β 1 + E, Model 2: Y = X 1 β 1 + X 2 β 2 + E. When β 2 = 0, we reduce to Model 1. So for each value of β 2 0 taken as true we can look at how well CV and MCV can distinguish the two models. Note the difference when β 2 = 0!

Cauchy error, LOO LOO CV and MCV with sample size 30 and Cauchy error % to choose the true model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 0 1 2 3 4 5 6 value of beta_2 Figure: LOO MCV and CV for Cauchy errors.

Cauchy error, 5-fold: 5 fold CV and MCV with sample size 30 and Cauchy error % to choose the true model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 0 1 2 3 4 5 6 value of beta_2 Figure: 5 fold MCV and CV for Cauchy errors.

two models, df vs. β 2, black dots = MCV better: Comparison of MCV and CV under the t error distributions degree of freedom 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2

Tentative Inferences: Motivational Example With normal error, or high df s, CV wins (β 0 not too large). As the error becomes heavier-tailed, MCV is better able than CV to identify the correct model. MCV seems able to ignore residuals that are too big because of large noise components while CV focusses on them. CV tends to sparsify : Put too much mass incorrectly on smaller models. Prefers models that are too small even when they re wrong. Comment: Need the non-sparse case for prediction...tianxi Cai: Can t just look at the top SNP s. If n increases, the probability CV and MCV of choosing the right model increases (rises to well over.5) but the same qualitative properties hold. range of β s depending onmedian n and CV df for which MCV wins

4. No theory...let the pictures do the talking. Imagine 5 explanatory variables in a linear regression model. Generate the X j s from a Unif[c, c] where c is the 5-th percentile of a Cauchy. Consider 2 5 models... all (non-nested) submodels of Y + β 0 = 5 β j X j + ɛ. j=1 Which model classes do 5-fold CV-LS and MCV-LMS choose when ɛ is Cauchy, n = 70, and various models are taken as true? We ll see that CV misses all the small terms.

model β = (5, 5, 5, 5, 5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 1.0 Eprop Mprop Figure: Last model on RHS is correct.

model β = (5, 5, 5, 5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct, 2nd last model has β 5 = 0.

β = (5, 5, 5,.5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct

β = (5, 5,.5,.5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct

β = (5,.5,.5,.5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct

β = (.5,.5,.5,.5,.5) CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct

β = (5, 5,.5, 0, 0): CV vs MCV % of choosing true model 28 0.0 0.1 0.2 0.3 0.4 Eprop Mprop Figure: CV splits its weight on the sparse model (purple, β 3 = β 4 = β 5 = 0) and the true model with β 3 0.

Percent time correct choice as a function of df in t Again, let s look at how the probability of correct selection of the true model depends on the df s of the error given a true model. For comparison purposes, we look at correct selection and selection of a model that has symmetric difference at most 1 with the true model (may miss or add a term). Again, n = 70, 5-fold (M)CV.

β = (5, 5, 5, 5, 5) true: t distribution with different df and true model with beta=(5 5 5 5 5) % that cv/mcv chose the true model 0.80 0.85 0.90 0.95 1.00 SD MCV SD CV MCV CV 1 2 3 4 5 degree of freedom Figure: MCV and SD-MCV coincide. As df increases, CV and SD-CV catch up to MCV, SD-MCV, by 1.7.

β = (5, 5, 5,.5,.5) true: t distribution with different df and true model with beta=(5 5 5 0.5 0.5) % that cv/mcv chose the true model 0.0 0.2 0.4 0.6 0.8 1.0 SD MCV SD CV MCV CV 1 2 3 4 5 degree of freedom Figure: MCV, SD-MCV deteriorate as df increases; CV, SD-CV improve as df increases. Crossover 2.

β = (5,.5,.5,.5,.5) true: t distribution with different df and true model with beta=(5 0.5 0.5 0.5 0.5) % that cv/mcv chose the true model 0.0 0.2 0.4 0.6 0.8 1.0 SD MCV SD CV MCV CV 1 2 3 4 5 degree of freedom Figure: MCV, SD-MCV deteriorate as df increases; CV, SD-CV improve as df increases. Crossover 1.8.

True Model Inside Motivational Example New simulation: Consider 25 nested linear regression models the true model is Y = β 0 + J +E where E is Cauchy and J = 1,..., 25. Suppose J = 20 is the true model for which β j = 2/(j 1). Let s compare the sampling distributions of CV (with LS and LMS) and MCV with (LS and LMS) over the model list. j=1 We see that MCV-LMS is best at detecting small terms.

Cauchy error, 5-fold, nonsparse CV vs MCV % of choosing true model 21 0.00 0.02 0.04 0.06 0.08 0.10 Eprop Mprop EMprop3 EMprop4 Figure: 25 models, model 21 is correct, β s decreasing, n 300

Cauchy error, 5-fold, nonsparse CV vs MCV % of choosing true model 21 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Eprop Mprop EMprop3 EMprop4 Figure: 25 models, model 21 is correct, β s decreasing, n 1500

5. and Future Work For heavy tailed errors, especially where the leading terms are not enough, use MCV, not CV. When you have light tailed (normal) errors, use regular CV unless β j /σ c n (simulations not shown). Diagnostic for using MCV rather than CV: histogram of residuals. (If not normal, use MCV. If normal, rule out having constructed the normal error and use CV.) Maybe the Bahadur representation can help quantify these findings...bahadur (1966), JKG (1971), Mazumder and Serfling (2009)... Empirical process approach Kim and Pollard (1987) style as for LMS? Random effects models???