Median Cross-Validation Chi-Wai Yu 1, and Bertrand Clarke 2 1 Department of Mathematics Hong Kong University of Science and Technology 2 Department of Medicine University of Miami IISA 2011
Outline Motivational Example 1 Motivational Example 2 3 4 5
1. Motivational Example Consider a data set n = 30, p = 2, one response. Fit Y = β 1 X 1 + β 2 X 2 + ɛ, regression through the origin. Do the usual Frequentist analysis with normal error...much the same as a Bayes analysis with a flat prior. Param. Estimate SE p-value β 1 3.7988 1.1524 0.0027 β 2 1.0014 0.0497 10 16 It is not reasonable to take either estimate as 0. Q-Q plot will confirm this.
Q-Q plot Motivational Example Quantile plot of the model: Y on X1 and X2 quantile of standardized residuals 2 1 0 1 2 1 0 1 quantile of N(0,1) Figure: The normal quantile plot of model of Y on the model with X 1 and X 2.
Here is the data: Motivational Example Outcome Y X 1 X 2 Outcome Y X 1 X 2 1-5.3033-0.4101-3.2433 16 83.2148 0.3581 73.9643 2-3.6200-2.0153 4.5951 17 1.1077 0.0698-0.4236 3-5.5315-0.3860 1.9783 18 0.3226 0.0524 0.5338 4-7.9936 0.9578-18.4266 19-0.0885 0.1625-1.6023 5 0.9535 1.2655-5.5287 20 1.9883-0.3533 6.2485 6-3.7250-1.3486-2.6474 21-0.4950-0.1338-0.1414 7-0.2690-0.0403 12.0591 22 1.7951-0.4801 12.7869 8-9.0174-1.3326-0.0708 23-1.3429 0.6400-1.0665 9 72.6267 0.1414 68.4117 24-3.0678-0.8689 0.4740 10 1.7789 0.4701 7.1028 25-2.6882-1.1170-8.1612 11 5.5903 0.1628 3.8304 26-2.7023-1.3752-3.8681 12 6.0789 1.1990 5.8940 27 4.8959 1.4115 1.4447 13 2.0914 0.2894 2.0439 28 0.0509-0.9251 12.0864 14 6.9873-0.7600 13.8218 29-2.1155-0.3520 5.8767 15-1.8672-0.7837-1.3106 30-0.8377 0.2684-0.9703
The Problem: Motivational Example The model we used fits nearly perfectly, natural regression equation is Ŷ = 3.8X 1 + X 2. PROBLEM: This is dead wrong. The data generator for Y was Y i = 2X 1i + E i, for i = 1,..., n, (1) where {E i : i = 1,..., n} are IID standard Cauchy and X 1 is generated IID N(0, 1). That is, X 2 is not in the correct model! (So what did the p-value mean???) In fact, X 2 was constructed by setting X 2 = ê + normal noise, where ê is the residual from the LSE s in (1) using only Y and X 1. The heavy tailed Cauchy is built into X 2.
Implications: The technical term for this is cheating. However: The standard checks on normality can be misleading when a heavy tail is built into the explanatory variables. Otherwise put: We must not search naively through variables to select the ones that create a normal noise term when the noise term is not normal. Could this have been detected? Yes...but you use up data as you estimate/perform tests. With n = 30 we ve already estimated β 1, β 2, σ and done 3 hypothesis tests (β 1 = 0, β 2 = 0, and H 0 : normal error; plot as test statistic Di Cook) using up degrees of freedom ( 4,?). If we sphere our data or use other transformations we also use degrees of freedom.
Often must deal with heavy tails: With small n and not-small p, you can t be sure you re not just constructing a normal noise in place of a heavy tailed noise by variable selection. In addition, there is a lot of work on heavy tailed distributions that deserves to be better known. Normal-Independent (NI) class from Lange and Sinsheimer (1992) used in Lachos, Bandyopadhyay, and Dey (2011) to analyse viral loads in HIV. The NI class includes the t m distributions among others. Tressou (2008) used a (heavy tailed) Pareto in a nonparametric Bayes setting for a clustering step in estimating dietary risks. In general, Cauchy = N(0,1)/N(0,1), so ratios are often heavy tailed.
More generally: Motivational Example Levy-alpha-stable distributions: Like Cauchy they usually don t have a mean. Occur in Brownian motions. Inverse Gaussian has a Levy distribution limit in some cases. Applications in finance. Log-normal, Weibull (shape parameter < 1)... Our Message: For heavy tailed error, and other contexts, we propose a median version of CV that seems to work better than mean squared error CV. Simple: Just replace the sum in regular CV by a median.
2. Forms of CV Consider a regression model Y i = f λ (X i ; β) + E i, where {E i : i = 1,..., n} are IID median 0. Suppose ˆf λ (X) = f λ (X, ˆβ) is an estimate of the regression function from M = {f λ, λ Λ}. Assume λ varies over a finite set, then estimate β. To choose λ, could use Information Criteria (AIC, BIC,...) or shrinkage methods (SCAD, ALASSO, AEN...)
Forms of CV: Motivational Example Here we ll use CV approaches based on within-sample predictive accuracy. Benefits of CV: CV does not require choosing a penalty term, CV combines both (internal) prediction and fit. LOO-CV: Find the model with smallest value of CV (λ) = 1 n (y i f λ (x i ; ˆβ n (i) )) 2. i=1 Usually use leave-k-out CV, k increasing with n. Choose k n? Vast literature on CV...review by Arlot (2010), theorem of Yang (2007). Need second moments for CV; very sensitive to outliers.
Robust CV: Motivational Example Robust CV (Huber 1964, 1973) is to find λ to minimize 1 n ρ(y i f λ (x i ; ˆβ n (i) )), i=1 where ρ is subjectively chosen. If we choose ρ(t) so that ρ increases slower than t 2 when t, then the minimum is less sensitive to extreme values of residuals, than regular CV. (This still requires choice of ρ subjective.) In nonparametric regression, Leung (2005) shows the minimum is asymptotically independent of the choice of ρ. For small or moderate sample sizes, the minimum of may depend on ρ.
: Motivational Example Use the sample median in place of the mean in CV: ˆλ = arg min λ med (y i f λ (x i ; ˆβ (i) )) 2. 1 i n Three advantages 1) median automatically gives invariance of the estimators up to increasing functions, 2) the minimum always exists, 3) resistant to outliers more stable than moments. Loss functions are nonnegative and therefore right skewed. The median is a better location for skewed distributions than the mean is. Zheng and Yang (1998) used an MCV to choose the k in k-nn s regression. More generally: model selection, smoothing parameters (Yu 2009), decay parameters, anywhere you might use CV.
Another Median Criterion: Rousseeuw (1984) least median of squares (LMS): ˆβ LMS = arg min median [y i f (x i ; β)] 2. β 1 i n LMS is an alternative to the LSE s and to other robustified estimators. LMS has cube-root rate, asymptotic distribution, Kim and Pollard (1987). When the error term in a regression model is heavy tailed, LSE s tend to do poorly because there will be outliers. By contrast, the LMS is highly resistant to outliers. MCV is an alternative to CV like LMS is an alternative to LSE. CV works better with LSE s than with LMS s; MCV works better with LMS than with LSE s.
: Consistency Suppose three nested models: Model 1: Y = 2(1 + X 1 ) + E, Model 2: Y = 2(1 + X 1 + X 2 ) + E, Model 3: Y = 2(1 + X 1 + X 2 + X 3 ) + E. If Model 2 is true, Model 1 underfits and Model 3 overfits. Generate samples of size n = 50, take the covariates IID Unif[0,1] and use three noise distributions: i) standard normal, ii) standard Cauchy and iii) contamination, 80%*N(0, 1) + 20%*N(15, 1). Over 1000 reps, find P M2 (MSP chooses M k ), k = 1, 2, 3, MSP = CV-LS, MCV-LMS.
N(0,1) error, 5-fold: Motivational Example 5 fold : Standard normal N(0,1) % for the chosen model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 1 2 3 Model Figure: Proportion of times each model chosen by MCV and CV.
Cauchy error, 5-fold: Motivational Example 5 fold : Standard Cauchy % for the chosen model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 1 2 3 Model Figure: Proportion of times each model is selected by MCV and CV.
Contamination, 5-fold: 5 fold : 4/5N(0,1)+1/5N(10,1) % for the chosen model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 1 2 3 Model Figure: Proportion of times each model is selected by MCV and CV.
: Size of β Suppose we use CV and MCV to compare { Model 1: Y = X1 β 1 + E, Model 2: Y = X 1 β 1 + X 2 β 2 + E. When β 2 = 0, we reduce to Model 1. So for each value of β 2 0 taken as true we can look at how well CV and MCV can distinguish the two models. Note the difference when β 2 = 0!
Cauchy error, LOO LOO CV and MCV with sample size 30 and Cauchy error % to choose the true model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 0 1 2 3 4 5 6 value of beta_2 Figure: LOO MCV and CV for Cauchy errors.
Cauchy error, 5-fold: 5 fold CV and MCV with sample size 30 and Cauchy error % to choose the true model 0.0 0.2 0.4 0.6 0.8 1.0 usual CV 0 1 2 3 4 5 6 value of beta_2 Figure: 5 fold MCV and CV for Cauchy errors.
two models, df vs. β 2, black dots = MCV better: Comparison of MCV and CV under the t error distributions degree of freedom 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 beta_2
Tentative Inferences: Motivational Example With normal error, or high df s, CV wins (β 0 not too large). As the error becomes heavier-tailed, MCV is better able than CV to identify the correct model. MCV seems able to ignore residuals that are too big because of large noise components while CV focusses on them. CV tends to sparsify : Put too much mass incorrectly on smaller models. Prefers models that are too small even when they re wrong. Comment: Need the non-sparse case for prediction...tianxi Cai: Can t just look at the top SNP s. If n increases, the probability CV and MCV of choosing the right model increases (rises to well over.5) but the same qualitative properties hold. range of β s depending onmedian n and CV df for which MCV wins
4. No theory...let the pictures do the talking. Imagine 5 explanatory variables in a linear regression model. Generate the X j s from a Unif[c, c] where c is the 5-th percentile of a Cauchy. Consider 2 5 models... all (non-nested) submodels of Y + β 0 = 5 β j X j + ɛ. j=1 Which model classes do 5-fold CV-LS and MCV-LMS choose when ɛ is Cauchy, n = 70, and various models are taken as true? We ll see that CV misses all the small terms.
model β = (5, 5, 5, 5, 5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 1.0 Eprop Mprop Figure: Last model on RHS is correct.
model β = (5, 5, 5, 5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct, 2nd last model has β 5 = 0.
β = (5, 5, 5,.5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct
β = (5, 5,.5,.5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct
β = (5,.5,.5,.5,.5): CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct
β = (.5,.5,.5,.5,.5) CV vs MCV % of choosing true model 31 0.0 0.2 0.4 0.6 0.8 Eprop Mprop Figure: Last model on RHS is correct
β = (5, 5,.5, 0, 0): CV vs MCV % of choosing true model 28 0.0 0.1 0.2 0.3 0.4 Eprop Mprop Figure: CV splits its weight on the sparse model (purple, β 3 = β 4 = β 5 = 0) and the true model with β 3 0.
Percent time correct choice as a function of df in t Again, let s look at how the probability of correct selection of the true model depends on the df s of the error given a true model. For comparison purposes, we look at correct selection and selection of a model that has symmetric difference at most 1 with the true model (may miss or add a term). Again, n = 70, 5-fold (M)CV.
β = (5, 5, 5, 5, 5) true: t distribution with different df and true model with beta=(5 5 5 5 5) % that cv/mcv chose the true model 0.80 0.85 0.90 0.95 1.00 SD MCV SD CV MCV CV 1 2 3 4 5 degree of freedom Figure: MCV and SD-MCV coincide. As df increases, CV and SD-CV catch up to MCV, SD-MCV, by 1.7.
β = (5, 5, 5,.5,.5) true: t distribution with different df and true model with beta=(5 5 5 0.5 0.5) % that cv/mcv chose the true model 0.0 0.2 0.4 0.6 0.8 1.0 SD MCV SD CV MCV CV 1 2 3 4 5 degree of freedom Figure: MCV, SD-MCV deteriorate as df increases; CV, SD-CV improve as df increases. Crossover 2.
β = (5,.5,.5,.5,.5) true: t distribution with different df and true model with beta=(5 0.5 0.5 0.5 0.5) % that cv/mcv chose the true model 0.0 0.2 0.4 0.6 0.8 1.0 SD MCV SD CV MCV CV 1 2 3 4 5 degree of freedom Figure: MCV, SD-MCV deteriorate as df increases; CV, SD-CV improve as df increases. Crossover 1.8.
True Model Inside Motivational Example New simulation: Consider 25 nested linear regression models the true model is Y = β 0 + J +E where E is Cauchy and J = 1,..., 25. Suppose J = 20 is the true model for which β j = 2/(j 1). Let s compare the sampling distributions of CV (with LS and LMS) and MCV with (LS and LMS) over the model list. j=1 We see that MCV-LMS is best at detecting small terms.
Cauchy error, 5-fold, nonsparse CV vs MCV % of choosing true model 21 0.00 0.02 0.04 0.06 0.08 0.10 Eprop Mprop EMprop3 EMprop4 Figure: 25 models, model 21 is correct, β s decreasing, n 300
Cauchy error, 5-fold, nonsparse CV vs MCV % of choosing true model 21 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Eprop Mprop EMprop3 EMprop4 Figure: 25 models, model 21 is correct, β s decreasing, n 1500
5. and Future Work For heavy tailed errors, especially where the leading terms are not enough, use MCV, not CV. When you have light tailed (normal) errors, use regular CV unless β j /σ c n (simulations not shown). Diagnostic for using MCV rather than CV: histogram of residuals. (If not normal, use MCV. If normal, rule out having constructed the normal error and use CV.) Maybe the Bahadur representation can help quantify these findings...bahadur (1966), JKG (1971), Mazumder and Serfling (2009)... Empirical process approach Kim and Pollard (1987) style as for LMS? Random effects models???