Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Similar documents
Bootstrap confidence intervals in nonparametric regression without an additive model

REJOINDER Bootstrap prediction intervals for linear, nonlinear and nonparametric autoregressions

UC San Diego Recent Work

Nonparametric Methods

Discussion of Bootstrap prediction intervals for linear, nonlinear, and nonparametric autoregressions, by Li Pan and Dimitris Politis

Nonparametric Inference via Bootstrapping the Debiased Estimator

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Statistics: Learning models from data

Local Polynomial Regression

Nonparametric Regression

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Time Series and Forecasting Lecture 4 NonLinear Time Series

Econ 582 Nonparametric Regression

Local Polynomial Modelling and Its Applications

Local Polynomial Wavelet Regression with Missing at Random

The Nonparametric Bootstrap

Additive Isotonic Regression

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Transformation and Smoothing in Sample Survey Data

Chapter 2: Resampling Maarten Jansen

Nonparametric Econometrics

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

UNIVERSITÄT POTSDAM Institut für Mathematik

Nonparametric Regression. Badr Missaoui

Bootstrap prediction intervals for Markov processes

Bootstrap & Confidence/Prediction intervals

University of California San Diego and Stanford University and

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

What to do if Assumptions are Violated?

A New Method for Varying Adaptive Bandwidth Selection

A nonparametric method of multi-step ahead forecasting in diffusion processes

41903: Introduction to Nonparametrics

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Confidence intervals for kernel density estimation

Introduction to Nonparametric Regression

A Novel Nonparametric Density Estimator

Inference on distributions and quantiles using a finite-sample Dirichlet process

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Stat 5101 Lecture Notes

Chapter 9. Non-Parametric Density Function Estimation

Unit 10: Simple Linear Regression and Correlation

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Quantile methods. Class Notes Manuel Arellano December 1, Let F (r) =Pr(Y r). Forτ (0, 1), theτth population quantile of Y is defined to be

Classification. Chapter Introduction. 6.2 The Bayes classifier

Statistical inference on Lévy processes

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Gaussian Processes (10/16/13)

Single Index Quantile Regression for Heteroscedastic Data

Quantile Regression for Residual Life and Empirical Likelihood

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Outline of GLMs. Definitions

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Statistics 3657 : Moment Approximations

Machine Learning Linear Regression. Prof. Matteo Matteucci

Semiparametric Generalized Linear Models

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Single Index Quantile Regression for Heteroscedastic Data

Sociology 740 John Fox. Lecture Notes. 1. Introduction. Copyright 2014 by John Fox. Introduction 1

Ultra High Dimensional Variable Selection with Endogenous Variables

Introduction to Econometrics

Chapter 4: Imputation

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES

Non-parametric Inference and Resampling

Small Sample Corrections for LTS and MCD

STAT Chapter 11: Regression

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Error distribution function for parametrically truncated and censored data

Better Bootstrap Confidence Intervals

2 Functions of random variables

Nonparametric Regression. Changliang Zou

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

Uncertainty Quantification for Inverse Problems. November 7, 2011

Bayesian spatial quantile regression

Why experimenters should not randomize, and what they should do instead

On the Robust Modal Local Polynomial Regression

Web-based Supplementary Material for. Dependence Calibration in Conditional Copulas: A Nonparametric Approach

Statistical View of Least Squares

The Simple Regression Model. Part II. The Simple Regression Model

Section 7: Local linear regression (loess) and regression discontinuity designs

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data

STAT Section 2.1: Basic Inference. Basic Definitions

F9 F10: Autocorrelation

Lecture 3: Statistical Decision Theory (Part II)

Post-exam 2 practice questions 18.05, Spring 2014

Independent and conditionally independent counterfactual distributions

Comments on: Model-free model-fitting and predictive distributions : Applications to Small Area Statistics and Treatment Effect Estimation

Modelling Non-linear and Non-stationary Time Series

4 Nonparametric Regression

1 Empirical Likelihood

Nonparametric confidence intervals. for receiver operating characteristic curves

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Transcription:

Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world.

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when:

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when: a model is overspecified parameter inference is highly model-specific (and sensitive to model mis-specification) prediction is carried out by plugging in the estimated parameters and treating the model as exactly true.

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when: a model is overspecified parameter inference is highly model-specific (and sensitive to model mis-specification) prediction is carried out by plugging in the estimated parameters and treating the model as exactly true. All models are wrong but some are useful George Box.

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction?

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction? If the true value of β 2 is close to zero, and var( ˆβ 2 ) is large, then it may be advantageous to omit β 2 : allow a nonzero Bias but minimize MSE.

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction? If the true value of β 2 is close to zero, and var( ˆβ 2 ) is large, then it may be advantageous to omit β 2 : allow a nonzero Bias but minimize MSE. A mis-specified model can be optimal for prediction!

Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution

Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution Abundant Bayesian literature in parametric framework Cox (1975), Geisser (1993), etc.

Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution Abundant Bayesian literature in parametric framework Cox (1975), Geisser (1993), etc. Frequentist and/or nonparametric literature scarse.

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data F ε is the predictive distribution, and its quantiles could be used to form predictive intervals

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data F ε is the predictive distribution, and its quantiles could be used to form predictive intervals The mean and median of F ε are optimal point predictors under an L 2 and L 1 criterion respectively.

I.i.d. data F ε is unknown but can be estimated by the empirical distribution (edf) ˆF ε.

I.i.d. data F ε is unknown but can be estimated by the empirical distribution (edf) ˆF ε. Practical model-free predictive intervals will be based on quantiles of ˆF ε, and the L 2 and L 1 optimal predictors will be approximated by the mean and median of ˆF ε respectively.

Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d.

Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d. So the predictive distribution of Y n+1 given the data will depend on Y n and X n+1 which is a matrix of observable, explanatory (predictor) variables.

Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d. So the predictive distribution of Y n+1 given the data will depend on Y n and X n+1 which is a matrix of observable, explanatory (predictor) variables. Key Examples: Regression and Time series

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1)

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models.

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models. Given one of the above models, optimal model-based predictors of a future Y -value can be constructed.

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models. Given one of the above models, optimal model-based predictors of a future Y -value can be constructed. Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering at the very least robustness against model mis-specification.

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n )

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data Find invertible transformation H m so that (for all m) the vector ɛ m = H m (Y m ) has i.i.d. components ɛ k where ɛ m = (ɛ 1,..., ɛ m )

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data Find invertible transformation H m so that (for all m) the vector ɛ m = H m (Y m ) has i.i.d. components ɛ k where Y H m ɛ ɛ m = (ɛ 1,..., ɛ m ) Y H 1 m ɛ

Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n

Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n (ii) implies that Y n+1 is a function of ɛ 1,..., ɛ n, and ɛ n+1

Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n (ii) implies that Y n+1 is a function of ɛ 1,..., ɛ n, and ɛ n+1 So, given the data Y n, Y n+1 is a function of ɛ n+1 only, i.e., Y n+1 = h(ɛ n+1 )

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h.

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h. The unknown F ε can be estimated by ˆF ε, the edf of ɛ 1,..., ɛ n.

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h. The unknown F ε can be estimated by ˆF ε, the edf of ɛ 1,..., ɛ n. But the predictive distribution needs bootstrapping also because h is estimated from the data.

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n.

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n. ε t i.i.d. (0,1) from (unknown) cdf F

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n. ε t i.i.d. (0,1) from (unknown) cdf F the functions µ( ) and σ( ) unknown but smooth

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x).

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc.

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h here K(x) is the kernel, h the bandwidth, and K ( ) ( x x i h = K x xi ) h / n k=1 K ( ) x x k h.

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h here K(x) is the kernel, h the bandwidth, and K ( ) ( x x i h = K x xi ) h / n k=1 K ( ) x x k h. Similarly, sx 2 = M x mx 2 where M x = ( ) n i=1 Y i 2 K x xi q

(a) (b) log wage 12 13 14 15 residual 2.0 1.5 1.0 0.5 0.0 0.5 1.0 20 30 40 50 60 age 20 30 40 50 60 age (a) Log-wage vs. age data with fitted kernel smoother m x. (b) Unstudentized residuals Y m x with superimposed s x. 1971 Canadian Census data cps71 from np package of R; wage vs. age dataset of 205 male individuals with common education. Kernel smoother problematic at the left boundary; local linear is better (Fan and Gijbels, 1996) or reflection (Hall and Wehrly, 1991).

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset.

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset. Selection of bandwidth parameters h and q is often done by cross-validation, i.e., pick h, q to minimize PRESS= n t=1 ẽ2 t.

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset. Selection of bandwidth parameters h and q is often done by cross-validation, i.e., pick h, q to minimize PRESS= n t=1 ẽ2 t. BETTER: L 1 cross-validation: pick h, q to minimize n t=1 ẽ t.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f )

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf. L 1 optimal predictor of Y f is the conditional median, i.e., µ(x f ) + σ(x f ) median(f )

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf. L 1 optimal predictor of Y f is the conditional median, i.e., µ(x f ) + σ(x f ) median(f ) which is approximated by m xf + s xf median(ˆf e ) where ˆF e is the edf of the residuals e 1,..., e n

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x).

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x). MB L 2 optimal predictor of g(y f ) is E(g(Y f ) x f ) estimated by n 1 n i=1 g (m x f + σ xf e i ).

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x). MB L 2 optimal predictor of g(y f ) is E(g(Y f ) x f ) estimated by n 1 n i=1 g (m x f + σ xf e i ). Naive predictor g(m xf ) grossly suboptimal when g is nonlinear. MB L 1 optimal predictor of g(y f ) estimated by the sample median of the set {g (m xf + σ xf e i ), i = 1,..., n}; naive plug-in ok iff g is monotone!

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n.

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Y i = m xi + s xi ri, for i = 1,...n.

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ).

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x.

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ).

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α).

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α). Our estimate of the predictive distribution of g(y f ) is the empirical df of bootstrap roots shifted to the right by Π.

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α). Our estimate of the predictive distribution of g(y f ) is the empirical df of bootstrap roots shifted to the right by Π. Then, a (1 α)100% equal-tailed predictive interval for g(y f ) is given by: [Π + q(α/2), Π + q(1 α/2)].

Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true?

Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true? E.g., the skewness and/or kurtosis of Y t may depend on x t.

Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true? E.g., the skewness and/or kurtosis of Y t may depend on x t. cps71 data: skewness/kurtosis of salary depend on age.

(a) (b) log wage skewness 2.0 1.5 1.0 0.5 0.0 0.5 log wage kurtosis 1 0 1 2 3 4 5 20 30 40 50 60 age 20 30 40 50 60 age (a) Log-wage SKEWNESS vs. age. (b) Log-wage KURTOSIS vs. age. Both skewness and kurtosis are nonconstant!

General background- Could try skewness reducing transformations but log already does that.

General background- Could try skewness reducing transformations but log already does that. Could try ACE, AVAS, etc.

General background- Could try skewness reducing transformations but log already does that. Could try ACE, AVAS, etc. There is a simpler, more general solution!

General background The Y t s are still independent but not identically distributed.

General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x}

General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y.

General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y. With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc.

General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y. With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc. Since D x ( ) depends in a smooth way on x, we can estimate D x (y) by the local empirical N 1 x,h t: x 1{Y t x <h/2 t y} where 1{ } is indicator, and N x,h is the number of summands, i.e., N x,h = # {t : x t x < h/2}.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n. Estimator ˆD x (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n. Estimator ˆD x (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007). But ˆD x (y) is discontinuous in y, and therefore unacceptable! Could smooth it by kernel methods, or...

Linear interpolation (a) (b) Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 quantiles of transformed data u_t 0.2 0.4 0.6 0.8 1.0 1.5 0.5 0.5 x 0.2 0.4 0.6 0.8 quantiles of Uniform (0,1) FIGURE 2: (a) Empirical distribution of a test sample consisting of five N(0, 1) and five N(1/2, 1) independent r.v. s with the piecewise linear estimator D( ) superimposed; the vertical/horizontal lines indicates the inversion process, i.e., finding D 1 (0.75). (b) Q-Q plot of the transformed variables u i vs. the quantiles of Uniform (0,1).

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness.

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1).

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy.

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n.

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n. D x ( ) is consistent, so u 1,..., u n are approximately i.i.d.

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n. D x ( ) is consistent, so u 1,..., u n are approximately i.i.d. The probability integral transform has been used in the past for building better density estimators Ruppert and Cline (1994).

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n.

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x Let u f = D xf (Y f ) and Y f = D 1 x f (u f ). is well-defined since D x ( ) is

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is Let u f = D xf (Y f ) and Y f = Dx 1 f (u f ). D x 1 f (u i ) has (approximately) the same distribution as Y f (conditionally on x f ) for any i.

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is Let u f = D xf (Y f ) and Y f = Dx 1 f (u f ). D x 1 f (u i ) has (approximately) the same distribution as Y f (conditionally on x f ) for any i. So { D 1 x f (u i ), i = 1,..., n} is a set of bona fide potential responses that can be used as proxies for Y f.

These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived.

These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived. The L 2 optimal predictor of g(y f ) will be the expected value of g(y f ) that is approximated by n 1 ( ) n i=1 g D x 1 f (u i ).

These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived. The L 2 optimal predictor of g(y f ) will be the expected value of g(y f ) that is approximated by n 1 ( ) n i=1 g D x 1 f (u i ). The L 1 optimal predictor of g(y( f ) will be) approximated by the sample median of the set {g D x 1 f (u i ), i = 1,..., n}.

Model-free optimal point predictors Model-free method L 2 predictor of Y f n 1 ( ) n i=1 D 1 x f D xi (Y i ( ) L 1 predictor of Y f median{ D x 1 f D xi (Y i ) } L 2 predictor of g(y f ) n 1 ( n i=1 g D 1 x f ( D ) xi (Y i )) ( L 1 predictor of g(y f ) median{g D 1 x f ( D ) xi (Y i )) } TABLE. The model-free (MF 2 ) optimal point predictors.

Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit.

Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit. Focus on the L 2 optimal case with g(x) = x.

Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit. Focus on the L 2 optimal case with g(x) = x. Calculating the MF predictor Π(x f ) = n 1 ( ) n i=1 g D 1 x f (u i ) for many different x f values say on a grid, the equivalent of a nonparametric smoother of a regression function is constructed Model-Free Model-Fitting (MF 2 ).

M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4.

M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4. No need for log-transformation of salaries!

M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4. No need for log-transformation of salaries! MF 2 is totally automatic!!

wage 5000 10000 15000 20000 25000 30000 35000 predicted wage 8000 10000 12000 14000 20 30 40 50 60 age 20 30 40 50 60 age FIGURE 3: (a) Wage vs. age scatterplot. (b) Circles indicate the salary predictor n 1 ( ) n i=1 g D 1 x f (u i ) calculated from log-wage data with g(x) exponential. For both figures, the superimposed solid line represents the MF 2 salary predictor calculated from the raw data (without log).

(a) (b) quantiles of transformed data u_t 0.0 0.2 0.4 0.6 0.8 1.0 quantiles of transformed data u_t 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 quantiles of Uniform (0,1) 0.0 0.4 0.8 quantiles of Uniform (0,1) FIGURE 4: Q-Q plots of the u i vs. the quantiles of Uniform (0,1). (a) The u i s are obtained from the log-wage vs. age dataset of Figure 1 using bandwidth 5.5; (b) The u i s are obtained from the raw (untransformed) dataset of Figure 3 using bandwidth 7.3.

MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations.

MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations. Let g(y f ) Π be the prediction root where Π is either the L 2 or L 1 optimal predictor, i.e., Π = n 1 ( ) n i=1 g D x 1 f (u i ) ( ) or Π = median {g D x 1 f (u i ) }.

MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations. Let g(y f ) Π be the prediction root where Π is either the L 2 or L 1 optimal predictor, i.e., Π = n 1 ( ) n i=1 g D x 1 f (u i ) ( ) or Π = median {g D x 1 f (u i ) }. Based on the Y data, estimate the conditional distribution D x ( ) by D x ( ), and let u i = D xi (Y i ) to obtain the transformed data u 1,..., u n that are approximately i.i.d.

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n.

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n.

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ).

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ).

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) n i=1 g D x 1 f (ui ) or Π =median {g ( D 1 x f (u i ) )}

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) ( n i=1 g D x 1 f (ui ) or Π =median {g D x 1 f (ui )} ) Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α).

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) n i=1 g D x 1 f (ui ) or Π =median {g ( D 1 x f (u i ) )} Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α). Predictive distribution of g(y f ) is the above edf shifted to the right by Π, and MF 2 (1 α)100% equal-tailed, prediction interval for g(y f ) is [Π + q(α/2), Π + q(1 α/2)].

MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i )

MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ).

MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ). How does Π xf compare to the kernel smoother m xf?

MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ). How does Π xf compare to the kernel smoother m xf? Can we construct confidence intervals for the regression function µ(x f ) without additive model ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1)?

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n.

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n.

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh).

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh). Recall u i approx. i.i.d. Uniform(0,1). So, for large n, Π xf = n 1 n i=1 D 1 x f (u i ) 1 0 D 1 x f (u) du = m xf

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh). Recall u i approx. i.i.d. Uniform(0,1). So, for large n, Π xf = n 1 n i=1 D 1 x f (u i ) 1 0 D 1 x f (u) du = m xf This is just the identity ydf (y) = 1 0 F 1 (u) du.

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n.

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n.

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ).

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ).

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ). Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α).

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ). Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α). MF 2 (1 α)100% equal-tailed, confidence interval for µ(x f ) is [Π + q(α/2), Π + q(1 α/2)].

Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π)

Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace.

Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace. Prediction points: x f = π; µ(x) has high slope but zero curvature easy case for estimation.

Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace. Prediction points: x f = π; µ(x) has high slope but zero curvature easy case for estimation. x f = π/2 and x f = 3π/2; µ(x) has zero slope but high curvature peak and valley so large bias of m x.

y x y (a) (b) -1 0 1 2-1 0 1 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x FIGURE 6: Typical scatterplots with superimposed kernel smoothers; (a) Normal data; (b) Laplace data.

x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.845 0.832 0.822 0.838 0.847 0.865 0.798 MF/MB 0.901 0.912 0.874 0.897 0.921 0.908 0.861 MF 2 0.834 0.836 0.829 0.831 0.849 0.852 0.804 MF/MF 2 0.897 0.906 0.886 0.886 0.912 0.895 0.868 Normal 0.874 0.876 0.879 0.860 0.863 0.867 0.856 Table 4.2(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several x f points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L 1 cross-validation. Error distribution: i.i.d. Normal. [st.error of CVRs =0.013]

x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.870 0.820 0.841 0.894 0.883 0.872 0.831 MF/MB 0.899 0.870 0.886 0.910 0.912 0.905 0.868 MF 2 0.868 0.805 0.834 0.867 0.874 0.859 0.825 MF/MF 2 0.905 0.876 0.885 0.916 0.910 0.910 0.865 Normal 0.874 0.849 0.876 0.888 0.885 0.879 0.874 Table 4.3(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several x f points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L 1 cross-validation. Error distribution: i.i.d. Laplace.

NORMAL intervals exhibit under-coverage even when the true distribution is Normal they need explicit bias correction. Bootstrap methods more variable due to extra randomization. The MF/MB intervals are more accurate than their MB analogs in the case x f = π/2. When x f = π, the MB intervals are most accurate, and the MF/MB intervals seem to over-correct (and over-cover). Over-coverage can be attributed to bias leakage and should be alleviated with a larger sample size, using higher-order kernels, or by a bandwidth trick such as undersmoothing.

Performance of MF 2 (resp. MF/MF 2 ) intervals resembles that of MB intervals (resp. MF/MB). MF/MF 2 intervals are best at x f = π/2; this is quite surprising since one would expect the MB and MF/MB intervals to have a distinct advantage when model ( ) is true. The price to pay for using the generally valid MF/MF 2 intervals instead of the model-specific MF/MB ones is the increased variability of interval length.

Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1).

Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1). W is either exponential, i.e., 1 2 χ2 2 1, to capture skewness, or 3 Student s t with 5 d.f., i.e., 5 t 5, to capture kurtosis.

Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1). W is either exponential, i.e., 1 2 χ2 2 1, to capture skewness, or 3 Student s t with 5 d.f., i.e., 5 t 5, to capture kurtosis. ε x independent but not i.i.d.: c x = x/(2π) for x [0, 2π] Large x: ε x is close to Normal. Small x: ε x is skewed/kurtotic.

x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.886 0.893 0.840 0.840 0.888 0.816 0.780 MF/MB 0.917 0.927 0.877 0.917 0.928 0.892 0.849 MF 2 0.894 0.876 0.823 0.843 0.876 0.813 0.782 MF/MF 2 0.917 0.932 0.897 0.894 0.930 0.874 0.836 Normal 0.903 0.920 0.876 0.888 0.883 0.838 0.834 Table 4.4(a). CVRs with error distribution non i.i.d. skewed. x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.836 0.849 0.804 0.868 0.872 0.831 0.805 MF/MB 0.883 0.885 0.850 0.921 0.923 0.908 0.861 MF 2 0.814 0.843 0.798 0.861 0.863 0.829 0.816 MF/MF 2 0.876 0.899 0.858 0.923 0.921 0.894 0.877 Normal 0.858 0.876 0.859 0.883 0.872 0.845 0.865 Table 4.5(a). CVRs with error distribution non i.i.d. kurtotic.

The NORMAL intervals are totally unreliable which is to be expected due to the non-normal error distributions. The MF/MF 2 intervals are best (by far) in the cases x f = π/2 and x f = 3π/2 attaining close to nominal coverage even with a sample size as low as n = 100. The case x f = π remains problematic for the same reasons previously discussed.

Conclusions Model-free prediction principle

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties Totally Automatic: no need to search for optimal transformations in regression...

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties Totally Automatic: no need to search for optimal transformations in regression......but very computationally intensive!

Bootstrap prediction intervals for time series [joint work with Li Pan, UCSD] TIME SERIES DATA: X 1,..., X n GOAL: Predict X n+1 given the data (new notation) Problem: We can not choose x f ; it is given by the recent history of the time series, e.g., X n 1,, X n p for some p. Bootstrap series X 1, X 2,..., X n 1, X n, X n+1,... must be such that have the same values for the recent history, i.e., X n 1 = X n 1, X n 2 = X n 2,..., X n p = X n p.

Forward" bootstrap: Linear AR model: φ(b)x t = ɛ t with ɛ t iid Let ˆφ be the LS (scatterplot) estimate of φ based on X 1,..., X n Let φ be the LS (scatterplot) estimate of φ based on X 1,..., X n p, X n p+1,..., X n i.e., the last p values have been changed/corrupted/omitted. ˆφ φ = O p (1/n) Hence, we can artificially change the last p values in the bootstrap world and the effect will be negligible. Bootstrap series: X1,..., X n p, X n p+1,..., X n where Xt = ˆφ(B) 1 ɛ t (forward recursion) and X n p+1,..., X n are the same values as in the original dataset.