Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world.

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when:

To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when: a model is overspecified parameter inference is highly model-specific (and sensitive to model mis-specification) prediction is carried out by plugging in the estimated parameters and treating the model as exactly true.

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction?

A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction? If the true value of β 2 is close to zero, and var( ˆβ 2 ) is large, then it may be advantageous to omit β 2 : allow a nonzero Bias but minimize MSE.

Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution

Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution Abundant Bayesian literature in parametric framework Cox (1975), Geisser (1993), etc.

Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution Abundant Bayesian literature in parametric framework Cox (1975), Geisser (1993), etc. Frequentist and/or nonparametric literature scarse.

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data

I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data F ε is the predictive distribution, and its quantiles could be used to form predictive intervals The mean and median of F ε are optimal point predictors under an L 2 and L 1 criterion respectively.

I.i.d. data F ε is unknown but can be estimated by the empirical distribution (edf) ˆF ε.

I.i.d. data F ε is unknown but can be estimated by the empirical distribution (edf) ˆF ε. Practical model-free predictive intervals will be based on quantiles of ˆF ε, and the L 2 and L 1 optimal predictors will be approximated by the mean and median of ˆF ε respectively.

Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d.

Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d. So the predictive distribution of Y n+1 given the data will depend on Y n and X n+1 which is a matrix of observable, explanatory (predictor) variables.

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1)

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models.

Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models. Given one of the above models, optimal model-based predictors of a future Y -value can be constructed. Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering at the very least robustness against model mis-specification.

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n )

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data

Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data Find invertible transformation H m so that (for all m) the vector ɛ m = H m (Y m ) has i.i.d. components ɛ k where ɛ m = (ɛ 1,..., ɛ m )

Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n

Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n (ii) implies that Y n+1 is a function of ɛ 1,..., ɛ n, and ɛ n+1

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion

Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h. The unknown F ε can be estimated by ˆF ε, the edf of ɛ 1,..., ɛ n.

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n.

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n. ε t i.i.d. (0,1) from (unknown) cdf F

Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n. ε t i.i.d. (0,1) from (unknown) cdf F the functions µ( ) and σ( ) unknown but smooth

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x).

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc.

Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h here K(x) is the kernel, h the bandwidth, and K ( ) ( x x i h = K x xi ) h / n k=1 K ( ) x x k h.

(a) (b) log wage 12 13 14 15 residual 2.0 1.5 1.0 0.5 0.0 0.5 1.0 20 30 40 50 60 age 20 30 40 50 60 age (a) Log-wage vs. age data with fitted kernel smoother m x. (b) Unstudentized residuals Y m x with superimposed s x. 1971 Canadian Census data cps71 from np package of R; wage vs. age dataset of 205 male individuals with common education. Kernel smoother problematic at the left boundary; local linear is better (Fan and Gijbels, 1996) or reflection (Hall and Wehrly, 1991).

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t

Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset. Selection of bandwidth parameters h and q is often done by cross-validation, i.e., pick h, q to minimize PRESS= n t=1 ẽ2 t.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf. L 1 optimal predictor of Y f is the conditional median, i.e., µ(x f ) + σ(x f ) median(f )

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary.

Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x). MB L 2 optimal predictor of g(y f ) is E(g(Y f ) x f ) estimated by n 1 n i=1 g (m x f + σ xf e i ). Naive predictor g(m xf ) grossly suboptimal when g is nonlinear. MB L 1 optimal predictor of g(y f ) estimated by the sample median of the set {g (m xf + σ xf e i ), i = 1,..., n}; naive plug-in ok iff g is monotone!

Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α).

Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true?

Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true? E.g., the skewness and/or kurtosis of Y t may depend on x t.

(a) (b) log wage skewness 2.0 1.5 1.0 0.5 0.0 0.5 log wage kurtosis 1 0 1 2 3 4 5 20 30 40 50 60 age 20 30 40 50 60 age (a) Log-wage SKEWNESS vs. age. (b) Log-wage KURTOSIS vs. age. Both skewness and kurtosis are nonconstant!

General background- Could try skewness reducing transformations but log already does that.

General background- Could try skewness reducing transformations but log already does that. Could try ACE, AVAS, etc.

General background- Could try skewness reducing transformations but log already does that. Could try ACE, AVAS, etc. There is a simpler, more general solution!

General background The Y t s are still independent but not identically distributed.

General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x}

General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y. With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc. Since D x ( ) depends in a smooth way on x, we can estimate D x (y) by the local empirical N 1 x,h t: x 1{Y t x <h/2 t y} where 1{ } is indicator, and N x,h is the number of summands, i.e., N x,h = # {t : x t x < h/2}.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n.

Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n. Estimator ˆD x (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

Linear interpolation (a) (b) Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 quantiles of transformed data u_t 0.2 0.4 0.6 0.8 1.0 1.5 0.5 0.5 x 0.2 0.4 0.6 0.8 quantiles of Uniform (0,1) FIGURE 2: (a) Empirical distribution of a test sample consisting of five N(0, 1) and five N(1/2, 1) independent r.v. s with the piecewise linear estimator D( ) superimposed; the vertical/horizontal lines indicates the inversion process, i.e., finding D 1 (0.75). (b) Q-Q plot of the transformed variables u i vs. the quantiles of Uniform (0,1).

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness.

Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n.

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n.

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x Let u f = D xf (Y f ) and Y f = D 1 x f (u f ). is well-defined since D x ( ) is

Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is Let u f = D xf (Y f ) and Y f = Dx 1 f (u f ). D x 1 f (u i ) has (approximately) the same distribution as Y f (conditionally on x f ) for any i.

These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived.

These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived. The L 2 optimal predictor of g(y f ) will be the expected value of g(y f ) that is approximated by n 1 ( ) n i=1 g D x 1 f (u i ).

Model-free optimal point predictors Model-free method L 2 predictor of Y f n 1 ( ) n i=1 D 1 x f D xi (Y i ( ) L 1 predictor of Y f median{ D x 1 f D xi (Y i ) } L 2 predictor of g(y f ) n 1 ( n i=1 g D 1 x f ( D ) xi (Y i )) ( L 1 predictor of g(y f ) median{g D 1 x f ( D ) xi (Y i )) } TABLE. The model-free (MF 2 ) optimal point predictors.

Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit.

Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit. Focus on the L 2 optimal case with g(x) = x.

Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit. Focus on the L 2 optimal case with g(x) = x. Calculating the MF predictor Π(x f ) = n 1 ( ) n i=1 g D 1 x f (u i ) for many different x f values say on a grid, the equivalent of a nonparametric smoother of a regression function is constructed Model-Free Model-Fitting (MF 2 ).

M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4.

M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4. No need for log-transformation of salaries!

wage 5000 10000 15000 20000 25000 30000 35000 predicted wage 8000 10000 12000 14000 20 30 40 50 60 age 20 30 40 50 60 age FIGURE 3: (a) Wage vs. age scatterplot. (b) Circles indicate the salary predictor n 1 ( ) n i=1 g D 1 x f (u i ) calculated from log-wage data with g(x) exponential. For both figures, the superimposed solid line represents the MF 2 salary predictor calculated from the raw data (without log).

(a) (b) quantiles of transformed data u_t 0.0 0.2 0.4 0.6 0.8 1.0 quantiles of transformed data u_t 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 quantiles of Uniform (0,1) 0.0 0.4 0.8 quantiles of Uniform (0,1) FIGURE 4: Q-Q plots of the u i vs. the quantiles of Uniform (0,1). (a) The u i s are obtained from the log-wage vs. age dataset of Figure 1 using bandwidth 5.5; (b) The u i s are obtained from the raw (untransformed) dataset of Figure 3 using bandwidth 7.3.

MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations.

MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations. Let g(y f ) Π be the prediction root where Π is either the L 2 or L 1 optimal predictor, i.e., Π = n 1 ( ) n i=1 g D x 1 f (u i ) ( ) or Π = median {g D x 1 f (u i ) }. Based on the Y data, estimate the conditional distribution D x ( ) by D x ( ), and let u i = D xi (Y i ) to obtain the transformed data u 1,..., u n that are approximately i.i.d.

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n.

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) ( n i=1 g D x 1 f (ui ) or Π =median {g D x 1 f (ui )} ) Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α).

Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) n i=1 g D x 1 f (ui ) or Π =median {g ( D 1 x f (u i ) )} Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α). Predictive distribution of g(y f ) is the above edf shifted to the right by Π, and MF 2 (1 α)100% equal-tailed, prediction interval for g(y f ) is [Π + q(α/2), Π + q(1 α/2)].

MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ). How does Π xf compare to the kernel smoother m xf? Can we construct confidence intervals for the regression function µ(x f ) without additive model ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1)?

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n.

MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh). Recall u i approx. i.i.d. Uniform(0,1). So, for large n, Π xf = n 1 n i=1 D 1 x f (u i ) 1 0 D 1 x f (u) du = m xf

Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ).

Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π)

Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace. Prediction points: x f = π; µ(x) has high slope but zero curvature easy case for estimation.

y x y (a) (b) -1 0 1 2-1 0 1 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x FIGURE 6: Typical scatterplots with superimposed kernel smoothers; (a) Normal data; (b) Laplace data.

x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.845 0.832 0.822 0.838 0.847 0.865 0.798 MF/MB 0.901 0.912 0.874 0.897 0.921 0.908 0.861 MF 2 0.834 0.836 0.829 0.831 0.849 0.852 0.804 MF/MF 2 0.897 0.906 0.886 0.886 0.912 0.895 0.868 Normal 0.874 0.876 0.879 0.860 0.863 0.867 0.856 Table 4.2(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several x f points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L 1 cross-validation. Error distribution: i.i.d. Normal. [st.error of CVRs =0.013]

x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.870 0.820 0.841 0.894 0.883 0.872 0.831 MF/MB 0.899 0.870 0.886 0.910 0.912 0.905 0.868 MF 2 0.868 0.805 0.834 0.867 0.874 0.859 0.825 MF/MF 2 0.905 0.876 0.885 0.916 0.910 0.910 0.865 Normal 0.874 0.849 0.876 0.888 0.885 0.879 0.874 Table 4.3(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several x f points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L 1 cross-validation. Error distribution: i.i.d. Laplace.

NORMAL intervals exhibit under-coverage even when the true distribution is Normal they need explicit bias correction. Bootstrap methods more variable due to extra randomization. The MF/MB intervals are more accurate than their MB analogs in the case x f = π/2. When x f = π, the MB intervals are most accurate, and the MF/MB intervals seem to over-correct (and over-cover). Over-coverage can be attributed to bias leakage and should be alleviated with a larger sample size, using higher-order kernels, or by a bandwidth trick such as undersmoothing.

Performance of MF 2 (resp. MF/MF 2 ) intervals resembles that of MB intervals (resp. MF/MB). MF/MF 2 intervals are best at x f = π/2; this is quite surprising since one would expect the MB and MF/MB intervals to have a distinct advantage when model ( ) is true. The price to pay for using the generally valid MF/MF 2 intervals instead of the model-specific MF/MB ones is the increased variability of interval length.

Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1).

Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1). W is either exponential, i.e., 1 2 χ2 2 1, to capture skewness, or 3 Student s t with 5 d.f., i.e., 5 t 5, to capture kurtosis. ε x independent but not i.i.d.: c x = x/(2π) for x [0, 2π] Large x: ε x is close to Normal. Small x: ε x is skewed/kurtotic.

x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.886 0.893 0.840 0.840 0.888 0.816 0.780 MF/MB 0.917 0.927 0.877 0.917 0.928 0.892 0.849 MF 2 0.894 0.876 0.823 0.843 0.876 0.813 0.782 MF/MF 2 0.917 0.932 0.897 0.894 0.930 0.874 0.836 Normal 0.903 0.920 0.876 0.888 0.883 0.838 0.834 Table 4.4(a). CVRs with error distribution non i.i.d. skewed. x f /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 MB 0.836 0.849 0.804 0.868 0.872 0.831 0.805 MF/MB 0.883 0.885 0.850 0.921 0.923 0.908 0.861 MF 2 0.814 0.843 0.798 0.861 0.863 0.829 0.816 MF/MF 2 0.876 0.899 0.858 0.923 0.921 0.894 0.877 Normal 0.858 0.876 0.859 0.883 0.872 0.845 0.865 Table 4.5(a). CVRs with error distribution non i.i.d. kurtotic.

The NORMAL intervals are totally unreliable which is to be expected due to the non-normal error distributions. The MF/MF 2 intervals are best (by far) in the cases x f = π/2 and x f = 3π/2 attaining close to nominal coverage even with a sample size as low as n = 100. The case x f = π remains problematic for the same reasons previously discussed.

Conclusions Model-free prediction principle

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties

Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties Totally Automatic: no need to search for optimal transformations in regression...

Bootstrap prediction intervals for time series [joint work with Li Pan, UCSD] TIME SERIES DATA: X 1,..., X n GOAL: Predict X n+1 given the data (new notation) Problem: We can not choose x f ; it is given by the recent history of the time series, e.g., X n 1,, X n p for some p. Bootstrap series X 1, X 2,..., X n 1, X n, X n+1,... must be such that have the same values for the recent history, i.e., X n 1 = X n 1, X n 2 = X n 2,..., X n p = X n p.

Forward" bootstrap: Linear AR model: φ(b)x t = ɛ t with ɛ t iid Let ˆφ be the LS (scatterplot) estimate of φ based on X 1,..., X n Let φ be the LS (scatterplot) estimate of φ based on X 1,..., X n p, X n p+1,..., X n i.e., the last p values have been changed/corrupted/omitted. ˆφ φ = O p (1/n) Hence, we can artificially change the last p values in the bootstrap world and the effect will be negligible. Bootstrap series: X1,..., X n p, X n p+1,..., X n where Xt = ˆφ(B) 1 ɛ t (forward recursion) and X n p+1,..., X n are the same values as in the original dataset.