Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Size: px

Start display at page:

Download "Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego"

Cameron Cross
5 years ago
Views:

1 Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego

2 To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world.

3 To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when:

4 To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when: a model is overspecified parameter inference is highly model-specific (and sensitive to model mis-specification) prediction is carried out by plugging in the estimated parameters and treating the model as exactly true.

5 To explain or to predict? Models are indispensable for exploring/utilizing relationships between variables: explaining the world. Use of models for prediction can be problematic when: a model is overspecified parameter inference is highly model-specific (and sensitive to model mis-specification) prediction is carried out by plugging in the estimated parameters and treating the model as exactly true. All models are wrong but some are useful George Box.

6 A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error

7 A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction?

8 A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction? If the true value of β 2 is close to zero, and var( ˆβ 2 ) is large, then it may be advantageous to omit β 2 : allow a nonzero Bias but minimize MSE.

9 A Toy Example Assume regression model: Y = β 0 + β 1 X + β 2 X 20 + error If ˆβ 2 is barely statistically significant, do you still use it in prediction? If the true value of β 2 is close to zero, and var( ˆβ 2 ) is large, then it may be advantageous to omit β 2 : allow a nonzero Bias but minimize MSE. A mis-specified model can be optimal for prediction!

10 Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution

11 Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution Abundant Bayesian literature in parametric framework Cox (1975), Geisser (1993), etc.

12 Prediction Framework a. Point predictors b. Interval predictors c. Predictive distribution Abundant Bayesian literature in parametric framework Cox (1975), Geisser (1993), etc. Frequentist and/or nonparametric literature scarse.

13 I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε

14 I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data

15 I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data F ε is the predictive distribution, and its quantiles could be used to form predictive intervals

16 I.i.d. set-up Let ε 1,..., ε n i.i.d. from the (unknown) cdf F ε GOAL: prediction of future ε n+1 based on the data F ε is the predictive distribution, and its quantiles could be used to form predictive intervals The mean and median of F ε are optimal point predictors under an L 2 and L 1 criterion respectively.

17 I.i.d. data F ε is unknown but can be estimated by the empirical distribution (edf) ˆF ε.

18 I.i.d. data F ε is unknown but can be estimated by the empirical distribution (edf) ˆF ε. Practical model-free predictive intervals will be based on quantiles of ˆF ε, and the L 2 and L 1 optimal predictors will be approximated by the mean and median of ˆF ε respectively.

19 Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d.

20 Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d. So the predictive distribution of Y n+1 given the data will depend on Y n and X n+1 which is a matrix of observable, explanatory (predictor) variables.

21 Non-i.i.d. data In general, data Y n = (Y 1,..., Y n ) are not i.i.d. So the predictive distribution of Y n+1 given the data will depend on Y n and X n+1 which is a matrix of observable, explanatory (predictor) variables. Key Examples: Regression and Time series

22 Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1)

23 Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t

24 Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models.

25 Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models. Given one of the above models, optimal model-based predictors of a future Y -value can be constructed.

26 Models Regression: Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1) Time series: Y t = µ(y t 1,, Y t p ; x t ) + σ(y t 1,, Y t p ; x t ) ε t The above are flexible, nonparametric models. Given one of the above models, optimal model-based predictors of a future Y -value can be constructed. Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering at the very least robustness against model mis-specification.

27 Transformation vs. modeling DATA: Y n = (Y 1,..., Y n )

28 Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data

29 Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data Find invertible transformation H m so that (for all m) the vector ɛ m = H m (Y m ) has i.i.d. components ɛ k where ɛ m = (ɛ 1,..., ɛ m )

30 Transformation vs. modeling DATA: Y n = (Y 1,..., Y n ) GOAL: predict future value Y n+1 given the data Find invertible transformation H m so that (for all m) the vector ɛ m = H m (Y m ) has i.i.d. components ɛ k where Y H m ɛ ɛ m = (ɛ 1,..., ɛ m ) Y H 1 m ɛ

31 Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n

32 Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n (ii) implies that Y n+1 is a function of ɛ 1,..., ɛ n, and ɛ n+1

33 Transformation (i) (Y 1,..., Y m ) H m (ɛ 1,..., ɛ m ) (ii) (Y 1,..., Y m ) H 1 m (ɛ 1,..., ɛ m ) (i) implies that ɛ 1,..., ɛ n are known given the data Y 1,..., Y n (ii) implies that Y n+1 is a function of ɛ 1,..., ɛ n, and ɛ n+1 So, given the data Y n, Y n+1 is a function of ɛ n+1 only, i.e., Y n+1 = h(ɛ n+1 )

34 Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε

35 Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion

36 Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε

37 Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h.

38 Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h. The unknown F ε can be estimated by ˆF ε, the edf of ɛ 1,..., ɛ n.

39 Model-free prediction principle Y n+1 = h(ɛ n+1 ) Suppose ɛ 1,..., ɛ n cdf F ε The mean and median of h(ɛ) where ɛ F ε are optimal point predictors of Y n+1 under L 2 or L 1 criterion The whole predictive distribution of Y n+1 is the distribution of h(ɛ) when ɛ F ε To predict Y 2 n+1, replace h by h 2 ; to predict g(y n+1 ), replace h by g h. The unknown F ε can be estimated by ˆF ε, the edf of ɛ 1,..., ɛ n. But the predictive distribution needs bootstrapping also because h is estimated from the data.

40 Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic

41 Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n.

42 Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n. ε t i.i.d. (0,1) from (unknown) cdf F

43 Nonparametric Regression MODEL ( ) : Y t = µ(x t ) + σ(x t ) ε t x t univariate and deterministic Y t data available for t = 1,..., n. ε t i.i.d. (0,1) from (unknown) cdf F the functions µ( ) and σ( ) unknown but smooth

44 Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x).

45 Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc.

46 Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h

47 Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h here K(x) is the kernel, h the bandwidth, and K ( ) ( x x i h = K x xi ) h / n k=1 K ( ) x x k h.

48 Nonparametric Regression Note: µ(x) = E(Y x) and σ 2 (x) = Var(Y x). Let m x, s x be smoothing estimators of µ(x), σ(x). Examples: kernel smoothers, local linear fitting, wavelets, etc. E.g. Nadaraya-Watson estimator m x = n i=1 Y i K ( ) x x i h here K(x) is the kernel, h the bandwidth, and K ( ) ( x x i h = K x xi ) h / n k=1 K ( ) x x k h. Similarly, sx 2 = M x mx 2 where M x = ( ) n i=1 Y i 2 K x xi q

49 (a) (b) log wage residual age age (a) Log-wage vs. age data with fitted kernel smoother m x. (b) Unstudentized residuals Y m x with superimposed s x Canadian Census data cps71 from np package of R; wage vs. age dataset of 205 male individuals with common education. Kernel smoother problematic at the left boundary; local linear is better (Fan and Gijbels, 1996) or reflection (Hall and Wehrly, 1991).

50 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t

51 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt

52 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t

53 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the

54 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset.

55 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset. Selection of bandwidth parameters h and q is often done by cross-validation, i.e., pick h, q to minimize PRESS= n t=1 ẽ2 t.

56 Residuals ( ): Y t = µ(x t ) + σ(x t ) ε t fitted residuals: e t = (Y t m xt )/s xt predictive residuals: ẽ t = (Y t m x (t) t )/s x (t) t and s x (t) t delete-y t dataset: {(Y i, x i ), for all i t}. m (t) x are the estimators m and s computed from the ẽ t is the (standardized) error in trying to predict Y t from the delete-y t dataset. Selection of bandwidth parameters h and q is often done by cross-validation, i.e., pick h, q to minimize PRESS= n t=1 ẽ2 t. BETTER: L 1 cross-validation: pick h, q to minimize n t=1 ẽ t.

57 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f.

58 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f )

59 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf.

60 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf. L 1 optimal predictor of Y f is the conditional median, i.e., µ(x f ) + σ(x f ) median(f )

61 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. GOAL: Predict a future response Y f associated with point x f. L 2 optimal predictor of Y f is E(Y f x f ), i.e., µ(x f ) which is approximated by m xf. L 1 optimal predictor of Y f is the conditional median, i.e., µ(x f ) + σ(x f ) median(f ) which is approximated by m xf + s xf median(ˆf e ) where ˆF e is the edf of the residuals e 1,..., e n

62 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary.

63 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x).

64 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x). MB L 2 optimal predictor of g(y f ) is E(g(Y f ) x f ) estimated by n 1 n i=1 g (m x f + σ xf e i ).

65 Model-based (MB) point predictors ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. DATASET cps71: salaries are logarithmically transformed, i.e., Y t = log-salary. To predict salary at age x f we need to predict g(y f ) where g(x) = exp(x). MB L 2 optimal predictor of g(y f ) is E(g(Y f ) x f ) estimated by n 1 n i=1 g (m x f + σ xf e i ). Naive predictor g(m xf ) grossly suboptimal when g is nonlinear. MB L 1 optimal predictor of g(y f ) estimated by the sample median of the set {g (m xf + σ xf e i ), i = 1,..., n}; naive plug-in ok iff g is monotone!

66 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n.

67 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Y i = m xi + s xi ri, for i = 1,...n.

68 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ).

69 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x.

70 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ).

71 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α).

72 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α). Our estimate of the predictive distribution of g(y f ) is the empirical df of bootstrap roots shifted to the right by Π.

73 Resampling Algorithm for predictive distribution of g(y f ) Prediction root: g(y f ) Π where Π is the point predictor. Bootstrap the (fitted or predictive) residuals r 1,..., r n to create pseudo-residuals r 1,..., r n whose edf is denoted by ˆF n. Create pseudo-data Yi = m xi + s xi ri, for i = 1,...n. Calculate a bootstrap pseudo-response Yf = m xf + s xf r where r is drawn randomly from (r 1,..., r n ). Based on the pseudo-data Y 1,..., Y n, re-estimate the functions µ(x) and σ(x) by m x and s x. Calculate bootstrap root: g(y f ) Π(g, m x, s x, Y n, X n+1, ˆF n ). Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α quantile denoted q(α). Our estimate of the predictive distribution of g(y f ) is the empirical df of bootstrap roots shifted to the right by Π. Then, a (1 α)100% equal-tailed predictive interval for g(y f ) is given by: [Π + q(α/2), Π + q(1 α/2)].

74 Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true?

75 Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true? E.g., the skewness and/or kurtosis of Y t may depend on x t.

76 Model-free prediction in regression Previous discussion hinged on model: ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) from cdf F. What happens if model ( ) does not hold true? E.g., the skewness and/or kurtosis of Y t may depend on x t. cps71 data: skewness/kurtosis of salary depend on age.

77 (a) (b) log wage skewness log wage kurtosis age age (a) Log-wage SKEWNESS vs. age. (b) Log-wage KURTOSIS vs. age. Both skewness and kurtosis are nonconstant!

78 General background- Could try skewness reducing transformations but log already does that.

79 General background- Could try skewness reducing transformations but log already does that. Could try ACE, AVAS, etc.

80 General background- Could try skewness reducing transformations but log already does that. Could try ACE, AVAS, etc. There is a simpler, more general solution!

81 General background The Y t s are still independent but not identically distributed.

82 General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x}

83 General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y.

84 General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y. With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc.

85 General background The Y t s are still independent but not identically distributed. We will denote the conditional distribution of Y f given x f by D x (y) = P{Y f y x f = x} Assume the quantity D x (y) is continuous in both x and y. With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc. Since D x ( ) depends in a smooth way on x, we can estimate D x (y) by the local empirical N 1 x,h t: x 1{Y t x <h/2 t y} where 1{ } is indicator, and N x,h is the number of summands, i.e., N x,h = # {t : x t x < h/2}.

86 Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h.

87 Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n.

88 Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n.

89 Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n. Estimator ˆD x (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

90 Constructing the transformation More general estimator ˆD x (y) = n i=1 1{Y i y} K ( x x i ) h. ˆD x (y) is just a Nadaraya-Watson smoother of the variables 1{Y t y}, t = 1,..., n. Can use local linear smoother of 1{Y t y}, t = 1,..., n. Estimator ˆD x (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007). But ˆD x (y) is discontinuous in y, and therefore unacceptable! Could smooth it by kernel methods, or...

91 Linear interpolation (a) (b) Fn(x) quantiles of transformed data u_t x quantiles of Uniform (0,1) FIGURE 2: (a) Empirical distribution of a test sample consisting of five N(0, 1) and five N(1/2, 1) independent r.v. s with the piecewise linear estimator D( ) superimposed; the vertical/horizontal lines indicates the inversion process, i.e., finding D 1 (0.75). (b) Q-Q plot of the transformed variables u i vs. the quantiles of Uniform (0,1).

92 Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness.

93 Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1).

94 Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy.

95 Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n.

96 Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n. D x ( ) is consistent, so u 1,..., u n are approximately i.i.d.

97 Constructing the transformation Since the Y t s are continuous r.v. s, the probability integral transform is the key idea to transform them to i.i.d. ness. To see why, note that if we let η i = D xi (Y i ) for i = 1,..., n our transformation objective would be exactly achieved since η 1,..., η n would be i.i.d. Uniform(0,1). D x ( ) not known but we have estimator D x ( ) as its proxy. Therefore, our proposed transformation for the MF prediction principle is u i = D xi (Y i ) for i = 1,..., n. D x ( ) is consistent, so u 1,..., u n are approximately i.i.d. The probability integral transform has been used in the past for building better density estimators Ruppert and Cline (1994).

98 Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n.

99 Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is

100 Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x Let u f = D xf (Y f ) and Y f = D 1 x f (u f ). is well-defined since D x ( ) is

101 Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is Let u f = D xf (Y f ) and Y f = Dx 1 f (u f ). D x 1 f (u i ) has (approximately) the same distribution as Y f (conditionally on x f ) for any i.

102 Model-free optimal predictors Transformation:u i = D xi (Y i ) for i = 1,..., n. Inverse transformation strictly increasing. D 1 x is well-defined since D x ( ) is Let u f = D xf (Y f ) and Y f = Dx 1 f (u f ). D x 1 f (u i ) has (approximately) the same distribution as Y f (conditionally on x f ) for any i. So { D 1 x f (u i ), i = 1,..., n} is a set of bona fide potential responses that can be used as proxies for Y f.

103 These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived.

104 These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived. The L 2 optimal predictor of g(y f ) will be the expected value of g(y f ) that is approximated by n 1 ( ) n i=1 g D x 1 f (u i ).

105 These n valid potential responses { D 1 x f (u i ), i = 1,..., n} gathered together give an approximate empirical distribution for Y f from which our predictors will be derived. The L 2 optimal predictor of g(y f ) will be the expected value of g(y f ) that is approximated by n 1 ( ) n i=1 g D x 1 f (u i ). The L 1 optimal predictor of g(y( f ) will be) approximated by the sample median of the set {g D x 1 f (u i ), i = 1,..., n}.

106 Model-free optimal point predictors Model-free method L 2 predictor of Y f n 1 ( ) n i=1 D 1 x f D xi (Y i ( ) L 1 predictor of Y f median{ D x 1 f D xi (Y i ) } L 2 predictor of g(y f ) n 1 ( n i=1 g D 1 x f ( D ) xi (Y i )) ( L 1 predictor of g(y f ) median{g D 1 x f ( D ) xi (Y i )) } TABLE. The model-free (MF 2 ) optimal point predictors.

107 Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit.

108 Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit. Focus on the L 2 optimal case with g(x) = x.

109 Model-free model-fitting The MF predictors (mean or median) can be used to give the equivalent of a model fit. Focus on the L 2 optimal case with g(x) = x. Calculating the MF predictor Π(x f ) = n 1 ( ) n i=1 g D 1 x f (u i ) for many different x f values say on a grid, the equivalent of a nonparametric smoother of a regression function is constructed Model-Free Model-Fitting (MF 2 ).

110 M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4.

111 M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4. No need for log-transformation of salaries!

112 M.o.a.T. MF 2 relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc. see Figures 3 and 4. No need for log-transformation of salaries! MF 2 is totally automatic!!

113 wage predicted wage age age FIGURE 3: (a) Wage vs. age scatterplot. (b) Circles indicate the salary predictor n 1 ( ) n i=1 g D 1 x f (u i ) calculated from log-wage data with g(x) exponential. For both figures, the superimposed solid line represents the MF 2 salary predictor calculated from the raw data (without log).

114 (a) (b) quantiles of transformed data u_t quantiles of transformed data u_t quantiles of Uniform (0,1) quantiles of Uniform (0,1) FIGURE 4: Q-Q plots of the u i vs. the quantiles of Uniform (0,1). (a) The u i s are obtained from the log-wage vs. age dataset of Figure 1 using bandwidth 5.5; (b) The u i s are obtained from the raw (untransformed) dataset of Figure 3 using bandwidth 7.3.

115 MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations.

116 MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations. Let g(y f ) Π be the prediction root where Π is either the L 2 or L 1 optimal predictor, i.e., Π = n 1 ( ) n i=1 g D x 1 f (u i ) ( ) or Π = median {g D x 1 f (u i ) }.

117 MF 2 predictive distributions For MF 2 we can always take g(x) = x; no need for other preliminary transformations. Let g(y f ) Π be the prediction root where Π is either the L 2 or L 1 optimal predictor, i.e., Π = n 1 ( ) n i=1 g D x 1 f (u i ) ( ) or Π = median {g D x 1 f (u i ) }. Based on the Y data, estimate the conditional distribution D x ( ) by D x ( ), and let u i = D xi (Y i ) to obtain the transformed data u 1,..., u n that are approximately i.i.d.

118 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n.

119 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n.

120 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ).

121 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ).

122 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) n i=1 g D x 1 f (ui ) or Π =median {g ( D 1 x f (u i ) )}

123 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) ( n i=1 g D x 1 f (ui ) or Π =median {g D x 1 f (ui )} ) Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α).

124 Resampling Algorithm: MF 2 predictive distribution of g(y f ) Bootstrap the u 1,..., u n to create bootstrap pseudo-data u1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D x 1 to create pseudo-data in the Y domain, i.e., Yt = D x 1 t (ut ) for t = 1,...n. Generate a bootstrap pseudo-response Yf = D 1 x f (u) where u is drawn randomly from the set (u 1,..., u n ). Based on the pseudo-data Yt, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap root g(yf ) Π where Π = n 1 ( ) n i=1 g D x 1 f (ui ) or Π =median {g ( D 1 x f (u i ) )} Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α). Predictive distribution of g(y f ) is the above edf shifted to the right by Π, and MF 2 (1 α)100% equal-tailed, prediction interval for g(y f ) is [Π + q(α/2), Π + q(1 α/2)].

125 MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i )

126 MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ).

127 MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ). How does Π xf compare to the kernel smoother m xf?

128 MF 2 : Confidence intervals for the regression function without an additive model To fix ideas, assume g(x) = x. MF 2 L 2 optimal point predictor of Y f is Π xf = n 1 n i=1 D 1 x f (u i ) where u i = D xi (Y i ) Π xf is an estimate of the regression function µ(x f ) = E(Y f x f ). How does Π xf compare to the kernel smoother m xf? Can we construct confidence intervals for the regression function µ(x f ) without additive model ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0,1)?

129 MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n.

130 MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n.

131 MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh).

132 MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh). Recall u i approx. i.i.d. Uniform(0,1). So, for large n, Π xf = n 1 n i=1 D 1 x f (u i ) 1 0 D 1 x f (u) du = m xf

133 MF 2 : Confidence intervals for the regression function without an additive model Note that Π xf is defined as a function of the approx. i.i.d. variables u 1,..., u n. So Π xf could be bootstrapped by resampling the i.i.d. variables u 1,..., u n. In fact, Π xf is asymptotically equivalent to the kernel smoother m xf, i.e., Π xf = m xf + o P (1/ nh). Recall u i approx. i.i.d. Uniform(0,1). So, for large n, Π xf = n 1 n i=1 D 1 x f (u i ) 1 0 D 1 x f (u) du = m xf This is just the identity ydf (y) = 1 0 F 1 (u) du.

134 Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n.

135 Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n.

136 Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ).

137 Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ).

138 Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ). Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α).

139 Resampling Algorithm: MF 2 confidence intervals for µ(x f ) Define the confidence root µ(x f ) Π xf. Bootstrap the u 1,..., u n to create bootstrap pseudo-data u 1,..., u n whose empirical distribution in denoted ˆF n. Use the inverse transformation D 1 x the Y domain, i.e., Y t = D 1 x t to create pseudo-data in (ut ) for t = 1,...n. Based on the pseudo-data Y t, re-estimate the conditional distribution D x ( ); denote the bootstrap estimator by D x ( ). Calculate the bootstrap confidence root Π xf Π where Π = n 1 n 1 i=1 D x f (ui ). Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α quantile denoted q(α). MF 2 (1 α)100% equal-tailed, confidence interval for µ(x f ) is [Π + q(α/2), Π + q(1 α/2)].

140 Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π)

141 Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace.

142 Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace. Prediction points: x f = π; µ(x) has high slope but zero curvature easy case for estimation.

143 Simulation: regression under model ( ) ( ) Y t = µ(x t ) + σ(x t ) ε t with ε t i.i.d. (0, 1) with cdf F. Design points x 1,..., x n for n = 100 drawn from a uniform distribution on (0, 2π) µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace. Prediction points: x f = π; µ(x) has high slope but zero curvature easy case for estimation. x f = π/2 and x f = 3π/2; µ(x) has zero slope but high curvature peak and valley so large bias of m x.

144 y x y (a) (b) x FIGURE 6: Typical scatterplots with superimposed kernel smoothers; (a) Normal data; (b) Laplace data.

145 x f /π = MB MF/MB MF MF/MF Normal Table 4.2(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several x f points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L 1 cross-validation. Error distribution: i.i.d. Normal. [st.error of CVRs =0.013]

146 x f /π = MB MF/MB MF MF/MF Normal Table 4.3(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several x f points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L 1 cross-validation. Error distribution: i.i.d. Laplace.

147 NORMAL intervals exhibit under-coverage even when the true distribution is Normal they need explicit bias correction. Bootstrap methods more variable due to extra randomization. The MF/MB intervals are more accurate than their MB analogs in the case x f = π/2. When x f = π, the MB intervals are most accurate, and the MF/MB intervals seem to over-correct (and over-cover). Over-coverage can be attributed to bias leakage and should be alleviated with a larger sample size, using higher-order kernels, or by a bandwidth trick such as undersmoothing.

148 Performance of MF 2 (resp. MF/MF 2 ) intervals resembles that of MB intervals (resp. MF/MB). MF/MF 2 intervals are best at x f = π/2; this is quite surprising since one would expect the MB and MF/MB intervals to have a distinct advantage when model ( ) is true. The price to pay for using the generally valid MF/MF 2 intervals instead of the model-specific MF/MB ones is the increased variability of interval length.

149 Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1).

150 Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1). W is either exponential, i.e., 1 2 χ2 2 1, to capture skewness, or 3 Student s t with 5 d.f., i.e., 5 t 5, to capture kurtosis.

151 Simulation: regression without model ( ) Instead: Y = µ(x) + σ(x) ε x with ε x = cx Z+(1 cx )W c 2 x +(1 c x ) 2 Z N(0, 1) independent of W that is also (0,1). W is either exponential, i.e., 1 2 χ2 2 1, to capture skewness, or 3 Student s t with 5 d.f., i.e., 5 t 5, to capture kurtosis. ε x independent but not i.i.d.: c x = x/(2π) for x [0, 2π] Large x: ε x is close to Normal. Small x: ε x is skewed/kurtotic.

152 x f /π = MB MF/MB MF MF/MF Normal Table 4.4(a). CVRs with error distribution non i.i.d. skewed. x f /π = MB MF/MB MF MF/MF Normal Table 4.5(a). CVRs with error distribution non i.i.d. kurtotic.

153 The NORMAL intervals are totally unreliable which is to be expected due to the non-normal error distributions. The MF/MF 2 intervals are best (by far) in the cases x f = π/2 and x f = 3π/2 attaining close to nominal coverage even with a sample size as low as n = 100. The case x f = π remains problematic for the same reasons previously discussed.

154 Conclusions Model-free prediction principle

155 Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction

156 Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2

157 Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties

158 Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties Totally Automatic: no need to search for optimal transformations in regression...

159 Conclusions Model-free prediction principle Novel viewpoint for inference: estimation and prediction Regression: point predictors and MF 2 Prediction intervals with good coverage properties Totally Automatic: no need to search for optimal transformations in regression......but very computationally intensive!

160 Bootstrap prediction intervals for time series [joint work with Li Pan, UCSD] TIME SERIES DATA: X 1,..., X n GOAL: Predict X n+1 given the data (new notation) Problem: We can not choose x f ; it is given by the recent history of the time series, e.g., X n 1,, X n p for some p. Bootstrap series X 1, X 2,..., X n 1, X n, X n+1,... must be such that have the same values for the recent history, i.e., X n 1 = X n 1, X n 2 = X n 2,..., X n p = X n p.

161 Forward" bootstrap: Linear AR model: φ(b)x t = ɛ t with ɛ t iid Let ˆφ be the LS (scatterplot) estimate of φ based on X 1,..., X n Let φ be the LS (scatterplot) estimate of φ based on X 1,..., X n p, X n p+1,..., X n i.e., the last p values have been changed/corrupted/omitted. ˆφ φ = O p (1/n) Hence, we can artificially change the last p values in the bootstrap world and the effect will be negligible. Bootstrap series: X1,..., X n p, X n p+1,..., X n where Xt = ˆφ(B) 1 ɛ t (forward recursion) and X n p+1,..., X n are the same values as in the original dataset.

Bootstrap confidence intervals in nonparametric regression without an additive model

Bootstrap confidence intervals in nonparametric regression witout an additive model Dimitris N. Politis Abstract Te problem of confidence interval construction in nonparametric regression via te bootstrap