Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

1 / 172 Bootstrap inference Francisco Cribari-Neto Departamento de Estatística Universidade Federal de Pernambuco Recife / PE, Brazil email: cribari@gmail.com October 2014

2 / 172 Unpaid advertisement Graduate program in Statistics: Masters and PhD Graduate Program in Statistics at Federal University of Pernambuco: http://www.ufpe.br/ppge CAPES: 5 Research areas: asymptotic theory, econometrics, game theory, multivariate analysis, probability theory, regression analysis, signal processing, time series.

3 / 172 Figure 1 : Boa Viagem beach.

4 / 172 Figure 2 : Boa Viagem beach (at night).

5 / 172 Figure 3 : Recife ( the Brazilian Venice ).

6 / 172 Figure 4 : Porto de Galinhas beach (near Recife).

7 / 172 In a world in which the price of calculation continues to decrease rapidly, but the price of theorem proving continues to hold steady or increase, elementary economics indicates that we ought to spend a larger fraction of our time on calculation. John W. Tukey, 1986

8 / 172 Figure 5 : This is the man.

9 / 172 Some references General 1 Chernick, M.R. (1999). Bootstrap Methods: A Practitioner s Guide. New York: Wiley. 2 Davison, A.C. & Hinkley, D.V. (1997). Bootstrap Methods and their Application. Cambridge: Cambridge University Press. 3 Efon, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: Society for Industrial and Applied Mathematics. 4 Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. New York: Chapman & Hall.

10 / 172 Some references General 5 Godfrey, L. (2009). Bootstrap Tests for Regression Models. New York: Palgrave MacMillan. 6 Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer Verlag. 7 Shao, J. & Tu, D. (1995). The Jackknife and Bootstrap. New York: Springer.

11 / 172 Some references Specific 1 Booth, J.G. & Hall, P. (1994). Monte Carlo approximation and the iterated bootstrap. Biometrika, 81, 331-340. 2 Cribari-Neto, F. & Zarkos, S.G. (1999). Bootstrap methods for heteroskedastic regression models: evidence on estimation and testing. Econometric Reviews, 18, 465 476. 3 Cribari-Neto, F. & Zarkos, S.G. (2001). Heteroskedasticity-consistent covariance matrix estimation: White s estimator and the bootstrap. Journal of Statistical Computation and Simulation, 68, 391 411.

12 / 172 Some references Specific (cont.) 4 Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics and Data Analysis, 45, 215 233. 5 Cribari-Neto, F.; Frery, A.C. & Silva, M.F. (2002). Improved estimation of clutter properties in speckled imagery. Computational Statistics and Data Analysis, 40, 801 824. 6 Davidson, R. & MacKinnon, J.G. (2000). Bootstrap tests: how many bootstraps? Econometric Reviews, 19, 55 68.

13 / 172 Some references Specific (cont.) 7 Ferrari, S.L.P. & Cribari-Neto, F. (1997). On bootstrap and analytical bias corrections. Economics Letters, 58, 7 15. 8 Ferrari, S.L.P. & Cribari-Neto, F. (1999). On the robustness of analytical and bootstrap corrections to score tests in regression models. Journal of Statistical Computation and Simulation, 64, 177 191.

14 / 172 Some references Specific (cont.) 9 Lemonte, A.J.; Simas, A.B.; Cribari-Neto, F. (2008). Bootstrap-based improved estimators for the two-parameter Birnbaum-Saunders distribution. Journal of Statistical Computation and Simulation, 78, 37 49. 10 MacKinnon, J.G. & Smith, Jr., A.A. (1998). Approximate bias correction in econometrics. Journal of Econometrics, 85, 205 230. 11 Wu, C.F.J. (1996). Jackknife, bootstrap and other resampling methods in regression analysis (with discussion). Annals of Statistics, 14, 1261 1295.

15 / 172 Some references Specific Cordeiro, G.M. & Cribari-Neto, F. (2014). An Introduction to Bartlett Correction and Boas Reduction. New York: Springer.

16 / 172 A fundamental equation Figure 6 : C.R. Rao C.R. Rao: uncertain knowledge + knowledge about the uncertainty = useful knowledge

17 / 172 As Anthony Davison and David Hinkley remind us... The explicity recognition of uncertainty is central to statistical sciences. Notions such as prior information, probability models, likelihood, standard errors and confidence limits are all intended to formalize uncertainty and thereby make allowance for it. Davison & Hinkley

18 / 172 The big picture (the grand scheme of things) Sampling POPULATION DATA model = f(parameters) The great scheme of things.

19 / 172 What is the bootstrap? The bootstrap is a computer-based method for assessing the accuracy of statistical estimates and tests. It was first proposed by Bradley Efron in a 1979 Annals of Statistics paper. Main idea: Treat the data as if they were the (true, unknown) population, and draw samples (with replacement) from the data as if you were sampling from the population. Repeat the procedure a large number of times (say, B), each time computing the quantity of interest. Then, use the B values of the quantity of interest to estimate its unknown distribution.

20 / 172 In a nutshell... population sample }{{} real world bootstrap samples }{{} virtual world

21 / 172 Does it work? Question: Does it work well? Answer: Yes (most of the time). In the simplest nonparametric problems we do literally sample from the data, and a common initial reaction is that this is a fraud. In fact it is not. Davison and Hinkley, 1997

22 / 172 Asymptotic refinement Question: When does the bootstrap provide an asymptotic refinement? The quantity being bootstrapped must be asymptotically pivotal. That is: It must have a limiting distribution free of unknown parameters.

23 / 172 Point estimation in a nutshell Suppose that the model that represents the population is indexed by the parameter θ = (θ 1,..., θ p ) Θ, where Θ is the parameter space. Estimator: A statistic used to estimate θ. The estimator, say θ, is typically obtained from the minimization of some undesirable quantity (e.g., sum of squared errors) or from the maximization of some desirable quantity.

24 / 172 Point estimation in a nutshell (cont.) Some of the most important properties an estimator can enjoy are: Unbiasedness: IE( θ) = θ θ Θ; Consistency ( ): θ p θ; Asymptotic normality: When n is large, θ approx normally distributed; Efficiency: (more generally, optimality in some class; e.g., Gauss-Markov Theorem).

25 / 172 Setup Y 1,..., Y n i.i.d. F 0 (θ), where θ Θ IR p. We can write the unknown parameter θ as a functional of F 0 : θ = θ(f 0 ). We can denote an estimator of θ (say, the MLE) as θ, which can be written as the functional θ = θ( F), where F is the empirical c.d.f. of Y 1,..., Y n.

26 / 172 Plug-in principle Y 1,..., Y n iid F0. Plug-in Write the parameter as θ = θ(f 0 ). Estimator: θ = θ( F). Example: mean Parameter: θ(f 0 ) = y df 0 = IE(Y). Estimator: θ = y d F = n 1 n i=1 y i = y.

Main idea: Plug-in principle. Example: Let Y = n 1 n i=1 Y i. We know that if Y i (µ, σ 2 ), then Y (µ, n 1 σ 2 ), so that n 1 σ 2 gives us an indication of the accuracy of the estimate Y. In particular, the standard error of the estimate can be obtained as s.e.(y) = σ 2 n, σ2 = 1 n 1 n (y i ȳ) 2, ( ) for a given observed sample Y 1 = y 1,..., Y n = y n. Bootstrap approach: Write σ 2 = σ 2 (F 0 ), and replace F 0 by F to obtain: σ 2 b.s.e.(y) = n, σ2 σ 2 ( F) = 1 n (y i ȳ) 2. n This is the bootstrap estimate. i=1 i=1 27 / 172

28 / 172 Noteworthy Note: (i) the difference between the two estimates is minor and vanishes as n ; (ii) F places probability mass 1/n on y 1,..., y n.

29 / 172 Problem: We are usually interested in estimates more complicated than the sample mean, and for such statistics we may not have a directly available formula like ( ). [E.g., we may be interested in the correlation coefficient, the median, a given quantile, the coefficients of a quantile regression, etc.] Solution: The bootstrap approach allows us to numerically evaluate σ 2 = σ 2 ( F).

30 / 172 Bootstrap standard error Question: How can we use bootstrap resampling to obtain a standard error for a given estimate?

31 / 172 What s in a number?

32 / 172 The basic bootstrap algorithm Suppose we wish to obtain a standard error for θ = θ( F), an estimate of θ = θ(f 0 ), from an i.i.d. sample of size n. Here is how we proceed: 1 Compute θ for our sample. 2 Sample from the data with replacement and construct a new sample of size n, say y = (y 1,..., y n). 3 Compute θ for the bootstrap sample obtained in (ii). 4 Repeat steps (ii) and (iii) B times. 5 Use the B realizations of θ to obtain an estimate for the standard error of θ.

33 / 172 The basic bootstrap algorithm That is, where 1 b.s.e.( θ ) = B B 1 b=1 { θ b θ ( )} 2, θ ( ) = 1 B B θ b. b=1

34 / 172 It is important to notice that... Note that the bootstrap generalizes the jackknife in the sense that resampling is carried out in a random fashion, and not in a deterministic and systematic way ( leave one out ).

35 / 172 Parametric versus nonparametric bootstrap The bootstrap may be performed parametrically or nonparametrically. Nonpametric bootstrap: Resampling from F. That is, sample from the data (with replacement). Parametric bootstrap: Sample from F( θ ). The nonparametric bootstrap is more robust against distributional assumptions whereas the parametric bootstrap is expected to be more efficient when the parametric assumptions are true.

36 / 172 Noparametric bootstrap sampling EMPIRICAL DISTRIBUTION: Puts equal weight probabilities n 1 at each sample value y i. EMPIRICAL DISTRIBUTION FUNCTION (EDF): F(y) = #{y i y}. n Notice that the values of the EDF are fixed: (0, 1/n, 2/n,..., n/n). Hence, the EDF is equivalent to its points of increase: y (1) y (n) (ordered sample values).

37 / 172 Noparametric bootstrap sampling (cont.) Since the EDF puts equal probabilities on the data values y 1,..., y n, each Y is independently sampled at random from those data values. Hence, the bootstrap sample is a random sample taken with replacement from the original data.

38 / 172 Smoothed bootstrap The empirical distribution function ( F) is discrete. Sampling from it boils down to sampling from data with replacement. An interesting idea: To sample from a smoothed distribution function. We replace F by a smooth distribution based on, e.g., a kernel density estimate of F (the derivative of F with respect to y). An example using the correlation coefficient can be found in Efron s 1982 monograph: Efron, B. (1982). The Jackknife, the Bootstrap, and Other Resampling Plans. Philadelphia: SIAM.

39 / 172 Bayesian bootstrap Suppose y 1,..., y n are i.i.d. realizations of Y which has distribution function F(θ), where θ is scalar. Let θ be an estimator of θ. We know that the bootstrap can be used to construct an estimate of the distribution of such an estimator. Instead of sampling from the data with replacement (i.e., sampling each y i with probability 1/n), the Bayesian bootstrap uses a posterior probability distribution for y i. The posterior probability distribution is centered at 1/n, but varies for each y i. How is that done?

40 / 172 Bayesian bootstrap Draw a random sample of size n 1 from the standard uniform distribution. Order the sampled values: u (1),..., u (n 1). Let u (0) = 0 and u (n) = 1. Compute g i = u (i) u (i 1), i = 1,..., n. The g i s are called the gaps between the uniform order statistics. The vector g = (g 1,..., g n ) is used to assign probabilities to the Bayesian bootstrap. That is, sample y i with probability g i (not 1/n). Note that we obtain a different g in each bootstrap replication.

41 / 172 Bayesian bootstrap ADVANTAGE: It can used to make Bayesian inference on θ based on the estimated posterior distribution for θ. The bootstrap distribution of θ and the Bayesian bootstrap posterior distribution for θ will be similar in many applications. Reference: Rubin, D.B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130 134.

42 / 172 Revisiting the big picture Sampling POPULATION DATA model = f(parameters) The great scheme of things.

43 / 172 Software Programming Programming bootstrap resampling is easy: (i) PARAMETRIC: Sample from F( θ), (ii) NONPARAMETRIC: Sample from F (empirical distribution function). Sampling from F: 1) Obtain a standard uniform draw, i.e., obtain u from U(0, 1). 2) Generate a random integer (sau, i ) from {1,..., n} as u n + 1. The ith observation in the bootstrap sample is i th observation in the original sample.

44 / 172 Software R package boot: functions and datasets for bootstrapping from the book Bootstrap Methods and Their Applications by A.C. Davison and D.V. Hinkley (1997, Cambridge University Press).

45 / 172 Bootstrap bias correction Suppose that θ is biased (although consistent) for θ, and that we would like to obtain a new estimate which is bias-corrected up to some order of accuracy. The bias of θ is: IE( θ) θ (systematic error). Ideally, we would like to have: θ bias, but this is not feasible (since the bias depends on θ). Define the bias-corrected estimate as: θ = θ bias. We then take bias to be bias B = θ ( ) θ, which implies that θ = θ { θ ( ) θ} = 2 θ θ ( ).

46 / 172 We shall call the above bias correction BC1. NOTE: θ ( ) is not itself the bootstrap bias-corrected estimate. Let s look into that. (For futher details, see, e.g., MacKinnon & Smith, Journal of Econometrics, 1998). Assuming that IE( θ) exists, write θ = θ 0 + B(θ 0, n) + R(θ 0, n), where B(θ 0, n) = IE( θ) θ 0 (i.e., B(, ) is the bias function) and R(θ 0, n) is defined so that the above equation holds. Assume we know the distribution of Y i up to the unknown parameter θ (so that we can use the parametric bootstrap).

47 / 172 Noteworthy If θ is n consistent and asymptotically normal, the bias will be typically be O(n 1 ). (Otherwise, n( θ θ 0 ) would not have mean zero asymptotically.)

48 / 172 Suppose that B(θ, n) = B(n) for all θ. That is, suppose the bias function if flat. Here, it does not matter at which value of θ we evaluate the bias function since it is flat. An obvious candidate, however, is θ, the MLE. And what we get here is exactly our bias correction BC1: θ = θ B( θ, n) = θ { θ ( ) θ} = 2 θ θ ( ). NOTE: In many (most?) cases, however, the bias function is not flat.

49 / 172 Suppose now that the bias function is linear in θ, that is, B(θ, n) = α 0 + α 1 θ. ( ) The main idea is to evaluate ( ) at two points and then solve for α 0 and α 1. (Note that this will require two sets of simulations!) Obvious choices for the two points at which we evaluate the DGP are θ and θ. The solution is α 0 = B B B θ θ θ, α 1 = B B θ θ. (NOTE: Here we are using a shorthand notation.) The estimated α s will converge to the true ones as B increases.

50 / 172 The bias-corrected estimator can then be defined as θ = θ α 0 α 1 θ. (Here we are evaluating the bias function at θ itself.) The solution, therefore, is θ = 1 1 + α 1 ( θ α 0 ). We can call the above bias correction BC2.

What if the bias function is nonlinear? In that case, we define a bias-corrected estimator as θ = θ B( θ, n). One way of implement this is as follows. Start w/ B obtained as in the BC1, i.e., B = θ ( ) θ. Now, compute (sequentially) θ (j) = (1 λ) θ (j 1) + λ( θ B( θ (j 1), n)), where θ (0) = θ and 0 < λ 1. Stop when θ (j) θ (j 1) < ɛ for a sufficiently small ɛ. Suggestion: Start with λ = 1. If the procedure does not converge, try smaller values of λ. 51 / 172

52 / 172 An alternative bootstrap bias estimate was introduced by Efron (1990). It is carried out nonparametrically and uses an auxiliary (n 1) resampling vector, whose elements are the proportions of observations in the original sample y = (y 1,..., y n ) that were included in the bootstrap sample. Let P = (P1, P 2,..., P n) be the resampling vector. Its jth element (j = 1, 2,..., n), Pj, is defined with respect to a given bootstrap sample y = (y 1,..., y n) as P j = n 1( #{y k = y j} ). It is important to note that the vector P 0 = (1/n, 1/n,..., 1/n) corresponds to the original sample.

53 / 172 Also, any bootstrap replicate θ can be defined as a function of the resampling vector. For example, if θ = s(y) = y = n 1 n i=1 y i, then θ = y 1 + y 2 + + y n n = (np 1 )y 1 + + (np n)y n n = #{y k = y 1}y 1 + + #{y k = y n}y n n = P y.

54 / 172 Suppose we can write the estimate of interest, obtained from the original sample y, as G(P 0 ). It is now possible to obtain bootstrap estimates θ b using the resampling vectors P b, b = 1, 2,..., R, as G(P b ). Efron s (1990) bootstrap bias estimate, BˆF(ˆθ, θ), is defined as BˆF(ˆθ, θ) = ˆθ ( ) G(P ( ) ), where P ( ) = 1 R R P b, b=1 which differs from ˆBˆF(ˆθ, θ), since ˆBˆF(ˆθ, θ) = ˆθ ( ) G(P 0 ). Notice that this bias estimate uses an additional information, namely: the proportions of the n observations that were selected each nonparametric resampling.

55 / 172 After obtaining an estimate for the bias, it is easy to obtain a bias-adjusted estimator: θ = s(y) BˆF(ˆθ, θ) = ˆθ ˆθ ( ) + G(P ( ) ).

56 / 172 It is important to note that the bias estimation procedure proposed by Efron (1990) requires the estimator ˆθ to have closed form. However, oftentimes the maximum likelihood estimator of θ, the parameter that indexes the model used to represent the population, does not have a closed form. Rather, it needs to be obtained by numerically maximizing the log likelihood function using a nonlinear optimization algorithm, such as a Newton or quasi-newton algorithm. Cribari-Neto, Frery and Silva (2002) proposed an adaptation of Efron s method that can be used with estimators that cannot be written in closed form.

57 / 172 They use the resampling vector to modify the log likelihood function, and then maximize the modified log likelihood. The main idea is to write the log likelihood function in terms of P 0, replace this vector by P ( ), and then maximize the resulting (modified) log likelihood function. The maximizer of such a function is a bias-corrected maximum likelihood estimator.