Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

Lecture 13: 2011

Bootstrap ) R n x n, θ P)) = τ n ˆθn θ P) Example: ˆθn = X n, τ n = n, θ = EX = µ P) ˆθ = min X n, τ n = n, θ P) = sup{x : F x) 0} ) Define: J n P), the distribution of τ n ˆθ n θ P) under P. For real ˆθ n, ) ) J n x, P) Prob P τ n ˆθn θ P) x Since P is unknown, θ P) is unknown, and J n x, P) is also unknown.

The bootstrap estimate J n x, P) by J n x, ˆP n ), where ˆP n is a consistent estimate of P in some sense. For example, take ˆP n x) = 1 n n i=1 1 X i x) sup x ˆPn x) P x) 0 a.s. Similarly estimate 1 α)th quantile of J n x, P) by J n x, ˆP n ): i.e. Estimate Jn 1 x, P) by Jn 1 x, ˆP n ). Usually J n x, ˆP n ) can t be explicitly calculated, use MC: J n x, ˆP n ) 1 B for ˆθ n,i = ˆθ X 1,i,..., X n,i). B i=1 1 τ n ˆθn,i ˆθ ) ) n x

When bootstrap works, for each x, J n x, ˆP n ) J n x, P) p 0 = Jn 1 1 α, ˆP n ) Jn 1 1 α, P) p 0 When should Bootstrap work? Need local uniformity in weak convergence: Usually J n x, P) J x, P). Usually ˆP n a.s. P in some sense, say sup x ˆP n x) P x) a.s. 0 Suppose for each sequence P n s.t. P n P, say sup x P n P 0, it is also true that J n x, P n ) J x, P), then it must be true that a.s. J n x, ˆP n ) J x, P) So it ends up having to show for P n P, J n x, P n ) J x, P), use triangular array formulation.

Case When Bootstrap Works Sample mean with finite variance. sup x ˆF n x) F x) a.s. 0. θˆf n ) = 1 n n i=1 X a.s. i θ F ) = EX ). σ 2 ˆF n ) = 1 n n i=1 Xi X ) 2 a.s. n σ 2 F ) = VarX ). Use Linderberg-Feller for the triangular array, applied to the deterministic sequence of P n such that: 1) sup x P n x) P n x) 0; 2) θ P n ) θ P); 3) σ 2 P n ) σ 2 P), it can be shown that n Xn θ P n ) ) d N 0, σ 2 ) under P n. Since ˆP n satisfies 1,2,3 a.s., therefore J n x, ˆP n ) a.s. J x, P) So local uniformity of weak convergence is satisfied here.

Cases When Bootstrap Fails Order Statistics: F U 0, θ), and X 1),..., X n) is the order statistics of the sample, so X n) is the maximum: P n θ X ) n) > x = P X n) < θ θx ) = P X i < θ θx ) n θ n n 1 = θ θx )) n = 1 x ) n n e x θ n n The bootstrap version: P n X n) Xn) ) = 0 = 1 1 1 ) n ) X n) n n 1 e 1) 0.63

Degenerate U-statistics: Take w x, y) = xy, θ F ) = w x, y) df x) df y) = µ F ) 2. If µ F ) 0 it is known that ) 1 ˆθ n = θ ˆFn = X i X j n n 1) i j S x) = xydf y) = xµ F ) n ˆθn θ) d N 0, 4Var S X ))) = N 0, 4 µ 2 EX 2 µ 4)) The bootstrap works.

But if µ F ) = 0 = θ F ) = 0: 1 θˆf n ) = X i X j = X n 2 1 n n 1) n i j ) ) n θ ˆF n θ F ) = n X n 2 Sn 2 1 n 1 i Xi X n ) 2 = X 2 n S 2 n n d N 0, σ 2) σ 2 [ ) )] However the bootstrap version of n θ ˆF n θ ˆF n : [ n X n 2 1 ] [ n S n 2 X n 2 1 ]) n S n 2 2 = n X n Sn 2 n X n 2 + Sn 2 n 2 X n X n 2 ) = [ n X n X n )] 2 + 2 n X n X n ) n Xn d N 0, σ 2) 2 + 2N 0, σ 2 ) n X n

Subsampling iid case: Y i block of size b from X 1,..., X n ), i = 1,..., q, for q = n b). Let ˆθn,b,i = ˆθ Y i ) calculated with ith block of data. Use the empirical distribution of τ b ˆθ n,b,i ˆθ) over the q pseudo-estimates to approximate the distribution of τ n ˆθ θ): ) ) Approximate J n x, P) =P τ n ˆθ n θ x by L n,b x) =q 1 q i=1 1 τ b ˆθn,b,i ˆθ ) ) n x d Claim: If b, b/n 0, τ b /τ n 0, as long as τ n ˆθ θ) something, J n x, P) L n,b x) p 0

Different Motivation for Subsampling vs. Bootstrap Subsampling: Each subset of size b comes from the TRUE model. Since d τ n ˆθ n θ) J x, P), so as long as b : τ b ˆθ b θ) d J x, P) The distributions of τ n ˆθ n θ) and τ b ˆθ b θ) should be close. But τ b ˆθ b θ) = τ b ˆθ b ˆθ n ) + τ b ˆθ n θ). Since ) ) τb τ b ˆθn θ = O p = o p 1) τ n The distributions of τ b ˆθ b θ) and τ b ˆθ b ˆθ n ) should be close. The distribution of τ b ˆθ b ˆθ n ) is estimated by the empirical distribution over q = n b) pseudo-estimates.

Bootstrap: Recalculate the statistics from the ESTIMATED model ˆP n. Given that ˆP n is close to P, hopefully J n x, ˆP n ) is close to J n x, P) Or to J x, P), the limit distribution). But when bootstrap fails ˆP n P J n x, ˆP n ) J x, P)

Formal Proof of Consistency of Subsampling d Assumptions: τ n ˆθ n θ) J x, P), b, b n 0, τ b τ n 0. Need to show: L n,b x) J x, P) p 0. Since τ θ n θ) U n,b x) = q 1 p 0, it is enough to show q i=1 ) ) p 1 τ b ˆθn,b,i θ x J x, P). U n,b x) is a bth order U-statistics with kernel function bounded by 1, 1). U n,b x) J x, P) = U n,b x) EU n,b x) + EU n,b x) J x, P), it is enough to show U n,b x) EU n,b x) p 0 and EU n,b x) J x, P) 0

But EU n,b x) J x, P) = J b x, P) 0 Use Hoeffding exponential-type inequality Serfling1980), Thm A. p201): P U n,b x) J b x, P) ɛ) exp 2 n ) b ɛ2 / [1 1)] = exp n b t2) 0 as n. b So Q.E.D. L n,b x) J x, P) = L n,b x) U n,b x) + U n,b x) J b x, P) + J b x, P) J x, P) p 0.

Time Series Respect the ordering of the data to preserve correlation. ˆθ n,b,t = ˆθ b X t,..., X t+b 1 ), q = T b + 1. L n,b x) = 1 q ) ) 1 τ b ˆθ n,b,t ˆθ n x q i=1 d Assumption: τ n ˆθ n θ) J x, P), b, b n 0, τ b τ n 0, α m) 0. Result: L n,b x) J x, P) p 0. Most difficult part: To show τ n ˆθ n θ) d J x, P).

Can treat iid data as time series, or even using non-overlapping blocks k = [ n b ], but using n b) more efficient. For example, if then Ū n x) = k 1 k 1 τ b [R n,b,j θ P)] x) j=1 U n,b x) = E [ Ū n x) X n ] = E [1 τb [R n,b,j θ P)] x) X n ] for X n = X 1),..., X n) ). U n,b x) is better than Ū n x) since X n is sufficient statistics for iid data.

Hypothesis Testing: T n = τ n t n X 1,..., X n ), G n x, P) = Prob p τ n x) P P0 J x, P) q q Ĝ n,b x) = q 1 1 T n,b,i x) = q 1 1 τ b t n,b,i x) i=1 i=1 As long as b, b n 0, then under P P 0: Ĝ n,b x) G x, P) If under P P 1, T n, then x, Ĝ n,b x) 0. Key difference with confidence interval: don t need τ b τ n 0, because don t need to estimate θ 0 but assumed known under the null hypothesis.

Estimating the unknown rate of convergence Assume that τ n = n β, for some unknown β > 0. Estimate β using different size of subsampling distribution. Key idea: Compare the shape of the empirical distributions of ˆθ b ˆθ n for different values of b to infer the value of β. Let q = n b) for iid data, or q = T b + 1 for time series data: This implies L n,b x τ b ) q 1 L n,b x 1) q 1 q a=1 q a=1 1 τ b ˆθn,b,a ˆθ ) ) n x ) 1 ˆθ n,b,a ˆθ n x L n,b x τ b ) = L n,b τ 1 b x 1 ) t

x = L 1 n,b t τ b) = τ b τ 1 b x) = τ b L 1 n,b t 1) p Since L n,b x τ b ) J x, P), if J x, P) is continuous and increasing, it can be infered that Same as So L 1 n,b t τ b) = J 1 t, P) + o p 1) τ b L 1 n,b t 1) = J 1 t, P) + o p 1) b β L 1 n,b t 1) = J 1 t, P) + o p 1) Assuming J 1 t, P) > 0, or t > J 0, P), take log.

For different b 1 and b 2, then this becomes ) β log b 1 + log L 1 n,b 1 t 1) = log J 1 t, P) + o p 1) ) β log b 2 + log L 1 n,b 2 t 1) = log J 1 t, P) + o p 1) Different out the fixed effect ) β log b 1 log b 2 ) = log L 1 n,b 2 t 1) ) log L 1 n,b 1 t 1) + o p 1) So estimate β by ˆβ ) )) = log b 1 log b 2 ) 1 log L 1 n,b 2 t 1) log L 1 n,b 1 t 1) = β + log b 1 log b 2 ) 1 o p 1) Take b 1 = n γ1, b 2 = n γ2, 1 γ 1 > γ 2 > 0) ˆβ β = γ 1 γ 2 ) log n) 1 o p 1) = o p log n) 1)

How to know t > J 0, P), L n,b 0 τ b ) = L n,b 0 1) = J 0, P) + o p 1) So estimating J 0, P) not a problem. Alternatively, take t 2 0.5, 1), take t 1 0, 0.5) ) b β L 1 n,b t 2 1) L 1 n,b t 1 1) = J 1 t 2 P) J 1 t 1 P) + o p 1) ) β log b + log L 1 n,b t 2 1) L 1 n,b t 1 1) = log J 1 t 2 P) J 1 t 1 P) ) + o p 1) [ ˆβ = log b 1 log b 2 ) 1 ) log L 1 n,b 2 t 2 1) L 1 n,b 2 t 1 1) )] log L 1 n,b 1 t 2 1) L 1 n,b 1 t 1 1) Take b 1 = n γ1, b 2 = n γ2 1 > γ 1 > γ 2 > 0), ˆβ β = o p log n) 1 ).

Two Step Subsampling ˆτ n = n ˆβ L n,b x ˆτ b ) = q 1 q a=1 1 ˆτ b ˆθn,b,a ˆθ ) ) n x Can show that sup x L n,b x ˆτ b ) J x, P) p 0. Problem: imprecise in small samples. In variation estimation, best choice of b gives On 1/3 ) error rate. Parameter estimates, if model is true, gives On 1/2 ) error rate. Bootstrap pivotal statistics, when applicable, gives even better than On 1/2 ) error rate.