OBSERVATIONS ON BAGGING

Size: px

Start display at page:

Download "OBSERVATIONS ON BAGGING"

Quentin French
5 years ago
Views:

1 OBSERVATIONS ON BAGGING Andreas Buja and Werner Stuetzle Uniersity of Pennsylania and Uniersity of Washington Abstract: Bagging is a deice intended for reducing the prediction error of learning algorithms. In its simplest form, bagging draws bootstrap samples from the training sample, applies the learning algorithm to each bootstrap sample, and then aerages the resulting prediction rules. More generally, the resample size M may be different from the original sample size N, and resampling can be done with or without replacement. We inestigate bagging in a simplified situation: the prediction rule produced by a learning algorithm is replaced by a simple real-alued U-statistic of i.i.d. data. U-statistics of high order can describe complex dependencies, and yet they admit a rigorous asymptotic analysis. We show that bagging U-statistics often but not always decreases ariance, whereas it always increases bias. The most striking finding, howeer, is an equialence between bagging based on resampling with and without replacement: the respectie resample sizes M with = α with N and M w/o = α w/o N produce ery similar bagged statistics if α with = α w/o /( α w/o ). While our deriation is limited to U-statistics, the equialence seems to be uniersal. We illustrate this point in simulations where bagging is applied to cart trees. Key words and phrases: Bagging, U-statistics, cart.. Introduction Bagging, short for bootstrap aggregation, was introduced by Breiman (996) as a deice for reducing the prediction error of learning algorithms. Bagging is performed by drawing bootstrap samples from the training sample, applying the learning algorithm to each bootstrap sample, and aeraging/aggregating the resulting prediction rules, that is, aeraging or otherwise aggregating the predicted alues for test obserations. Breiman presents empirical eidence that bagging can indeed reduce prediction error. It appears to be most effectie for cart trees (Breiman et al. 984). Breiman s heuristic explanation is that cart trees are highly unstable functions of the data a small change in the training sample

2 ANDREAS BUJA AND WERNER STUETZLE can result in a ery different tree and that aeraging oer bootstrap samples reduces the ariance component of the prediction error. The goal of the present article is to contribute to the theoretical understanding of bagging. We inestigate bagging in a simplified situation: the prediction rule produced by a learning algorithm is replaced by a simple real-alued statistic of i.i.d. data. While this simplification does not capture some characteristics of function fitting, it still enables us, for example, to analyze the conditions under which ariance reduction occurs. The claim that bagging always reduces ariance is in fact not true. We start by describing bagging in operational terms. Bagging a statistic θ(x,..., X N ) is defined as aeraging it oer bootstrap samples X,..., X N : θ B (X,..., X N ) = ae X,...,X N θ(x,..., X N). where the obserations X i are i.i.d. draws from { X,..., X N }. The bagged statistic can also be written as θ B (X,..., X N ) = N N i,...,i N θ(x i,..., X in ) because there are N N sets of bootstrap samples, each haing probability /N N. For realistic sample sizes N, the N N sets cannot be enumerated in actual computations, hence one resorts to sampling a smaller number, often as few as 50. Our analysis coers seeral ariations on bagging. Instead of aeraging the alues of a statistic oer bootstrap samples of the same size N as the original sample, we may choose the resample size M to be smaller, or een larger, than N. Another alternatie coered by our analysis is resampling without replacement. The specific class of statistics we consider is that of finite sums of U-statistics. While they do not capture the statistical properties of cart trees, U-statistics can model complex interactions and yet they allow for a rigorous second order analysis. (For an approach tailored to tree-based methods, see Buhlmann and Yu (003).) The most striking effect we obsere, both theoretically and in simulations, is a correspondence between bagging based on resampling with and without replacement. The two modes of resampling produce ery similar bagged statistics

3 OBSERVATIONS ON BAGGING 3 if the resampling fractions α w/o = M w/o /N for sampling without replacement and α with = M with /N for sampling with replacement satisfy the relation α with = α w/o α w/o, or =. α with α w/o This equialence holds to order N under regularity assumptions. equialence is implicit in one form or another in preious work: Friedman and Hall (000, sec..6) notice it for a type of polynomial expansions, but they do not make use of it other than noting that half-sampling without replacement (α w/o = /) and standard bootstrap sampling (α with = ) yield ery similar bagged statistics. The Knight and Bassett (00, sec. 4) note the equialence for half-sampling and bootstrap in the case of quantile estimators. In the present article we show the equialence for U-statistics of fixed but arbitrary order. We also illustrate it in simulations for bagged trees where it holds with surprising accuracy, hinting at a much greater range of alidity. Other obserations about the effects of bagging concern the ariance, squared bias, and mean squared error (MSE) of bagged U-statistics. Similar to Chen and Hall (00) and Knight and Bassett (00), we obtain effects that are only of order O(N ). We also find that, with decreasing resample size, squared bias always increases and ariance often but not always decreases. More precisely, the difference between bagged and unbagged for the squared bias is an increasing quadratic function of g := =, α with α w/o and for the ariance it is an often but not always decreasing linear function of g. Therefore, the only possible beneficial effect of bagging stems from ariance reduction. In those situations where ariance is reduced, the combined effect of bagging is to reduce the MSE in an interal of g near zero; equialently, the MSE is reduced for α with near infinity and correspondingly for α w/o near. For the standard bootstrap ( α with = ) and half-sampling (α w/o = /), improements in MSE are obtained only if the resample sizes fall in the respectie critical interals. Howeer, there can arise odd situations in which the MSE is improed only for α with > and α w/o > /. We finish this article with some illustratie simulations of bagged cart trees. A purpose of these illustrations is to gain some understanding of the peculiarities

4 4 ANDREAS BUJA AND WERNER STUETZLE of trees in light of the fact that bagging often shows dramatic improements that apparently go beyond the effects described by O(N ) asymptotics. An important point to keep in mind is that there are two notion of bias: If we regard θ(f N ) as a plug-in estimate of θ(f ), then the plug-in bias is E θ(f N ) θ(f ). If we regard θ(f N ) as an estimate for some parameter µ, then the estimation bias is E θ(f N ) µ. The second notion of bias estimation bias is the one commonly used in function estimation. Our theory of bagged U-statistics, howeer, is concerned with plugin bias, not with estimation bias. The same applies to Chen and Hall s (00) theory of bagging estimating equations, as well as Knight and Bassett s (00) theory of bagged quantiles. This point een applies to Buhlmann and Yu s (003) treatment of bagged stumps and trees because their notion of bias refers not to the true underlying function but to the optimal asymptotic target, that is, the asymptotically best fitting stump or tree. Their theory therefore explains bagging s effect on the ariance of stumps and trees (better than any of the other theories, including ours), but it, too, has nothing to say about bias in the usual sense of function fitting. An interesting obseration we make in the simulations is that for smooth underlying f(x) bagging not only decreases ariance, but it can reduce estimation bias as well. This should not be too surprising because according to Buhlmann and Yu s theory the effect of bagging is essentially to replace fitting a stump with fitting a stump conoled with a narrow-bandwidth kernel. The conoled stump is smooth and has a chance to reduce estimation bias when the underlying f(x) is smooth. Note: Unless otherwise noted, proofs of propositions can be found in the appendix.. Resampling U-Statistics Let X, X,..., X N be i.i.d. random ariables. We consider statistics of X,..., X N that are finite sums U = A Xi + N N B Xi,X j + N 3 C Xi,X j,x k +..., i i,j i,j,k

5 OBSERVATIONS ON BAGGING 5 where the kernels B, C,... are permutation symmetric in their arguments. [We put the arguments in subscripts in order to aoid the clutter caused by frequent parentheses.] The normalizations of the sums are such that under common assumptions limits for N exist. Strictly speaking, only the off-diagonal part i j B X i,x j (e. g.) is a proper U-statistic. Because we include the diagonal i = j in the double sum, this is strictly speaking a V-statistic or on Mises statistic (Serfling 980, Sec. 5..), but we use the better known term U-statistic anyway. The reason for including the diagonal is that only in this way can U be interpreted as the plug-in estimate U(F N ) of a statistical functional U(F ) = E A X + E B X,Y + E C X,Y,Z +... where X, Y, Z,... are i.i.d. Knowing what the statistic U estimates is a necessity for bias calculations. A second reason for including the diagonal is that bagging has the effect of creating terms such as B Xi,X i, so we may as well include such terms from the outset. It is possible to explicitly calculate the bagged ersion U bag of a sum of U-statistics U. We can allow bagging based on resampling with and without replacement as well as arbitrary resample sizes M. Let W = W..., W N 0 be integer-alued random ariables counting the multiplicities of X,..., X N in a resample. For resampling with replacement, that is, bootstrap, the distribution of W is Multinomial(/N,..., /N; M). Conentional bootstrap corresponds to M = N, but we allow M to range between and. Although M > N is computationally undesirable, infinity is the conceptually plausible upper bound on M: for M = no aeraging takes place because with an infinite resample one has F M = F N. For resampling without replacement, that is, subsampling, the distribution of W is Hypergeometric(M, N). Half-sampling, for example, is for M = N/, but the resample size M can range between and N. For the upper bound M = N no aeraging takes place because the resample is just a permutation of the data, hence F M = F N. With these facts we can write down the resampled and the bagged ersion of a U explicitly. We illustrate this for a statistic U with kernels A Xi and B Xi,X j. For

6 6 ANDREAS BUJA AND WERNER STUETZLE a resample of M with multiplicities W,..., W N, the alue of U is U resample = M Wi A Xi + M Wi W j B Xi,X j. The bagged ersion of U under either mode of resampling is the expected alue with respect to W: [ U bag = E W Wi A Xi + ] Wi M M W j B Xi,X j = E [Wi ] A Xi + E [Wi M M W j ] B Xi,X j. From the form of U bag it is apparent that the only releant quantities are moments of W: E W i = M N E Wi = E W i W j = with: w/o: with: w/o: with and w/o M N + M(M ) N M N M(M ) N M(M ) N(N ) (i j) The bagged functional can now be written down explicitly. It is necessary to distinguish between the two resampling modes: we denote U bag by U with and U w/o for resampling with and without replacement, respectiely. U with = ( A Xi + ) B Xi,X N M i + ( i with N M i,j with U w/o = A Xi + M w/o N N B Xi,X M i + N w/o N i i,j ) B Xi,X j, M w/o N B Xi,X j. Analogous calculations can be carried out for statistics with U-terms of orders higher than two. We summarize: Proposition : A bagged sum of U-statistics is also a sum of U-statistics. For a statistic with kernels A x and B x,y only, the bagged terms A with x, Bx,y with and A w/o x, Bx,y w/o, respectiely, depend on A x and B x,y as follows: A with x = A x + ( B x,x, Bx,y with = ) B x,y, M with M with

7 A w/o x = A x + OBSERVATIONS ON BAGGING 7 M w/o N N M w/o B x,x, B w/o x,y = M w/o N B x,y. For U-statistics with terms of first and second order, the proposition is a direct result of the preceding calculations. For general U-statistics of arbitrary order, the proposition is a consequence of the proofs in the Appendix. We see from the proposition that the effect of bagging is to remoe mass from the proper U-part of B ( i j ) and shift it to the diagonal ( i=j ), thus increasing the importance of the linear part. Similar effects take place in higher orders where ariability is shifted to lower orders. 3. Equialence of Resampling With and Without Replacement Proposition yields a heuristic for an important fact: bagging based on resampling with replacement yields results ery similar to bagging based on resampling without replacement if the resample sizes M with and M w/o are suitably matched up. The required correspondence can be deried by equating A with = A w/o and/or B with = B w/o in Proposition ; both equations yield the identical condition: Corollary: Bagging a sum of U-statistics of first and second order yields identical results under the two resampling modes if N = N. M with M w/o For a general finite sum of U-statistics of arbitrary order, we do not obtain an identity but an approximate equialence: Proposition: Bagging a finite sum of U-statistics of arbitrary order under either resampling mode yields the same results up to order O(N ) if N = N, M with M w/o assuming the kernels are bounded. If the kernels are not bounded but hae moments of order q, the approximation is to order O(N p ), where p + q =.

8 8 ANDREAS BUJA AND WERNER STUETZLE We will similarly see that ariance, squared bias and hence MSE of bagged U-statistics all agree in the N term in the two resampling modes under corresponding resample sizes. The correspondence between the two resampling modes is more intuitie if one expresses the resample sizes M with and M w/o as fractions/multiples of the sample size N: α with = M with (> 0, < ) and α N w/o = M w/o N The condition of Proposition aboe is equialent to α with = α w/o. α w/o (> 0, < ). It equates, for example, half-sampling without replacement, α w/o = /, with conentional bootstrap, α with =. Subsampling without replacement with α w/o > / corresponds to bootstrap with α with >, that is, bootstrap resamples larger than the original sample. The correspondence also maps α w/o = to α with =, both of which mean that the bagged and the unbagged statistic are identical. 4. The Effect of Bagging on Variance, Bias and MSE 4. Variance Variances of U-statistics can be calculated explicitly. statistic that has only terms A X and B X,Y, the ariance is Var(U) = N Var(A X + B X ) For example, for a + N (Co(A X, B X,X ) + 4Co(B X,X, B X ) 4Co(A X, B X ) +Var(B X,Y ) Var(B X )) + N 3 (Var(B X,X ) Var(B X,Y ) + 8Var(B X ) 4Co(B X,X, B X )) We are, howeer, primarily interested not in ariances but differences between ariances of bagged and unbagged statistics: Proposition 3: Let g = N M for sampling with replacement and g = N M for sampling without replacement. Assume g is fixed and 0 < g < as N. Let U be a finite sum of U-statistics of arbitrary order; then: Var(U bag ) Var(U) = N S Var g + O( N 3 ),

9 OBSERVATIONS ON BAGGING 9 for both sampling with and without replacement. If U has only terms A X and B X,Y, then: S Var = Co(A X + B X, B X,X B X ). The effect of bagging on ariance is of order O(N ). The proof is in the appendix; we also show how to calculate S Var for statistics with U-terms of any order. The assumption about g is essential. If it is not satisfied, the order of the asymptotics will be affected. The jackknife is a case in point: It is obtained for M = N and resampling without replacement. This implies g 0, which iolates the assumption of the proposition. It would be easy to coer this type of asymptotics because the calculations can be performed exactly. There exist situations in which bagging increases the ariance, namely, when S Var > 0. If S Var < 0, ariance is reduced, and the beneficial effect becomes the more pronounced the smaller the resample size. Therefore, the fact that bagging may reduce ariance cannot be the whole story: if ariance were the only criterion of interest, one should choose the resample size M as low as operationally feasible for maximal ariance reduction. Obiously, one has to take into account bias as well. 4. Bias We show that bagging U-statistics always increases squared bias (except for linear statistics, where the bias anishes). Recall that the statistic U = U(F N ) is the plug-in estimator for the functional U(F ), so the bias is E U(F N ) U(F ). Proposition 4: Under the same assumptions as in Proposition 3, we hae: Bias (U bag ) Bias (U) = N (g + g) S Bias + O( N 3 ), for both sampling with and without replacement. If U has only terms A X and B X,Y, then S Bias = (E B X,X E B X,Y ). The appendix has proofs and a general formula for S Bias for statistics with U-terms of any order.

10 0 ANDREAS BUJA AND WERNER STUETZLE Bias^ 0 MSE Var g Figure 4.: Dependence of Variance, Squared Bias and MSE on g. The graph shows the situation for S Var /S Bias = 4. Bagging is beneficial for g < 6, that is, for resample sizes M with > N/6 and M w/o > N/7. Optimal is g = 3, that is, M with = N/3 and M w/o = N/4. Just as in the comparison of ariances, sampling with and without replacement agree in the N term modulo differing interpretation of g in the two resampling modes. 4.3 Mean Squared Error The mean squared error of U = U(F N ) is MSE(U) = E ( [U(F N ) U(F )] ) = Var(U) + Bias (U). The difference between MSEs of bagged and unbagged functionals is as follows: Proposition 5: Under the same assumptions as in Propositions 3 and 4, we hae: MSE(U bag M (F N)) MSE(U(F N )) = ( ) N S Bias g + (S Var + S Bias ) g for both sampling with and without replacement. + O( N 3 ). 4.4 Choice of Resample Size

11 OBSERVATIONS ON BAGGING In some situations one may obtain a reduction in MSE for some resample sizes M but not for others, while in other situations bagging may neer lead to an improement. The critical factor is the dependence of the MSE difference on g: S Bias g + (S Var + S Bias ) g. One immediately reads off the following condition for MSE improement: Corollary 5: There exist resample sizes for which bagging improes the MSE to order N iff S Var + S Bias < 0. Under this condition the range of beneficial resample sizes is characterized by ( ) SVar g < +. S Bias The resample size with optimal MSE improement is ( ) g opt SVar = +. S Bias Conentional bootstrap, M with = N, and half-sampling, M w/o = N/, (both characterized by g = ) are beneficial iff and they are optimal iff S Var S Bias < 3, S Var S Bias =. Recall from Proposition 3 that the resample sizes M with and M w/o are expressed in terms of g with = N/M with and g w/o = N/M w/o. The corollary therefore prescribes a minimum resample size to achiee MSE reduction. See Figure for an illustration. The intuition that the benefits of bagging arise from ariance reduction is thus correct, although it must be qualified: Bagging is not always beneficial, but if it is, the reduction in MSE is due to reduction in ariance. This follows from the fact that S Bias is always positie, hence bagging always increases bias, but if the ariance dips sufficiently strongly, an oerall benefit results.

12 ANDREAS BUJA AND WERNER STUETZLE Recall that the aboe statements should be limited to alues of g bounded away from zero and infinity. Near either boundary a different type of asymptotics sets in. 4.5 An Example: Quadratic Functionals Consider as a concrete example of U-statistics the case of quadratic functions: A X = a X and B X,Y = b XY, that is, U = a X N i + b ( Xi ). N In order to determine the terms S Var and S Bias, we need the first four moments of X: Let µ = E X, σ = E [(X µ) ], γ = E [(X µ) 3 )]/σ 3 and κ = E [(X µ) 4 ]/σ 4 be expectation, ariance, skewness and kurtosis, respectiely. Then: S Var = (µγσ 3 + (κ )σ 4 ) ab + µγσ 3 b and S Bias = b σ 4. It is conenient to write the criterion for the existence of resample sizes with beneficial effect on the MSE as S Var /S Bias + < 0: ( µ σ γ + (κ ) ) a b + ( µ σ γ + ) If µ = 0 or γ = 0, this simplifies to (κ ) a b + < 0. < 0. Since κ > for all distributions except a balanced -point mass, the condition becomes a b < κ. For a =, b =, that is, the empirical ariance U = mean(x ) mean(x), beneficial effects of bagging exist iff κ >. For a = 0, that is, the squared mean U = mean(x), no beneficial effects exist. 5. Simulation Experiment The prinicipal purpose of the experiments presented here is is to demonstrate the correspondence between resampling with and without replacement in the non-triial setting of bagging cart trees.

13 Scenarios. OBSERVATIONS ON BAGGING 3 We consider four scenarios, differing in the size N of the training sample, the dimension p of the predictor space, the noise ariance σ, the number K of leaes of the cart tree, and the true regression function f. The scenarios are adapted from Friedman and Hall (000). Scenario N p X σ K f(x) 800 U[0, ] I(x > 0.5) 800 U[0, ] f(x) = x U[0, ] i= I(x i > 0.3) U[0, ] i= i x i We grew all trees in Scenarios 3 and 4 in a stagewise forward manner without pruning; at each stage we split the node that resulted in the largest reduction of the residual sum of squares, till the desired number of leaes was reached. Performance Measures. Let Tα w/o ( ; L) be the bagged tree obtained by aeraging cart trees grown on resamples of size α N drawn without replacement from a training sample L = (x, y ),..., (x N, y N ), and let Tα wi ( ; L) be the bagged tree obtained by aeraging oer resamples of size α N/( α) drawn with replacement. The mean squared error (MSE) of Tα w/o is MSE (T w/o α ) = E X [ E L ((T w/o α (X; L) f(x)) )] = E X [ E L ((Tα w/o (X; L) E L (Tα w/o (X; L))) )] +E X [ (E L (T w/o α (X; L) f(x))) ] = Var(Tα w/o ) + Bias est(tα w/o ). The MSE of Tα wi is defined analogously. Recall that the definition of bias used here is estimation bias expected difference between the estimated regression function for a finite sample size and the true regression function. As pointed out in the introduction, this is different from plug-in bias expected difference between the alue of a statistic for a finite sample size and its alue for infinite sample size which was analyzed in the earlier sections of the article. cart trees with a fixed number of leaes and their bagging aerages are not in general consistent estimates of the true regression function, and in cases where they are not, as in scenarios () and (4) aboe, the two notions of bias differ.

14 4 ANDREAS BUJA AND WERNER STUETZLE Operational details of the experiment. We estimated plug-in bias, estimation bias, ariance, and MSE for α = 0., 0.,..., 0.9, 0.95, 0.99, ; α = corresponds to unbagged cart. Estimates were obtained by aeraging oer 00 training samples and 0, 000 test obserations. We approximated the bagged trees Tα w/o ( ; L) and Tα wi ( ; L) by aeraging oer 50 resamples. A finite number of resamples adds a significant ariance component to the Monte Carlo estimates of Var(Tα w/o ) and Var(Tα wi ). We adjusted the estimates by remoing this component. The influence on bias is of smaller order. To calculate the plug-in bias we need to know the cart tree for infinite training sample size. In Scenarios and 3 this is not a problem because the trees are consistent estimates for the true regression functions. In Scenarios and 4 we approximated the tree for infinite training sample size by a tree grown on a training sample of size n = 00, 000. Simulation results. Figure 5. summarizes the simulation results for Scenario. The top panels show ariance, squared plug-in bias, and squared estimation bias as functions of the resampling fraction α, for resampling with and without replacement. The bottom panel shows the MSE for both resampling modes, and ariance and squared estimation bias for sampling with replacement only. To make the tick mark labels more readable, ertical scales in all the panels are relatie to the MSE of the unbagged tree. We note that ariance decreases monotonically with decreasing resampling fraction, which confirms the intuition that smaller resample size means more aeraging. Estimation bias and plug-in bias agree because a tree with two leaes is a consistent estimate for the true regression function, which in this scenario is a step function. Squared plug-in bias increases with decreasing resampling fraction, as predicted by the theory presented in Sections 4. through 4.3. Figure 5.3 shows the corresponding results for Scenario. Again, ariance is decreasing with decreasing resampling fraction, and squared plug-in bias is increasing, as predicted by the theory. Squared estimation bias, howeer, is decreasing with decreasing resampling fraction. Bagging therefore coneys a double benefit, decreasing both ariance and squared (estimation) bias. The explanation is simple: A bagged cart tree is smoother than the corresponding unbagged tree,

15 OBSERVATIONS ON BAGGING 5 because bagging smoothes out the discontinuities of a piecewise constant model. If the true regression function is smooth, smoothing the estimate can be expected to be beneficial. Admittedly, the scenario considered here is highly unrealistic, but the beneficial effect can also be expected in more realistic situations, like Scenario 4 discussed below. Scenario 3 is analogous to Scenario, with 0-dimensional instead of onedimensional predictor space. The true regression function is piecewise constant and can be consistently estimated by a cart tree with 50 leaes. The results, shown in Figure 5.4, parallel those for Scenario. The results for Scenario 4, shown in Figure 5.5, closely parallel those for Scenario. Again, both ariance and squared (estimation) bias decrease with decreasing resampling fraction. The experiments confirm the agreement between bagging with and without replacement predicted by the theory deeloped in Section 3: Bagging without replacement with resample size N α gies almost the same results in terms of bias, ariance, and MSE as bagging with replacement with resample size N α/( α). The experiments also confirm that bagging does increase squared plug-in bias. Howeer, the releant quantity in a regression context is estimation bias: if the true regression function is smooth, bagging can in fact reduce estimation bias as well as ariance and therefore yield a double benefit.

16 6 ANDREAS BUJA AND WERNER STUETZLE Scenario, Variance Scenario, Squared Plug in Bias Scenario, Squared Estimation Bias With replacement Without replacement alpha alpha alpha Scenario, Mean Squared Error b b b b b With replacement Without replacement Squared estimation bias Variance b b b b b b bb alpha Figure 5.: Simulation results for Scenario. Top panels: Variance, squared plug-in bias, and squared estimation bias for resampling with and without replacement. Bottom panel: MSE for both resampling modes, and ariance and squared estimation bias for resampling with replacement.

17 OBSERVATIONS ON BAGGING Scenario, Variance With replacement Without replacement Scenario, Squared Plug in Bias Scenario, Squared Estimation Bias alpha alpha alpha Scenario, Mean Squared Error b b b b b With replacement Without replacement Squared estimation bias Variance b alpha b b b b b bb Figure 5.3: Simulation results for Scenario. Top panels: Variance, squared plug-in bias, and squared estimation bias for resampling with and without replacement. Bottom panel: MSE for both resampling modes, and ariance and squared estimation bias for resampling with replacement.

18 8 ANDREAS BUJA AND WERNER STUETZLE Scenario 3, Variance With replacement Without replacement alpha Scenario 3, Squared Plug in Bias alpha Scenario 3, Squared Estimation Bias alpha Scenario 3, Mean Squared Error b With replacement Without replacement Squared estimation bias Variance b b b b b b b b b b bb alpha Figure 5.4: Simulation results for Scenario 3. Top panels: Variance, squared plug-in bias, and squared estimation bias for resampling with and without replacement. Bottom panel: MSE for both resampling modes, and ariance and squared estimation bias for resampling with replacement.

19 OBSERVATIONS ON BAGGING Scenario 4, Variance With replacement Without replacement alpha Scenario 4, Squared Plug in Bias alpha Scenario 4, Squared Estimation Bias alpha Scenario 4, Mean Squared Error b With replacement Without replacement Squared estimation bias Variance b b b b b b b b b b bb alpha Figure 5.5: Simulation results for Scenario 4. Top panels: Variance, squared plug-in bias, and squared estimation bias for resampling with and without replacement. Bottom panel: MSE for both resampling modes, and ariance and squared estimation bias for resampling with replacement.

20 0 ANDREAS BUJA AND WERNER STUETZLE 6. Summary and Conclusions We studied the effects of bagging for U-statistics of any order and finite sums thereof. U-statistics of high order can describe complex data dependencies and yet they admit a rigorous asymptotic analysis. The findings are as follows: The effects of bagging on ariance, squared bias and mean squared error are of order N. (The following statements are all meant to second order.) If one allows boostrap samples with or without replacement and arbitrary resample sizes, then bagging based on sampling with for resample size M with is equialent to sampling without for resample size M w/o if N/M with = N/M w/o = g (> 0, < ). While our deriation is limited to U-statistics and sums thereof, the equialence seems to hold more widely, as illustrated by our experiments with bagging cart trees. Var(bagged) Var(raw) is a linear function of g; bagging improes ariance if the slope is negatie. Bias (bagged) Bias (raw) is a positie quadratic function of g; bagging hence always increases squared bias. MSE (bagged) MSE (raw) is a quadratic function of g; bagging may or may not improe mean squared error, and if it does, it is for sufficiently small g, that is, sufficiently large resample sizes M. While our theory does not explain the striking magnitude of the improements seen when bagging cart trees, there is qualitatie agreement between our theoretical findings and the experimental results. Plug-in bias increases and ariance decreases with decreasing resample size, as predicted by the theory, and the theoretically predicted equialence between bagging with and without replacement is indeed obsered in the experiments.

21 OBSERVATIONS ON BAGGING 7. Appendix 7. Summation Patterns for U-Statistics The calculations for U-statistics in this and the following sections are reminiscent of those found in Hoeffding (948). We introduce notation for statistical functionals that are interactions of order J and K, respectiely: B = N J B µ, C = N K C ν, µ where µ = (µ,..., µ J ) {,..., N} J, B µ = B Xµ,...,X µj, ν = (ν,..., ν K ) {,..., N} K, C ν = C Xν,...,X νk. We assume the functions B x,...,x J and C y,...,y K to be permutation symmetric in their arguments, the random ariables X,..., X N to be i.i.d., and the second moments of B µ and C ν to exist for all µ and ν. As is usual in the context of on Mises expansions, we do not limit the summations to distinct indices as is usual in the context of U-statistics. One reason is that we wish B and C to be plug-in estimates of the functionals E B,...,J and E C,...,K. Another reason is that bagging produces lower order interactions from higher order, as we will see. In what follows we will need to partition sums such as σ µ according to how many indexes appear multiple times in µ = (µ,..., µ J ). To this end, we introduce t(µ) as the numbers of essential ties in µ: t(µ) = #{ (i, j) i < j, µ i = µ j, µ i µ,..., µ i }. The sub-index i marks the first appearance of the index µ i, and all other µ j equal to µ i are counted relatie to i. For example, µ = (,,,, ) has three essential ties: µ = µ, µ = µ 4, and µ 3 = µ 5 ; the tie µ = µ 4 is inessential because it can be inferred from the essential ties. An important obseration concerns the counts of indexes with a gien number of essential ties. The following will be used repeatedly: #{ µ t(µ) = 0} = N J = O(N J ), #{ µ t(µ) = } = N J J = O(N J ), ν

22 ANDREAS BUJA AND WERNER STUETZLE #{ µ t(µ) = 0} = O(N J ). Another notation we need is for the number c(µ, ν) of essential cross-ties between µ and ν: c(µ, ν) = #{ (i, j) µ i = ν j, µ i µ,..., µ i, ν j ν,..., ν j }. We exclude inessential cross-ties that can be inferred from the ties within µ and ν. For example, for µ = (,, ) and ν = (3, ) the only essential cross-tie is µ = ν = ; the remaining inessential cross-tie µ 3 = ν can be inferred from the essential tie µ = µ 3 within µ. With these definitions we hae the following fact for the number of essential ties of the concatenated sequence (µ, ν): t((µ, ν)) = t(µ) + t(ν) + c(µ, ν). 7. Coariance of General Interactions In expanding the coariance between B and C, we note that the terms with zero cross-ties between µ and ν anish due to independence. Thus: Co(B, C) = N J+K c(µ,ν)>0 Co(B µ, C ν ). Because #{(µ, ν) c(µ, ν) > 0 } is of order O(N J+K ) (a crude upper bound is JKN J+K ), it follows that Co(B, C) is of order O(N ), as it should. We now show that in order to capture terms of order N and N in Co(B, C) it is sufficient to limit the summation to those (µ, ν) that satisfy either t(µ) = 0, t(ν) = 0 and c(µ, ν) =, or t(µ) =, t(ν) = 0 and c(µ, ν) =, or t(µ) = 0, t(ν) = and c(µ, ν) =, or t(µ) + t(ν) = 0, and c(µ, ν) = for short. To this end, we note that the number of terms with t(µ) + t(ν) and c(µ, ν) is of order N J+K 3. This

23 is seen from the following crude upper bound: OBSERVATIONS ON BAGGING 3 #{ (µ, ν) t(µ) + t(ν), c(µ, ν) } #{ (µ, ν) t((µ, ν)) 3 } K + J K + J J + K + + N J+K 3, 4, K + J 4 3,, K + J 5,,, J + K 6 where the choose terms arise from choosing the index patterns (,,, ), (,,,, ) and (,,,, 3, 3) in all possible ways in a sequence (µ, ν) of length K + J; these three patterns are necessary and sufficient for t((µ, ν)) 3. Using N J+K 3 instead of N(N )... (N (J + K 4)) makes this an upper bound. With the assumption of finite second moments of B µ and C ν for all µ and ν, it follows that the sum of terms with t(µ) + t(ν) and c(µ, ν) is of order O(N 3 ). Abbreiating we hae: Co(B, C) = N J+K = + + N J+K N J+K N J+K N = L N! (N L)! t(µ)+t(ν)=0,; c(µ,ν)= t(µ)=0, t(ν)=0, c(µ,ν)= t(µ)=, t(ν)=0, c(µ,ν)= t(µ)=0, t(ν)=, c(µ,ν)= = N(N )... (N (L )) Co(B µ, C ν ) + O(N 3 ) Co(B µ, C ν ) Co(B µ, C ν ) Co(B µ, C ν ) + O(N 3 ) N = JK Co(B N J+K (,...), C (,...) ) J + K + J N ( ) N J+K KN Co(B (,,...), C (,...) ) + Co(B (,,,...), C (,...) ) J + K 3

24 4 ANDREAS BUJA AND WERNER STUETZLE + N J+K J K N + O(N 3 ), N J + K 3 ( ) Co(B (,...), C (,,...) ) + Co(B (,...,J), C (,,,...) ) where... inside a coariance stands for as many distinct other indices as necessary. Using we obtain Co(B, C) = + + N N L J + K ( ) N + O(N 3 ) J ( ) N + O(N 3 ) + O(N 3 ). = N L L N L + O(N L ) N + O(N 3 ) JK Co(B (,...), C (,...) ) K J K ( ) Co(B (,,...), C (,...) ) + Co(B (,,,...), C (,...) ) ( ) Co(B (,...), C (,,...) ) + Co(B (,...), C (,,...) ) Collecting terms O(N 3 ), the aboe can be written in a more sightly manner as Co(B, C) = N J + K N JK Co(B X, C X ) + N J K (Co(B X,X, C X ) + Co(B X,X,Y, C Y )) + N J K (Co(B X, C X,X ) + Co(B X, C X,Y,Y )) + O(N 3 ) = a N + b N + O(N 3 ). 7.3 Moments of Resampling Coefficients

25 OBSERVATIONS ON BAGGING 5 We consider sampling in terms of M draws from N objects {,..., N} with and without replacement. The draws are M exchangeable random ariables R,..., R M, where R i {,..., N}. Each draw is equally likely: P [R i = n] = N, but for sampling with replacement the draws are independent; for sampling w/o replacement they are dependent and the joint probabilities are P [R = n, R = n,..., R J = n J ] = M / N for distinct n i s, and = 0 if ties exist J J among the n i s. For resampling one is interested in the count ariables W n,m,n = W n = µ=,...,m [Rµ=n], where we drop M and N from the subscripts if they are fixed. We let W = W M,N = (W,..., W N ) and recall: For resampling with replacement: W Multinomial(/N,..., /N; M). For resampling w/o replacement: W Hypergeometric(M, N). For bagging one needs the moments of W. Because of exchangeability of W for fixed M and N, it is sufficient to consider moments of the form E [W i n=,m,n W i n=,m,n W i L n=l,m,n ]. The following recursion formulae hold for i l : E [W i n=,m,n W i n=,m,n W i L n=l,m,n ] with : = w/o : M N E [(W n=,m,n + ) i W i n=,m,n W i L n=l,m,n ], M N E [W i n=,m,n W i L n=l,m,n ]. From these we derie the moments that will be needed below. Recall α = M/N, and g = α for resampling with, g = α for resampling without, replacement. Using repeatedly approximations such as N = N L L L N L + O(N L ),

26 6 ANDREAS BUJA AND WERNER STUETZLE we obtain: E [W i W i W i L L ] = O() E [W W W L ] = with : M /N L L w/o : M / N L L = with : α L α L L w/o : α L α L L α N + O(N ) ( α ) N + O(N ) = α L L g N + O(N ) E [W W W L ] = = with : M /N L + L M /N L L w/o : M / N L L with : α L + α L + O(N ) w/o : α L + O(N )

27 OBSERVATIONS ON BAGGING 7 = α L (g + ) + O(N ). 7.4 Equialence of resampling With and Without Replacement We show the equialence of resampling with and without replacement to order N. To this end we need to distinguish between the resampling sizes M with and M w/o, and the corresponding resampling fractions α with = M with /N and α w/o = M w/o /N. The equialence holds under the condition α with = α w/o (=: g). The two types of bagged U-statistics are denoted, respectiely, by B with = B w/o = Mwith J µ M J w/o µ E E [ ] Wµ with Wµ with J B µ, [ ] Wµ w/o Wµ w/o J B µ. Bagging differentially reweights the parts of a general interaction in terms of moments of the resampling ector W. The result of bagging is no longer a pure interaction but a general U-statistic because bagging creates lower-order interactions from higher orders. Recall two facts about the bagging weights, that is, the moments of W: ) They depend on the structure of the ties in the index ectors µ = (µ,..., µ J ) only; for example, µ = (,, ) and µ = (3,, 3) hae the same weights, E [W W ] = E [W 3 W ] due to exchangeability. ) The moments of W are of order O() in N (Subsection ) and hence presere the orders O(N ), O(N ), O(N 3 ) of the terms considered in Subsection. We derie a crude bound on their difference using B bound = max µ B µ. We assume the aboe condition on α with and α w/o and obtain: B with B w/o µ = Mwith J E t(µ)=0 + t(µ)= [ ] Wµ with Wµ with J + t(µ)> M J w/o... B bound E ] µ J [ W w/o µ W w/o B bound

28 8 ANDREAS BUJA AND WERNER STUETZLE = t(µ)=0 Mwith J αwith J J g N + O(N ) Mw/o J αw/o J J g N + O(N ) B bound + [ ] M J αwith J (g + ) + O(N ) t(µ)= with [ ] Mw/o J αw/o J (g + ) + O(N ) B bound + M J [ O() ] t(µ)> with Mw/o J [ O() ] B bound = N J O(N ) + O(N ) + O() B bound t(µ)=0 t(µ)= t(µ)> = N N J O(N ) + N J O(N ) J J + N J N N J O() B bound J J = [ ] N J O(N J ) O(N ) + O(N J ) O(N ) + O(N J ) O() B bound = O(N ) B bound This proes the per-sample equialence of bagging based on resampling with and without replacement up to order O(N ). The result is somewhat unsatisfactory because the bound depends on the extremes of the U-terms B µ, which tend to infinity for N, unless B µ is bounded. Other bounds at a weaker rate can be obtained with the Hölder inequality: ( B with B w/o O N ) ( p N J µ B µ q ) q for p + q =. This specializes to the preiously deried bound when p = and q =, for which the best rate of O(N ) is obtained, albeit under the strongest assumptions on B µ. 7.5 Coariances of Bagged Interactions

29 OBSERVATIONS ON BAGGING 9 Resuming calculations begun in Subsection for coariances of unbagged interaction terms, we now derie the coariance of their M-bagged ersions: B bag = M J µ E [W µ W µj ] B µ, C bag = M K The moment calculations of Subsection yield the following: ν E [W ν W νk ] C ν. Co(B bag, C bag ) = M J+K + O(N 3 ) = + + M J+K M J+K M J+K + O(N 3 ) t(µ)+t(ν)=0,; c(µ,ν)= t(µ)=0, t(ν)=0, c(µ,ν)= t(µ)=, t(ν)=0, c(µ,ν)= t(µ)=0, t(ν)=, c(µ,ν)= N = N J+K JK αj+k J + K E [W µ W µj ] E [W ν W νk ] Co(B µ, C ν ) E [W µ W µj ] E [W ν W νk ] Co(B µ, C ν ) E [W µ W µ W µj ] E [W ν W νk ] Co(B µ, C ν ) E [W µ W µ W µj ] E [W ν W ν W νk ] Co(B µ, C ν ) E [W W J ] E [W W K ] Co(B (,...), C (,...) ) + J N N J+K α J+K KN J + K 3 ( ) E [W W W J ] E [W W K ] Co(B (,,...), C (,...) ) + Co(B (,,,...), C (,...) ) + N J+K α J+K J K N N J + K 3 ( ) E [W W J ] E [W W W K ] Co(B (,...), C (,,...) ) + Co(B (,...), C (,,,...) ) + O(N 3 )

30 30 ANDREAS BUJA AND WERNER STUETZLE = JK N J + K N K g N Co(B X, C X ) + J g N J K N (g + ) (Co(B X,X, C X ) + Co(B X,X,Y, C Y )) + J K N (g + ) (Co(B X, C X,X ) + Co(B X, C X,Y,Y )) + O(N 3 ) = N N J + K N J + K g JK Co(B X, C X ) + N J K (g + ) (Co(B X,X, C X ) + Co(B X,X,Y, C Y )) + N J K (g + ) (Co(B X, C X,X ) + Co(B X, C X,Y,Y )) + O(N 3 ) The last three lines form the final result of these calculations. 7.6 Difference Between Variances of Bagged and Unbagged Comparing the results of the Sections and, we get: Co(B bag, C bag ) Co(B, C) = N J + K g JK Co(B X, C X ) + N J K g (Co(B X,X, C X ) + Co(B X,X,Y, C Y )) + N J K g (Co(B X, C X,X ) + Co(B X, C X,Y,Y )) + O(N 3 ) = N g J + K JK Co(B X, C X )

31 OBSERVATIONS ON BAGGING 3 + J K (Co(B X,X, C X ) + Co(B X,X,Y, C Y )) + J K (Co(B X, C X,X ) + Co(B X, C X,Y,Y )) + O(N 3 ) = N g S Var (B, C) + O(N 3 ), where S Var (B, C) = J K Co(C X, B X,X + B X,Y,Y JB X ) + K J Co(B X, C X,X + C X,Y,Y KC X ). The expression for S Var (B, C) remains correct for J and K as low as, in which case one interprets J = 0 and B X,X = 0 when J =, and B X,Y,Y = 0 when J, and similar for C when K = or. The result generalizes to arbitrary finite sums of interactions U = A + B + C +... = A i + N N B i,j + N 3 C i,j,k i i,j Because S Var (B, C) is a bilinear form in its arguments, the corresponding constant S Var (U) for sums of U-statistics can be expanded as follows: S Var (U) = S Var (A, A) + S Var (A, B) + S Var (B, B) + S Var (A, C) + S Var (B, C) + S Var (C, C) +..., so that Var(U bag ) Var(U) = N g S Var (U) + O(N 3 ). For example a functional consisting of first and second order terms, i,j,k U = A + B = A i + N N B i,j, i i,j

32 3 ANDREAS BUJA AND WERNER STUETZLE yields S Var (U) = S Var (A, A) + S Var (A, B) + S Var (B, B) = Co(A X, B X,X B X ) + Co(B X, B X,X B X ) = Co(A X + B X, B X,X B X ). Note that S Var (A, A) = 0 because bagging leaes additie statistics unchanged. 7.7 Difference between Squared Bias of Bagged and Unbagged statistic We consider a single K-th order interaction first, with functional and plug-in U(F ) = E C (,,...,K), U(F N ) = N K N ν,...,ν K = C (ν,...,ν K ). [Recall that C ν and C (ν,...,ν K ) are short for C Xν,...,X νk.] The functional U(F ) plays the role of the parameter to be estimated by the statistic U = U(F N ), so that the notion of bias applies. We first calculate the bias for the unbagged statistic U and second for the bagged statistic U bag. Note that E C X = E C,...,K = U(F ). E [U(F N )] = N K E C (ν,...,ν K ) ν,...,ν K = N N K E C (,...,K) + K N E C (,,,...,K ) + O(N K ) K K = U(F ) + N K ( E C X,X E C X ) + O(N ). Now for the bias of the bagged statistic: E U bag = = = M K N K α K N K α K N ν,...,ν k = t(ν)=0 E [W ν W νk ] E C (ν,...,ν K ) + t(ν)= + O(N K ) N E [W W K ] E C (,...,K) K

33 = + OBSERVATIONS ON BAGGING 33 K N E [W W W K ] E C (,,,...,K ) K + O(N ) K N K g N E C (,...,K) + N K (g + ) E C (,,,...,K ) + O(N ) = U(F ) N K (g + ) E C (,...,K) + N K (g + ) E C (,,,...,K ) + O(N ) = U(F ) + N K (g + ) (E C X,X E C X ) + O(N ) Thus: Bias (U bag ) = N K (g + ) (E C X,X E C X ) + O(N ) As for ariances, we can now consider more general statistics that are finite sums of interactions: The final result is: U = A + B + C +... b = Ai + Bi,j N N + Ci,j,k N Bias (U bag ) Bias (U) ( = N (g + ) ) (E B X,X E B X ) + 3 (E C X,X E C X ) O(N 3 ). As usual, g = α for sampling with, and g = α for sampling w/o, replacement.

34 34 ANDREAS BUJA AND WERNER STUETZLE References L. Breiman (996). Bagging predictors. Machine Learning 6, L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone (984). Classification and Regression Trees. Wadsworth, Belmont, California. P. Buhlmann and B. Yu (00). Analyzing bagging. Ann. of Statist. 30, S. X. Chen and P. Hall (003). Effects of bagging and bias correction on estimators defined by estimating equations. Statistica Sinica 3, J.H. Friedman and O. Hall (000). On bagging and nonlinear estimation. Can be downloaded from jhf/#reports. W. Hoeffding (948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist. 9, K. Knight and Jr. G. W. Bassett (00). Second order improements of sample quantiles using subsamples. Can be downloaded from R. J. Serfling (980). Approximation Theorems of Mathematical Statistics. Wiley, New York. Department of Statistics, The Wharton School, Uniersity of Pennsylania, Philadelphia, PA buja@wharton.upenn.edu Department of Statistics, Box 3543, Uniersity of Washington, Seattle, WA wxs@stat.washington.edu

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images John Fowler 28 March 23 The possibility exists that after pointing transfer has been performed for each BCD (i.e. a calibrated image