Using Additive Expert Ensembles to Cope with Concept Drift

Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the target concept can change over time. Previous work on expert prediction algorithms has bounded the worst-case performance on any subsequence of the training data relative to the performance of the best expert. However, because these experts may be difficult to implement, we take a more general approach and bound performance relative to the actual performance of any online learner on this single subsequence. We present the additive expert ensemble algorithm AddExp, a new, general method for using any online learner for drifting concepts. We adapt techniques for analyzing expert prediction algorithms to prove mistake and loss bounds for a discrete and a continuous version of AddExp. Finally, we present pruning methods and empirical results for data sets with concept drift. 1. Introduction We consider the online learning scenario where an algorithm is presented with a series of labeled examples, S = {x t, y t }, for t = 1, 2,..., T. At each time step, the algorithm outputs a class prediction, ŷ t, for the given feature vector x t, then updates its hypothesis based on the true class y t. In the discrete case, y t and ŷ t are in a finite set of classes Y, and under the mistake bound model (Littlestone, 1988), one bounds the number of mistakes an algorithm makes on some target concept. In the continuous case, y t, ŷ t [0, 1], and one bounds the maximum loss suffered by the algorithm we focus on the absolute loss function L(ŷ t, y t ) = ŷ t y t. In addition, we allow for the target concept to change over time. Certain expert prediction algorithms (Littlestone & Warmuth, 1994; Herbster & Warmuth, 1998) pro- Appearing in Proceedings of the 22 nd International Conference on Machine Learning, Bonn, Germany, 2005. Copyright 2005 by the author(s)/owner(s). vide theoretical performance guarantees under concept drift. These algorithms predict based on the advice of several prediction strategies, called experts. When trained on a series of different concepts, these algorithms perform almost as well as the best expert on each concept the assumption being that different experts perform best on different concepts. In practice, however, such algorithms can be difficult to implement. If the experts are fixed prediction strategies, then they are limited to those that one can delineate prior to any training. Alternatively, if the experts are learning algorithms, then they must be able to adapt to concept drift individually, since all are trained over the entire example series. In this paper, we take a different, more adaptive approach: we bound our algorithm s performance over changing concepts, not relative to the performance of any abstract expert, but relative to the actual performance of an online learner trained on each concept individually. Our algorithm, called Additive Expert (AddExp), maintains what we call an additive expert ensemble. It is similar to expert prediction algorithms, except that new experts can be added during the online learning process. We adapt techniques from Littlestone and Warmuth (1994) and Freund and Schapire (1997) to prove performance bounds under the mistake bound and online loss models. The primary result of the paper is, when trained on a series of changing concepts, AddExp makes O(m) mistakes on any concept, where m is the number of mistakes that the base learner makes when trained only on this single concept. Similarly for the continuous case, AddExp will suffer a loss of O(l) on any concept, where l is the loss of the base learner when trained only on this single concept. We present pruning methods so the algorithm can be efficient on complex data sets. Finally, we present empirical results. 2. Background and Related Work Much work on expert prediction can be traced back to the Weighted Majority (WM) algorithm (Littlestone & Warmuth, 1994) and Vovk s (1990) work on aggre-

gating strategies. Cesa-Bianchi et al. (1997) propose a general algorithm for randomized predictions and prove lower bounds on the net loss of any expert prediction algorithm. Freund and Schapire (1997) propose the Hedge algorithm, which uses these methods for an online-allocation problem. Haussler et al. (1998) and Vovk (1998) generalize previous results to general-case loss functions. Separately, several algorithms have been proposed to deal with concept drift (e.g., Kolter & Maloof, 2003; Schlimmer & Granger, 1986; Street & Kim, 2001; Widmer & Kubat, 1996). There has also been theoretical analysis (e.g., Helmbold & Long, 1994). Expert prediction algorithms have been applied specifically to the problem of concept drift. A variant of the original WM algorithm, WML (Littlestone & Warmuth, 1994) adds mechanisms that allow new experts to become dominant in constant time. Blum (1997) implemented a version of WM for a calendar scheduling task that exhibits concept drift. Herbster and Warmuth (1998) generalize the loss bounds of such expert predictions algorithms to general-case loss functions, and Monteleoni and Jaakkola (2004) have recently extended this work by allowing the algorithm to adjust parameters during online learning to improve performance. Bousquet and Warmuth (2003) analyze a special case where the best experts come from a small subset of the ensemble. Additive expert ensembles were introduced by Kolter and Maloof (2003) with the Dynamic Weighted Majority (DWM) algorithm, which shares many similarities with the AddExp algorithm we present in this paper. DWM was shown to perform well in situations with concept drift, but the evidence for this performance was entirely empirical, and in fact DWM s weighting scheme makes it impossible to bound performance in the worst case. DWM handles only discrete classes, can become unstable in noisy settings, and has a pruning mechanism that provides no guarantee as to the number of experts created. The AddExp algorithm, as we will describe, allows for theoretical analysis, generalizes to regression tasks, provides pruning mechanisms that cap the number of experts, and achieves better empirical results than DWM. 3. AddExp for Discrete Classes In this section, we present the AddExp.D algorithm (see Figure 1) for handling discrete class predictions. As its concept description, AddExp.D maintains an ensemble of predictive models, referred to as experts, each with an associated weight. Experts use the same algorithm for training and prediction, but are created Algorithm AddExp.D({x, y} T, β, γ) Parameters: {x, y} T : training data with class y Y β [0, 1]: factor for decreasing weights γ [0, 1]: factor for new expert weight Initialization: 1. Set the initial number of experts N 1 = 1. 2. Set the initial expert weight w 1,1 = 1. For t = 1, 2,..., T : 1. Get expert predictions ξ t,1,..., ξ t,nt Y 2. Output prediction: ŷ t = argmax c Y w t,i [c = ξ t,i ] 3. Update expert weights: w t+1,i = w t,i β [yt ξt,i] 4. If ŷ t y t then add a new expert: N t+1 = N t + 1 = γ w t+1,nt+1 w t,i 5. Train each expert on example x t, y t. Figure 1. AddExp for discrete class predictions. at different time steps. The performance element of the algorithm uses a weighted vote of all the experts. For each possible classification, the algorithm sums the weights of all the experts predicting that classification, and predicts the classification with the greatest weight. The learning element of the algorithm first predicts the classification of the training example. The weights of all the experts that misclassified the example are decreased by a multiplicative constant β. If the overall prediction was incorrect, a new expert is added to the ensemble with weight equal to the total weight of the ensemble times some constant γ. Finally, all experts are trained on the example. If trained on a large amount of data, AddExp.D has the potential to create a large number of experts. However, since this does not affect the theoretical mistake or loss bounds of the algorithm, we will not deal with this issue until Section 5, where we discuss pruning. 3.1. Analysis Here we analyze the performance of AddExp.D within the mistake bound model (Littlestone, 1988) by adapting the methods used by Littlestone and Warmuth (1994) to analyze the WM algorithm. As this is a worst-case analysis, we allow for an adversary to choose, at any time step, whether any given expert is going to make a mistake. This may seem odd because, after all, not only do these experts all employ

the same learning algorithm, they are also all trained from the same data stream the only difference being that some experts are trained on a larger subset of the stream than other experts. However, if we consider training noise and the sampling effects present in real-world problems, it becomes difficult to conclude anything about the performance of the different experts. Therefore we perform this analysis in the most general case. We denote the total weight of the ensemble at time step t to be W t = N t i=0 w t,i. We also let M t be the number of mistakes that AddExp.D has made through time steps 1, 2,..., t 1. The bound rests on the fact that the total weight of the ensemble decreases exponentially with the number of mistakes. Theorem 3.1 For any time steps t 1 < t 2, if we stipulate that β + 2γ < 1, then the number of mistakes that AddExp.D will make between time steps t 1 and t 2 can be bounded by log(w t1 /W t2 ) M t2 M t1 log(2/(1 + β + 2γ)). Proof AddExp.D operates by multiplying the weights of the experts that predicted incorrectly by β. If we assume that a mistake is made at time step t then we know that at least 1/2 of the weight in the ensemble predicted incorrectly. In addition, a new expert will be added with weight γw t. So we can bound the weight with W t+1 1 2 W t + β 2 W t + γw t = 1 + β + 2γ W t. 2 Applying this function recursively leads to: W t2 1 + β + 2γ M t2 M t1 Wt1. 2 Taking the logarithm and rearranging terms gives the desired bound, which will hold since the requirement that β + 2γ < 1 ensures that (1 + β + 2γ)/2 < 1. Now suppose that the algorithm makes a mistake at time t 1, and so a new expert is added. The initial weight of this expert will be γw t1. Let m be the number of mistakes made by the new expert by time step t 2. The weight of the expert is multiplied by β for each mistake it makes, so its weight at time t 2 will be γw t1 β m. Since the total weight of the ensemble must be greater than that of any individual member, W t2 γw t1 β m. (1) Corollary 3.2 For any time steps, t 1 < t 2, if Add- Exp.D makes a mistake at time t 1 and therefore adds a new expert to the pool then the number of mistakes that AddExp.D makes between time steps t 1 and t 2 can be bounded by M t2 M t1 m log(1/β) + log(1/γ) log(2/(1 + β + 2γ)) where m is the number of mistakes that the new expert makes in this time frame. Proof The bound is obtained by substituting (1) into the result of Theorem 3.1. Finally, we show how this analysis applies to concept drift. Corollary 3.3 Let S 1, S 2,..., S k be any arbitrary partitioning of the input sequence S into k subsequences, such that AddExp.D makes a mistake on the first example of each subsequence. Then for each S i, AddExp.D makes no more than { min S i, m } i log(1/β) + log(1/γ) log(2/(1 + β + 2γ)) mistakes, where m i is the number of mistakes made by the base learner when trained only on S i. Proof The follows follows by applying Corollary 3.2 to each subsequence S i. It is important to note that the requirement that Add- Exp.D makes a mistake on the first example of each subsequence is not a practical limitation, since subsequences need not correspond directly to concepts. Consider the subsequence that begins whenever Add- Exp.D makes its first mistake on some concept, and ends at the last example of this concept. We can bound the performance of AddExp.D on this subsequence relative to the performance of the base learner. And since we know that AddExp.D made no mistakes on the concept before the subsequence begins, this also serves as a bound for the performance of AddExp.D over the entire concept. Additionally, while this bound is most applicable to scenarios of sudden drift, it is by no means limited to these situations. It only requires that at some point during online training, a new expert will perform better than experts trained on old data, which can also happen in situations of gradual drift. However, there is another difficulty when considering the theoretical performance of AddExp.D. It can be shown that the coefficient on the m term in Corollary 3.2 can never be reduced lower than 2, which can be a significant problem if, for example, the base learner makes a mistake 20% of the time.,

Theorem 3.4 For any β, γ [0, 1], m log(1/β) + log(1/γ) log(2/(1 + β + 2γ)) 2m. Proof (Sketch) The coefficient of m is minimized as γ 0, so it will always be greater than log(1/β) log(2/(1 + β)). It can be shown that ln(1/x) 2(1 x)/(1 + x) for x [0, 1]. Using this inequality to bound the numerator, and the inequality ln(1 + x) x to bound the denominator gives the desired result. This can also be verified by plotting the function for β [0, 1]. A similar situation occurs in the analysis of the original WM algorithm. The problem is that since only half the weight of the ensemble needs to predict incorrectly in order to make a mistake, an adversary will be able to let the experts conspire to drive up the number of global mistakes. For this reason, we shift our focus to a continuous version of the algorithm analyzed under the online loss model. In the case of binary classes, it is possible to use randomized prediction to obtain a bound on the expected number of mistakes equivalent to the loss bound. 4. AddExp for Continuous Classes In this section, we present and analyze the AddExp.C algorithm (see Figure 2) for predicting values in the interval [0,1]. The general operation of the AddExp.C algorithm is the same as that of AddExp.D except that class predictions both the overall prediction of AddExp.C and the predictions of each expert are in [0, 1] rather than in a discrete set of classes Y. This necessitates other changes to the algorithm. Rather than predicting with a weighted majority vote, AddExp.C predicts the sum of all expert predictions, weighted by their relative expert weight. When updating weights, the algorithm uses the rule w t+1,i = w t,i β ξt,i yt. (2) As with other expert prediction algorithms, the following analysis also holds for a more general update function w t+1,i = w t,i U β ( ξ t,i y t ), for any U β : [0, 1] [0, 1], where β r U β (r) 1 (1 β)r for β 0 and r [0, 1]. Whereas we previously added a new member to the ensemble whenever the global prediction was incorrect, we now add a new member whenever the global loss of the algorithm on that time step is greater than some threshold τ [0, 1]. The weight of Algorithm AddExp.C({x, y} T, β, γ, τ) Parameters: {x, y} T : training data with class y [0, 1] β [0, 1]: factor for decreasing weights γ [0, 1]: factor for new expert weight τ [0, 1]: loss required to add a new expert Initialization: 1. Set the initial number of experts N 1 = 1. 2. Set the initial expert weight w 1,1 = 1. For t = 1, 2,..., T : 1. Get expert predictions ξ t,1,..., ξ t,nt [0, 1] 2. Output prediction: ŷ t = Nt w t,iξ t,i Nt w t,i 3. Suffer loss ŷ t y t 4. Update expert weights: w t+1,i = w t,i β ξt,i yt 5. If ŷ t y t τ add a new expert: N t+1 = N t + 1 = γ w t,i ξ t,i y t w t+1,nt+1 6. Train each expert on example x t, y t. Figure 2. AddExp for continuous class predictions. this new expert is calculated by: 4.1. Analysis w t+1,nt+1 = γ w t,i ξ t,i y t. (3) We adapt techniques from Littlestone and Warmuth (1994) and Freund and Schapire (1997) to analyze AddExp.C. Again we use the notation W t to denote the total weight of the expert pool at time t. We let L t refer to the total absolute loss suffered by the algorithm before time t, with L t = t 1 ŷ i y i. Theorem 4.1 For any time steps t 1 < t 2, if we stipulate that β + γ < 1, then the absolute loss of Add- Exp.C between time steps t 1 and t 2 can be bounded by L t2 L t1 ln(w t 1 /W t2 ) 1 β γ. Proof It can be shown by a convexity argument (e.g., Littlestone & Warmuth, 1994, Lemma 5.1) that β r 1 (1 β)r (4) for β 0 and r [0, 1]. At each time step AddExp.C updates the weights according to (2) and may add a

new expert with weight determined by (3). Combining these equations with (4) gives W t+1 i=0 i=0 w t,i β ξt,i yt + γ w t,i ξ t,i y t w t,i (1 (1 β γ) ξ t,i y t ) W t (1 (1 β γ) ŷ y ). Applying repeatedly for t = t 1,..., t 2 1 gives W t2 t 2 1 W t1 t=t 1 (1 (1 β γ) ŷ t y t ). The theorem follows by taking the natural logarithm and using the inequality ln(1 + x) x. Now suppose that the loss at time step t 1 is greater than τ, so a new expert is added with initial weight γ N t w t,i ξ t,i y t = γ ŷ t y t W t. Let l be the loss that the new expert has suffered by time t 2. At time step t 2 the weight of this expert will be β l γ ŷ t y t W t. And since the total weight of the ensemble must be greater than any individual member, we have W t2 β l γ ŷ t1 y tt W t1. (5) We use this to prove corollaries corresponding to 3.2 and 3.3 for the continuous case. Corollary 4.2 For any time steps, t 1 < t 2, if Add- Exp.C suffers a loss greater than τ at time t 1 and therefore adds a new expert to the pool then the loss suffered by AddExp.C between time steps t 1 and t 2 can be bounded by L t2 L t1 l ln(1/β) + ln(1/γ) + ln(1/τ) 1 β γ where l is the loss suffered by the new expert in this time frame. Proof The bound is obtained immediately by substituting (5) into the result of Theorem 4.1 and noticing that ŷ t1 y t1 τ. Corollary 4.3 Let S 1, S 2,..., S k be any arbitrary partitioning of the input sequence S such that at the first example in each subsequence, AddExp.C suffers loss greater than τ. Then for each subsequence S i, AddExp.C suffers loss no greater than { min S i, l } i ln(1/β) + ln(1/γ) + ln(1/τ), 1 β γ where l i is the loss suffered by the base learner when trained only on S i., Proof The bound is obtained by applying Corollary 4.2 to each partition S i. As with the discrete case, the requirement that Add- Exp.C suffer loss greater than τ at the first example of each subsequence is not a significant practical limitation. It is slightly more difficult in the continuous case, since AddExp.C may continuously suffer loss less than τ, and never add a new expert. However, since τ can be made arbitrary small, with only a log(1/τ) penalty, this extra loss suffered can be reduced arbitrarily close to zero. 4.2. Optimizing Parameters In this section, we present methods for choosing parameters β and γ based on previous knowledge of the problem. Specifically, if we have an upper bound l on the loss of a given expert, then we can choose β and γ so as to minimize the loss bound for AddExp.C. However, due to the nature of the bound, it is very difficult to optimize the parameters precisely, and so approximations are used. Well-chosen values for β and γ result in analysis that is remarkably similar to the analysis of WM-style algorithms. We make use of the following lemma from Freund and Schapire (1997), slightly rewritten: Lemma 4.4 (Freund & Schapire, 1997) Suppose 0 A Ã and 0 B B. Let α = 1 + 2 B/Ã. Then A ln α + B 1 (1/α) A + B + 2Ã B. Empirical optimization of the loss bound suggests that the bound reaches a minimum when γ = β/l. Using this conjecture we can choose values for β and γ such that the resulting equation is an instance of Lemma 4.4. Specifically, we choose the values where l β = ( l + 1)g( l) g( l) = 1 + and γ = 1 ( l + 1)g( l), (6) 2(ln ( l + 1) + ln(1/τ))/( l + 1). 1 (7) Substituting these values into the bound from Corollary 4.2 satisfies the conditions of Lemma 4.4 with 1 Here we define the function ln (x) = x ln x (x 1) ln(x 1). We use the name ln since this function is asymptotically similar to the ln function, and can be bounded by ln x ln x + 1 for x > 1. The difference is that ln x approaches 0 as x approaches 1, so the bound is stronger.

Ã = l + 1, B = ln ( l + 1) + ln(1/τ) and α = g( l). So we can obtain the new loss bound, L t2 L t1 l + 1 + ln (l + 1) + ln(1/τ) + 2( l + 1)(ln ( l + 1) + ln(1/τ)). If we choose l and τ such that l 1/τ, then the net loss of AddExp.C ) over this interval, L t2 L t1 l, is O ( l ln l. This bears a strong resemblance to the bound proven by Cesa-Bianchi et al. (1997) and others for the worst-case ) net loss of WM-style algorithms: O ( l ln n, where n in the number of experts. 5. Pruning Until now, we have ignored issues of efficiency. Add- Exp can potentially add a new expert every time step, leading to inefficiency as the length of the input grows. In this section, we present two pruning techniques, one that has theoretical guarantees but which may be impractical, and one that performs well in practice but which has no theoretical guarantees. Ideally, pruning removes unnecessary experts from the ensemble while keeping the best experts. If done correctly, it will not affect the mistake or loss bounds proven in Sections 3 and 4. These bounds depended only on the fact that the weight of the ensemble decreases over time, and pruning does not interfere with this. If we are concerned with the bound between time steps t 1 and t 2, the only requirement is that we do not remove the best expert during this time period. Pruning Method 1 (Oldest First) When a new expert is added to the ensemble, if the number of experts is greater than some constant K, remove the oldest expert before adding the new member. Theorem 5.1 Let S 1, S 2,..., S k be a partitioning of the input series as discussed in Corollaries 3.3 and 4.3. Suppose that we have an upper bound m or l on the maximum mistakes or loss of the base learner on any partition. Then AddExp trained on S, using the Oldest First pruning method with suitably chosen K, will have the same mistake or loss bound as AddExp without pruning. Proof For the discrete case, let K = m log(1/β) + log(1/γ) log(2/(1 + β + 2γ)) By Corollary 3.3, AddExp.D will make no more than K mistakes when trained on any partition S i. Since a new expert is added only when a mistake is made,. this also serves as a bound for the number of experts that can be created when trained on one partition. Therefore, if more experts than K are created, the oldest expert must span more than one partition, and we can remove it, since the bound depends only on the expert added at the beginning of each partition. In the continuous case we select K = l ln(1/β) + ln(1/γ) + ln(1/τ) 1 β γ 1 τ, as this is a worst-case bound on the number of new experts that can be created. After this, the proof parallels that of the discrete case. Unfortunately, this method may be difficult to use in practice since we may not be able to obtain l or m. Even if we could, the number of experts required might still be prohibitively large. Additionally, if we select a K that is too small, Oldest First may seriously degrade the performance of AddExp since, when trained on a static concept, it is precisely the oldest expert that is usually the most valuable. For this reason, we introduce a more practical pruning method. Pruning Method 2 (Weakest First) When a new expert is added to the ensemble, if the number of experts is greater than a constant K, remove the expert with the lowest weight before adding the new member. Weakest First performs well in practice, because it removes the worst performing experts. However, it is difficult to determine performance in the worst case since an adversary can always degrade the initial performance of the optimal expert, such that its weight falls below other experts. We must also be careful in choosing K in conjunction with γ, as we do not want new experts to be pruned before they are given a chance to learn. However, as we show in the next section, Weakest First works well in practice, even with small values of K. 6. Empirical Results To provide empirical support for AddExp, we evaluated it on the STAGGER Concepts (Schlimmer & Granger, 1986), a standard benchmark for concept drift, and on classification and regression tasks involving a drifting hyperplane. For the STAGGER Concepts, each example consists of three attributes, color {green, blue, red}, shape {triangle, circle, rectangle}, and size {small, medium, large}, and a class label y {+, }. Training lasts for 120 time steps, and consists of three target concepts that last 40 time steps each: (1) color = red size = small, (2) color = green shape = circle, and (3) size = medium

Predictive Accuracy (%) 100 90 80 70 60 50 40 30 AddExp.D w/ Naive Bayes Naive Bayes on each concept 20 Naive Bayes DWM w/ Naive Bayes 10 0 20 40 60 80 100 120 Time Step (t) Figure 3. Predictive accuracy on the STAGGER concepts. size = large. At each time step, the learner is trained on one example, and tested on 100 examples generated randomly according to the current concept. To test performance on a larger problem involving noise and more gradual drift, and to evaluate Add- Exp on a regression task, we created a synthetic problem using a drifting hyperplane. Feature vectors consist of ten variables, x [0, 1] 10, each with a uniform distribution over this range. Training lasts for 2000 time steps, with 4 target concepts lasting for 500 examples each. At every time step, the learners are trained on one example and tested on 100. For the classification task, y {+, }, and the general target concept is (x i + x i+1 + x i+2 )/3 > 0.5, with i = 1, 2, 4, and 7 for each of the 4 concepts, respectively. We also introduced 10% random class noise into the training data. The regression task is identical except that y [0, 1] and we eliminated the threshold, i.e., y = (x i + x i+1 + x i+2 )/3 with i = 1, 2, 4, and 7 for each of the 4 concepts. Instead of class noise, we introduced a ±0.1 random variate to y, clipping y if it was not in [0, 1]. We evaluated AddExp.D on the STAGGER and hyperplane classification tasks, and AddExp.C on the hyperplane regression task. Based on pilot studies, we chose parameters of β = 0.5, γ = 0.1 for STAGGER, β = 0.5, γ = 0.01 for the hyperplane classification task, and β = 0.5, γ = 0.1, τ = 0.05 for hyperplane regression. We used an incremental version of naive Bayes as the base learner for AddExp.D and an online batch version of least squares regression as the base learner for AddExp.C. For the sake of comparison, we evaluated the performance of the base learner over the entire data set, and on each concept individually. These serve as worst-case and best-case comparisons, respectively, since the base learners have no mechanisms for adjusting to concept drift, and conversely, the base learners trained on each concept individually are not Predictive Accuracy on Test Set (%) 100 90 80 70 60 50 40 AddExp.D w/ Naive Bayes Naive Bayes on each concept Naive Bayes DWM w/ Naive Bayes 30 0 500 1000 1500 2000 Time Step (t) Figure 4. Predictive accuracy on the hyperplane problem. Average Absolute Loss on Test Set 0.6 0.5 0.4 0.3 0.2 0.1 AddExp.C w/ Linear Regression Linear Regression on each concept Linear Regression 0 0 500 1000 1500 2000 Time Step (t) Figure 5. Average absolute loss on a continuous version of the hyperplane problem. hindered by previous data. We also evaluated DWM on the classification tasks using naive Bayes as a base learner, β = 0.5, and allowing new experts to be added every time step, mirroring AddExp.D. For all simulations, we averaged accuracies over 50 runs, and calculated 95% confidence intervals. Figure 3 shows the performance of AddExp.D on the STAGGER data set. All algorithms except DWM perform virtually the same on the first concept, since no drift takes place. Once the target concept changes, naive Bayes cannot adapt quickly to the new concept, but AddExp.D is able to quickly converge to performance similar to naive Bayes trained only on the new target concept. Figure 4 shows a similar result for the hyperplane classification task. AddExp.D converges quickly to all concepts. However, it converges slowest to the second concept, the most gradual of the three drifts since only one relevant variable changes, and x 2 and x 3 are relevant for both concepts. After the drift occurs, AddExp.D s performance is initially better than naive Bayes, because of the old experts and the overlap of the two concepts. Convergence is slower because it takes longer for the new experts to out-

weigh the old ones. This shows AddExp.D balancing the predictions of its old and new experts, which is important for real-world applications, since it is not known in advance whether drift will be gradual or sudden. Finally, Figure 5 shows similar behavior for AddExp.C on the hyperplane regression task. DWM performs slightly worse than AddExp.D on the first STAGGER concept (Figure 3), but outperforms it on the last. This is because DWM gives more weight to a new expert set initially to the highest weight in the ensemble which can accelerate convergence, but it can also let new experts become dominant too quickly, thereby decreasing performance. This is particularly harmful in noisy scenarios, as shown by Figure 4. To deal with this problem, DWM uses a parameter such that new experts can only be added periodically. However, this prevents DWM from recognizing drift during this period, and does not alleviate the problem entirely. Therefore, we argue that AddExp is a more robust solution to this problem, since the γ parameter allows one to precisely dictate the weight of new experts, and the bounds we have proven show that new experts can never dominate entirely if an older expert performs better. By the end of the 2000 training examples on the hyperplane classification task, AddExp.D created an average of 385 experts, causing the algorithm to be much slower than naive Bayes. This is especially a problem in noisy domains, since new experts will continually be created when the algorithm misclassifies noisy examples. Fortunately, the Weakest First method of pruning works remarkably well on this task. The learning curves for AddExp.D with a maximum of 10 or 20 experts are virtually indistinguishable from the unpruned learning curve. The average accuracy over all time steps for unpruned AddExp.D is 87.34% ± 0.06, while Weakest First achieves average accuracies of 87.27±0.06 for K = 20 and 87.08±0.06 for K = 10. 7. Concluding Remarks We presented the additive expert ensemble algorithm AddExp, a general method for using any online learner to handle concept drift. We proved mistake and loss bounds on a subsequence of the training data relative to the actual performance of the base learner, the first theoretical analysis of this type. We proposed two pruning methods that limit the number of experts in the ensemble, and evaluated the performance of the algorithm on two synthetic data sets. In future work, we would like to conduct more empirical analysis, analyze additive expert ensembles under other loss functions, and investigate methods for dynamically adjusting the parameters of the algorithm during online learning to maximize performance. Acknowledgements. The authors thank the Georgetown Undergraduate Research Opportunities Program for funding, and thank Lisa Singh and the anonymous reviewers for helpful comments on earlier drafts of this article. References Blum, A. (1997). Empirical support for Winnow and Weighted-Majority algorithms. Machine Learning, 26, 5 23. Bousquet, O., & Warmuth, M. K. (2003). Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3, 363 396. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44, 427 485. Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and Sys. Sciences, 55, 119 139. Haussler, D., Kivinen, J., & Warmuth, M. (1998). Sequential predictions of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44, 1906 1925. Helmbold, D. P., & Long, P. M. (1994). Tracking drifting concepts by minimising disagreements. Machine Learning, 14, 27 45. Herbster, M., & Warmuth, M. K. (1998). Tracking the best expert. Machine Learning, 32, 151 178. Kolter, J. Z., & Maloof, M. A. (2003). Dynamic weighted majority. Proceedings of the 3rd IEEE ICDM Conference (pp. 123 130). IEEE Press. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285 318. Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212 261. Monteleoni, C., & Jaakkola, T. S. (2004). Online learning of non-stationary sequences. Proceedings of the 16th NIPS Conference. MIT Press. Schlimmer, J., & Granger, R. (1986). Beyond incremental processing: Tracking concept drift. Proceedings of the 5th AAAI Conference (pp. 502 507). AAAI Press. Street, W., & Kim, Y. (2001). A streaming ensemble algorithm (SEA) for large-scale classification. Proceedings of the 7th SIGKDD Conference (pp. 377 382). ACM Press. Vovk, V. (1990). Aggregating strategies. Proceedings of the 3rd COLT Workshop (pp. 371 386). Morgan Kaufmann. Vovk, V. (1998). A game of prediction with expert advice. J. of Computer and System Sciences, 56, 153 173. Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23, 69 101.