AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that we are given a training set S : (x, y ), (x 2, y 2 ),..., (x n, y n ) where i, y i {, } and a pool of hypothesis functions H from which we are to pick T hypotheses in order to form an ensemble H. H then makes a decision using the individual hypotheses h,..., h T in the ensemble as follows: T H(x) α i h i (x) () That is, H uses a linear combination of the decisions of each of the h i hypotheses in the ensemble. The AdaBoost algorithm sequentially chooses h i from H and assigns this hypothesis a weight α i. We let H t be the classifier formed by the first t hypotheses. That is, H t (x) t α i h i (x) H t (x) + α t h t (x) where H 0 (x) : 0. That is, the empty ensemble will always output 0. The idea behind the AdaBoost algorithm is that the t th hypothesis will correct for the errors that the first t hypotheses make on the training set. More specifically, after we select the first t hypotheses, we determine which instances in S our t hypotheses perform poorly on and make sure that the t th hypothesis performs well on these instances. The pseudocode for AdaBoost is described in Algorithm. A high-level overview of the algorithm is described below:. Initialize a training set distribution At each iteration,..., T of the AdaBoost algorithm, we define a probability distribution D over the training instances in S. We let D t be the probability distribution at the t th iteration and D t (i) be the probability assigned to the i th training instance, (x i, y i ) S, according to D t. As the algorithm proceeds, each iteration will design D t so that it assigns higher probability mass to instances that the first t hypotheses performed poorly on. That is, the worse the performance on x i, the higher will be D t (i). At the onset of the algorithm, we set D to be the uniform distribution over the instances. That is, i {, 2,..., n}, D (i) : n Matthew Bernstein 207
Algorithm AdaBoost for binary classification Precondition: A training set S : (x, y ),..., (x n, y n ), hypothesis space H, and number of iterations T. for i {, 2..., n} do 2 D (i) n 3 end for 4 H 5 for t,..., T do 6 h t argmin P i Dt (h(x i ) y i ) find good hypothesis on weighted training set 7 ϵ t P i Dt (h t (x i ) y i ) compute hypothesis's error 8 α t ln ( ) ϵ t 2 ϵ t compute hypothesis's weight 9 H H {(α t, h t )} add hypothesis to the ensemble 0 for i {, 2..., n} do update training set distribution D t+ (i) D t(i) e α ty i ht(x i ) j D t ( j) e α ty j ht(x j ) 2 end for 3 end for 4 return H where n is the size of S. 2. Find a new hypothesis to add to the ensemble At the t th iteration, we search for a new hypothesis, h t, that performs well on S assuming that instances are drawn from D t ). By ``performs well", we mean that h t should have a low expected 0- loss on S under D t. That is h t : E i Dt [l 0 (h, x i, y i )] P i Dt (y i h(x i )) We call this expected loss the ``weighted loss" because the 0- loss is not computed on the instances in the training set directly, but rather on the weighted instances in the training set. Matthew Bernstein 207 2
3. Assign the new hypothesis a weight Once we compute h t, we assign h t a weight α t based on its performance. More specifically, we give it the weight α t : ( ) 2 ln ϵt (2) ϵ t where ϵ t : P i Dt (y i h t (x i )). We will soon explain the theoretical justification of this precise weight assignment, but intuitively we see that the the higher ϵ t, the the larger will be the denominator and the smaller the numerator in ϵ t ϵ t thus, the smaller will be ln ( ) ϵ t 2 ϵ t. Thus, if the new hypothesis, h t, has a high error, ϵ t, then we assign this hypothesis a smaller weight. That is, h t will contribute less to the output of ensemble H. 4. Recompute the training set distribution Once the new hypothesis is added to the ensemble, we recompute the training set distribution to assign each instance a probability proportional to how well the current ensemble H t performs on the training set. We compute D t+ as follows: D t+ (i) : D t (i) e α ty i h t (x i ) j D t ( j) e α ty j h t (x j ) (3) We will soon explain a theoretical justification for this precise probability assignment, but for now we can gain an intuitive understanding. Note the term e α ty i h t (x i ). If h t (x i ) y i, then y i h t (x i ) which means that e α ty i h t (x i ) e α t. If, on the other hand, h t (x i ) y i, then y i h t (x i ) which means that e α ty i h t (x i ) e α t. Thus, we see that e α ty i h t (x i ) is smaller if the hypothesis's prediction agrees with the true value. That is, we assign higher probability to the i th instance if h t was wrong on x i. Repeat steps 2 through 4 Repeat steps 2 through 4 for T more iterations. Derivation of AdaBoost from first principles The AdaBoost algorithm can be viewed as an algorithm that searches for hypotheses of the form of Equation in order to minimize the empirical loss under the exponential loss function: l exp (h, x, y) : e yh(x) Matthew Bernstein 207 3
We note that there are many ways in which one might search for a hypothesis of the form of Equation in order to minimize the exponential loss function. The AdaBoost algorithm performs this minimization using a sequential procedure such that, at iteration t, we are given H t and our goal is to produce H t H t + α t h t where the new h t and α t minimizes the exponential loss of H t on the training data. Theorem shows that AdaBoost's choice of h t minimizes the exponential loss of H t over the training data. That is, where h t L S (H t + Ch) : n L S (H t + Ch) l exp (H t + Ch, x, y) and C is an arbitrary constant. Theorem 2 shows that once h t is chosen, AdaBoost's choice of α t then further minimizes the exponential loss of H t over the training set. That is, α t : L S (H t + αh t ) α. Theorem The choice of h t under AdaBoost, h t : P i Dt (y i h(x i )), minimizes the exponential-loss of H t over the training set. That is, given an arbitrary constant C, h t L S (H t + Ch). Matthew Bernstein 207 4
Proof: h t argmin L S (H t + Ch) n n e y i[h t (x i )+Ch(x i )] e y ih t (x i ) e ych(x i) e ych(x i) n e ych t(x i ) let : e y ih t (x i ) e C + e C split the summation i:h(x i )y i e C e C + e αt e C + (e C e C ) K + (e C e C ) K : w i e α t is a constant (ec e C ) j w t, j is a constant i:h(x i ) y j w t, j i j w t, j P i Dt (y i h(x i )) See Lemma Matthew Bernstein 207 5
Lemma P i Dt (y i h(x i )) j w t, j where : e y ih t (x i ) Proof: First, we show that D t (i) j w t, j (4) We show this fact by induction. First, we prove the base case: w,i j w, j e y ih 0 (x i ) j e y jh 0 (x j ) n D (i) for all i because H 0 (x i ) 0 Next, we need to prove the inductive step. That is, we prove that D t (i) j w t, j D t+ (i) w t+,i j w t+, j Matthew Bernstein 207 6
This is proven as follows: D t+ (i) : D t (i) e α ty i h t (x i ) j D t ( j) e α ty j h t (x j ) by Equation 3 j j j w t, j e α ty i h t (x i ) w t, j k w t,k e α ty j h t (x j ) e y i H t (x i ) j e y j H t (x j ) e α ty i h t (x i ) e y j H t (x j ) k e y k H t (x k ) e α ty j h t (x j ) by the inductive hypothesis by the fact that : e y ih t (x i ) j e y j H t (x j ) e y ih t (x i ) e α ty i h t (x i ) k e y k H t (x k ) j e y jh t (x j) e α ty j h t (x j ) e y ih t (x i ) α t y i h t (x i ) j e y jh t (x j ) α t y j h t (x j ) e y ih t (x i ) j e y jh t (x j ) w t+,i j w t+, j Now that we have proven Equation 4, it follows that D t (x i ) j w t, j P i Dt (y i h t (x i )) Theorem 2 The choice of α t under AdaBoost, α t : 2 ln ( ϵt ϵ t ) where ϵ t : P i Dt (y i h t (x i )) Matthew Bernstein 207 7
, minimizes the exponential-loss of H t over the training set. That is,. Proof: Our goal is to solve α t : α α α t α L S (H t + αh t ) L S (H t + αh t ) eα + e α i:h(x i )y i To do so, set the derivative of the function in the argmin to zero and solve for α (the function is convex, though we don't prove it here): d dα eα + e α i:h(x i )y i 0 eα e α 0 i:h(x i )y i e 2α i:h(x i )y i ( ) 2α ln i:h(x i )y i α ( ) 2 ln i:h(x i )y i α ( 2 ln α 2 ln α 2 ln n α 2 ln ( ϵt ϵ t ) ) Matthew Bernstein 207 8
Matthew Bernstein 207 9
Bibliography [] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 996. 0