Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Notation R n : feature space Y : class label set m : number of training examples D = {x i, y i } m i=1 (x i R n ; y i Y}: training data set H hipothesis space h : base classificer H : emsemble classifier Probabilidade conjunta (regra do produto): P(X, Y ) = P(Y X )P(X ) = P(X Y )P(Y ) Probabilidade total (regra da soma): P(X ) = Y P(X, Y ) = Y P(X Y )P(Y ) Probabilidade tota (regra da soma): P(Y ) = X P(X, Y ) = X P(Y X )P(X ) Teorema de Bayes: P(Y X ) = P(X Y )P(Y ) P(X ) = P(X Y )P(Y ) P(X Y )P(Y ) Y

Bayes optimal classifier Ensemble methods: try to combine models in one model, each base model is assigned a weight based on its contribution to the classification task which model should be considered; how to choose them from the hypothesis space?; how to assign the weight to each model? Given D = {x i, y i } m i=1 (x i R n ; y i Y}, a training data set goal: obtain un ensemble classifier that assigns a label to unseen x as: H(x) = y = arg max p(y x, h) p(h D) y Y h H

Bayes optimal classifier Assume the training examples drawn independently: p(h D) = p(h)p(d h) p(d) = p(h) m i=1 p(x i h) p(d) therefore: H(x) = y = arg max p(y x, h) p(d h)p(h) y Y h H The Bayes optimal classifier is the ideal ensemble method However, it cannot be pratically implemented

Bayes Model Averaging Ref.: J. Hoeting et al, Bayesian model averaging: a tutorial, Statistical Science, 14 (4): 382-417, 1999 Models are sampled using a Montecarlo sampling technique Simpler way: model trained on a random subset of training data as a sampled model Computation of p(h) no prior knowledeg is available uniform distribution without normalization p(h) = 1 for all hypothesis Computation of p(d h) error rate of hypothesis h on training data is ε(h) p(x i h) is computed as p(x i h) = exp{ε(h) ln(ε(h)) + (1 ε(h)) ln(1 ε(h))} m p(d h) = p(x i h) i=1 = exp{m[ε(h) ln(ε(h)) + (1 ε(h)) ln(1 ε(h))]}

Bayes Model Averaging Algorithm 1 Bayesian Model Averaging Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: for t 1 to T do 2: Construct D t with size m by randomly sample in D 3: Learn a base classifier h t based on D t 4: Set p(h t) = 1 5: Calculate ε(h t) on D t 6: Calculate p(d t h t) = exp{m [ε(h t) ln(ε(h t)) + (1 ε(h t)) ln(1 ε(h t))]} 7: Set weight(h t) = p(d t h t)p(h t) 8: end for 9: Normalize all the weights to sum 1 T 10: return H(x) = y = arg max p(y x, h t) p(d h)p(h t) y Y t=1 Expected error of Bayesian model averaging: at most twice the expected error of the Bayes optimal classifier over-fitting problems it prefers the hytotheis that has the lowest error on training data rather the hypotheis that has the lowest error it conducts a selection of classifiers instead of combining them

Bayesian Model Combination Algorithm 2 Bayesian Model Combination Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: for t 1 to T do 2: Construct D t with size m by randomly sample in D 3: Learn a base classifier h t based on D t 4: Set weight(h t) = 0 5: end for 6: SumWeight = 0 7: z = 8: Set the iteration number for compute weights: iteration To overcome the over-fitting problem, It directly samples from the space of possible ensemble hypothesis It regards all the base classifier as one model and iteratively calculates their weights simultaneously

Bayesian Model Combination Algorithm 3 Bayesian Model Combination (cont.) 1: for iter 1 to iteration do 2: for each weak classifier do 3: draw a temp weight 4: TempWeight(h t) = ln(randuniform(0, 1)) 5: end for 6: Normalize TempWeight to sum 1 7: Combine the base classifiers as H = T t=1 httempweight(ht) 8: Calculate ε(h ) on D 9: Calculate p(d H ) = exp{m(ε(h ) ln(ε(h ) + (1 (ε(h ) ln(1 (ε(h )))} 10: if p(d H ) > Z then 11: 12: for each base classifier do weight(h t) = weight(h t) exp{z p(d H )} 13: 14: end for z = p(d H ) 15: 16: end if w = exp{p(d H ) z} 17: for each base classifier do 18: weight(h t) = weight(h t) SumWeight SumWeight+w + wtempweight(ht) 19: end for 20: SumWeight = SumWeight + w 21: end for 22: Normalize all the weights to sum 1 T 23: return H(x) = y = arg max p(y x, h t) p(d h)p(h t) y Y t=1

Bagging (bootstrap aggregation) Algorithm 4 Bagging Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: for t 1 to T do 2: Construct D t by randomly sampling with replacement in D 3: Learn a base classifier h t based on D t 4: end for T 5: return H(x) = y = arg max 1(h t(x) = y) y Y t=1 It adopts Bootstrap sampling technique in constructing base models It gnerates new data sets by sampling from the original data set with replacement It trains base classifiers on the sampled data sets It combines all the base classifiers by majority voting

Boosting Algorithm 5 Boosting Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: Initialize the weight distribution W 1 2: for t 1 to T do 3: Learn weak classifier h t based on D t and W t 4: Evaluate weak classifier ε(h t) 5: Update weight distribution W t+1 based on ε(h t) 6: end for 7: return H = Combination({h 1,..., h T }) Boosting convert weak classifiers to a strong one it iteratively adjust the importance of examples in the training set It corrects the mistake made in weak classifiers gradually It leans base classifiers based on the weight distribution It combines the learned classifiers

AdaBoost Algorithm 6 AdaBoost in Binary Classification Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i {+1, 1})} Ensure: : An ensemble classifier H 1: Initialize the weight distribution W 1 = 1 m 2: for t 1 to T do 3: Learn a base classifier h t = arg min h 4: Calculate the weight of h t: α t = 1 2 1 ε(ht ) ln( ) ε(h t ) 5: Update weight distribution of training example: W t+1 = 6: end for 7: return H = T t=1 αtht W t (i) exp{ α t h t (x i )y i } m i =1 Wt (i ) exp{ α t h t (x i )y i } ε(h) where ε(h) = m i=1 Wt(i)1(h(x i ) y i ) if ε(h 1 ) ε(h 2 ) then α h1 α h2 For ε(h t ) 0.5, α t 0 If x i is wrongly classified, h t (x i )y i is 1 and α t 0, so α t h t (x i )y i 0. As exp{ α t h t (x i )y i } > 1, the new weight W t+1 (i) > W t (i) If x i is correctly classified, W t+1 (i) < W t (i).

Stacking Algorithm 7 Stacking Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: Step 1: Learn first-level classifiers 2: for t 1 to T do 3: Learn base classifier h t based on D 4: end for 5: Step 2: Construct new data sets from D 6: for t 1 to m do 7: Construct a new data set that contains {(x i, y i )}, where x i = (h 1 (x i ),..., h T (x i )) 8: end for 9: Learns a second-level classifier 10: Learn a new classifier h based on the newly constructed data set 11: return H(x) = h (h 1 (x),..., h T (x)) Stacking learns a high-level classifier on top of the base classifiers it can see as a meta learning approach the base classifiers are called first-level classifiers a second level classifiers is learnt to combine the first-level classifiers

Stacking Step 1: learn fist-level classifiers based on the original training data set. we can apply Bootstrap sampling technique to learn independent classifiers we can adopt the strategy used in Boosting: adaptively learn base classifiers based on data with a weight distribution we can tune parameters in a learning algorithms to generate diverse base classifiers (homogeneous classifiers) we can apply different classification and/or sampling methods to generate base classifiers (heterogeneous classifiers) Step 2: construct a new data set based on the output of base classifiers Step 3: Learn a second-level classifiers based on the newly constructed data set. Any learning method could be applied to learn the second-level classifier