Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27
Outline Independent Mixing Coefficients Mixture of Experts Summary
Chapter 14 shortest chapter in the book examples in regression and classification Bishop style: exponential error functions introduced with which boosting can be expressed in a flexible way, etc. BiShop-Bingo three crosses in row, column or diagonal erase the counters BS-Bingo starts...
Chapter 14 shortest chapter in the book examples in regression and classification Bishop style: exponential error functions introduced with which boosting can be expressed in a flexible way, etc. BiShop-Bingo three crosses in row, column or diagonal erase the counters BS-Bingo starts......now!
ensemble of statistical classifiers are more accurate than a single classifier weak learner or weak classifier: slightly better than chance final results by voting (classification) or averaging (regression) some techniques: bagging, boosting
a committee technique based on bootstrapping the data set and model averaging bootstrapping: given a data set of size N, create M datasets of size N with replacement averaging low-bias models produce accurate predictions bias-variance decomposotion (Section 3.5) 1 1 1 1 1 1
Bagging in Regression example on regression y(x) = h(x) + ɛ(x) from a single data set D M bootstrap data sets D m, and from which regressors y m (x) with errors ɛ m (x) sum-of-squares error average of individual errors E x {(y m (x) h(x)) 2 } = E x {ɛ(x) 2 } E AV = 1 M M E x {ɛ m (x) 2 } m=1
Averaging Gives Better Permormance committee prediction is the average of y m y COM (x) = 1 M M y m (x) m=1 expected error from the committee E COM = E x { ( y COM (x) h(x) ) 2 } = Ex { ( 1 M M ɛ m (x) ) 2 } m=1 under assumptions that errors ɛ m (x) zero-mean and uncorrelated we obtain E COM = 1 M E AV
Not As Good As in Theory assumptions do not hold generally however, it can be proved that E COM E AV, e.g., E x {(h(x) y m (x)) 2 } = h(x) 2 2h(x)E x {y m (x)}+e x {y m (x) 2 } using inequality E x {X 2 } E x {X} 2 and E x {y m (x)} = y COM (x) we get (h(x) y COM (x)) 2 E x {(h(x) y m (x)) 2 }
training in sequence misclassified data point gets more weight in the following classifier final prediction given by a weighted majority voting scheme example on two-class classification problem with most widely used algorithm AdaBoost
Adaptive AdaBoost weights w i for each training sample M weak classifiers in sequence indicator function I(y m (x n ) t n ), which equals 1 if the argument is true, i.e., in case of misclassification misclassified data points will have more weight in the following classifier weights α m for each classifier {w (1) n } {w (2) n } {w (M) n } y1(x) y2(x) ym (x) ( M ) YM (x) = sign αmym(x) m
AdaBoost: Algorithm Algorithm 1. Initialize data weights w (1) n = 1/N 2. For m = 1,..., M, 2.1 Fit a classifier y m (x) by minimizing J m = w (m) n I(y m (x n t n )) 2.2 Evaluate quantity ɛ m (ratio of misclassified) ɛ m = Pn w (m) n I(y m(x n) t n) Pn w (m) n Evaluate quantity α m (weight for classfier m) α m = log ( ) 1 ɛ m ɛ m 2.3 Update the data weighting coefficients w (m+1) n = w (m) n e αmi(ym(xn) tn) 3. Make prediction by Y M (x) = sign ( M m=1 α my m (x) )
AdaBoost: Example base learners consist of a threshold on one of the input variables misclassified samples by classifier at m = 1 get greater weight for m = 2 final classification: Y m (x) = sign( m α my m (x)) 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 1 1 2
as Sequential Minimization boosting was originally motivated by statistical learning theory here sequential optimization of exponential error function ( in a Bishop Style ) error function E = N n=1 e tnfm(xn) combined classifier fm (x) =.5 l α ly l (x) keeping base classifiers y1 (x)... y m 1 (x) with corresponding α l fixed and minimizing only the last α m and y m (x) leads to the same equations as in AdaBoost
Error Functions for lots of boosting-like algorithms by altering of error function exponential error function sequential minimization leads to simple AdaBoost penalizes large negative values of ty(x) cross-entropy error function for t { 1, 1}: log(1 + e yt ) more robust to outliers log likelihoods for any distribution exist multi-class problems possible to solve
Classification and Regression Trees CART input space is splitted into cuboid regions; axis-aligned boundaries only one model, e.g., constant, in one region human interpretation is easy x 2 E θ 3 B x 1 > θ 1 θ 2 C D x 2 θ 2 x 1 θ 4 x 2 > θ 3 A θ 1 θ 4 x 1 A B C D E
CART: Learning from Data determine from data structure of a tree input variable for each node threshold values θi for a split values of prediction combinatorially infeasible greedy algorithm from a single node start growing stopping criterion pruning criterion
Drawbacks of CART learning of a tree is sensitive to data splits aligned with axes of feature space hard splitting: each region of input space belongs to one and only one node piecewise-constant predictions of a tree not smooth hierarchical mixture of expers
Independent Mixing Coefficients Mixture of Experts Mixture of Linear Regression Models simple probabilistic cases for regression and classification mixtures of linear regression models mixtures of logistic models Gaussians with mixing coefficients independent from input variables K p(t θ) = π k N (t w T k φ, β 1 ) k=1
Independent Mixing Coefficients Mixture of Experts EM for Maximizing Log Likelihood log likelihood function given a data set of {φ n, t n } log p(t θ) = N log ( K π k N (t n w T k φ n, β 1 ) ) n=1 k=1 complete-data log likelihood function with binary latent variables z nk log p(t, Z θ) = N n=1 k=1 EM for γ nk, Q(θ, θ old ), π k, w k, and β K z nk log ( π k N (t n w T k φ n, β 1 ) )
Example Independent Mixing Coefficients Mixture of Experts mixture of two linear regressors drawback: lot of probability mass with no data solution: input dependent mixing coefficients 1.5 1.5.5 1 1.5 1.5.5 1 1.8.6.4.2 1.5.5 1 1.5 1.5.5 1 1.5 1.5.5 1 1.8.6.4.2 1.5.5 1 1.5 1.5.5 1 1.5 1.5.5 1 1.8.6.4.2 1.5.5 1
Mixture of Experts Independent Mixing Coefficients Mixture of Experts mixture of linear regression models p(t θ) = K k=1 π kp k (t θ) mixture of experts model p(t x, θ) = K π k (x)p k (t x, θ) k=1 mixing coefficients, gating functions, as functions of input individual component densities, experts
Hierarchical Mixture of Experts Independent Mixing Coefficients Mixture of Experts probabilistic version of decision trees each component in the mixture is itself a mixture distribution nodes: probabilistic splits of all input variables leaves: probabilistic models mixture density network (Section 5.6) 1 1 (c)
Summary multiple models to increase capabilities of the regressor or classifier basic methods bagging and boosting improve results compared to a single learner decision trees are easy to interpret probabilistic networks extend models
Course Feedback http://www.cs.hut.fi/opinnot/palaute/ kurssipalaute.html Kevään 27 kurssikyselyt T-61.62