Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1
Neural networks Neural network Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x. Tries to find f = argmin f C n (Y i f (X i )) 2 i=1 where C = { neural net functions }. 2 / 1
ial Neural networks: Networks single layer (ANN) 3 / 1
Neural networks Single layer neural nets With p inputs, there are p + 1 weights ( ) f (x 1,..., x p ) = S α + x T β Here, S is a sigmoid function like the logistic curve. S(η) = eη 1 + e η. The algorithm must estimate (α, β) from data. Classifier would assign +1 to anything with S( α + x T β) > t for some threshold like 0.5. 4 / 1
Neural networks: double layer 5 / 1
Neural networks Single layer neural nets In the previous figure, there are 15 weights and 3 intercept terms (α s) for input hidden layer. There are 3 weights and one intercept term for hidden output layer. The outputs from the hidden layer have the form ( ) 5 S α j + w ij x i 1 j 3. i=1 So, the final output is of the form ( 3 S α 0 + v j S α j + j=1 5 i=1 w ij x i ). 6 / 1
Neural networks Single layer neural nets Fitting this model is done via nonlinear least squares. Computationally difficult due to the highly nonlinear form of the regression function. 7 / 1
Support vector machines Support vector machine Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x using a linear function. A linear function with an intercept α determines an affine function f α,β (x) = x T β + α and a hyperplane { } H α,β = x : x T β + α = 0. A good classifier will separate the classes into the sides where f α,β (x) > 0, f α,β (x) < 0. It seeks to find the hyperplane that separates the classes the most. 8 / 1
Support Support Vector vector machine Machines 9 / 1
Support Support Vector vector machine Machines 10 / 1
Support Support Vector vector machine Machines 11 / 1
Support Support Vector vector machine Machines 12 / 1
upport Support Vector vector machine Machines 13 / 1
Support Support Vector vector machine Machines 14 / 1
Support vector machines Support vector machine The reciprocal maximum margin can be written as subject to y i (x T i β + α) 1. inf β,α β 2 Therefore, the support vector machine problem is subject to y i (x T i β + α) 1. minimize β,α β 2 15 / 1
Support vector machine What if the problem is not linearly separable 16 / 1
Support vector machines Non-separable problems If we cannot completely separate the classes it is common to modify the problem to minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n i=1 ξ i C Or the Lagrange version minimize β,α,ξ β 2 2 + γ n i=1 subject to y i (x T i β + α) 1 ξ i, ξ i 0. ξ i 17 / 1
Support vector machines Non-separable problems The ξ i s can be removed from this problem, yielding n minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + i=1 where (z) + = max(z, 0) is the positive part function. Or, n minimize β,α (1 y i f α,β (x i )) + + λ β 2 2 i=1 18 / 1
Logistic vs. SVM 4.0 3.5 Logistic SVM 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3 2 1 0 1 2 3 19 / 1
Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 20 / 1
Iris data: sepal.length, sepal.width 21 / 1
Iris data: sepal.length, sepal.width 22 / 1
Support vector machines Kernel trick The term β 2 2 can be thought of as a complexity penalty on the function f α,β so we could write minimize fβ,α n (1 y i (α + f β (x i )) + + λ f β 2 2 i=1 We could also add such a complexity penalty to logistic regression minimize f β,α n DEV(α + f β (x i ), y i ) + λ f β 2 2 i=1 So, in both SVM and penalized logistic, we are looking for the best linear function where best is determined by either logistic loss (deviance) or the hinge loss (SVM). 23 / 1
Support vector machines Kernel trick If the decision boundary is nonlinear, one might look for functions other than linear. There are classes of functions called Reproducing Kernel Hilbert Spaces (RKHS) for which it is theoretically (and computationally) possible to solve or minimize f H minimize f H n (1 y i (α + f (x i ))) + + λ f 2 H. i=1 n DEV(α + f (x i ), y i ) + λ f 2 H. i=1 The svm function in R allows some such kernels for SVM. 24 / 1
Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 25 / 1
Iris data: sepal.length, sepal.width 26 / 1
Iris data: sepal.length, sepal.width 27 / 1
Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 28 / 1
Iris data: sepal.length, sepal.width 29 / 1
Iris data: sepal.length, sepal.width 30 / 1
General EnsembleIdea methods 31 / 1
Ensemble methods Why use ensemble methods? Basic idea: weak learners / classifiers are noisy learners. Aggregating over a large number of weak learners should improve the output. Assumes some independence across weak learners, otherwise aggregation does nothing. That is, if every learner is the same, then any sensible agrregation technique should return this classifier. These are often some of the best classifiers in terms of performance. Downside: they have lost interpretability in the aggregation stage. 32 / 1
Ensemble methods Bagging In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote f (x) = majority vote of (f,b (x)) 1 i B. 33 / 1
Ensemble methods Bagging Approximately 1/3 of data is not selected in each bootstrap sample (recall 0.632 rule). Using this observation, we can define the Out-of-bag (OOB) estimate of accuracy based on OOB prediction f OOB (x i ) = majority vote of (f b (x x i )) 1 b B(i) where B(i) = {all trees not built with x i }. Total OOB error is OOB = 1 n n ( f OOB (x i ) y i ) i=1 34 / 1
Ensemble methods Random forests: bagging trees If the weak learners are trees, this method is called a Random Forest. Some variations on random forests inject more randomness than just bagging. For example: Taking a random subset of features. Random linear combinations of features. 35 / 1
Iris data: sepal.length, sepal.width using randomforest 36 / 1
Ensemble methods Boosting Unlike bagging, boosting does not insert randomizaton into the algorithm. Boosting works by iteratively reweighting each observation and refitting a given classifier. To begin, suppose we have a classification algorithm that accepts case weights w = (w 1,..., w n ) f w = argmin f n w i L(y i, f (x i )). i=1 and returns labels of {+1, 1}. 37 / 1
Ensemble methods Boosting algorithm (AdaBoost.M1) Step 0 Initiailize weights w (0) i = 1 n, 1 i n. Step 1 For 1 t T, fit a classifier with weights w (t 1) yielding ĝ (t) = f w (t 1) = argmin f n i=1 w (t 1) i L(y i, f (x i )). Step 2 Compute the total error rate n err (t) i=1 = w (t 1) i (y i ĝ (t) (x i )) n i=1 w (t 1) i ( ) α (t) 1 err (t) = log err (t) 38 / 1
Ensemble methods Boosting algorithm (AdaBoost.M1) Step 3 Produce new weights w (t) i = w (t 1) i exp(α (t) (y i ĝ (t) (x i )) Step 4 Repeat steps 1-3 until termination. Step 5 Output the classifier ( T ) f (x) = sign α (t) ĝ (t) (x) t=1 39 / 1
Ensemble methods Illustrating AdaBoost Initial weights for each data point Data points for training 40 / 1
Ensemble methods Illustrating AdaBoost Tan,Steinbach, Kumar Introduction to 4/18/2004 84 41 / 1
Ensemble methods Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. Above, we assumed we were solving argmin f n i=1 w (t 1) i L(y i, f (x i )) but we didn t say what type of f we considered. 42 / 1
Ensemble methods Boosting as gradient descent Call the space of all possible classifiers above F 0, and all linear combinations of such classifiers F = α j f j, f j F 0 j In some sense, the boosting algorithm is a steepest descent algorithm to find argmin f F n L(y i, f (x i )). i=1 For more details, see ESL, but we will discuss some of these ideas. In R, this idea is implemented in the gbm package. 43 / 1
Ensemble methods Boosting as gradient descent Consider the problem minimize f F = n L(y i, f (x i )) i=1 = L(f ) This can be solved by gradient descent where the gradient L(f ) is in F 0 and is the direction of steepest descent for some stepsize α argmin g F0 L(f + α g) 44 / 1
Ensemble methods Boosting as gradient descent We might also optimize the stepsize argmin g F0,αL(f + α g) 45 / 1
Ensemble methods Boosting as gradient descent In this context, the forward stagewise algorithm for steepest descent is: 1 Initialize: f (0) (x) = 0; 2 For i = 1,..., T : solve α (t), ĝ (t) = argmin α,g F 0 L(f (t 1) + α g) and update f (t) = f (t 1) + α (t) ĝ (t). 3 Return the final classifier f (T ) When L(y, f (x)) = e y f (x) this corresponds to AdaBoost.M1 Changing the loss function yields new gradient boosting algorithms. 46 / 1
Iris data: sepal.length, sepal.width, 1000 trees 47 / 1
Iris data: sepal.length, sepal.width, 5000 trees 48 / 1
Iris data: sepal.length, sepal.width, 10000 trees 49 / 1
Boosting fits an additive model (by default) 50 / 1
Boosting fits an additive model (by default) 51 / 1
Out-of-bag error improvement 52 / 1
53 / 1