Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks single layer (ANN) Neural network Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x. Tries to find f = argmin f C (Y i f (X i )) 2 where C = { neural net functions }. 3 / 1 4 / 1

Neural networks Neural networks: double layer Single layer neural nets With p inputs, there are p + 1 weights ( ) f (x 1,..., x p ) = S α + x T β Here, S is a sigmoid function like the logistic curve. S(η) = eη 1 + e η. The algorithm must estimate (α, β) from data. Classifier would assign +1 to anything with S( α + x T β) > t for some threshold like 0.5. 5 / 1 6 / 1 Neural networks Neural networks Single layer neural nets In the previous figure, there are 15 weights and 3 intercept terms (α s) for input hidden layer. There are 3 weights and one intercept term for hidden output layer. The outputs from the hidden layer have the form ( ) 5 S α j + w ij x i 1 j 3. Single layer neural nets Fitting this model is done via nonlinear least squares. Computationally difficult due to the highly nonlinear form of the regression function. So, the final output is of the form ( 3 S α 0 + v j S α j + j=1 5 w ij x i ). 7 / 1 8 / 1

Support vector machines Support Support Vector vector machine Machines Support vector machine Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x using a linear function. A linear function with an intercept α determines an affine function f α,β (x) = x T β + α and a hyperplane { } H α,β = x : x T β + α = 0. A good classifier will separate the classes into the sides where f α,β (x) > 0, f α,β (x) < 0. It seeks to find the hyperplane that separates the classes the most. Support Support Vector vector machine Machines 9 / 1 Support Vector Machines Support vector machine! Find a linear hyperplane (decision boundary) that will separate the da Tan,Steinbach, Kumar Introduction to 4/18/2004 59 10 / 1 11 / 1 12 / 1

Support Support Vector vector machine Machines Support Vector Machines Support vector machine Support! Other possible Support Vector solutions vector machine Machines 13 / 1! Which one is better? B1 or B2?! How do you define better? Tan,Steinbach, Kumar Introduction to 4/18/2004 Tan,Steinbach, 62 Kumar Introduction to 4/18/2004 63 Support vector machines Support vector machine The reciprocal maximum margin can be written as subject to y i (x T i β + α) 1. inf β,α β 2 Therefore, the support vector machine problem is subject to y i (x T i β + α) 1. minimize β,α β 2 14 / 1 15 / 1 16 / 1

Support vector machine! What if the problem is not linearly separable? Support vector machines Non-separable problems If we cannot completely separate the classes it is common to modify the problem to minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n ξ i C Or the Lagrange version minimize β,α,ξ β 2 2 + γ subject to y i (x T i β + α) 1 ξ i, ξ i 0. ξ i 17 / 1 18 / 1 Support vector machines Tan,Steinbach, Kumar Introduction to 4/18/2004 67 Logistic vs. SVM Non-separable problems The ξ i s can be removed from this problem, yielding 4.0 3.5 Logistic SVM minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + where (z) + = max(z, 0) is the positive part function. Or, 3.0 2.5 2.0 1.5 1.0 minimize β,α (1 y i f α,β (x i )) + + λ β 2 2 0.5 0.0 3 2 1 0 1 2 3 19 / 1 20 / 1

Iris data: sepal.length, sepal.width Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 21 / 1 22 / 1 Iris data: sepal.length, sepal.width Support vector machines Kernel trick The term β 2 2 can be thought of as a complexity penalty on the function f α,β so we could write minimize fβ,α (1 y i (α + f β (x i )) + + λ f β 2 2 We could also add such a complexity penalty to logistic regression minimize f β,α DEV(α + f β (x i ), y i ) + λ f β 2 2 23 / 1 So, in both SVM and penalized logistic, we are looking for the best linear function where best is determined by either logistic loss (deviance) or the hinge loss (SVM). 24 / 1

Support vector machines Iris data: sepal.length, sepal.width Kernel trick If the decision boundary is nonlinear, one might look for functions other than linear. 5.0 There are classes of functions called Reproducing Kernel Hilbert Spaces (RKHS) for which it is theoretically (and computationally) possible to solve 4.5 4.0 minimize f H (1 y i (α + f (x i ))) + + λ f 2 H. 3.5 3.0 or minimize f H DEV(α + f (x i ), y i ) + λ f 2 H. 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 The svm function in R allows some such kernels for SVM. 25 / 1 26 / 1 Iris data: sepal.length, sepal.width Iris data: sepal.length, sepal.width 27 / 1 28 / 1

Iris data: sepal.length, sepal.width Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 29 / 1 30 / 1 Iris data: sepal.length, sepal.width General EnsembleIdea methods 31 / 1 32 / 1

Why use ensemble methods? Basic idea: weak learners / classifiers are noisy learners. Aggregating over a large number of weak learners should improve the output. Assumes some independence across weak learners, otherwise aggregation does nothing. That is, if every learner is the same, then any sensible agrregation technique should return this classifier. Bagging In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote These are often some of the best classifiers in terms of performance. f (x) = majority vote of (f,b (x)) 1 i B. Downside: they have lost interpretability in the aggregation stage. 33 / 1 34 / 1 Bagging Approximately 1/3 of data is not selected in each bootstrap sample (recall 0.632 rule). Using this observation, we can define the Out-of-bag (OOB) estimate of accuracy based on OOB prediction f OOB (x i ) = majority vote of (f b (x x i )) 1 b B(i) where B(i) = {all trees not built with x i }. Total OOB error is Random forests: bagging trees If the weak learners are trees, this method is called a Random Forest. Some variations on random forests inject more randomness than just bagging. For example: Taking a random subset of features. Random linear combinations of features. OOB = 1 n ( f OOB (x i ) y i ) 35 / 1 36 / 1

Iris data: sepal.length, sepal.width using randomforest Boosting Unlike bagging, boosting does not insert randomizaton into the algorithm. Boosting works by iteratively reweighting each observation and refitting a given classifier. To begin, suppose we have a classification algorithm that accepts case weights w = (w 1,..., w n ) f w = argmin f w i L(y i, f (x i )). and returns labels of {+1, 1}. 37 / 1 38 / 1 Boosting algorithm (AdaBoost.M1) Step 0 Initiailize weights w (0) i = 1 n, 1 i n. Step 1 For 1 t T, fit a classifier with weights w (t 1) yielding ĝ (t) = f w (t 1) = argmin f Step 2 Compute the total error rate w (t 1) i L(y i, f (x i )). Boosting algorithm (AdaBoost.M1) Step 3 Produce new weights w (t) i = w (t 1) i exp(α (t) (y i ĝ (t) (x i )) Step 4 Repeat steps 1-3 until termination. Step 5 Output the classifier ( T ) n err (t) = w (t 1) i (y i ĝ (t) (x i )) n w (t 1) i ( ) α (t) 1 err (t) = log err (t) 39 / 1 f (x) = sign t=1 α (t) ĝ (t) (x) 40 / 1

Illustrating AdaBoost Illustrating AdaBoost Initial weights for each data point Data points for training Tan,Steinbach, Kumar Introduction to 4/18/2004 84 41 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 83 42 / 1 Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. Above, we assumed we were solving Boosting as gradient descent Call the space of all possible classifiers above F 0, and all linear combinations of such classifiers F = α j f j, f j F 0 j argmin f w (t 1) i L(y i, f (x i )) but we didn t say what type of f we considered. In some sense, the boosting algorithm is a steepest descent algorithm to find argmin f F L(y i, f (x i )). 43 / 1 For more details, see ESL, but we will discuss some of these ideas. In R, this idea is implemented in the gbm package. 44 / 1

Boosting as gradient descent Consider the problem minimize f F = L(y i, f (x i )) = L(f ) This can be solved by gradient descent where the gradient L(f ) is in F 0 and is the direction of steepest descent for some stepsize α Boosting as gradient descent We might also optimize the stepsize argmin g F0,αL(f + α g) argmin g F0 L(f + α g) 45 / 1 Iris data: sepal.length, sepal.width, 1000 trees 46 / 1 Boosting as gradient descent In this context, the forward stagewise algorithm for steepest descent is: 1 Initialize: f (0) (x) = 0; 2 For i = 1,..., T : solve α (t), ĝ (t) = argmin α,g F 0 L(f (t 1) + α g) and update f (t) = f (t 1) + α (t) ĝ (t). 3 Return the final classifier f (T ) When L(y, f (x)) = e y f (x) this corresponds to AdaBoost.M1 Changing the loss function yields new gradient boosting algorithms. 47 / 1 48 / 1

Iris data: sepal.length, sepal.width, 5000 trees Iris data: sepal.length, sepal.width, 10000 trees {Versicolor} vs. {Setosa, Virginica} 49 / 1 {Versicolor} vs. {Setosa, Virginica} 50 / 1 Boosting fits an additive model (by default) Boosting fits an additive model (by default) 51 / 1 52 / 1

Out-of-bag error improvement 53 / 1