Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Similar documents
Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

VBM683 Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

ECE 5424: Introduction to Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

6.036 midterm review. Wednesday, March 18, 15

Lecture 13: Ensemble Methods

ECE 5984: Introduction to Machine Learning

Neural Networks and Ensemble Methods for Classification

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS7267 MACHINE LEARNING

Gradient Boosting (Continued)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machine (SVM) and Kernel Methods

Data Mining und Maschinelles Lernen

Stochastic Gradient Descent

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

SVMs, Duality and the Kernel Trick

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Chapter 6. Ensemble Methods

18.9 SUPPORT VECTOR MACHINES

Announcements - Homework

Nonlinear Classification

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machines

TDT4173 Machine Learning

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Numerical Learning Algorithms

A Brief Introduction to Adaboost

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning with multiple models. Boosting.

Statistical Pattern Recognition

Support Vector Machines for Classification: A Statistical Portrait

Logistic Regression. Machine Learning Fall 2018

Learning theory. Ensemble methods. Boosting. Boosting: history

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear & nonlinear classifiers

CSC242: Intro to AI. Lecture 21

CSC 411 Lecture 17: Support Vector Machine

10-701/ Machine Learning - Midterm Exam, Fall 2010

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Logistic Regression and Boosting for Labeled Bags of Instances

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Support Vector Machine (SVM) and Kernel Methods

A Magiv CV Theory for Large-Margin Classifiers

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

Bagging and Other Ensemble Methods

Stat 502X Exam 2 Spring 2014

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Midterm exam CS 189/289, Fall 2015

Pattern Recognition 2018 Support Vector Machines

Evaluation. Andrea Passerini Machine Learning. Evaluation

Ensemble Methods: Jay Hyer

Boosted Lasso and Reverse Boosting

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning. Ensemble Methods. Manfred Huber

Evaluation requires to define performance measures to be optimized

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Statistical Methods for Data Mining

Linear Classifiers: Expressiveness

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Name (NetID): (1 Point)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Announcements Kevin Jamieson

Voting (Ensemble Methods)

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Discriminative Models

Support Vector Machine I

Neural Networks and Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Regularization Paths

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

TDT4173 Machine Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Minimax risk bounds for linear threshold functions

Statistics and learning: Big Data

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Statistical Machine Learning from Data

CSCI-567: Machine Learning (Spring 2019)

Support Vector Machines and Kernel Methods

Neural Networks biological neuron artificial neuron 1

Gradient Boosting, Continued

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Transcription:

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1

Neural networks Neural network Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x. Tries to find f = argmin f C n (Y i f (X i )) 2 i=1 where C = { neural net functions }. 2 / 1

ial Neural networks: Networks single layer (ANN) 3 / 1

Neural networks Single layer neural nets With p inputs, there are p + 1 weights ( ) f (x 1,..., x p ) = S α + x T β Here, S is a sigmoid function like the logistic curve. S(η) = eη 1 + e η. The algorithm must estimate (α, β) from data. Classifier would assign +1 to anything with S( α + x T β) > t for some threshold like 0.5. 4 / 1

Neural networks: double layer 5 / 1

Neural networks Single layer neural nets In the previous figure, there are 15 weights and 3 intercept terms (α s) for input hidden layer. There are 3 weights and one intercept term for hidden output layer. The outputs from the hidden layer have the form ( ) 5 S α j + w ij x i 1 j 3. i=1 So, the final output is of the form ( 3 S α 0 + v j S α j + j=1 5 i=1 w ij x i ). 6 / 1

Neural networks Single layer neural nets Fitting this model is done via nonlinear least squares. Computationally difficult due to the highly nonlinear form of the regression function. 7 / 1

Support vector machines Support vector machine Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x using a linear function. A linear function with an intercept α determines an affine function f α,β (x) = x T β + α and a hyperplane { } H α,β = x : x T β + α = 0. A good classifier will separate the classes into the sides where f α,β (x) > 0, f α,β (x) < 0. It seeks to find the hyperplane that separates the classes the most. 8 / 1

Support Support Vector vector machine Machines 9 / 1

Support Support Vector vector machine Machines 10 / 1

Support Support Vector vector machine Machines 11 / 1

Support Support Vector vector machine Machines 12 / 1

upport Support Vector vector machine Machines 13 / 1

Support Support Vector vector machine Machines 14 / 1

Support vector machines Support vector machine The reciprocal maximum margin can be written as subject to y i (x T i β + α) 1. inf β,α β 2 Therefore, the support vector machine problem is subject to y i (x T i β + α) 1. minimize β,α β 2 15 / 1

Support vector machine What if the problem is not linearly separable 16 / 1

Support vector machines Non-separable problems If we cannot completely separate the classes it is common to modify the problem to minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n i=1 ξ i C Or the Lagrange version minimize β,α,ξ β 2 2 + γ n i=1 subject to y i (x T i β + α) 1 ξ i, ξ i 0. ξ i 17 / 1

Support vector machines Non-separable problems The ξ i s can be removed from this problem, yielding n minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + i=1 where (z) + = max(z, 0) is the positive part function. Or, n minimize β,α (1 y i f α,β (x i )) + + λ β 2 2 i=1 18 / 1

Logistic vs. SVM 4.0 3.5 Logistic SVM 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3 2 1 0 1 2 3 19 / 1

Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 20 / 1

Iris data: sepal.length, sepal.width 21 / 1

Iris data: sepal.length, sepal.width 22 / 1

Support vector machines Kernel trick The term β 2 2 can be thought of as a complexity penalty on the function f α,β so we could write minimize fβ,α n (1 y i (α + f β (x i )) + + λ f β 2 2 i=1 We could also add such a complexity penalty to logistic regression minimize f β,α n DEV(α + f β (x i ), y i ) + λ f β 2 2 i=1 So, in both SVM and penalized logistic, we are looking for the best linear function where best is determined by either logistic loss (deviance) or the hinge loss (SVM). 23 / 1

Support vector machines Kernel trick If the decision boundary is nonlinear, one might look for functions other than linear. There are classes of functions called Reproducing Kernel Hilbert Spaces (RKHS) for which it is theoretically (and computationally) possible to solve or minimize f H minimize f H n (1 y i (α + f (x i ))) + + λ f 2 H. i=1 n DEV(α + f (x i ), y i ) + λ f 2 H. i=1 The svm function in R allows some such kernels for SVM. 24 / 1

Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 25 / 1

Iris data: sepal.length, sepal.width 26 / 1

Iris data: sepal.length, sepal.width 27 / 1

Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 28 / 1

Iris data: sepal.length, sepal.width 29 / 1

Iris data: sepal.length, sepal.width 30 / 1

General EnsembleIdea methods 31 / 1

Ensemble methods Why use ensemble methods? Basic idea: weak learners / classifiers are noisy learners. Aggregating over a large number of weak learners should improve the output. Assumes some independence across weak learners, otherwise aggregation does nothing. That is, if every learner is the same, then any sensible agrregation technique should return this classifier. These are often some of the best classifiers in terms of performance. Downside: they have lost interpretability in the aggregation stage. 32 / 1

Ensemble methods Bagging In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote f (x) = majority vote of (f,b (x)) 1 i B. 33 / 1

Ensemble methods Bagging Approximately 1/3 of data is not selected in each bootstrap sample (recall 0.632 rule). Using this observation, we can define the Out-of-bag (OOB) estimate of accuracy based on OOB prediction f OOB (x i ) = majority vote of (f b (x x i )) 1 b B(i) where B(i) = {all trees not built with x i }. Total OOB error is OOB = 1 n n ( f OOB (x i ) y i ) i=1 34 / 1

Ensemble methods Random forests: bagging trees If the weak learners are trees, this method is called a Random Forest. Some variations on random forests inject more randomness than just bagging. For example: Taking a random subset of features. Random linear combinations of features. 35 / 1

Iris data: sepal.length, sepal.width using randomforest 36 / 1

Ensemble methods Boosting Unlike bagging, boosting does not insert randomizaton into the algorithm. Boosting works by iteratively reweighting each observation and refitting a given classifier. To begin, suppose we have a classification algorithm that accepts case weights w = (w 1,..., w n ) f w = argmin f n w i L(y i, f (x i )). i=1 and returns labels of {+1, 1}. 37 / 1

Ensemble methods Boosting algorithm (AdaBoost.M1) Step 0 Initiailize weights w (0) i = 1 n, 1 i n. Step 1 For 1 t T, fit a classifier with weights w (t 1) yielding ĝ (t) = f w (t 1) = argmin f n i=1 w (t 1) i L(y i, f (x i )). Step 2 Compute the total error rate n err (t) i=1 = w (t 1) i (y i ĝ (t) (x i )) n i=1 w (t 1) i ( ) α (t) 1 err (t) = log err (t) 38 / 1

Ensemble methods Boosting algorithm (AdaBoost.M1) Step 3 Produce new weights w (t) i = w (t 1) i exp(α (t) (y i ĝ (t) (x i )) Step 4 Repeat steps 1-3 until termination. Step 5 Output the classifier ( T ) f (x) = sign α (t) ĝ (t) (x) t=1 39 / 1

Ensemble methods Illustrating AdaBoost Initial weights for each data point Data points for training 40 / 1

Ensemble methods Illustrating AdaBoost Tan,Steinbach, Kumar Introduction to 4/18/2004 84 41 / 1

Ensemble methods Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. Above, we assumed we were solving argmin f n i=1 w (t 1) i L(y i, f (x i )) but we didn t say what type of f we considered. 42 / 1

Ensemble methods Boosting as gradient descent Call the space of all possible classifiers above F 0, and all linear combinations of such classifiers F = α j f j, f j F 0 j In some sense, the boosting algorithm is a steepest descent algorithm to find argmin f F n L(y i, f (x i )). i=1 For more details, see ESL, but we will discuss some of these ideas. In R, this idea is implemented in the gbm package. 43 / 1

Ensemble methods Boosting as gradient descent Consider the problem minimize f F = n L(y i, f (x i )) i=1 = L(f ) This can be solved by gradient descent where the gradient L(f ) is in F 0 and is the direction of steepest descent for some stepsize α argmin g F0 L(f + α g) 44 / 1

Ensemble methods Boosting as gradient descent We might also optimize the stepsize argmin g F0,αL(f + α g) 45 / 1

Ensemble methods Boosting as gradient descent In this context, the forward stagewise algorithm for steepest descent is: 1 Initialize: f (0) (x) = 0; 2 For i = 1,..., T : solve α (t), ĝ (t) = argmin α,g F 0 L(f (t 1) + α g) and update f (t) = f (t 1) + α (t) ĝ (t). 3 Return the final classifier f (T ) When L(y, f (x)) = e y f (x) this corresponds to AdaBoost.M1 Changing the loss function yields new gradient boosting algorithms. 46 / 1

Iris data: sepal.length, sepal.width, 1000 trees 47 / 1

Iris data: sepal.length, sepal.width, 5000 trees 48 / 1

Iris data: sepal.length, sepal.width, 10000 trees 49 / 1

Boosting fits an additive model (by default) 50 / 1

Boosting fits an additive model (by default) 51 / 1

Out-of-bag error improvement 52 / 1

53 / 1