Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Similar documents
Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

VBM683 Machine Learning

ECE 5424: Introduction to Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

6.036 midterm review. Wednesday, March 18, 15

Lecture 13: Ensemble Methods

ECE 5984: Introduction to Machine Learning

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Networks and Ensemble Methods for Classification

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

CS7267 MACHINE LEARNING

Gradient Boosting (Continued)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

SVMs, Duality and the Kernel Trick

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machine (SVM) and Kernel Methods

Data Mining und Maschinelles Lernen

Learning with multiple models. Boosting.

TDT4173 Machine Learning

Nonlinear Classification

Stochastic Gradient Descent

Numerical Learning Algorithms

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Announcements - Homework

Chapter 6. Ensemble Methods

Support Vector Machine (SVM) and Kernel Methods

18.9 SUPPORT VECTOR MACHINES

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (continued)

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Support Vector Machines

Support Vector Machines

Statistical Pattern Recognition

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression and Boosting for Labeled Bags of Instances

Support Vector Machines for Classification: A Statistical Portrait

A Brief Introduction to Adaboost

Logistic Regression. Machine Learning Fall 2018

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Linear & nonlinear classifiers

CSC242: Intro to AI. Lecture 21

CSC 411 Lecture 17: Support Vector Machine

10-701/ Machine Learning - Midterm Exam, Fall 2010

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Support Vector Machine (SVM) and Kernel Methods

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

CSCI-567: Machine Learning (Spring 2019)

Support Vector Machines

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

A Magiv CV Theory for Large-Margin Classifiers

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

Neural Networks biological neuron artificial neuron 1

Bagging and Other Ensemble Methods

Stat 502X Exam 2 Spring 2014

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Support Vector Machine I

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Pattern Recognition 2018 Support Vector Machines

Midterm exam CS 189/289, Fall 2015

Machine Learning Lecture 7

Evaluation. Andrea Passerini Machine Learning. Evaluation

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Boosted Lasso and Reverse Boosting

Ensemble Methods: Jay Hyer

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Evaluation requires to define performance measures to be optimized

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture 4: Perceptrons and Multilayer Perceptrons

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Linear Classifiers: Expressiveness

Statistical Methods for Data Mining

Machine Learning. Ensemble Methods. Manfred Huber

Name (NetID): (1 Point)

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Announcements Kevin Jamieson

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Discriminative Models

Voting (Ensemble Methods)

Neural Networks DWML, /25

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Deep Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Minimax risk bounds for linear threshold functions

Regularization Paths

Infinite Ensemble Learning with Support Vector Machinery

Transcription:

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks single layer (ANN) Neural network Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x. Tries to find f = argmin f C (Y i f (X i )) 2 where C = { neural net functions }. 3 / 1 4 / 1

Neural networks Neural networks: double layer Single layer neural nets With p inputs, there are p + 1 weights ( ) f (x 1,..., x p ) = S α + x T β Here, S is a sigmoid function like the logistic curve. S(η) = eη 1 + e η. The algorithm must estimate (α, β) from data. Classifier would assign +1 to anything with S( α + x T β) > t for some threshold like 0.5. 5 / 1 6 / 1 Neural networks Neural networks Single layer neural nets In the previous figure, there are 15 weights and 3 intercept terms (α s) for input hidden layer. There are 3 weights and one intercept term for hidden output layer. The outputs from the hidden layer have the form ( ) 5 S α j + w ij x i 1 j 3. Single layer neural nets Fitting this model is done via nonlinear least squares. Computationally difficult due to the highly nonlinear form of the regression function. So, the final output is of the form ( 3 S α 0 + v j S α j + j=1 5 w ij x i ). 7 / 1 8 / 1

Support vector machines Support Support Vector vector machine Machines Support vector machine Another classifier (or regression technique) for 2-class problems. Like logistic regression, it tries to predict labels y from x using a linear function. A linear function with an intercept α determines an affine function f α,β (x) = x T β + α and a hyperplane { } H α,β = x : x T β + α = 0. A good classifier will separate the classes into the sides where f α,β (x) > 0, f α,β (x) < 0. It seeks to find the hyperplane that separates the classes the most. Support Support Vector vector machine Machines 9 / 1 Support Vector Machines Support vector machine! Find a linear hyperplane (decision boundary) that will separate the da Tan,Steinbach, Kumar Introduction to 4/18/2004 59 10 / 1 11 / 1 12 / 1

Support Support Vector vector machine Machines Support Vector Machines Support vector machine Support! Other possible Support Vector solutions vector machine Machines 13 / 1! Which one is better? B1 or B2?! How do you define better? Tan,Steinbach, Kumar Introduction to 4/18/2004 Tan,Steinbach, 62 Kumar Introduction to 4/18/2004 63 Support vector machines Support vector machine The reciprocal maximum margin can be written as subject to y i (x T i β + α) 1. inf β,α β 2 Therefore, the support vector machine problem is subject to y i (x T i β + α) 1. minimize β,α β 2 14 / 1 15 / 1 16 / 1

Support vector machine! What if the problem is not linearly separable? Support vector machines Non-separable problems If we cannot completely separate the classes it is common to modify the problem to minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n ξ i C Or the Lagrange version minimize β,α,ξ β 2 2 + γ subject to y i (x T i β + α) 1 ξ i, ξ i 0. ξ i 17 / 1 18 / 1 Support vector machines Tan,Steinbach, Kumar Introduction to 4/18/2004 67 Logistic vs. SVM Non-separable problems The ξ i s can be removed from this problem, yielding 4.0 3.5 Logistic SVM minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + where (z) + = max(z, 0) is the positive part function. Or, 3.0 2.5 2.0 1.5 1.0 minimize β,α (1 y i f α,β (x i )) + + λ β 2 2 0.5 0.0 3 2 1 0 1 2 3 19 / 1 20 / 1

Iris data: sepal.length, sepal.width Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 21 / 1 22 / 1 Iris data: sepal.length, sepal.width Support vector machines Kernel trick The term β 2 2 can be thought of as a complexity penalty on the function f α,β so we could write minimize fβ,α (1 y i (α + f β (x i )) + + λ f β 2 2 We could also add such a complexity penalty to logistic regression minimize f β,α DEV(α + f β (x i ), y i ) + λ f β 2 2 23 / 1 So, in both SVM and penalized logistic, we are looking for the best linear function where best is determined by either logistic loss (deviance) or the hinge loss (SVM). 24 / 1

Support vector machines Iris data: sepal.length, sepal.width Kernel trick If the decision boundary is nonlinear, one might look for functions other than linear. 5.0 There are classes of functions called Reproducing Kernel Hilbert Spaces (RKHS) for which it is theoretically (and computationally) possible to solve 4.5 4.0 minimize f H (1 y i (α + f (x i ))) + + λ f 2 H. 3.5 3.0 or minimize f H DEV(α + f (x i ), y i ) + λ f 2 H. 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 The svm function in R allows some such kernels for SVM. 25 / 1 26 / 1 Iris data: sepal.length, sepal.width Iris data: sepal.length, sepal.width 27 / 1 28 / 1

Iris data: sepal.length, sepal.width Iris data: sepal.length, sepal.width 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 29 / 1 30 / 1 Iris data: sepal.length, sepal.width General EnsembleIdea methods 31 / 1 32 / 1

Why use ensemble methods? Basic idea: weak learners / classifiers are noisy learners. Aggregating over a large number of weak learners should improve the output. Assumes some independence across weak learners, otherwise aggregation does nothing. That is, if every learner is the same, then any sensible agrregation technique should return this classifier. Bagging In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote These are often some of the best classifiers in terms of performance. f (x) = majority vote of (f,b (x)) 1 i B. Downside: they have lost interpretability in the aggregation stage. 33 / 1 34 / 1 Bagging Approximately 1/3 of data is not selected in each bootstrap sample (recall 0.632 rule). Using this observation, we can define the Out-of-bag (OOB) estimate of accuracy based on OOB prediction f OOB (x i ) = majority vote of (f b (x x i )) 1 b B(i) where B(i) = {all trees not built with x i }. Total OOB error is Random forests: bagging trees If the weak learners are trees, this method is called a Random Forest. Some variations on random forests inject more randomness than just bagging. For example: Taking a random subset of features. Random linear combinations of features. OOB = 1 n ( f OOB (x i ) y i ) 35 / 1 36 / 1

Iris data: sepal.length, sepal.width using randomforest Boosting Unlike bagging, boosting does not insert randomizaton into the algorithm. Boosting works by iteratively reweighting each observation and refitting a given classifier. To begin, suppose we have a classification algorithm that accepts case weights w = (w 1,..., w n ) f w = argmin f w i L(y i, f (x i )). and returns labels of {+1, 1}. 37 / 1 38 / 1 Boosting algorithm (AdaBoost.M1) Step 0 Initiailize weights w (0) i = 1 n, 1 i n. Step 1 For 1 t T, fit a classifier with weights w (t 1) yielding ĝ (t) = f w (t 1) = argmin f Step 2 Compute the total error rate w (t 1) i L(y i, f (x i )). Boosting algorithm (AdaBoost.M1) Step 3 Produce new weights w (t) i = w (t 1) i exp(α (t) (y i ĝ (t) (x i )) Step 4 Repeat steps 1-3 until termination. Step 5 Output the classifier ( T ) n err (t) = w (t 1) i (y i ĝ (t) (x i )) n w (t 1) i ( ) α (t) 1 err (t) = log err (t) 39 / 1 f (x) = sign t=1 α (t) ĝ (t) (x) 40 / 1

Illustrating AdaBoost Illustrating AdaBoost Initial weights for each data point Data points for training Tan,Steinbach, Kumar Introduction to 4/18/2004 84 41 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 83 42 / 1 Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. Above, we assumed we were solving Boosting as gradient descent Call the space of all possible classifiers above F 0, and all linear combinations of such classifiers F = α j f j, f j F 0 j argmin f w (t 1) i L(y i, f (x i )) but we didn t say what type of f we considered. In some sense, the boosting algorithm is a steepest descent algorithm to find argmin f F L(y i, f (x i )). 43 / 1 For more details, see ESL, but we will discuss some of these ideas. In R, this idea is implemented in the gbm package. 44 / 1

Boosting as gradient descent Consider the problem minimize f F = L(y i, f (x i )) = L(f ) This can be solved by gradient descent where the gradient L(f ) is in F 0 and is the direction of steepest descent for some stepsize α Boosting as gradient descent We might also optimize the stepsize argmin g F0,αL(f + α g) argmin g F0 L(f + α g) 45 / 1 Iris data: sepal.length, sepal.width, 1000 trees 46 / 1 Boosting as gradient descent In this context, the forward stagewise algorithm for steepest descent is: 1 Initialize: f (0) (x) = 0; 2 For i = 1,..., T : solve α (t), ĝ (t) = argmin α,g F 0 L(f (t 1) + α g) and update f (t) = f (t 1) + α (t) ĝ (t). 3 Return the final classifier f (T ) When L(y, f (x)) = e y f (x) this corresponds to AdaBoost.M1 Changing the loss function yields new gradient boosting algorithms. 47 / 1 48 / 1

Iris data: sepal.length, sepal.width, 5000 trees Iris data: sepal.length, sepal.width, 10000 trees {Versicolor} vs. {Setosa, Virginica} 49 / 1 {Versicolor} vs. {Setosa, Virginica} 50 / 1 Boosting fits an additive model (by default) Boosting fits an additive model (by default) 51 / 1 52 / 1

Out-of-bag error improvement 53 / 1