# CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Size: px
Start display at page:

Download "CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )"

## Transcription

1 CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive algorithm for basis function selection. A recursive, binary partition (a tree) is given by a list of splits {(t 01 ), (t 11, t 12 ), (t 21, t 22, t 23, t 24 ),..., (t n1,..., t n2 n)} and corresponding split variable indices {(i 01 ), (i 11, i 12 ), (i 21, i 22, i 23, i 24 ),..., (i n1,..., i n2 n)} R 1 = (x i01 < t 01 ) (x i11 < t 11 )... (x in1 < t n1 ) and we can determine if x R 1 in n steps M = 2 n. All the remaining sets in the partition corresponding to the 2 n leafs are determined similarly and recursively. It is by far easier to draw a picture of the corresponding tree than to write down the precise mathematical recursion in terms of indices. A point is that the algorithmic complexity of determining which of the 2 n sets in the partition an x belongs to scales with n. Thus even for extremely large partitions we can very rapidly determine where a concrete x belongs. In practice not all of the 2 n partitions need to be present. The tree can be pruned so that some of the leaf are not at depth n. Figure 9.2 Recursive Binary Partitions The recursive partition of [0, 1] 2 above and the representation of the partition by a tree. A binary tree of depth n can represent up to 2 n partitions/basis functions. We can determine which R j an x belongs to by n recursive yes/no questions. Figure 9.2 General Partitions A general partition that can not be represented as binary splits. With M sets in a general partition we would in general need of the order M yes/no questions to determine which of the sets an x belongs to. Figure 9.2 Recursive Binary Partitions For a fixed partition R 1,..., R M the least squares estimates are ĉ m = ȳ(r m ) = 1 N m i:x i R m y i 1

2 N m = {i x i R m }. The recursive partion allows for rapid computation of the estimates and rapid predition of new observations. Greedy Splitting Algorithm With squared error loss and an unknown partition R 1,..., R M we would seek to minimize (y i ȳ(r m(i) )) 2 over the possible binary, recursive partitions. But this is computationally difficult. An optimal single split on a region R is determined by min min (y i ȳ(r(j, s))) 2 + (y i ȳ(r(j, s) c )) 2 j s i:x i R(j,s) i:x i R(j,s) }{{ c } univariate optimization problem with R(j, s) = {x R x j < s} The tree growing algorithm recursively does single, optimal splits on each of the partitions obtained in the previous step. Note that the complements above are taken within the region R, that is, R(j, s) c = R\R(j, s). Tree Pruning The full binary tree, T 0, representing the partitions R 1,..., R M with M = 2 n may be too large. We prune it by snipping of leafs or subtrees. For any subtree T of T 0 with T leafs and partition R 1 (T ),..., R T (T ) the cost-complexity of T is C α (T ) = (y i ȳ(r m(i) (T ))) 2 + α T. Theorem 1. There is a finite set of subtrees T 0 T α1 T α2... T αr with 0 α 1 < α 2 <... < α r such that T αi minimizes C α (T ) for α [α i, α i+1 ) A proof of the theorem above can be found in Section 7.2 in Pattern Recognition and Neural Networks by Brian D. Ripley. The practical consequence of the theorem above is that tree algorithms find this sequence of pruned trees in the estimation process and then the choice of α can be determined by validation or cross-validation. We do not need to consider other choices of α than those corresponding to the precomputed pruned trees. 2

3 Node Impurities and Classification Trees Define the node impurity as the average loss for the node R The greedy split of R is found by Q(R) = 1 N(R) i:x i R (y i ȳ(r)) 2 min min (N(R(j, s))q(r(j, s)) + N(R(j, s) c )Q(R(j, s) c )) j s with R(j, s) = {x R x j < s} and we have C α (T ) = T m=1 N(R m (T ))Q(R m (T )) + α T. If Y takes K discrete values we focus on the node estimate for R m (T ) in tree T as being ˆp m (T )(k) = 1 N m i:x i R m(t ) 1(y i = k) Node Impurities and Classification Trees The loss functions for classification enter in the specification of the node impurities used for splitting an cost-complexity computations. Examples of 0-1 loss gives misclassification error impurity: Q(R m (T )) = 1 max{ˆp(r m (T ))(1),..., ˆp(R m (T ))(K)} likelihood loss gives entropy impurity: The Gini index impurity: Q(R m (T )) = Q(R m (T )) = K ˆp(R m (T ))(k) log ˆp(R m (T ))(k) k=1 K ˆp(R m (T ))(k)(1 ˆp(R m (T ))(k)) k=1 Note that it is certainly not always the case that all K different cases are present in a single set R m, hence some of the estimated conditional probabilities are 0 and we have to remember that 0 log 0 = 0 yields a continuous extension of p log p in 0. The Gini index can be thought of in the following way: With the 0-1 loss... 3

4 Figure 9.3 Node Impurities Spam Example Spam Example Spensitivity and Specificity The sensitivity is the probability of predicting 1 given that the true value is 1 (predict a case given that there is a case). sensitivity = Pr(f(X) = 1 Y = 1) = Pr(Y = 1, f(x) = 1) Pr(Y = 1, f(x) = 1) + Pr(Y = 1, f(x) = 0) The specificity is the probability of predicting 0 given that the true value is 0 (predict that there is no case given that there is no case). specificity = Pr(f(X) = 0 Y = 0) = Pr(Y = 0, f(x) = 0) Pr(Y = 0, f(x) = 0) + Pr(Y = 0, f(x) = 1) ROC curves The reciever operating characteristic or ROC curve. Ensembles of Weak Predictors A weak predictor is a predictor that performs only a little better than random guessing. With an ensemble or collection of weak predictors ˆf 1,..., ˆf B predictions, e.g. as ˆf B = 1 B ˆf b B hoping to improve performance. b=1 we seek to combine their 4

5 Bootstrap aggregation or Bagging is an example where the ensemble of preditors are obtained by estimation of the predictor on bootstrapped data sets. Bagging is treated in Section 8.7 in the book. Note the weak predictor is often taken to mean simple predictor where again simple refers to a predictor with few parameters, which has low variance and besides from special cases considerable bias. There is, however, nothing that prevent us from considering ensembles of weak predictors where weak refers to high variance as opposed to high bias. Combining Weak Predictors Recall that V ( ˆf B (x)) = 1 B B 2 V ( ˆf b (x)) + 1 B 2 b=1 b b cov( ˆf b (x), ˆf b(x)) hence bagging can be improved if the preditors can be de-correlated. Random Forests (Chapter 15) is a modification of bagging for trees where the bagged trees are de-correlated. The problem of ensemble learning is broken down into the The selection of base learners. The combination of the base learners. Trees as Ensemble Learners The basis expansion techniques can be seen as ensemble learning (with or without regularization) where we have specified the base learners a priori. For trees we build and combine sequentially and recursively the simplest base learners; the stumps or single splits. Are there general ways to search the space of learners and combinations of simple learners? Stagewise Additive Modeling With b(, γ) for γ Γ a parameterized family of basis functions we can seek expansions of the form M β m b(x, γ m ) m=1 With fixed γ m this is standard, with unrestricted γ m this is in general very difficult numerically. Suggetion: Evolve the expansions in stages where (β m, γ m ) are estimated in step m and then fixed forever. 5

6 Boosting With any loss function L the Forward Stagewise Additive Model is estimated by the algorithm: 1. Set m = 1 and initialize with ˆf 0 (x) = Compute ( ˆβ m, ˆγ m ) = argmin β,γ L(y i, ˆf m 1 (x i ) + βb(x i, γ)) 3. Set f m = f m 1 + ˆβ m b(, ˆγ m ), m = m + 1 and return to 2. Note that with squared error loss L(y i, ˆf m 1 (x i ) + βb(x i, γ)) = ((y i ˆf m 1 (x i )) βb(x i, γ)) 2 every estimation step is a reestimation on the residuals. Base Classifiers With Y { 1, 1} and any classifier G(x) { 1, 1} the misclassification error is err(g) = 1 N 1(y i G(x i )) = 1 2N (1 y i G(x i )) With G a class of classifiers the (unweigted) optimal classifier is Ĝ = argmin err(g) G G With w 1,..., w N 0 the weighted optimal classifier is Ĝ = argmin G G w i 1(y i G(x i )) A simple class of classifiers is the class of stumps trees with only two leafs. A stump is given simply by a pair (i, t) of the splitting variable and the split point, and if we use the misclassification node impurity the optimization over (i, t) is precisely the optimization we carry out when we make each greedy step in the algorithm for estimation of trees. Surrogate Loss Functions Most important property of the surrogate loss functions is that they are convexifications of the 0-1-loss. 6

7 AdaBoost Classification with Exponential Loss With exponential loss L(y, f(x)) = exp( yf(x)) and with w (m) i = exp( y i f m 1 (x i )) L(y i, ˆf m 1 (x i ) + βg(x i )) = = (e β e β ) w (m) i exp( y i βg(x i )) w (m) i 1(y i G(x i )) + e β The minimizer is Ĝm = argmin G G N w(m) i 1(y i G(x i )), The updated weights in step m + 1 are ˆβ m = 1 2 log 1 err m err m w (m+1) i = w (m) i exp( y i ˆβm Ĝ m (x i )) = w (m) i exp(2 ˆβ m 1(y i Ĝm(x i ))) exp( ˆβ m ). w (m) i The weights in the AdaBoost algorithm evolve over time multiplying larger weights in the m th step on those pairs (x i, y i ) that are misclassified by the classifier in the m th step. Figure 10.1 Schematic AdaBoost M G(x) = α m G m (x) m=1 1. Initialize with weights w i = 1/N and set m = 1 and fix M. 2. Fit a classifier G m using weights w i. 3. Recompute weights as where α m = log((1 err m )/err m ) and w i w i exp(α m 1(y i G m (x i ))) err m = 1 N w i w i 1(y i G(x i )). 4. Stop if m = M or set m m + 1 and return to 2 7

8 Figure 10.2 and 10.3 Boosting using stumps only can outperform even large trees in terms of test error (simulation). Even when the misclassification error is 0 on the training data it can pay to continue the boosting and the exponential loss will continue to decrease. More Boosting The computational problem in boosting is minimization of L(y i, ˆf m 1 (x i ) + βb(x i, γ)). For classification with exponential loss this simplifies to weighted optimal classification. For regression and squared error loss this is re-estimation based on the residuals. With the notation L(f) = L(y i, f i ) for f = (f 1,..., f N ) R N we aim at finding steps h 1,..., h M and with h 0 the initial guess an approximate minimizer of the form f M = M h m. m=0 Gradient Boosting The gradient of L : R N R is L(f) = ( z L(y 1, f 1 ),..., z L(y N, f N )) T Gradient descent algorithms suggest steps from f m in the direction of L(f m ); h m = ρ m L(f m ). Problem: ρ m L(f m ) is most likely not obtainable as a prediction within the class of base learners it is not of the form β(b(x 1, γ),..., b(x N, γ)) T. Solution: Fit a base learner ĥm to L(f m ) and compute by iteration the expansion M ˆf M = ρ m ĥ m. m=0 8

9 This is gradient boosting as implemented in the mboost library. More information on gradient boosting and the mboost R-package can be found in: Peter Bühlmann and Torsten Hothorn. Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science 2007, Vol. 22, No. 4, There is also an interesting discussion following the paper. 9

### CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive

### Kernel Density Estimation

Kernel Density Estimation If Y {1,..., K} and g k denotes the density for the conditional distribution of X given Y = k the Bayes classifier is f (x) = argmax π k g k (x) k If ĝ k for k = 1,..., K are

### Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

### Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

### Chapter 14 Combining Models

Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

### Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

### Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

### Lecture 13: Ensemble Methods

Lecture 13: Ensemble Methods Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu 1 / 71 Outline 1 Bootstrap

### Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

### Statistics and learning: Big Data

Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

### EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

### Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

### ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

### Decision Trees: Overfitting

Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

### Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

### STK-IN4300 Statistical Learning Methods in Data Science

Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive

### Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

### Boosting Methods: Why They Can Be Useful for High-Dimensional Data

New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

### Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

### Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

### Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

### CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#\$

### VBM683 Machine Learning

VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

### Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

### Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

### SF2930 Regression Analysis

SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression

### BAGGING PREDICTORS AND RANDOM FOREST

BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS

### Announcements Kevin Jamieson

Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

### Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

### mboost - Componentwise Boosting for Generalised Regression Models

mboost - Componentwise Boosting for Generalised Regression Models Thomas Kneib & Torsten Hothorn Department of Statistics Ludwig-Maximilians-University Munich 13.8.2008 Boosting in a Nutshell Boosting

### Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

### Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

### Informal Definition: Telling things apart

9. Decision Trees Informal Definition: Telling things apart 2 Nominal data No numeric feature vector Just a list or properties: Banana: longish, yellow Apple: round, medium sized, different colors like

### Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

### EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

### Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

### Decision Tree Learning Lecture 2

Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

### Decision trees COMS 4771

Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

### Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

### CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

### the tree till a class assignment is reached

Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

### Boosting. March 30, 2009

Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

### The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

### Why does boosting work from a statistical view

Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

### Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1 PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response

### Voting (Ensemble Methods)

1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

### Regularization Paths

December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

### Neural Networks and Deep Learning

Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

### CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

### Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

### Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Statistisches Data Mining (StDM) Woche 11 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 29 November 2016 1 Multitasking

### Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

### Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

### SPECIAL INVITED PAPER

The Annals of Statistics 2000, Vol. 28, No. 2, 337 407 SPECIAL INVITED PAPER ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING By Jerome Friedman, 1 Trevor Hastie 2 3 and Robert Tibshirani 2

### Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

### 10701/15781 Machine Learning, Spring 2007: Homework 2

070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

### Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

### Learning with multiple models. Boosting.

CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

### day month year documentname/initials 1

ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

### Machine Learning 2nd Edi7on

Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e

### PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

### TDT4173 Machine Learning

TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

### TDT4173 Machine Learning

TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline

### Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

### A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

### FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

### CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

### CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

### Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

### Harrison B. Prosper. Bari Lectures

Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

### Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

### Application of machine learning in manufacturing industry

Application of machine learning in manufacturing industry MSC Degree Thesis Written by: Hsinyi Lin Master of Science in Mathematics Supervisor: Lukács András Institute of Mathematics Eötvös Loránd University

### 15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting

### Supervised Learning via Decision Trees

Supervised Learning via Decision Trees Lecture 4 1 Outline 1. Learning via feature splits 2. ID3 Information gain 3. Extensions Continuous features Gain ratio Ensemble learning 2 Sequence of decisions

### 10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

### Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

### Learning Decision Trees

Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

### Combining Classifiers

Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

### Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

### Learning Decision Trees

Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

### CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods Orianna DeMasi, Juan Meza, David H. Bailey Lawrence Berkeley National Laboratory 1 Cyclotron

### Ensembles of Classifiers.

Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

### JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu

### Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

### Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan nan.zhou@jpmorgan.com The 4th Annual

### ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

### Lecture 3: Decision Trees

Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

### Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

### StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data Ilya Narsky, Caltech Motivation Introduction advanced classification tools in a convenient C++ package for HEP researchers

### Constructing Prediction Intervals for Random Forests

Senior Thesis in Mathematics Constructing Prediction Intervals for Random Forests Author: Benjamin Lu Advisor: Dr. Jo Hardin Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor