SF2930 Regression Analysis

Similar documents
Decision trees COMS 4771

BAGGING PREDICTORS AND RANDOM FOREST

the tree till a class assignment is reached

Statistics and learning: Big Data

Informal Definition: Telling things apart

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CS145: INTRODUCTION TO DATA MINING

Decision Tree Learning Lecture 2

Statistical Machine Learning from Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Variance Reduction and Ensemble Methods

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Chapter 14 Combining Models

Regression tree methods for subgroup identification I

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

UVA CS 4501: Machine Learning

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Knowledge Discovery and Data Mining

Deconstructing Data Science

CS 6375 Machine Learning

Chapter 6. Ensemble Methods

Holdout and Cross-Validation Methods Overfitting Avoidance

Empirical Risk Minimization, Model Selection, and Model Assessment

Learning with multiple models. Boosting.

Decision Tree Learning

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Supervised Learning via Decision Trees

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Ensemble Methods and Random Forests

day month year documentname/initials 1

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Performance Evaluation and Comparison

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Kernel Density Estimation

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Learning Decision Trees

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Dan Roth 461C, 3401 Walnut

Machine Learning & Data Mining

Ensembles. Léon Bottou COS 424 4/8/2010

Decision Tree And Random Forest

Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn

Regression and Classification Trees

Induction of Decision Trees

Machine Learning 2nd Edi7on

Stephen Scott.

Learning theory. Ensemble methods. Boosting. Boosting: history

Lecture 3: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Notes on Machine Learning for and

Classification using stochastic ensembles

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to machine learning

Learning Decision Trees

Lecture 7: DecisionTrees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Statistical Consulting Topics Classification and Regression Trees (CART)

Advanced Statistical Methods: Beyond Linear Regression

1 Handling of Continuous Attributes in C4.5. Algorithm

Learning Decision Trees

1/sqrt(B) convergence 1/B convergence B

Adaptive Crowdsourcing via EM with Prior

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Ensembles of Classifiers.

C4.5 - pruning decision trees

Low Bias Bagged Support Vector Machines

Data Mining und Maschinelles Lernen

An Empirical Study of Building Compact Ensembles

Random Forests: Finding Quasars

ECE 5424: Introduction to Machine Learning

Lecture VII: Classification I. Dr. Ouiem Bchir

Jeffrey D. Ullman Stanford University

Decision Tree Learning

Model Averaging With Holdout Estimation of the Posterior Distribution

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Advanced Techniques for Mining Structured Data: Process Mining

Lossless Online Bayesian Bagging

Constructing Prediction Intervals for Random Forests

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Applied Multivariate and Longitudinal Data Analysis

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Classification: Decision Trees

Decision Trees (Cont.)

Filter Methods. Part I : Basic Principles and Methods

University of Alberta

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Linear model selection and regularization

Hierarchical models for the rainfall forecast DATA MINING APPROACH

ISyE 691 Data mining and analytics

Transcription:

SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30

Idag Overview Regression trees Pruning Bagging, random forests 2 / 30

Today Overview Regression trees Pruning Bagging, random forests 3 / 30

Setup Let Y be a real-valued random variable. Given x X R d, we want to predict the value of Y. To understand how dierent values of x inuence Y, we have N N examples (x i, y i ) i [1..N]. The variable Y can have discrete or continuous values, corresponding respectively to classication and regression problems. We consider here binary decision trees. A decision tree partitions X into simple regions (R j ) j [1..J]. For x = (x 1,..., x d ) R d and j such that x R j, the value predicted by the tree is the most represented class y among {x i R j } in classication problems, the average of {y i x i R j, i [1..N]} in regression problems. 4 / 30

Setup Each node n of the tree is associated to a subset R n of X, which is further split into two regions through a test comparing x i for some i [1..d] with a value c R. Then for l and r left and right children of n, R l := {x R n x i < c} R r := {x R n x i c} Years < 4.5 5.11 Hits < 117.5 6.00 6.74 Figur: A regression tree for predicting the log salary (in 1, 000$) of a baseball player, based on the number of years played and the number of hits made in the previous year. 5 / 30

238 R 3 Hits R 1 117.5 R 2 1 4.5 24 Years 1 Figur: The three-region partition provided by the regression tree in the previous gure. 6 / 30

Example: spam classication 600/1536 ch$<0.0555 ch$>0.0555 remove<0.06 280/1177 remove>0.06 spam 48/359 hp<0.405 hp>0.405 spam spam 180/1065 9/112 26/337 0/22 ch!<0.191 ch!>0.191 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 spam spam spam 80/861 100/204 6/109 0/3 19/110 7/227 george<0.005 george>0.005 CAPAVE<2.7505 CAPAVE>2.7505 1999<0.58 1999>0.58 spam spam 80/652 0/209 36/123 16/81 18/109 0/1 hp<0.03 hp>0.03 free<0.065 free>0.065 spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 20/238 57/185 spam 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 spam 19/236 1/2 48/113 9/72 our<1.2 our>1.2 spam 37/101 1/12 Figur: Tree for spam classication. Split variables are shown in blue. CAPMAX and CAPAVE = maximal and average length of the longest uninterrupted sequences of capital letters, respectively. 7 / 30

Questions How do we construct such trees? How large should a tree be? What are the pros and cons of tree-based methods? 8 / 30

Today Overview Regression trees Pruning Bagging, random forests 9 / 30

Regression trees vs. linear regression We generally want to approximate a relation Y = f (x) + ε. Linear regression involves linear functions f (x) = β 0 + We will instead consider p β j x j. j=1 f (x) = J m j 1 Rj (x), j=1 for a partition {R j } j [1..J] of X. 10 / 30

Building a regression tree: cost function We will construct the tree by minimizing RSS(m 1,..., m J, R 1,..., R J ) = = J j=1 N (y i f (x i)) 2 i=1 i:xi R j (y i f (x i)) 2 = J j=1 For each j, the m j minimizing the sum of squares i:xi R j (y i m j ) 2 is always i:xi R j (y i m j ) 2. m j = ȳ Rj = average over all y i such that x i R j. To nd the minimizing R j is however computationally infeasible in general. 11 / 30

Building a regression tree: top-down minimization Thus, in order to nd regions {R j } J j=1 minimizing J j=1 i:xi R j (y i ȳ Rj ) 2 }{{} Q Rj we proceed with a top-down, greedy approach. In the rst step, when splitting X into R 1 (k, s) = {x X : x k < s} and R 2 (k, s) = {x X : x k s}, we minimize w.r.t. k and s. Q R1 (k,s) + Q R2 (k,s) 12 / 30

Building a regression tree: top-down minimization (cont.) For a given k, the optimal s is quite easily found by scanning though the data. Repeating for all k [1..d] yields an optimal pair (k, s). Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of these. And so forth. We stop growing the tree when some stopping criterion is reached, e.g., when no region contains more than ve training data. 13 / 30

Bias versus variance The tree size is a tuning parameter governing the model's complexity. A too large tree leads to overtting (high variance on test sets), while a too small tree leads to high bias. The optimal tree size should be determined adaptively from the data. One possibility is to split a node only if the split implies an RSS reduction exceeding a certain threshold; this is however treacherous in top-down minimization, as an apparently meaningless split may imply a good split later on. 14 / 30

Today Overview Regression trees Pruning Bagging, random forests 15 / 30

Cost complexity pruning Grow a large tree T 0 while storing the dierent Q Rj associated with each split. We dene a subtree T T 0 obtained by pruning T 0 in a bottom-up fashion. Let T denote the number of leaves R m (T ) in T. Dene, for some α 0, the cost function C α (T ) = T m=1 Q Rm(T ) + α T, where the Q Rm(T ) are available from the rst step. The idea is to nd a subtree T α minimizing C α (T ). 16 / 30

Cost complexity pruning (cont.) Large values of α leads to smaller trees. For α = 0, the solution is T 0. To nd T α, we create a sequence of subtrees by removing successively leaves whose elimination leads to the smallest increase of T m=1 Q R m(t ). To nd the optimal α, apply K-fold cross validation. 17 / 30

Example: baseball player salaries (cont.) Years < 4.5 RBI < 60.5 Hits < 117.5 Putouts < 82 Years < 3.5 Years < 3.5 5.487 4.622 5.183 5.394 6.189 Walks < 43.5 Runs < 47.5 6.407 6.015 5.571 6.549 Walks < 52.5 RBI < 80.5 Years < 6.5 7.289 6.459 7.007 Figur: Unpruned tree for the log salary (in 1, 000$) of a baseball player. 18 / 30

Trees vs. linear models X 2 X 2 2 1 0 1 2 2 1 0 1 2 X 1 X 2 2 1 0 1 2 2 1 0 1 2 X 2 2 1 0 1 2 2 1 0 1 2 X 1 2 1 0 1 2 2 1 0 1 2 X 1 X 1 Figur: True decision boundary is indicated by the shaded regions. Linear boundary vs. decision tree (boxes). 19 / 30

Node impurity The task of growing a classication tree is very similar to that of growing a regression tree. The only dierence is that the RSS is not meaningful any longer. We thus need to nd alternative measures Q m (T ) of node impurity. For this purpose, dene ˆp j,k = 1 R j i:xi R j 1 {yi =k} and k(j) = arg max k ˆp j,k and let, e.g., 1 ˆp j,k(j) misclassication error, Q m (T ) = K k=1 ˆp j,k(1 ˆp j,k ) Gini index, K ˆp k=1 j,k log ˆp j,k cross-entropy. 20 / 30

Node impurity measures 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy 0.0 0.2 0.4 0.6 0.8 1.0 p Figur: Node impurity measures for two-class classication, as a function of the proportion p in class 2. (Cross-entropy has been scaled.) 21 / 30

Pros and cons with tree-based methods Pros: Cons: Easy to explain, able to handle categorical or numerical data, scales well with N Do not, in their most basic form, have the same level of predictive accuracy as some other regression and classication approaches, can be very non-robust (i.e., small changes in the data leads to completely dierent trees). highly sensitive to data-representation: ill-suited to many problems The predictive accuracy can however be drastically improved using ideas as bagging and random forests discussed in the following. 22 / 30

Today Overview Regression trees Pruning Bagging, random forests 23 / 30

Prelude: the bootstrap Given an i.i.d. sample {x i} N i=1 from a probability density p, the bootstrap algorithm is based on the approximation ˆp(x) = 1 N N δ (x) x i i=1 of p. Given ˆp, a new sample set {x i }N i=1 with approximately the same distribution as {x i} N i=1 can be formed by sampling from ˆp, i.e., by drawing repeatedly N times among {x i} N i=1 with replacement. In machine learning applications, this can be used for creating articial replicates of the training set. 24 / 30

Bagging In the case of regression, we have approximated the target function using J f (x) = ȳ Rj 1 Rj (x), j=1 where the partition {R j } N j=1 is formed using a decision tree. The decision trees discussed above suer generally from high variance, i.e., building the tree using a dierent training set may lead to a quite dierent f. Better would be to use B independent training sets of similar size, each yielding a tree f b, and use the model f avg (x) = 1 B B f b (x). b=1 25 / 30

Bagging (cont.) We use bootstrap sampling for creating articially B such training sets, each yielding a tree fb, and use the model f bag (x) = 1 B B b=1 f b (x). The trees fb are grown deep without pruning, implying that each tree has high variance but low bias. In the classication case, we can record the class predicted by each of the B trees, and take the most popular vote. 26 / 30

Bagging (cont.) The number of trees B is not a critical parameter with bagging; using a very large value of B will not lead to overtting. The reason is that the probability of making an error converges as B. The probability of making an error depends also on the expected correlation between errors of dierent trees f b (x) and f b (x) when (x, Y ) varies. 27 / 30

Some theory Indeed, in the classication case, let the margin function be m(x, y) = 1 B B b=1 1 {fb (x)=y} max k y 1 B B b=1 1 {fb (x)=k} Moreover, dene the generalization error by PE = P(m(x, Y ) < 0). breiman:2001 (breiman:2001) shows that PE converges as B. In addition, the PE can be bounded as PE (1 s 2 )ρ/s 2, where, roughly, s is the limiting margin functionthe strengthand ρ describes the expected correlation between the trees' errors as (x, Y ) varies. 28 / 30

Random park random forest In order to decorrelate the trees, it is desirable so build them dierent. Say that there is one strong predictor x i along with a number of other moderately strong predictors most or all of the trees will use this strong predictor in the top split all trees will look quite similar highly correlated predictions. Thus, at each split, consider only splitting of a randomly chosen subset of m d predictors. Therefore, on average (d m)/d of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. Typically, one uses m d; m = d corresponds to standard bagging. 29 / 30

Example: cancer-type prediction Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees Figur: Here p = 500. The test error as a function of the number of trees. Each colored line corresponds to a dierent value of m. A single classication tree has an error rate of 45.7%. 30 / 30