Statistics and learning: Big Data

Similar documents
Learning with multiple models. Boosting.

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Voting (Ensemble Methods)

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

ECE 5424: Introduction to Machine Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

Statistical Machine Learning from Data

Chapter 14 Combining Models

SF2930 Regression Analysis

Learning Ensembles. 293S T. Yang. UCSB, 2017.

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Ensembles of Classifiers.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Learning Decision Trees

CS7267 MACHINE LEARNING

1 Handling of Continuous Attributes in C4.5. Algorithm

Decision Tree Learning Lecture 2

CS145: INTRODUCTION TO DATA MINING

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Ensemble Learning in the Presence of Noise

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Decision Trees: Overfitting

Ensembles. Léon Bottou COS 424 4/8/2010

Data Mining und Maschinelles Lernen

Learning Decision Trees

Learning from Examples

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision trees COMS 4771

the tree till a class assignment is reached

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

COMS 4771 Lecture Boosting 1 / 16

CSCI-567: Machine Learning (Spring 2019)

Lecture 8. Instructor: Haipeng Luo

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

VBM683 Machine Learning

Learning Decision Trees

A Brief Introduction to Adaboost

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Ensemble Methods: Jay Hyer

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Harrison B. Prosper. Bari Lectures

Lecture 13: Ensemble Methods

BAGGING PREDICTORS AND RANDOM FOREST

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Statistical Learning. Philipp Koehn. 10 November 2015

Ensemble Methods and Random Forests

EECS 349:Machine Learning Bryan Pardo

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

10701/15781 Machine Learning, Spring 2007: Homework 2

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

CS 6375 Machine Learning

From inductive inference to machine learning

Lecture 3: Decision Trees

day month year documentname/initials 1

Hierarchical Boosting and Filter Generation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Knowledge Discovery and Data Mining

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Multivariate Analysis Techniques in HEP

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

ECE 5984: Introduction to Machine Learning

Data Warehousing & Data Mining

Informal Definition: Telling things apart

Machine Learning & Data Mining

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Name (NetID): (1 Point)

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Artificial Intelligence Roman Barták

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Neural Networks and Ensemble Methods for Classification

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning. Ensemble Methods. Manfred Huber

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Classification and regression trees

Boosting & Deep Learning

Generalized Boosted Models: A guide to the gbm package

Machine Learning 2nd Edi7on

UVA CS 4501: Machine Learning

Statistical Consulting Topics Classification and Regression Trees (CART)

TDT4173 Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 7: DecisionTrees

Holdout and Cross-Validation Methods Overfitting Avoidance

1 Handling of Continuous Attributes in C4.5. Algorithm

Decision Tree And Random Forest

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Supervised Learning via Decision Trees

CS229 Supplemental Lecture notes

REGRESSION TREE CREDIBILITY MODEL

Transcription:

Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30

Keywords Decision trees Divide and Conquer Impurity measure, Gini index, Information gain Pruning and overfitting CART and C4.5 Contents of this class: The general idea of learning decision trees Regression trees Classification trees Boosting and trees Random Forests and trees S. Gadat (TSE) SAD 2013 2 / 30

Introductory example Alt Bar F/S Hun Pat Pri Rai Res Typ Dur Wai x 1 Y N N Y 0.38 $$$ N Y French 8 Y x 2 Y N N Y 0.83 $ N N Thai 41 N x 3 N Y N N 0.12 $ N N Burger 4 Y x 4 Y N Y Y 0.75 $ Y N Thai 12 Y x 5 Y N Y N 0.91 $$$ N Y French 75 N x 6 N Y N Y 0.34 $$ Y Y Italian 8 Y x 7 N Y N N 0.09 $ Y N Burger 7 N x 8 N N N Y 0.15 $$ Y Y Thai 10 Y x 9 N Y Y N 0.84 $ Y N Burger 80 N x 10 Y Y Y Y 0.78 $$$ N Y Italian 25 N x 11 N N N N 0.05 $ N N Thai 3 N x 12 Y Y Y Y 0.89 $ N N Burger 38 Y Please describe this dataset without any calculation. S. Gadat (TSE) SAD 2013 3 / 30

Introductory example Alt Bar F/S Hun Pat Pri Rai Res Typ Dur Wai x 1 Y N N Y 0.38 $$$ N Y French 8 Y x 2 Y N N Y 0.83 $ N N Thai 41 N x 3 N Y N N 0.12 $ N N Burger 4 Y x 4 Y N Y Y 0.75 $ Y N Thai 12 Y x 5 Y N Y N 0.91 $$$ N Y French 75 N x 6 N Y N Y 0.34 $$ Y Y Italian 8 Y x 7 N Y N N 0.09 $ Y N Burger 7 N x 8 N N N Y 0.15 $$ Y Y Thai 10 Y x 9 N Y Y N 0.84 $ Y N Burger 80 N x 10 Y Y Y Y 0.78 $$$ N Y Italian 25 N x 11 N N N N 0.05 $ N N Thai 3 N x 12 Y Y Y Y 0.89 $ N N Burger 38 Y Why is Pat a better indicator than Typ? S. Gadat (TSE) SAD 2013 3 / 30

Deciding to wait... or not 1 3 4 6 8 12 2 5 7 9 10 11 S. Gadat (TSE) SAD 2013 4 / 30

Deciding to wait... or not 1 3 4 6 8 12 2 5 7 9 10 11 Pat [0;0.1] [0.1;0.5] [0.5;1] 7 11 1 3 6 8 4 12 2 5 9 10 S. Gadat (TSE) SAD 2013 4 / 30

Deciding to wait... or not 1 3 4 6 8 12 2 5 7 9 10 11 Pat [0;0.1] [0.1;0.5] [0.5;1] 7 11 1 3 6 8 4 12 2 5 9 <40 Dur 10 >40 4 12 10 2 5 9 S. Gadat (TSE) SAD 2013 4 / 30

Deciding to wait... or not 1 3 4 6 8 12 2 5 7 9 10 11 Pat [0;0.1] [0.1;0.5] [0.5;1] 7 11 1 3 6 8 No Yes 4 12 2 5 9 <40 Dur 10 >40 4 12 10 Yes 2 5 9 No S. Gadat (TSE) SAD 2013 4 / 30

The general idea of learning decision trees Decision trees Ingredients: Nodes Each node contains a test on the features which partitions the data. Edges The outcome of a node s test leads to one of its child edges. Leaves A terminal node, or leaf, holds a decision value for the output variable. S. Gadat (TSE) SAD 2013 5 / 30

The general idea of learning decision trees Decision trees Ingredients: Nodes Each node contains a test on the features which partitions the data. Edges The outcome of a node s test leads to one of its child edges. Leaves A terminal node, or leaf, holds a decision value for the output variable. We will look at binary trees ( binary tests) and single variable tests. Binary attribute: Continuous attribute: node = attribute node = (attribute, threshold) S. Gadat (TSE) SAD 2013 5 / 30

The general idea of learning decision trees Decision trees Ingredients: Nodes Each node contains a test on the features which partitions the data. Edges The outcome of a node s test leads to one of its child edges. Leaves A terminal node, or leaf, holds a decision value for the output variable. We will look at binary trees ( binary tests) and single variable tests. Binary attribute: Continuous attribute: node = attribute node = (attribute, threshold) How does one build a good decision tree? For a regression problem? For a classification problem? S. Gadat (TSE) SAD 2013 5 / 30

The general idea of learning decision trees A little more formally A tree with M leaves describes a covering set of M hypercubes R m in X. Each R m hold a decision value ŷ m. ˆf(x) = M ŷ m I Rm (x) m=1 Notation: N m = x i R m = q I Rm (x i ) i=1 S. Gadat (TSE) SAD 2013 6 / 30

The general idea of learning decision trees The general idea: divide and conquer Example Set T, attributes x 1,..., x p FormTree(T ) 1. Find best split (j, s) over T // Which criterion? 2. If (j, s) =, node = FormLeaf(T) // Which value for the leaf? 3. Else node = (j, s) split T according to (j, s) into (T 1, T 2) append FormTree(T 1) to node // Recursive call append FormTree(T 2) to node 4. Return node S. Gadat (TSE) SAD 2013 7 / 30

The general idea of learning decision trees The general idea: divide and conquer Example Set T, attributes x 1,..., x p FormTree(T ) 1. Find best split (j, s) over T // Which criterion? 2. If (j, s) =, node = FormLeaf(T) // Which value for the leaf? 3. Else node = (j, s) split T according to (j, s) into (T 1, T 2) append FormTree(T 1) to node // Recursive call append FormTree(T 2) to node 4. Return node Remark This is a greedy algorithm, performing local search. S. Gadat (TSE) SAD 2013 7 / 30

The general idea of learning decision trees The R point of view Two packages for tree-based methods: tree and rpart. S. Gadat (TSE) SAD 2013 8 / 30

Regression trees Regression trees criterion We want to fit a tree to the data {(x i, y i )} i=1..q with y i R. Criterion? S. Gadat (TSE) SAD 2013 9 / 30

Regression trees Regression trees criterion We want to fit a tree to the data {(x i, y i )} i=1..q with y i R. Criterion? Sum of squares: q i=1 ( y i ˆf(x ) 2 i ) S. Gadat (TSE) SAD 2013 9 / 30

Regression trees Regression trees criterion We want to fit a tree to the data {(x i, y i )} i=1..q with y i R. Criterion? Sum of squares: Inside region R m, best ŷ m? q i=1 ( y i ˆf(x ) 2 i ) S. Gadat (TSE) SAD 2013 9 / 30

Regression trees Regression trees criterion We want to fit a tree to the data {(x i, y i )} i=1..q with y i R. Criterion? Sum of squares: q i=1 ( y i ˆf(x ) 2 i ) Inside region R m, best ŷ m? ŷ m = 1 N m Node impurity measure: Q m = 1 N m x i R m y i = Y Rm x i R m (y i ŷ m ) 2 S. Gadat (TSE) SAD 2013 9 / 30

Regression trees Regression trees criterion Best partition: hard to find. But locally, best split? S. Gadat (TSE) SAD 2013 10 / 30

Regression trees Regression trees criterion Best partition: hard to find. But locally, best split? C(j, s) = = min ŷ 1 x i R 1 (j,s) x i R 1 (j,s) = N 1 Q 1 + N 2 Q 2 (y i ŷ 1 ) 2 + min ŷ 2 ( yi Y R1 (j,s)) 2 + x i R 2 (j,s) x i R 1 (j,s) Solve argmin C(j, s) j,s (y i ŷ 2 ) 2 ( ) 2 yi Y R2 (j,s) S. Gadat (TSE) SAD 2013 10 / 30

Regression trees Overgrowing the tree? Too small: rough average. Too large: overfitting. Alt Bar F/S Hun Pat Pri Rai Res Typ Dur x 1 Y N N Y 0.38 $$$ N Y French 8 x 2 Y N N Y 0.83 $ N N Thai 41 x 3 N Y N N 0.12 $ N N Burger 4 x 4 Y N Y Y 0.75 $ Y N Thai 12 x 5 Y N Y N 0.91 $$$ N Y French 75 x 6 N Y N Y 0.34 $$ Y Y Italian 8 x 7 N Y N N 0.09 $ Y N Burger 7 x 8 N N N Y 0.15 $$ Y Y Thai 10 x 9 N Y Y N 0.84 $ Y N Burger 80 x 10 Y Y Y Y 0.78 $$$ N Y Italian 25 x 11 N N N N 0.05 $ N N Thai 3 x 12 Y Y Y Y 0.89 $ N N Burger 38 S. Gadat (TSE) SAD 2013 11 / 30

Regression trees Overgrowing the tree? Stopping criterion? Stop if min j,s C(j, s) > κ? Not good because a good split might be hidden in deeper nodes. Stop if N m < n? Good to avoid overspecialization. Prune the tree after growing. cost-complexity pruning. Cost-complexity criterion: C α = M N m Q m + αm m=1 Once a tree is grown, prune it to minimize C α. Each α corresponds to a unique cost-complexity optimal tree. Pruning method: Weakest link pruning, left to your curiosity. Best α? Through cross-validation. S. Gadat (TSE) SAD 2013 11 / 30

Regression trees Regression trees in a nutshell Constant values on the leaves. Growing phase: greedy splits that minimize the squared-error impurity measure. Pruning phase: Weakest-link pruning that minimize the cost-complexity criterion. S. Gadat (TSE) SAD 2013 12 / 30

Regression trees Regression trees in a nutshell Constant values on the leaves. Growing phase: greedy splits that minimize the squared-error impurity measure. Pruning phase: Weakest-link pruning that minimize the cost-complexity criterion. Further reading on regression trees: MARS: Multivariate Adaptive Regression Splines. Linear functions on the leaves. PRIM: Patient Rule Induction Method. Focuses on extremas rather than averages. S. Gadat (TSE) SAD 2013 12 / 30

Regression trees A bit of R before classification tasks Let s load the Optical Recognition of Handwritten Digits database. > optical <- read.csv("optdigits.tra", sep=",", header=false) > colnames(optical)[65] <- "class" S. Gadat (TSE) SAD 2013 13 / 30

Classification trees Classification trees Suppose y i {T rue; F alse}. Let s fit a tree to {(x i, y i )} i=1..q. Best ŷ m in node m? S. Gadat (TSE) SAD 2013 14 / 30

Classification trees Classification trees Suppose y i {T rue; F alse}. Let s fit a tree to {(x i, y i )} i=1..q. Best ŷ m in node m? Proportion of class k observations in node m: ˆp mk = 1 N m Class of node m: ŷ m = argmax k ˆp mk x i R m I(y i = k) S. Gadat (TSE) SAD 2013 14 / 30

Classification trees Classification trees Node impurity measure? Misclassification error Q m = 1 N m I(y i ŷ m ) = 1 ˆp mŷm x i R m Gini index (CART) Q m = ˆp mk ˆp mk = K ˆp mk (1 ˆp mk ) k k k=1 Information or deviance (C4.5) Q m = K ˆp mk log ˆp mk k=1 S. Gadat (TSE) SAD 2013 15 / 30

Classification trees Classification trees Node impurity measure? Misclassification error Q m = 1 N m I(y i ŷ m ) = 1 ˆp mŷm x i R m Gini index (CART) Q m = ˆp mk ˆp mk = K ˆp mk (1 ˆp mk ) k k k=1 Information or deviance (C4.5) Q m = K ˆp mk log ˆp mk k=1 Splitting criterion? Minimize N 1 Q 1 + N 2 Q 2 S. Gadat (TSE) SAD 2013 15 / 30

Classification trees Classification trees Node impurity measure? Misclassification error Q m = 1 N m I(y i ŷ m ) = 1 ˆp mŷm x i R m Gini index (CART) Q m = ˆp mk ˆp mk = K ˆp mk (1 ˆp mk ) k k k=1 Information or deviance (C4.5) Q m = K ˆp mk log ˆp mk k=1 Splitting criterion? Minimize N 1 Q 1 + N 2 Q 2 Pruning? Cost-complexity (often using the misclassification error) criterion. M C α = N m Q m + αm m=1 S. Gadat (TSE) SAD 2013 15 / 30

Classification trees Classification trees in a nutshell Class values on the leaves. Growing phase: greedy splits that maximize the Gini index reduction (CART) or the information gain (C4.5) Pruning phase: Weakest-link pruning that minimize the cost-complexity criterion. S. Gadat (TSE) SAD 2013 16 / 30

Classification trees Classification trees in a nutshell Class values on the leaves. Growing phase: greedy splits that maximize the Gini index reduction (CART) or the information gain (C4.5) Pruning phase: Weakest-link pruning that minimize the cost-complexity criterion. Further reading on classification trees: EC4.5 and YaDT: Implementation improvements for C4.5 C5.0: C4.5 with additional features. Loss matrix. Handling missing values. S. Gadat (TSE) SAD 2013 16 / 30

Classification trees A bit of R > help(tree) > optical.tree <- tree(factor(class) ~., optical, split="deviance") > optical.tree.gini <- tree(factor(class) ~., optical, split="gini") > plot(optical.tree); text(optical.tree) > help(prune.tree) > optical.tree.pruned <- prune.tree(optical.tree, method="misclass", k=10) > help(cv.tree) > optical.tree.cv <- cv.tree(optical.tree,, prune.misclass) > plot(optical.tree.cv) S. Gadat (TSE) SAD 2013 17 / 30

Classification trees Why should you use Decision Trees? Advantages Easy to read and interpret. Learning the tree has complexity linear in p. Can be rather efficient on well pre-processed data (in conjunction with PCA for instance). However No margin or performance guarantees. Lack of smoothness in the regression case. Strong assumption that the data can fit in hypercubes. Strong sensitivity to the data set. But... Can be compensated by ensemble methods such as Boosting or Bagging. Very efficient extension with Random Forests S. Gadat (TSE) SAD 2013 18 / 30

Boosting and trees Boosting and trees Motivation AdaBoost with trees is the best off-the-shelf classifier in the world. (Breiman 1998) Not so true today but still accurate enough. S. Gadat (TSE) SAD 2013 19 / 30

Boosting and trees What is Boosting? Key idea Boosting is a procedure that combines several weak classifiers into a powerful commitee. Commitee-based or ensemble methods literature in Machine Learning. Most popular boosting alg. (Freund & Schapire, 1997): AdaBoost.M1. Warning For this part, we take a very practical approach. For a more thorough and rigorous presentation, see (for instance) the reference below. R. E. Schapire. The boosting approach to machine learning: An overview. Nonlinear Estimation and Classification, 2002. S. Gadat (TSE) SAD 2013 20 / 30

Boosting and trees The main picture Weak classifiers h(x) = y is said to be a weak (or a PAC-weak) classifier if it performs better than a random guessing on the training data. AdaBoost AdaBoost constructs a strong classifier as a linear combination of weak classifiers h t (x): T f(x) = α t h t (x) t=1 S. Gadat (TSE) SAD 2013 21 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. Initialize weights D 1 (i) = 1/q S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. Initialize weights D 1 (i) = 1/q For t = 1 to T : Find h t = argmin h H q D t (i)i(y i h(x i )) i=1 S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. Initialize weights D 1 (i) = 1/q For t = 1 to T : Find h t = argmin h H q D t (i)i(y i h(x i )) i=1 If ɛ t = q D t (i)i(y i h t (x i )) 1/2 then stop i=1 S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. Initialize weights D 1 (i) = 1/q For t = 1 to T : Find h t = argmin h H q D t (i)i(y i h(x i )) i=1 If ɛ t = q D t (i)i(y i h t (x i )) 1/2 then stop i=1 ( ) Set α t = 1 2 log 1 ɛ t ɛ t S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. Initialize weights D 1 (i) = 1/q For t = 1 to T : Find h t = argmin h H q D t (i)i(y i h(x i )) i=1 If ɛ t = q D t (i)i(y i h t (x i )) 1/2 then stop i=1 ( ) Set α t = 1 2 log 1 ɛ t ɛ t Update D t+1 (i) = D t(i)e αtyiht(xi) Z t Where Z t is a normalisation factor. S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees The AdaBoost algorithm Given {(x i, y i )}, x i X, y i { 1; 1}. Initialize weights D 1 (i) = 1/q For t = 1 to T : Find h t = argmin h H q D t (i)i(y i h(x i )) i=1 If ɛ t = q D t (i)i(y i h t (x i )) 1/2 then stop i=1 ( ) Set α t = 1 2 log 1 ɛ t ɛ t Update D t+1 (i) = D t(i)e αtyiht(xi) Z t Where Z t is a normalisation factor. Return the classifier ( T ) H(x) = sign α t h t (x) t=1 S. Gadat (TSE) SAD 2013 22 / 30

Boosting and trees Iterative reweighting D t+1 (i) = D t(i)e αty ih t(x i ) Z t Increase the weight of incorrectly classified samples Decrease the weight of correctly classified samples Memory effect: a sample misclassified several times has a large D(i) h t focusses on samples that were misclassified by h 0,..., h t 1 S. Gadat (TSE) SAD 2013 23 / 30

Boosting and trees Properties 1 q q I (H(x i ) y i ) i=1 To minimize training error at ( each) step t, minimize this upper bound. This is where α t = 1 2 log 1 ɛt ɛ t comes from. T t=1 This is equivalent to maximizing the margin! Z t S. Gadat (TSE) SAD 2013 24 / 30

Boosting and trees AdaBoost is not Boosting Many variants of AdaBoost: Binary classification AdaBoost.M1, AdaBoost.M2,..., Multiclass AdaBoost.MH, Regression AdaBoost.R, Online,... And other Boosting algorithms. S. Gadat (TSE) SAD 2013 25 / 30

Boosting and trees Why should you use Boosting? AdaBoost is a meta-algorithm: it boosts a weak classif. algorithm into a commitee that is a strong classifier. AdaBoost maximizes margin Very simple to implement Can be seen as a feature selection algorithm In practice, AdaBoost often avoids overfitting. S. Gadat (TSE) SAD 2013 26 / 30

Boosting and trees AdaBoost with trees Your turn to play: will you be able to implement AdaBoost with trees in R? S. Gadat (TSE) SAD 2013 27 / 30

Random Forests and trees Random Forests and trees Motivation: Aggregation for stabilizing tree inference. Introduce independence between trees to make the aggregation step robust Simple remark: the variance of an average of B i.i.d. random variable is σ 2 B. When the variables are correlated with a coefficient ρ, we then have a variance of ρσ 2 (1 ρ)σ2 + B Idea of bagging: sub-sample in the training set to obtain an aggregation with ρ small and B large. S. Gadat (TSE) SAD 2013 28 / 30

Random Forests and trees Random Forests and trees To avoid pruning trees and bypass over-fitting, a common way is to randomly subsample the training set, the set of variables, and then average. It then leads to the so-called Random Forest algorithm. Algorithm 1 Random forest algorithm InputTraining Set D - Number of bags B - integer m B Iterates Sample a bootstrap training set D b among the n observations Sample a subset of m variables among the p variables Compute a classification tree T b Output: Prediction with the average decision rule B 1 B b=1 T b. S. Gadat (TSE) SAD 2013 29 / 30

Random Forests and trees Random Forests and trees Important parameters/features for the algorithm: m: number of variables that are sampled to build each individual tree. If p is the total number of variables, it should be chosen like m p. Important remark: it is possible with RF to produce a selection of the good variables. Important variables are the ones that are the most selected among the whole population of trees at each node. Everything is possible with the Random Forest package of Breiman... S. Gadat (TSE) SAD 2013 30 / 30