M1 Apprentissage. Michèle Sebag Benoit Barbot LRI LSV. 13 janvier 2014

Similar documents
Boosting: Foundations and Algorithms. Rob Schapire

An overview of Boosting. Yoav Freund UCSD

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CS7267 MACHINE LEARNING

Learning with multiple models. Boosting.

Ensemble Methods for Machine Learning

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

ECE 5424: Introduction to Machine Learning

COMS 4771 Lecture Boosting 1 / 16

Learning theory. Ensemble methods. Boosting. Boosting: history

Voting (Ensemble Methods)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

VBM683 Machine Learning

Midterm, Fall 2003

Ensemble Methods: Jay Hyer

Statistical Machine Learning from Data

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Data Mining und Maschinelles Lernen

Ensembles of Classifiers.

Variance Reduction and Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Theory and Applications of Boosting. Rob Schapire

Ensembles. Léon Bottou COS 424 4/8/2010

Decision Trees: Overfitting

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Boos$ng Can we make dumb learners smart?

FINAL: CS 6375 (Machine Learning) Fall 2014

Stochastic Gradient Descent

TDT4173 Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Announcements Kevin Jamieson

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Lecture 8. Instructor: Haipeng Luo

ECE 5984: Introduction to Machine Learning

Neural Networks and Ensemble Methods for Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning. Ensemble Methods. Manfred Huber

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CSCI-567: Machine Learning (Spring 2019)

TDT4173 Machine Learning

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Hierarchical Boosting and Filter Generation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Ensemble Methods and Random Forests

Machine Learning, Fall 2009: Midterm

CSE 546 Final Exam, Autumn 2013

Data Warehousing & Data Mining

Statistics and learning: Big Data

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

2 Upper-bound of Generalization Error of AdaBoost

10701/15781 Machine Learning, Spring 2007: Homework 2

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

A Brief Introduction to Adaboost

Midterm Exam Solutions, Spring 2007

10-701/ Machine Learning - Midterm Exam, Fall 2010

Boosting & Deep Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Click Prediction and Preference Ranking of RSS Feeds

Large-Margin Thresholded Ensembles for Ordinal Regression

CS229 Supplemental Lecture notes

PAC-learning, VC Dimension and Margin-based Bounds

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

i=1 = H t 1 (x) + α t h t (x)

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

6.036 midterm review. Wednesday, March 18, 15

10-701/ Machine Learning, Fall

MIRA, SVM, k-nn. Lirong Xia

Introduction to Machine Learning Midterm Exam

Cross Validation & Ensembling

Logistic Regression. Machine Learning Fall 2018

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Models, Data, Learning Problems

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Part of the slides are adapted from Ziko Kolter

Final Exam, Spring 2006

1 The Probably Approximately Correct (PAC) Model

CS 188: Artificial Intelligence. Outline

Large-Margin Thresholded Ensembles for Ordinal Regression

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Adaptive Boosting of Neural Networks for Character Recognition

9 Classification. 9.1 Linear Classifiers

A Simple Algorithm for Learning Stable Machines

Final Exam, Fall 2002

UVA CS 4501: Machine Learning

Lossless Online Bayesian Bagging

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Transcription:

M1 Apprentissage Michèle Sebag Benoit Barbot LRI LSV 13 janvier 2014

Empirical assessment Motivations Which algorithm for which data? Which hyper-parameters? Caruana et al, 08 Crossing the chasm Note on meta-learning Programming by optimization, http://www.prog-by-opt.net/ Needed: descriptive features

Empirical assessment, 2 Empirical study in large scale 700,000 dimensions curse of dimensionality Scores: Accuracy, Area under the ROC curve, square loss Algorithms SVM - adjusting C; stochastic gradient descent al. 05 Perceptrons Logistic regression Artificial neural nets Naive Bayes k-nearest Neighbors Boosted decision trees Random Forests Bordes et

Empirical assessment, 3 Methodology Normalize data Normalize indicators (median) Datasets

Datasets TIS1 is from the Kent Ridge Bio-medical Data Repository. The problem is to find Translation Initiation Sites (TIS) at which translation from mrna to proteins initiates. CRYST2 is a protein crystallography diffraction pattern analysis dataset from the X6A beamline at Brookhaven National Laboratory. STURN and CALAM are ornithology datasets. The task is to predict the appearance of two bird species: sturnella neglecta and calamospiza melanocorys. KDD98 is from the 1998 KDD-Cup. The task is to predict if a person donates money. This is the only dataset with missing values. DIGITS4 is the MNIST database of handwritten digits by Cortes and LeCun. It was converted from a 10 class problem to a hard binary problem by treating digits less than 5 as one class and the rest as the other class.

Datasets, 2 IMDB and CITE are link prediction datasets. For IMDB each attribute represents an actor, director, etc. For CITE attributes are the authors of a paper in the CiteSeer digital library. For IMDB the task is to predict if Mel Blanc was involved in the film or television program and for CITE the task is to predict if J. Lee was a coauthor of the paper. We created SPAM from the TREC 2005 Spam Public Corpora. Features take binary values showing if a word appears in the document or not. Words that appear less than three times in the whole corpus were removed. Real-Sim (R-S) is a compilation of Usenet articles from four discussion groups: simulated auto racing, simulated aviation, real autos and real aviation. The task is to distinguish real from simulated. DSE7 is newswire text with annotated opinion expressions. The task is to find Subjective Expressions i.e. if a particular word expresses an opinion.

Results

Results w.r.t. dimension

Overview Introduction Boosting PAC Learning Boosting Adaboost Bagging General bagging Random Forests

Ensemble learning Example (AT&T) Schapire 09 Categorizing customer s queries (Collect, CallingCard, PersonToPerson,..) Example queries: yes I d like to place a collect call long distance please Collect operator I need to make a call but I need to bill it to my office ThirdNumber yes I d like to place a call on my master card please CallingCard I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill BillingCredit

Ensemble learning Example (AT&T) Schapire 09 Categorizing customer s queries (Collect, CallingCard, PersonToPerson,..) Example queries: yes I d like to place a collect call long distance please Collect operator I need to make a call but I need to bill it to my office ThirdNumber yes I d like to place a call on my master card please CallingCard I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill BillingCredit Remark Easy to find rules of thumb with good accuracy IF card, THEN CallingCard Hard to find a single good rule

Ensemble learning Procedure A learner (fast, reasonable accuracy, i.e. accuracy > random) Learn from (a subset of) training set Find a hypothesis Do this a zillion times (T rounds) Aggregate the hypotheses Critical issues Enforce the diversity of hypotheses How to aggregate hypotheses The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. J. Surowiecki, 2004.

Ensemble learning E = {(x i, y i )}, x i X, y i { 1, 1}} (x i, y i ) D(x, y) Loop For t = 1... T, learn h t from E t Result: H = sign ( t α th t ) Requisite: Classifiers h t must be diverse Enforcing diversity through: Using different training sets E t Using diverse feature sets Enforce h t decorrelation bagging, boosting bagging boosting

Diversity of h t, Stability of H Stability: slight changes of E hardly modifies h Stable learners k-nearest neighbors Linear discriminant analysis (LDA) Unstable learners Neural nets Decision trees

Instability: Why is it useful? Bias of H: Err(h ) h = arg min{err(h), h H} Decreases with size/complexity of H LDA poor... Variance: h 1,... h T H = average {h t } Variance (H) = 1 H h t 2 T Instable learners Large variance Small bias (e.g. decision trees and NN are universal approximators) Variance(Ensemble) decreases with size of ensemble ensemble elements are not correlated. Variance (H) 1 T Variance(h t) t if

Why does it work, basics Suppose there are 25 base classifiers Each one with error rate ɛ =.35 Assume classifiers are independent Then, probability that the ensemble classifier makes a wrong prediction: Pr (ensemble makes error) = 2 5C25ɛ i i (1 ɛ) 25 i =.06 i=13

Overview Introduction Boosting PAC Learning Boosting Adaboost Bagging General bagging Random Forests

PAC Learning: Probably Approximately Correct Valiant 84 Turing award, 2011.

PAC Learning: Probably Approximately Correct Setting iid samples drawn after distribution D(x, y) E = {(x i, y i )}, x i X, y i { 1, 1}} (x i, y i ) D(x, y) Strong learnability Language (= set of target concepts) C is PAC learnable if there exists algorithm A s.t. For all D(x, y) For all 0 < δ < 1, with probability 1 δ probably For all error rate ε > 0, approximately correct There exists a number n of samples, n= Polynom( 1 δ, ε 1 ) s.t. A(E n ) = ĥn with Pr(Err(ĥn) < ɛ) > 1 δ C is polynomially PAC-learnable if Computational cost learning(ĥ n ) = Pol( 1 δ, 1 ε )

PAC Learning: Probably Approximately Correct, 2 Weak learnability Idem strong learnability Except that one only requires error to be < 1/2 (just better than random guessing) ε = 1 2 γ Question Kearns & Valiant 88 Strong learnability weak learnability Weak learnability some stronger learnability??

PAC Learning: Probably Approximately Correct, 2 Weak learnability Idem strong learnability Except that one only requires error to be < 1/2 (just better than random guessing) ε = 1 2 γ Question Kearns & Valiant 88 Strong learnability weak learnability Weak learnability some stronger learnability?? Yes! Strong learnability Weak learnability PhD Rob. Schapire 89 Yoav Freund, MLJ 1990: The strength of weak learnability Adaboost: Freund & Schapire 95

Overview Introduction Boosting PAC Learning Boosting Adaboost Bagging General bagging Random Forests

Weak learnability Strong learnability Thm Schapire MLJ 1990 Given algorithm able to learn with error η = 1/2 γ with complexity c under any distribution D(x, y) then there exists algorithm with complexity Pol(c), able to learn with error ε. Proof (sketch) Learn h under D(x, y) Define D (x, y): D(x, y) ( Pr(h(x) y) = 1 ) 2 Learn h under D (x, y) Define D (x, y): D(x, y) (h(x) h (x)) Learn h under D (x, y) Use Vote(h, h, h )

Proof (sketch) Vote(h, h, h ) true if h and h true, or ((h or h wrong), and h true) Pr(Vote(h, h, h ) OK) = Pr(h OK and h OK)+ Pr(h or h OK).Pr(h OK) 1 (3η 2 2η 3 ) Err(h) < η Err(Vote(h, h, h )) < 3η 2 2η 3 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5

Overview Introduction Boosting PAC Learning Boosting Adaboost Bagging General bagging Random Forests

Adaboost Given Iterate Freund Schapire 95 http://videolectures.net/mlss09us schapire tab/ algorithm A, weak learner E = {(x i, y i )}, x i X, y i { 1, 1}, i = 1... n} For t = 1,... T Define distribution Dt on {1,... n} focussing on examples misclassified by h t 1 Draw Et after D t or use example weights Learn h t Pr xi Dt (h t (x i ) y i ) = ε Return: weighted vote of h t

Adaboost Init: D 1 uniform distribution D 1 (i) = 1 n Define D t+1 as follows { D t+1 (i) = 1 exp( αt ) if h Z t D t (i) t (x i ) = y i exp(α t ) if h t (x i ) y i = 1 Z t D t (i) exp( α t h t (x i )y i ) With Z t : normalisation term α t = 1 2 ln ( 1 ε t ε t ) > 0 ε t = Pr xi D t (h t (x i ) y i )

Adaboost Init: D 1 uniform distribution D 1 (i) = 1 n Define D t+1 as follows { D t+1 (i) = 1 exp( αt ) if h Z t D t (i) t (x i ) = y i exp(α t ) if h t (x i ) y i = 1 Z t D t (i) exp( α t h t (x i )y i ) With Z t : normalisation term α t = 1 2 ln ( 1 ε t ε t ) > 0 ε t = Pr xi D t (h t (x i ) y i ) Final hypothesis ( ) H(x) = sign α t h t (x) t

Toy Example D 1 weak classifiers = vertical or horizontal half-planes

Round 1 h 1 D 2 ε 1 =0.30 α 1 =0.42

Round 2 h 2 D 3 ε 2 =0.21 α 2 =0.65

Round 3 h 3 ε 3 =0.14 α 3 =0.92

&' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' &' Final Classifier!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ % % % $ % % % $ % % % $ % % % $ % % % $ % % % $ % % % $ % % % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % H = sign 0.42 + 0.65 + 0.92 final = ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# ( ) ( ) ( ) ( ) "# () "# () "# () "# () "# () "# () "# () "# () "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "# "#

Bounding training error Thm Freund Schapire 95 Let ε t = 1 2 γ t γ t edge Then Err train (H) [ t 2 ] ε t (1 ε t ) Analysis = t 1 4γ 2 t exp ( 2 ) t γ2 t If A weak learner, γ s.t. t, γ t > γ > 0 Err train (H) < exp ( 2γ 2 T ) Does not require γ or T to be known a priori

Proof Note F (x) = t α t h t (x) Step 1: final distribution D T (i) = D 1 (i) t [ 1 Zt ] exp ( y i t α th t (x i )) = 1 n [ ] 1 t Zt exp ( y i F (x i ))

Proof Step 2 Err train (H) t Z t as Err train (H) = 1 n i { 1 if H(xi ) y i 0 otherwise

Proof Step 2 Err train (H) t Z t as Err train (H) = 1 n = 1 n { 1 if H(xi ) y i i 0 otherwise i { 1 if yi F (x i ) < 0 0 otherwise

Proof Step 2 Err train (H) t Z t as Err train (H) = 1 n = 1 n { 1 if H(xi ) y i i 0 otherwise i { 1 if yi F (x i ) < 0 0 otherwise 1 n exp ( y if (x i ))

Proof Step 2 Err train (H) t Z t as Err train (H) = 1 n = 1 n { 1 if H(xi ) y i i 0 otherwise i { 1 if yi F (x i ) < 0 0 otherwise 1 n exp ( y if (x i )) = i D T (i) t Z t

Proof Step 2 Err train (H) t Z t as Err train (H) = 1 n = 1 n { 1 if H(xi ) y i i 0 otherwise i { 1 if yi F (x i ) < 0 0 otherwise 1 n exp ( y if (x i )) = i D T (i) t Z t = t Z t

Proof Step 3 Z t = 2 ε t (1 ε t ) Because Z t = i D t(i)exp ( α t y i h t (x i )) = i / h t(x i ) y i D t (i)e αt + i / h t(x i )=y i D t (i)e αt = ε t e αt + (1 ε t )e αt = 2 ε t (1 ε t )

Training error test error! (overfitting?) Observed Why? Explanation based on the margin

The margin

Analysis 1. Boosting larger margin 2. Larger margin lower generalization error Why: if margin is large, hypothesis can be approximated by a simple one.

Intuition about Margin Infant Man SCR?? 1st SU-VLPR 09, Beijing Elderly Woman 6

Partial conclusion Adaboost is: a way of boosting a weak learner a margin optimizer (other interpretations related to the minimization of an exponential loss) However... if Adaboost minimizes a criterion, it would make sense to directly minimize this criterion... but direct minimization degrades performances... Main weakness: sample noise Noisy examples are rewarded.

Application: Boosting for Text Categorization [with Singer] weak classifiers: very simple weak classifiers that test on simple patterns, namely, (sparse) n-grams find parameter α t and rule h t of given form which minimize Z t use efficiently implemented exhaustive search How may I help you data: 7844 training examples 1000 test examples categories: AreaCode, AttService, BillingCredit, CallingCard, Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.

Weak Classifiers rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 1 collect 2 card 3 my home 4 person? person 5 code 6 I

More Weak Classifiers rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 7 time 8 wrong number 9 how 10 call 11 seven 12 trying to 13 and

More Weak Classifiers rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 14 third 15 to 16 for 17 charges 18 dial 19 just

Finding Outliers examples with most weight are often outliers (mislabeled and/or ambiguous) I m trying to make a credit card call (Collect) hello (Rate) yes I d like to make a long distance collect call please (CallingCard) calling card please (Collect) yeah I d like to use my calling card number (Collect) can I get a collect call (CallingCard) yes I would like to make a long distant telephone call and have the charges billed to another number (CallingCard DialForMe) yeah I can not stand it this morning I did oversea call is so bad (BillingCredit) yeah special offers going on for long distance (AttService Rate) mister allen please william allen (PersonToPerson) yes ma am I I m trying to make a long distance call to a non dialable point in san miguel philippines (AttService Other)

Application: Human-computer Spoken Dialogue [with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi] application: automatic store front or help desk for AT&T Labs Natural Voices business caller can request demo, pricing information, technical support, sales agent, etc. interactive dialogue

How It Works computer speech Human raw utterance text to speech text response automatic speech recognizer text dialogue manager predicted category natural language understanding NLU s job: classify caller utterances into 24 categories (demo, sales rep, pricing info, yes, no, etc.) weak classifiers: test for presence of word or phrase

Overview Introduction Boosting PAC Learning Boosting Adaboost Bagging General bagging Random Forests

Bagging Enforcing diversity through bootstrap: iterate Draw Breiman96 E t = n examples uniformly drawn with replacement from E Learn h t from E t Finally: H(x) = Vote ({h t (x)})

Bagging vs Boosting Bagging: h t independent parallelisation is possible Boosting: h t depends from the previous hypotheses (h t covers up h 1... h t 1 mistakes). Visualization Dietterich Margineantu 97 In the 2d plane: distance, error Boosting Bagging 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 error rate 0.25 0.2 error rate 0.25 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0.5 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 kappa kappa! "$#&%' )(+*,.-/./10325476 3%8:9& ;=< ;>?4@ AB8:9C;D8A> Ë % "F ;.G

Analysis Assume E drawn after distribution P E t uniformly sampled from E, h t learned from E t Error of H, average of the h t : Error Direct error: Bagging error Rewriting e: H(x) = IE Et [h t (x)] e = IE E IE X,Y [(Y h(x )) 2 ] e B = IE X,Y [(Y H(X )) 2 ] e = IE X,Y [Y 2 ] 2IE X,Y [YH] + IE X,Y IE E [h 2 ] and with Jensen inequality, IE[Z 2 ] IE[Z] 2 e e B

Overview Introduction Boosting PAC Learning Boosting Adaboost Bagging General bagging Random Forests

Random Forests Breiman 00 http://videolectures.net/sip08 biau corfao/

Classification / Regression

Splitting space

Splitting space

Splitting space

Splitting space

Splitting space

Splitting space

Splitting space

Splitting space

Build a tree

Build a tree

Build a tree

Build a tree

Build a tree

Build a tree

Build a tree

Random forests Breiman00 stat.berkeley.edu/users/breiman/randomforests Principle Randomized trees Average Properties Fast, easy to implement Excellent accuracy No overfitting % number of features Theoretical analysis difficult

KDD 2009 Orange Targets 1. Churn 2. Appetency 3. Up-selling Core Techniques 1. Feature Selection 2. Bounded Resources 3. Parameterless methods

Random tree In each node, uniformly select a subset of attribute Compute the best one Until reaching max depth

Tree aggregation h t : random tree Fast and straightforward parallelization

Analysis Biau et al. 10 E = {(x i, y i )}, x i [0, 1] d, y i IR, i = 1..n} (x i, y i ) P(x, y) Goal: estimate Criterion r(x) = IE[Y X = x] IE[(r n (X ) r(x )) 2 ] consistency

Simplified algorithm Set k n 2, iterate log 2 k n fois: Select in each node a feature s.t. Pr(feature. j selected) = p n,j Split: feature j < its median

Simplified algorithm

Simplified algorithm

Simplified algorithm

Simplified algorithm

Simplified algorithm

Simplified algorithm Analysis Each tree has 2 log 2k n = k n leaves Each leaf covers 2 log 2k n = 1/k n volume Assuming x i uniformly drawn in [0, 1] d, number of examples per leaf is n k n If k n = n, very few examples per leaf [ r n (x) = IE i y i1i xi,x in same leaf i 1I x i,x in same leaf ]

Consistency Thm r n is consistent if p nj log k n forall j and k n /n 0 when n.