Statistical Machine Learning from Data

Similar documents
An Introduction to Statistical Machine Learning - Theoretical Aspects -

Statistical Machine Learning from Data

Statistical Machine Learning from Data

ECE 5424: Introduction to Machine Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

VBM683 Machine Learning

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Statistical Machine Learning from Data

CS7267 MACHINE LEARNING

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

ECE 5984: Introduction to Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Chapter 14 Combining Models

Machine Learning. Ensemble Methods. Manfred Huber

Voting (Ensemble Methods)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Ensemble Methods and Random Forests

Learning theory. Ensemble methods. Boosting. Boosting: history

Gradient Boosting, Continued

Variance Reduction and Ensemble Methods

Statistics and learning: Big Data

Data Mining und Maschinelles Lernen

Holdout and Cross-Validation Methods Overfitting Avoidance

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

TDT4173 Machine Learning

Hierarchical Boosting and Filter Generation

CSCI-567: Machine Learning (Spring 2019)

TDT4173 Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Gradient Boosting (Continued)

Stochastic Gradient Descent

Statistical Machine Learning from Data

Machine Learning, Midterm Exam

Lecture 8. Instructor: Haipeng Luo

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Decision Tree Learning Lecture 2

COMS 4771 Lecture Boosting 1 / 16

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Bagging and Other Ensemble Methods

Understanding Generalization Error: Bounds and Decompositions

Cross Validation & Ensembling

Learning with multiple models. Boosting.

Model Averaging With Holdout Estimation of the Posterior Distribution

CS534 Machine Learning - Spring Final Exam

CS145: INTRODUCTION TO DATA MINING

Lecture 13: Ensemble Methods

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

1 Handling of Continuous Attributes in C4.5. Algorithm

10-701/ Machine Learning, Fall

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Neural Networks and Ensemble Methods for Classification

SF2930 Regression Analysis

Midterm, Fall 2003

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensemble Learning in the Presence of Noise

Lecture 3: Decision Trees

Decision Trees: Overfitting

Algorithm-Independent Learning Issues

Bias-Variance in Machine Learning

Qualifying Exam in Machine Learning

Ensemble Methods: Jay Hyer

the tree till a class assignment is reached

Decision trees COMS 4771

Machine Learning Recitation 8 Oct 21, Oznur Tastan

FINAL: CS 6375 (Machine Learning) Fall 2014

Empirical Risk Minimization, Model Selection, and Model Assessment

Ensembles. Léon Bottou COS 424 4/8/2010

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

A Brief Introduction to Adaboost

Lossless Online Bayesian Bagging

Introduction to Machine Learning Midterm Exam

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm, Tues April 8

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

BAGGING PREDICTORS AND RANDOM FOREST

Ensembles of Classifiers.

Ensemble Methods for Machine Learning

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Regression tree methods for subgroup identification I

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

10701/15781 Machine Learning, Spring 2007: Homework 2

10-701/ Machine Learning - Midterm Exam, Fall 2010

Links between Perceptrons, MLPs and SVMs

PDEEC Machine Learning 2016/17

Transcription:

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio January 27, 2006

Samy Bengio Statistical Machine Learning from Data 2 1 Introduction 2 3

Samy Bengio Statistical Machine Learning from Data 3 1 Introduction 2 3

Samy Bengio Statistical Machine Learning from Data 4 Basics of Ensembles Introduction When trying to solve a problem, we generally make some choices: family of functions, range of the hyper-parameters input representation and preprocessing precise dataset etc Idea: instead of making these choices, let us provide not one but many solutions to the same problem, and let us combine them Why should this be a good idea? These choices imply a variance in the expected performance (implicit capacity). In general, combining estimates reduces the variance enhances expected performance.

Ensemble - Why Does it Work? Samy Bengio Statistical Machine Learning from Data 5 It has been shown that the expected risk of the average of a set of models is better than the average of the expected risk of these models Let us consider the simplest ensemble g over models f i : g(x) = i α i f i (x) with i α i = 1 The MSE risk of f i at x is e i (x) = E y [(y f i (x)) 2 ] The average risk of a model is ē(x) = α i e i (x) i The average risk of the ensemble is e(x) = E y [(y g(x)) 2 ] Let us define diversity d i (x) = (f i (x) g(x)) 2 The average diversity is d(x) = i α id i (x) It can then be shown that e(x) = ē(x) d(x)

Samy Bengio Statistical Machine Learning from Data 6 Introduction Bootstrap Algorithm 1 Introduction 2 3

Samy Bengio Statistical Machine Learning from Data 7 Introduction Bootstrap Algorithm : bootstrap aggregating Underlying idea: part of the variance is due to the specific choice of the training data set Let us create many similar training data sets, For each of them, let us train a new function The final function will be the average of each function outputs. How similar? using bootstrap.

Samy Bengio Statistical Machine Learning from Data 8 Bootstrap Introduction Bootstrap Algorithm Given a data set D n with n examples drawn from p(z) A bootstrap B i of D n also contains n examples: For j = 1 n, the j th example of B i is drawn independently with replacement from D n Hence, some examples from D n are in multiple copies in B i and some examples from D n are not in B i Hypothesis: the examples were iid drawn from p(z) Hence, the datasets B i are as plausible as D n, but drawn from D n instead of p(z).

Samy Bengio Statistical Machine Learning from Data 9 - Algorithm Introduction Bootstrap Algorithm Training: 1 Given a training set D n, create T bootstraps B i of D n 2 For each bootstrap B i, select f (B i ) = arg min ˆR(f, B i ) f F Testing: Given an input x, the corresponding output ŷ is: ŷ = 1 T T f (B i )(x) i=1 Analysis: if generalization error is decomposed into bias and variance terms then bagging reduces variance.

Bias + Variance for Bootstrap Algorithm Error Normal Variance Variance Bias Capacity Samy Bengio Statistical Machine Learning from Data 10

Samy Bengio Statistical Machine Learning from Data 11 Random Forests Introduction Bootstrap Algorithm A random forest is an ensemble of decision trees. Each decision tree is trained as follows: Create a bootstrap of the training set Select a subset m d input variables as potential split nodes (m is constant over all trees) No pruning of the trees A decision is taken by voting amongst the trees Somehow, m controls the capacity.

Samy Bengio Statistical Machine Learning from Data 12 1 Introduction 2 3

Samy Bengio Statistical Machine Learning from Data 13 Introduction Most popular algorithm in the family of boosting algorithms Boosting: the performance of simple (weak) classifiers is boosted by combining them iteratively. General combination classifier: g(x) = T α t f t (x) t=1 Simplest framework: binary classification, targets = { 1, +1} What can we do with the following simplest requirement: each weak classifier f t should perform better than chance

- Concepts Samy Bengio Statistical Machine Learning from Data 14 is an iterative algorithm: select f t given the performance obtained by previous weak classifiers f 1 f t 1. At each time step t, Modify training sample distribution in order to favor difficult examples (according to previous weak classifiers). Train a new weak classifier Select the new weight α t by optimizing a global criterion Stop when impossible to find a weak classifier satisfying the simplest condition (being better than chance) Final solution is the weighted sum of all weak classifiers

- Algorithm Samy Bengio Statistical Machine Learning from Data 15 1 inputs: D n = {(x 1, y 1 ),, (x n, y n )} 2 initialize: w (1) 3 for t = 1,, T i = 1 n for all i = 1,, n 1 D (t) : sample n examples from D n according to weights w (t) 2 train classifier f t using D (t) 3 calculate weighted training error ɛ t = n where I (z) = 1 if z is true, 0 otherwise i=1 w (t) i I (y i f t (x i )) log 1 ɛt ɛ t 4 calculate weight α t of weak classifier f t : α t = 1 2 5 update weights of examples for next iteration: w (t+1) i = w (t) exp( α ty i f t(x i )) i Z t where Z t is a normalization factor such that i w (t+1) 6 if ɛ t = 0 or ɛ t 1 2, break: T = t 1. P α t f t(x) r αr 4 Final output: g(x) = t i = 1.

- Analysis Introduction Selection of α t comes from minimizing ( )) n t 1 αt = arg min exp ( y i α t f t (x i ) + α s f s (x i ) α t i=1 s=1 Other cost functions have been proposed (such as logitboost or arcing) Sampling can often be replaced by weighting If each weak classifier is always better than chance, then can be proven to converge to 0 training error Even after training error is 0, generalization error continues to improve: the margin continues to grow Early claims: does not overfit! This is false of course... Samy Bengio Statistical Machine Learning from Data 16

- Cost Functions Samy Bengio Statistical Machine Learning from Data 17 Comparison of various cost functions related to 3 2.5 exp(-m) [] log(1+exp(-m)) [LogitBoost] 1 - tanh(m) [Doom II] (1 - m)+ [SVM] 2 Loss(m) 1.5 1 0.5 0-3 -2-1 0 1 2 3 4 Margin m = y g(x)

Samy Bengio Statistical Machine Learning from Data 18 - Margin Introduction The margin is defined as the distribution of y g(x) 1 0.9 0.8 0.7 Cumulative distribution of the test margin for several iterations 1 2 3 4 5 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.5 0 0.5 1

- Extensions Samy Bengio Statistical Machine Learning from Data 19 Multi-class classification Single-class classification: estimating quantiles Regression: transform the problem into a binary classification task Localized Boosting: similar to mixtures of experts g(x) = T α t (x) f t (x) t=1 Examples of weak classifiers: Decision trees and stumps neural networks