Machine Learning. Ensemble Methods. Manfred Huber

Similar documents
CS7267 MACHINE LEARNING

Data Mining und Maschinelles Lernen

Voting (Ensemble Methods)

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

A Brief Introduction to Adaboost

Stochastic Gradient Descent

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Learning with multiple models. Boosting.

Chapter 14 Combining Models

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

ECE 5984: Introduction to Machine Learning

Variance Reduction and Ensemble Methods

Bagging and Other Ensemble Methods

TDT4173 Machine Learning

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

PDEEC Machine Learning 2016/17

Ensemble Methods: Jay Hyer

Bias-Variance in Machine Learning

Boosting & Deep Learning

Statistical Machine Learning from Data

Boos$ng Can we make dumb learners smart?

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

TDT4173 Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Classifier Performance. Assessment and Improvement

CS534 Machine Learning - Spring Final Exam

Ensemble Methods for Machine Learning

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensembles. Léon Bottou COS 424 4/8/2010

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Computational Learning Theory

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Ensembles of Classifiers.

FINAL: CS 6375 (Machine Learning) Fall 2014

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Hierarchical Boosting and Filter Generation

i=1 = H t 1 (x) + α t h t (x)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Numerical Learning Algorithms

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Machine Learning

Neural Networks and Ensemble Methods for Classification

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Decision Trees: Overfitting

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Generalization, Overfitting, and Model Selection

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Machine Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 13: Ensemble Methods

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Data Warehousing & Data Mining

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning

Lecture 8. Instructor: Haipeng Luo

1 Handling of Continuous Attributes in C4.5. Algorithm

Data Mining and Analysis: Fundamental Concepts and Algorithms

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

MODULE -4 BAYEIAN LEARNING

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Evaluation. Andrea Passerini Machine Learning. Evaluation

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Ensemble Learning in the Presence of Noise

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Evaluation requires to define performance measures to be optimized

Gradient Boosting (Continued)

Cross Validation & Ensembling

BAGGING PREDICTORS AND RANDOM FOREST

Monte Carlo Theory as an Explanation of Bagging and Boosting

Decision Tree Learning Lecture 2

Logistic Regression and Boosting for Labeled Bags of Instances

Machine learning - HT Basis Expansion, Regularization, Validation

Gaussians Linear Regression Bias-Variance Tradeoff

A Unified Bias-Variance Decomposition and its Applications

CSCI-567: Machine Learning (Spring 2019)

Totally Corrective Boosting Algorithms that Maximize the Margin

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Evaluating Hypotheses

Classification using stochastic ensembles

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Transcription:

Machine Learning Ensemble Methods Manfred Huber 2015 1

Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected error sources are often characterized using Bias Error due to the algorithm and hypothesis Variance Error due to training data Noise Noise in the data Manfred Huber 2015 2

Bias, Variance, Noise In regression we can show the decomposition of the error into these components Assume that a data point (x,y) and training sets D are drawn randomly from a data distribution The squared regression error for a data point x is E D,y " ( ) 2 # y! h(x) $ % = E " D,y # y 2! 2yh(x)+ h(x) 2 $ %!!!!!!!!!!!!!!!!!!!!!!!!!= E D,y " # y 2 $ %! 2E D,y[y]E D,y h(x)!!!!!!!!!!!!!!!!!!!!!!!!!= E D,y! " y 2 # $ % E y y!!!!!!!!!!!!!!!!!!!!!!!!!!!!+e y [ ] + E D,y " h(x) 2 Manfred Huber 2015 3 [ ] 2 [ y] 2 % 2E y [y]e D [ h(x) ] + E D [ h(x) ] 2 [ ] 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!+E D! " h(x) 2 # $ % E h(x) D # $ %

Bias, Variance, Noise Using the relation E[(x-E[x]) 2 ]=E[x 2 ]-E[x] 2 we can transform this: E D,y " ( ) 2 # y! h(x)!!!!!!!!!!!!!!!!!!!!!!!!!!!!+e y $ % = E " y # y 2 $ %! E y y [ ] 2 [ y] 2! 2E y [y]e D [ h(x) ] + E D [ h(x) ] 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!+E D " # h(x) 2 $ %! E h(x) D [ ] 2 " ( ) 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!= E y # y! f (x)!!!!!!!!!!!!!!!!!!!!!!!!!!!!+( f (x)! E D [ h(x) ]) 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!+E D h(x)! E D h(x) # $ % " ( ) 2 [ ] ç Noise ç Bias 2 $ % ç Variance Manfred Huber 2015 4

Bias, Variance, Noise Each of the terms explains a different part of the expected error Noise describes how much the target value varies form the true function value Bias describes how much different the average (best) learned hypothesis is from the true function Variance describes how much the learned hypotheses vary with changes in the training data Manfred Huber 2015 5

Linear Regression Example 50 datasets with 20 data points each!"#$%&'#()"#*+,-./*'#*,012# Ditterich and Ng Manfred Huber 2015 6

Linear Regression Example Noise Ditterich and Ng Manfred Huber 2015 7

Linear Regression Example Bias Ditterich and Ng Manfred Huber 2015 8

Linear Regression Example Variance Ditterich and Ng Manfred Huber 2015 9

Bias Variance Tradeoff Bias can be minimized by appropriate hypothesis spaces (containing the function) and algorithm Requires knowledge of the function or a more complex hypothesis space Variance can be minimized by a hypothesis space that requires little data and does not overfit Requires knowledge of function or use of a simpler hypothesis space Bias and variance often trade off against each other and are hard to optimize at the same time Manfred Huber 2015 10

Bias Variance Tradeoff Typical bias/variance tradeoff: Hastie, Tibshirani, Friedman Manfred Huber 2015 11

Influencing Bias and/or Variance: Ensemble Methods Ensemble methods can change bias and/or variance of an existing classifiers Class Prediction Combiner Class Predictions Classifier 1 Classifier 2... Classifier N Input Features Manfred Huber 2015 12

Bagging One way to address variance is by averaging over multiple learned hypotheses Bagging samples the initial n training examples with replacement to generate k bootstrap sample sets (usually of the same set size, n) A classifier/regression function is learned on each of the k training sets The final classification/regression function is determined through majority vote or averaging Manfred Huber 2015 13

Bagging The variance of the ensemble classifier/ regression function can have lower variance Resulting variance depends on correlation between the hypotheses If all classifiers are the same there is no gain If the classifiers change strongly there will be gain of up to a factor of 1/k Bootstrap sampling could lead to a small degradation in the learned classifiers/functions Bagging mainly helps with methods that are very sensitive to changes in the data set Manfred Huber 2015 14

Bagging Bagging generally has no influence on bias Does only average the hypotheses Reduces bias for classifiers/regression approaches that are sensitive to training data Averaging the hypotheses can reduce the variance of the final classifier/regression function Can we also influence bias using ensemble classification? Manfred Huber 2015 15

Boosting Boosting takes a different approach by modifying the training set systematically to allow subsequent classifiers to focus on misclassified examples Originally proposed in the theory of weak learners Even minimally better than random performance can be used to build better learners as an ensemble Only used for classification Whole ensemble formed by weighted voting Manfred Huber 2015 16

Boosting Start with equally weighted samples Learn a classifier on the weighted data Weight indicates contribution to the error function Compute the error on the training set Increase weight of samples that were misclassified Go back to learning a new classifier until sufficient are learned (often more than 100) Perform classification as the weighted sum or predictions of all the models Manfred Huber 2015 17

Boosting D 1 {w 1 (i) =1/n} D 2 {w 2 (i) } D 3 {w 3 (i) } D k {w k (i) } h 1 (x) h 2 (x) h 3 (x) h k (x) h(x)=sign( Σ α i h i (x) ) Manfred Huber 2015 18

Boosting Example: AdaBoost Assuming classes as 1 and -1, the normalized error of the m th classifier is! m = " n " (i) m (1!# h(x (i ) " n (i) " ) ),y (i ) m i=1 Learn the m th classifier and continue if m < ½ i=1 h m = argmin h E m From this we can compute the classifier weight ( )! m = 1 2 ln 1!! m! m And adjust the weights for the data items! m+1 =! m e!" my (i ) h m (x (i ) ) Manfred Huber 2015 19

Boosting Example: AdaBoost The final classifier produces h(x) = sign (! k! j h j (x)) j=1 Construction iteratively minimizes an exponential energy function over the misclassifications and with it the misclassification rate Manfred Huber 2015 20

Boosting Example!"#$%&'(!"#$%&'& Manfred Huber 2015 21

Boosting Boosting can improve bias and variance Usually leads to larger improvements than bagging Boosting can lead to improvements even with stable classifiers (as opposed to bagging) But: Boosting can hurt performance on very noisy data Boosting is also more common to lead to degraded performance than bagging Instead of weights on data samples, boosting can also use resampling Manfred Huber 2015 22

Ensemble Methods Ensemble methods can be used to improve the performance of existing classifiers Bagging improves variance by averaging solutions Boosting can improve bias and variance through weighing of data samples for each classifier to focus on misclassified items A range of other ensemble methods have been proposed and built to achieve better performance than a single classifier E.g Mixture of Experts Manfred Huber 2015 23