CS7267 MACHINE LEARNING

Similar documents
Voting (Ensemble Methods)

Boos$ng Can we make dumb learners smart?

Stochastic Gradient Descent

Machine Learning. Ensemble Methods. Manfred Huber

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Ensembles of Classifiers.

Learning with multiple models. Boosting.

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Data Mining und Maschinelles Lernen

A Brief Introduction to Adaboost

Ensemble Methods for Machine Learning

Boosting & Deep Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Ensembles. Léon Bottou COS 424 4/8/2010

Ensemble Methods: Jay Hyer

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

ECE 5984: Introduction to Machine Learning

TDT4173 Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Statistical Machine Learning from Data

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Bias-Variance in Machine Learning

1 Handling of Continuous Attributes in C4.5. Algorithm

Decision Trees: Overfitting

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

TDT4173 Machine Learning

Variance Reduction and Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Numerical Learning Algorithms

FINAL: CS 6375 (Machine Learning) Fall 2014

Hierarchical Boosting and Filter Generation

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

1 Handling of Continuous Attributes in C4.5. Algorithm

i=1 = H t 1 (x) + α t h t (x)

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Adaptive Boosting of Neural Networks for Character Recognition

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Chapter 14 Combining Models

PDEEC Machine Learning 2016/17

Neural Networks and Ensemble Methods for Classification

Statistics and learning: Big Data

10-701/ Machine Learning - Midterm Exam, Fall 2010

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Machine Learning

CS229 Supplemental Lecture notes

CSCI-567: Machine Learning (Spring 2019)

Learning theory. Ensemble methods. Boosting. Boosting: history

Data Warehousing & Data Mining

Bagging and Other Ensemble Methods

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning

Lecture 3: Decision Trees

2D1431 Machine Learning. Bagging & Boosting

Boosting: Foundations and Algorithms. Rob Schapire

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Introduction to Machine Learning Midterm Exam

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Linear and Logistic Regression. Dr. Xiaowei Huang

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Ensemble Learning in the Presence of Noise

Click Prediction and Preference Ranking of RSS Feeds

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CS534 Machine Learning - Spring Final Exam

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

Logistic Regression. Machine Learning Fall 2018

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

6.036 midterm review. Wednesday, March 18, 15

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Classification: The rest of the story

Support Vector Machines

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Lecture 10

Ensemble Methods and Random Forests

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Machine Learning Lecture 7

Midterm Exam Solutions, Spring 2007

Notes on Machine Learning for and

CMU-Q Lecture 24:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CS145: INTRODUCTION TO DATA MINING

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Transcription:

CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University

Definition of Ensemble Learning Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use.

Why ensemble learning? Accuracy: a more reliable mapping can be obtained by combining the output of multiple experts Efficiency: a complex problem can be decomposed into multiple sub problems that are easier to understand and solve (divide-and-conquer approach) There is not a single model that works for all ML problems 3

4

Bias-variance tradeoff Simple (a.k.a. weak or base) learners E.g., SVM, logistic regression, simple discriminant function Low variance, don t usually overfit High bias, can t solve hard learning problems Can we make weak learners always good?? No. Often yes though. 5

Ensemble methods Learn multiple weak learners which are good at different parts of the input space Weak learners: Homogeneous/heterogeneous weak learners Parallel or sequential style Should be as more accurate as possible and as more diverse as possible How to combine the weak learners? Majority voting Weighted averaging for regression 6

Accuracy/diversity for learners Easy to estimate accuracy of learners? How about diversity? No rigorous definition to measure it The diversity of the base learners can be introduced from different channels, such as Subsampling the training examples, Manipulating the attributes, Manipulating the outputs, Injecting randomness into learning algorithms 7

Subsampling the training examples Multiple hypotheses are generated by training individual classifiers on different datasets obtained by resampling a common training set Manipulating the input features Multiple hypotheses are generated by training individual classifiers on different representations or different subsets of a common feature vector 8

Manipulating the output targets The output targets for C classes are encoded with an L- bit codeword, and an individual classifier is built to predict each one of the bits in the codeword Modifying the learning parameters of the classifier A number of classifiers are built with different learning parameters, such as number of neighbors in a k- Nearest Neighbor rule. 9

Ensemble methods Ensemble methods Boosting AdaBoost Bagging Bootstrap sampling Random Forests: a variant of bagging Stacking A number of first-level individual learners are generated from the training data set by different learning algorithms The individual learners are combined by a second-level learners 10

Boosting [Schapire 89] Output class: weighted vote of each learner Let h t (x) be the output of t th classifier that learns about different parts of the input space The decision H(x) can be made by a weighted linear combination 11

Boosting [Schapire 89] Given a weak learner, run it multiple times on training data, let learned classifiers vote On each iteration t: - Weight each training example by how incorrectly it was classified - Learn a weak hypothesis: h t - A strength for this hypothesis: α t Final classifier: H x = sign( t α t h t (x)) 12

Learning from weighted data Consider a weighted dataset D(i): weight of i-th training example x i, y i When resampling data, may get more samples of weighted data points. 13

AdaBoost [Freund & Schapire 95] Given: x 1, y 1,, x n, y n where y i = 1, 1 Initialize D 1 i = 1/n // initially equal weights For t = 1,, T: Train weak learner using distribution D t Get weak classifier h t : X R Choose α t R Update: D t+1 i = D t(i) Z t e α t, if y i = h t x i e α t, if y i h t x i = D t i exp( α t y i h t x i ) Z t // increase weight if wrong 14

AdaBoost [Freund & Schapire 95] Z t is a normalization factor Z t = n i=1 D t i exp( α t y i h t x i ) Final classifier T H x = sign( t=1 α t h t x ) 15

How to choose α t? Weight update rule: D t+1 i = D t i exp( α t y i h t x i ) Z t Weighted training error ε t = α t = 1 2 ln 1 ε t ε t n i=1 D t i δ(h t (x i ) y i ) ε t = 0 if h t perfectly classifies all weighted data (α t = ) ε t = 1 if h t perfectly wrong (α t = ) ε t = 0.5 (α t = 0) 16

Boosting Example 17

Boosting Example 18

Hard & Soft Decision Weighted average of weak learners: f x = t α t h t (x) Hard Decision/Predicted Label: H x = sign f x Soft Decision Based on analogy with logistic regression P Y = 1 X = 1 1+exp(f x ) 19

Effect of Outliers Can identify outliers since focuses on examples that are hard to categorize Too many outliers can degrade classification performance dramatically increase time to convergence 20

Bagging [Breiman 96] Run independent weak learners on bootstrap replicates (sampling with replacement) of the training set Average/vote over weak hypotheses Difference with Boosting Bagging Resampling data Same weight of each classifier Only variance reduction Boosting Reweight data Weight is dependent on classifier s accuracy Both bias and variance reduced 21