Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Similar documents
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Data Mining und Maschinelles Lernen

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

CS7267 MACHINE LEARNING

Statistical Machine Learning from Data

Voting (Ensemble Methods)

Machine Learning. Ensemble Methods. Manfred Huber

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Ensembles. Léon Bottou COS 424 4/8/2010

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

TDT4173 Machine Learning

Stochastic Gradient Descent

Learning with multiple models. Boosting.

FINAL: CS 6375 (Machine Learning) Fall 2014

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Learning Ensembles. 293S T. Yang. UCSB, 2017.

A Brief Introduction to Adaboost

Boosting & Deep Learning

i=1 = H t 1 (x) + α t h t (x)

Numerical Learning Algorithms

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Cross Validation & Ensembling

ECE 5424: Introduction to Machine Learning

Algorithm-Independent Learning Issues

Logistic Regression. Machine Learning Fall 2018

CSCI-567: Machine Learning (Spring 2019)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Data Mining and Analysis: Fundamental Concepts and Algorithms

Ensemble Methods: Jay Hyer

Lecture 8. Instructor: Haipeng Luo

TDT4173 Machine Learning

VBM683 Machine Learning

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Notes on Machine Learning for and

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Large-Margin Thresholded Ensembles for Ordinal Regression

Ensemble Methods for Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

PATTERN RECOGNITION AND MACHINE LEARNING

Data Warehousing & Data Mining

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Introduction to Bayesian Learning. Machine Learning Fall 2018

10-701/ Machine Learning - Midterm Exam, Fall 2010

Ensembles of Classifiers.

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

2D1431 Machine Learning. Bagging & Boosting

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Bayesian Methods: Naïve Bayes

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Bagging and Other Ensemble Methods

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Large-Margin Thresholded Ensembles for Ordinal Regression

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CMU-Q Lecture 24:

1 Handling of Continuous Attributes in C4.5. Algorithm

Infinite Ensemble Learning with Support Vector Machinery

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

PDEEC Machine Learning 2016/17

Naïve Bayes classification

ECE 5984: Introduction to Machine Learning

Discriminative v. generative

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Learning. Bayesian Learning Criteria

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CS534 Machine Learning - Spring Final Exam

Midterm, Fall 2003

Gradient Boosting (Continued)

Neural Networks and Ensemble Methods for Classification

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Decision Trees: Overfitting

1 Handling of Continuous Attributes in C4.5. Algorithm

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Classification objectives COMS 4771

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

An Empirical Study of Building Compact Ensembles

Variance Reduction and Ensemble Methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

COMPUTATIONAL LEARNING THEORY

Bayesian Updating: Discrete Priors: Spring

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Chapter 6. Ensemble Methods

Introduction to Logistic Regression

Multiclass Boosting with Repartitioning

Ensemble Learning in the Presence of Noise

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning, Fall 2012 Homework 2

Bayesian Learning Features of Bayesian learning methods:

The PAC Learning Framework -II

Stephen Scott.

Transcription:

Notation R n : feature space Y : class label set m : number of training examples D = {x i, y i } m i=1 (x i R n ; y i Y}: training data set H hipothesis space h : base classificer H : emsemble classifier Probabilidade conjunta (regra do produto): P(X, Y ) = P(Y X )P(X ) = P(X Y )P(Y ) Probabilidade total (regra da soma): P(X ) = Y P(X, Y ) = Y P(X Y )P(Y ) Probabilidade tota (regra da soma): P(Y ) = X P(X, Y ) = X P(Y X )P(X ) Teorema de Bayes: P(Y X ) = P(X Y )P(Y ) P(X ) = P(X Y )P(Y ) P(X Y )P(Y ) Y

Bayes optimal classifier Ensemble methods: try to combine models in one model, each base model is assigned a weight based on its contribution to the classification task which model should be considered; how to choose them from the hypothesis space?; how to assign the weight to each model? Given D = {x i, y i } m i=1 (x i R n ; y i Y}, a training data set goal: obtain un ensemble classifier that assigns a label to unseen x as: H(x) = y = arg max p(y x, h) p(h D) y Y h H

Bayes optimal classifier Assume the training examples drawn independently: p(h D) = p(h)p(d h) p(d) = p(h) m i=1 p(x i h) p(d) therefore: H(x) = y = arg max p(y x, h) p(d h)p(h) y Y h H The Bayes optimal classifier is the ideal ensemble method However, it cannot be pratically implemented

Bayes Model Averaging Ref.: J. Hoeting et al, Bayesian model averaging: a tutorial, Statistical Science, 14 (4): 382-417, 1999 Models are sampled using a Montecarlo sampling technique Simpler way: model trained on a random subset of training data as a sampled model Computation of p(h) no prior knowledeg is available uniform distribution without normalization p(h) = 1 for all hypothesis Computation of p(d h) error rate of hypothesis h on training data is ε(h) p(x i h) is computed as p(x i h) = exp{ε(h) ln(ε(h)) + (1 ε(h)) ln(1 ε(h))} m p(d h) = p(x i h) i=1 = exp{m[ε(h) ln(ε(h)) + (1 ε(h)) ln(1 ε(h))]}

Bayes Model Averaging Algorithm 1 Bayesian Model Averaging Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: for t 1 to T do 2: Construct D t with size m by randomly sample in D 3: Learn a base classifier h t based on D t 4: Set p(h t) = 1 5: Calculate ε(h t) on D t 6: Calculate p(d t h t) = exp{m [ε(h t) ln(ε(h t)) + (1 ε(h t)) ln(1 ε(h t))]} 7: Set weight(h t) = p(d t h t)p(h t) 8: end for 9: Normalize all the weights to sum 1 T 10: return H(x) = y = arg max p(y x, h t) p(d h)p(h t) y Y t=1 Expected error of Bayesian model averaging: at most twice the expected error of the Bayes optimal classifier over-fitting problems it prefers the hytotheis that has the lowest error on training data rather the hypotheis that has the lowest error it conducts a selection of classifiers instead of combining them

Bayesian Model Combination Algorithm 2 Bayesian Model Combination Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: for t 1 to T do 2: Construct D t with size m by randomly sample in D 3: Learn a base classifier h t based on D t 4: Set weight(h t) = 0 5: end for 6: SumWeight = 0 7: z = 8: Set the iteration number for compute weights: iteration To overcome the over-fitting problem, It directly samples from the space of possible ensemble hypothesis It regards all the base classifier as one model and iteratively calculates their weights simultaneously

Bayesian Model Combination Algorithm 3 Bayesian Model Combination (cont.) 1: for iter 1 to iteration do 2: for each weak classifier do 3: draw a temp weight 4: TempWeight(h t) = ln(randuniform(0, 1)) 5: end for 6: Normalize TempWeight to sum 1 7: Combine the base classifiers as H = T t=1 httempweight(ht) 8: Calculate ε(h ) on D 9: Calculate p(d H ) = exp{m(ε(h ) ln(ε(h ) + (1 (ε(h ) ln(1 (ε(h )))} 10: if p(d H ) > Z then 11: 12: for each base classifier do weight(h t) = weight(h t) exp{z p(d H )} 13: 14: end for z = p(d H ) 15: 16: end if w = exp{p(d H ) z} 17: for each base classifier do 18: weight(h t) = weight(h t) SumWeight SumWeight+w + wtempweight(ht) 19: end for 20: SumWeight = SumWeight + w 21: end for 22: Normalize all the weights to sum 1 T 23: return H(x) = y = arg max p(y x, h t) p(d h)p(h t) y Y t=1

Bagging (bootstrap aggregation) Algorithm 4 Bagging Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: for t 1 to T do 2: Construct D t by randomly sampling with replacement in D 3: Learn a base classifier h t based on D t 4: end for T 5: return H(x) = y = arg max 1(h t(x) = y) y Y t=1 It adopts Bootstrap sampling technique in constructing base models It gnerates new data sets by sampling from the original data set with replacement It trains base classifiers on the sampled data sets It combines all the base classifiers by majority voting

Boosting Algorithm 5 Boosting Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: Initialize the weight distribution W 1 2: for t 1 to T do 3: Learn weak classifier h t based on D t and W t 4: Evaluate weak classifier ε(h t) 5: Update weight distribution W t+1 based on ε(h t) 6: end for 7: return H = Combination({h 1,..., h T }) Boosting convert weak classifiers to a strong one it iteratively adjust the importance of examples in the training set It corrects the mistake made in weak classifiers gradually It leans base classifiers based on the weight distribution It combines the learned classifiers

AdaBoost Algorithm 6 AdaBoost in Binary Classification Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i {+1, 1})} Ensure: : An ensemble classifier H 1: Initialize the weight distribution W 1 = 1 m 2: for t 1 to T do 3: Learn a base classifier h t = arg min h 4: Calculate the weight of h t: α t = 1 2 1 ε(ht ) ln( ) ε(h t ) 5: Update weight distribution of training example: W t+1 = 6: end for 7: return H = T t=1 αtht W t (i) exp{ α t h t (x i )y i } m i =1 Wt (i ) exp{ α t h t (x i )y i } ε(h) where ε(h) = m i=1 Wt(i)1(h(x i ) y i ) if ε(h 1 ) ε(h 2 ) then α h1 α h2 For ε(h t ) 0.5, α t 0 If x i is wrongly classified, h t (x i )y i is 1 and α t 0, so α t h t (x i )y i 0. As exp{ α t h t (x i )y i } > 1, the new weight W t+1 (i) > W t (i) If x i is correctly classified, W t+1 (i) < W t (i).

Stacking Algorithm 7 Stacking Require: : Training data D = {x i, y i } m i=1 (x i R n ; y i Y} Ensure: : An ensemble classifier H 1: Step 1: Learn first-level classifiers 2: for t 1 to T do 3: Learn base classifier h t based on D 4: end for 5: Step 2: Construct new data sets from D 6: for t 1 to m do 7: Construct a new data set that contains {(x i, y i )}, where x i = (h 1 (x i ),..., h T (x i )) 8: end for 9: Learns a second-level classifier 10: Learn a new classifier h based on the newly constructed data set 11: return H(x) = h (h 1 (x),..., h T (x)) Stacking learns a high-level classifier on top of the base classifiers it can see as a meta learning approach the base classifiers are called first-level classifiers a second level classifiers is learnt to combine the first-level classifiers

Stacking Step 1: learn fist-level classifiers based on the original training data set. we can apply Bootstrap sampling technique to learn independent classifiers we can adopt the strategy used in Boosting: adaptively learn base classifiers based on data with a weight distribution we can tune parameters in a learning algorithms to generate diverse base classifiers (homogeneous classifiers) we can apply different classification and/or sampling methods to generate base classifiers (heterogeneous classifiers) Step 2: construct a new data set based on the output of base classifiers Step 3: Learn a second-level classifiers based on the newly constructed data set. Any learning method could be applied to learn the second-level classifier