Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Similar documents
Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Advanced Machine Learning

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 8. Instructor: Haipeng Luo

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Learning Ensembles. 293S T. Yang. UCSB, 2017.

A Brief Introduction to Adaboost

Totally Corrective Boosting Algorithms that Maximize the Margin

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CS229 Supplemental Lecture notes

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Stochastic Gradient Descent

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Learning with multiple models. Boosting.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Voting (Ensemble Methods)

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Computational and Statistical Learning Theory

Random Classification Noise Defeats All Convex Potential Boosters

2 Upper-bound of Generalization Error of AdaBoost

Learning theory. Ensemble methods. Boosting. Boosting: history

COMS 4771 Lecture Boosting 1 / 16

Precise Statements of Convergence for AdaBoost and arc-gv

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

ECE 5984: Introduction to Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

i=1 = H t 1 (x) + α t h t (x)

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Boosting Ensembles of Structured Prediction Rules

Ensemble Methods for Machine Learning

An Introduction to Boosting and Leveraging

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

CS7267 MACHINE LEARNING

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Boos$ng Can we make dumb learners smart?

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Machine Learning. Ensemble Methods. Manfred Huber

Ensembles. Léon Bottou COS 424 4/8/2010

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

ECE 5424: Introduction to Machine Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Hierarchical Boosting and Filter Generation

18.9 SUPPORT VECTOR MACHINES

Logistic Regression and Boosting for Labeled Bags of Instances

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Rademacher Complexity Bounds for Non-I.I.D. Processes

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

Statistics and learning: Big Data

Margin-Based Ranking Meets Boosting in the Middle

Perceptron Mistake Bounds

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Support Vector Machines

Machine Learning, Fall 2011: Homework 5

A Study of Relative Efficiency and Robustness of Classification Methods

Learning with Rejection

Boosting the Area Under the ROC Curve

10701/15781 Machine Learning, Spring 2007: Homework 2

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Minimax risk bounds for linear threshold functions

Support Vector Machines

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Ensembles of Classifiers.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

CSCI-567: Machine Learning (Spring 2019)

Introduction to Support Vector Machines

Machine Learning for NLP

Logistic Regression Trained with Different Loss Functions. Discussion

The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Bias-Variance in Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Support Vector Machines. Machine Learning Fall 2017

Does Modeling Lead to More Accurate Classification?

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear Models for Classification

Support Vector Machines, Kernel SVM

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Boosting. March 30, 2009

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Deep Learning for Computer Vision

AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD

Support Vector Machine (SVM) and Kernel Methods

Logistic Regression. Machine Learning Fall 2018

TDT4173 Machine Learning

Transcription:

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Boosting Mehryar Mohri - Introduction to Machine Learning page 2

Boosting Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple relatively accurate base classifiers often not hard. But, how should base classifiers be combined? Mehryar Mohri - Introduction to Machine Learning page 3

H { 1, +1} X. AdaBoost (Freund and Schapire, 1997) AdaBoost(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 D 1 (i) 1 m 3 for t 1 to T do 4 h t base classifier in H with small error t =Pr Dt [h t (x i )=y i ] 5 α t 1 2 log 1 t t 6 Z t 2[ t (1 t )] 1 2 normalization factor 7 for i 1 to m do 8 D t+1 (i) D t(i)exp( α t y i h t (x i )) Z t 9 f T t=1 α th t 10 return h =sgn(f) Mehryar Mohri - Introduction to Machine Learning page 4

Distributions D t originally uniform. Notes over training sample: at each round, the weight of a misclassified example is increased. observation: D, since t+1 (i)= e y i f t (x i ) m Q t s=1 Z s D t+1 (i) = D t(i)e α ty i h t (x i ) Z t = D t 1(i)e αt 1yiht 1(xi) e αtyiht(xi) Z t 1 Z t Weight assigned to base classifier h t : α t directy depends on the accuracy of at round t. h t = 1 m e y P t i s=1 α sh s (x i ) t s=1 Z s. Mehryar Mohri - Introduction to Machine Learning page 5

Illustration t = 1 t = 2 Mehryar Mohri - Introduction to Machine Learning page 6

t = 3...... page Mehryar Mohri - Introduction to Machine Learning 7

α 1 + α 2 + α 3 = page Mehryar Mohri - Introduction to Machine Learning 8

Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost verifies: R(h) exp If further for all t [1,T], γ ( 1 2 t), then γ does not need to be known in advance: adaptive boosting. 2 T t=1 R(h) exp( 2γ 2 T ). (Freund and Schapire, 1997) 1 2 t 2. Mehryar Mohri - Introduction to Machine Learning page 9

Proof: Since, as we saw, D t+1 (i)= e y i f t (x i ), m Q t s=1 Z s R(h) = 1 m Now, since Z t = 1 m = m 1 yi f(x i ) 0 1 m i=1 m i=1 Z t m T t=1 m i=1 Z t D T +1 (i) = is a normalization factor, m D t (i)e α ty i h t (x i ) i=1 i:y i h t (x i ) 0 D t (i)e α t + =(1 t )e α t + t e α t =(1 t ) t 1 t + t exp( y i f(x i )) T t=1 i:y i h t (x i )<0 Z t. D t (i)e α t 1 t t =2 t (1 t ). page Mehryar Mohri - Introduction to Machine Learning 10

Thus, T t=1 Z t = T t=1 T t=1 2 t (1 t )= exp 1 2 2 t T t=1 2 2 1 1 4 2 t T =exp 2 t=1 2 1 2 t. Notes: α t minimizer of α (1 t )e α + t e α. since (1 t )e α t = t e α t, at each round, AdaBoost assigns the same probability mass to correctly classified and misclassified instances. for base classifiers x [ 1, +1], α t can be similarly chosen to minimize. Z t Mehryar Mohri - Introduction to Machine Learning page 11

AdaBoost = Coordinate Descent Objective Function: convex and differentiable. F (α) = m e y if(x i ) = m i=1 i=1 e x e y i P T t=1 α th t (x i ). 0 1 loss Mehryar Mohri - Introduction to Machine Learning page 12

Direction: unit vector with e t e t =argmin t m df (α t 1 + ηe t ) dη. η=0 Since F (α, t 1 + ηe t )= e y P t 1 i s=1 α sh s (x i ) e y iηh t (x i ) df (α t 1 + ηe t ) dη = η=0 = i=1 m y i h t (x i )exp i=1 m i=1 t 1 y i s=1 t 1 y i h t (x i )D t (i) m = [(1 t ) t ] m t 1 s=1 s=1 α s h s (x i ) Z s Z s =[2 t 1] m t 1 s=1 Z s. Thus, direction corresponding to base classifier with smallest error. page Mehryar Mohri - Introduction to Machine Learning 13

Step size: obtained via df (α t 1 + ηe t ) dη =0 m y i h t (x i )exp i=1 m i=1 t 1 y i s=1 t 1 y i h t (x i )D t (i) m s=1 m y i h t (x i )D t (i)e y ih t (x i )η =0 i=1 [(1 t )e η t e η ]=0 α s h s (x i ) e y ih t (x i )η =0 Z s e y ih t (x i )η =0 η = 1 2 log 1 t t. Thus, step size matches base classifier weight of AdaBoost. page Mehryar Mohri - Introduction to Machine Learning 14

Alternative Loss Functions boosting loss x e x logistic loss x log 2 (1 + e x ) square loss x (1 x) 2 1 x 1 hinge loss x max(1 x, 0) zero-one loss x 1 x<0 Mehryar Mohri - Introduction to Machine Learning page 15

Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: data in R N, e.g., N =2, (height(x), weight(x)). associate a stump to each component. pre-sort each component: O(Nmlog m). total complexity: O((m log m)n + mnt ). at each round, find best component and threshold. stumps not weak learners: think XOR example! Mehryar Mohri - Introduction to Machine Learning page 16

Overfitting? We could expect that AdaBoost would overfit for large values of T, and that is in fact observed in some cases, but in various others it is not! Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: training error error 20 15 10 5 test error Mehryar Mohri - Introduction to Machine Learning 0 10 100 1000 C4.5 decision trees (Schapire et al., 1998). # rounds page 17

L1 Margin Definitions Definition: the margin of a point (the algebraic distance of to the hyperplane α x=0): ρ(x) = yf(x) m t=t α t with label y is x x=[h 1 (x),...,h T (x)] Definition: the margin of the classifier for a sample S =(x 1,...,x m ) is the minimum margin of the points in that sample: ρ = = y T t=1 α th t (x) α 1 min i [1,m] y i α x i α 1. = y α x α 1. Mehryar Mohri - Introduction to Machine Learning page 18

Note: SVM margin: ρ = min i [1,m] y i w Φ(x i ) w 2. Boosting margin: with H(x) = h1 (x). h T (x) ρ =. min i [1,m] y i α H(x i ) α 1, Distances: q distance to hyperplane w x+b=0: with 1 p + 1 q =1. w x + b w p, page Mehryar Mohri - Introduction to Machine Learning 19

Convex Hull of Hypothesis Set Definition: Let H be a set of functions mapping from X to R. The convex hull of H is defined by conv(h) ={ p k=1 µ k h k : p 1,µ k 0, p µ k 1,h k H}. k=1 ensemble methods are often based on such convex combinations of hypotheses. Mehryar Mohri - Introduction to Machine Learning page 20

Margin Bound - Ensemble Methods Theorem: Let H be a set of real-valued functions. Fix ρ>0. For any δ>0, with probability at least 1 δ, the following holds for all h conv(h) : R(h) R ρ (h)+ 2 ρ R log 1 m H + δ 2m, (Koltchinskii and Panchenko, 2002) where R m H is a measure of the complexity of H. Mehryar Mohri - Introduction to Machine Learning page 21

Notes For AdaBoost, the bound applies to the functions x f(x) α 1 = T t=1 α th t (x) α 1 conv(h). Note that T does not appear in the bound. Mehryar Mohri - Introduction to Machine Learning page 22

But, Does AdaBoost Maximize the Margin? No: AdaBoost may converge to a margin that is significantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least ρ max if data separable and 2 some conditions on the base learners (Rätsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported. Mehryar Mohri - Introduction to Machine Learning page 23

Outliers AdaBoost assigns larger weights to harder examples. Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rätsch, 2003). Mehryar Mohri - Introduction to Machine Learning page 24

Advantages of AdaBoost Simple: straightforward implementation. Efficient: complexity N T O(mNT ) for stumps: when and are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost. Mehryar Mohri - Introduction to Machine Learning page 25

Weaker Aspects Parameters: need to determine T, the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overfitting or low margins. Noise: severely damages the accuracy of Adaboost (Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010). Mehryar Mohri - Introduction to Machine Learning page 26

References Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447 454, 2001. Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004. Mehryar Mohri - Introduction to Machine Learning page 27

References Rätsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334 350, July 2002. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. Mehryar Mohri - Introduction to Machine Learning page 28