Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Similar documents
Advanced Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

ADANET: adaptive learning of neural networks

Structured Prediction Theory and Algorithms

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Boosting Ensembles of Structured Prediction Rules

i=1 = H t 1 (x) + α t h t (x)

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Computational and Statistical Learning Theory

Learning theory. Ensemble methods. Boosting. Boosting: history

Structured Prediction

A Brief Introduction to Adaboost

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Hierarchical Boosting and Filter Generation

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Rademacher Complexity Bounds for Non-I.I.D. Processes

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CS229 Supplemental Lecture notes

Learning Ensembles. 293S T. Yang. UCSB, 2017.

CS7267 MACHINE LEARNING

Voting (Ensemble Methods)

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Lecture 8. Instructor: Haipeng Luo

Deep Boosting. Abstract. 1. Introduction

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Ensembles. Léon Bottou COS 424 4/8/2010

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Learning with Rejection

Structural Maxent Models

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

ECE 5424: Introduction to Machine Learning

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Gradient Boosting (Continued)

Rademacher Bounds for Non-i.i.d. Processes

Sample Selection Bias Correction

Statistical Machine Learning from Data

Stochastic Gradient Descent

COMS 4771 Lecture Boosting 1 / 16

Announcements Kevin Jamieson

Foundations of Machine Learning

Generalized Boosted Models: A guide to the gbm package

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

ECE 5984: Introduction to Machine Learning

Generalized Boosting Algorithms for Convex Optimization

Decision Trees: Overfitting

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Regularization Paths

Perceptron Mistake Bounds

Monte Carlo Theory as an Explanation of Bagging and Boosting

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Totally Corrective Boosting Algorithms that Maximize the Margin

Learning Weighted Automata

Machine Learning. Ensemble Methods. Manfred Huber

Ensemble Methods for Machine Learning

Incremental Training of a Two Layer Neural Network

Advanced Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Boosting & Deep Learning

VBM683 Machine Learning

Generalization, Overfitting, and Model Selection

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

2 Upper-bound of Generalization Error of AdaBoost

Online Learning for Time Series Prediction

Statistics and learning: Big Data

Computational and Statistical Learning theory

A Study of Relative Efficiency and Robustness of Classification Methods

Large-Margin Thresholded Ensembles for Ordinal Regression

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

An Introduction to Boosting and Leveraging

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Ensembles of Classifiers.

A Simple Algorithm for Multilabel Ranking

Boos$ng Can we make dumb learners smart?

Click Prediction and Preference Ranking of RSS Feeds

Logistic Regression and Boosting for Labeled Bags of Instances

Introduction to Machine Learning Midterm, Tues April 8

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

CSCI-567: Machine Learning (Spring 2019)

Ensemble Methods: Jay Hyer

Harrison B. Prosper. Bari Lectures

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Adaptive Boosting of Neural Networks for Character Recognition

Domain Adaptation for Regression

Learning Bounds for Importance Weighting

ENSEMBLES OF DECISION RULES

CS534 Machine Learning - Spring Final Exam

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Boosting the margin: A new explanation for the effectiveness of voting methods

Large-Margin Thresholded Ensembles for Ordinal Regression

Does Modeling Lead to More Accurate Classification?

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Learning Ensembles of Structured Prediction Rules

Does Unlabeled Data Help?

Transcription:

Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Ensemble Methods in ML Combining several base classifiers to create a more accurate one. Bagging (Breiman 1996). AdaBoost (Freund and Schapire 1997). Stacking (Smyth and Wolpert 1999). Bayesian averaging (MacKay 1996). Other averaging schemes e.g., (Freund et al. 2004). Often very effective in practice. Benefit of favorable learning guarantees. page #2

Convex Combinations Base classifier set. boosting stumps. decision trees with limited depth or number of leaves. Ensemble combinations: convex hull of base classifier set. conv(h) = T t=1 H th t : t 0; T t=1 t 1; t, h t H. page #3

Rademacher Complexity H R S =(x 1,...,x m ) m Definitions: let be a family of functions mapping from to and let be a sample of size. H Empirical Rademacher complexity of : X R S (H) = 1 m E H Rademacher complexity of : sup h H m i=1 ih(x i ). R m (H) = E R S(H). S D m page #4

Ensembles - Margin Bound H >0 Theorem: let be a family of real-valued functions. Fix. Then, for any >0, with probability at least 1, the following holds for all : f = T t=1 th t conv(h) R(f) R S, (f)+ 2 R m (H)+ log 1 2m, (Koltchinskii and Panchencko, 2002) where R S, (f) = 1 m m i=1 1 yi f(x i ). page #5

Questions Can we use a much richer or deeper base classifier set? richer families needed for difficult tasks on speech and image processing. but generalization bound indicates risk of overfitting. page #6

AdaBoost Description: coordinate descent applied to (Freund and Schapire, 1997) F ( )= m i=1 e y if(x i ) = m i=1 exp y i T t=1 th t (x i ). Guarantees: ensemble margin bound. but AdaBoost does not maximize the margin! some margin maximizing algorithms such as arc-gv are outperformed by AdaBoost! (Reyzin and Schapire, 2006) page #7

Suspicions Complexity of hypotheses used: arc-gv tends to use deeper decision trees to achieve a larger margin. Notion of margin: minimal margin perhaps not the appropriate notion. margin distribution is key. can we shed more light on these questions? page #8

This Talk Main question: how can we design ensemble algorithms that can succeed even with very deep decision trees or other complex sets? theory. practical algorithms. experimental results. page #9

Theory page

Base Classifier Set H Decomposition in terms of sub-families or their union. H 1 H 2 H 3 H 4 H 5 H 1 H 1 H 2 H 1 H p page #11

Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page #12

Ideas Use hypotheses drawn from H k s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? can we provide learning guarantees guiding these choices? connection with SRM? page #13

New Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 4 T t=1 tr m (H kt )+O log p 2 m. page #14

Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to inspire, or directly define an ensemble algorithm. page #15

Algorithms page

Set-Up H 1,...,H p [ 1, +1] : disjoint sub-families of functions taking values in. Further assumption (not necessary): symmetric subfamilies, i.e.. Notation: for any, r t = R m (H d(ht )). h H k h H k h p k=1 H k d(h) h H d(h) index of sub-family ( ). page #17

Derivation (1) Learning bound suggests seeking to minimize 1 m m i=1 1 yi P Tt=1 th t (x i ) + 4 T 0 with t=1 tr t. T t=1 t 1 page #18

Derivation (2) f = T t=1 th t f/ 0 Since and have the same error, we can T 1 instead search for with to minimize 1 m m i=1 t=1 t 1 yi P Tt=1 th t (x i ) 1 +4 T t=1 tr t. page #19

Derivation (3) u ( u) u 1 u 0 Let be a decreasing convex function upper bounding, with differentiable. Minimizing convex upper bound (convex optimization): min 0 1 m s.t. T t=1 m i=1 t 1 y i T 1, t=1 th t (x i ) + T t=1 tr t where the parameter 0 is introduced. page #20

Derivation (4) Moving the constraint to the objective and using the fact that the sub-families are symmetric leads to: min R N 1 m m i=1 1 y i N j=1 jh j (x i ) + N t=1 ( r j + ) j, where, 0, and for each hypothesis, keep either h or -h. page #21

Convex Surrogates Two principal choices: Exponential loss: Logistic loss: ( u) =exp( u). ( u) =log 2 (1 + exp( u)). page #22

DeepBoost Algorithm Coordinate descent applied to convex objective. non-differentiable function. definition of maximum coordinate descent. page #23

Direction & Step Maximum direction: definition based on the error t,j = 1 2 1 E [y i h j (x i )], i D t where D t is the distribution over sample at iteration t. Step: closed-form expressions for exponential and logistic losses. general case: line search. page #24

Pseudocode j = r j +. page #25

Notes Straightforward updates. Sparsity step. Parallel implementation. page #26

Connections with Previous Work For, DeepBoost coincides with AdaBoost (Freund and Schapire 1997), run with union of subfamilies, for the exponential loss. algorithm of (Friedman et al., 1998), run with union of subfamilies, for the logistic loss. For and, DeepBoost coincides with L1-regularized AdaBoost (Duchi and Singer 2009). = =0 =0 =0 close to unnormalized Arcing (Breiman 1999). also related to AdaBoost (Raetsch and Warmuth 2005). page #27

Experiments page

Rad. Complexity Estimates Benefit of data-dependent analysis: empirical estimates of each. example: for kernel function K k, R S (H k ) R m (H k ) Tr[K k ] m. alternatively, upper bounds in terms of growth functions, R m (H k ) 2log Hk (m) m. page #29

Experiments (1) Family of base classifiers defined by boosting stumps: boosting stumps (threshold functions). in dimension,, thus decision trees of depth 2, d H stumps 1 R m (H stumps 1 ) question at the internal nodes of depth 1. d H stumps 1 (m) 2md 2log(2md) m, with the same in dimension, H stumps, thus 2 R m (H stumps 2 ) H stumps 2 (m) (2m) 2 d(d 1) 2log(2m 2 d(d 1)) m. 2. page #30

Experiments (1) Base classifier set:. Data sets: same UCI Irvine data sets as (Breiman 1999) and (Reyzin and Schapire 2006). H stumps 1 H stumps 2 OCR data sets used by (Reyzin and Schapire 2006): ocr17, ocr49. MNIST data sets: ocr17-mnist, ocr49-mnist. Experiments with exponential loss. Comparison with AdaBoost and AdaBoost-L1. page #31

Data Statistics breastcancer ionosphere german (numeric) Examples 699 351 1000 Attributes 9 34 24 diabetes ocr17 ocr49 Examples 768 2000 2000 Attributes 8 196 196 ocr17-mnist ocr49-mnist Examples 15170 13782 Attributes 400 400 page #32

Experiments - Stumps Exp Loss page #33

Experiments (2) Family of base classifiers defined by decision trees of depth. k VC-dim(Hk trees ) (2 k +1)log 2 (d +1) Since, R m (H trees k ) (2 k+1 +2)log 2 (d +1)log(m) m. K k=1 Htrees k Base classifier set:. Same data sets as with Experiments (1). Both exponential and logistic loss. Comparison with AdaBoost and AdaBoost-L1. page #34

Experiments - Trees Exp Loss page #35

Experiments - Trees Log Loss page #36

Margin Distribution page #37

Extension to Multi-Class Similar data-dependent learning guarantee proven for the multi-class setting. bound depending on mixture weights and complexity of sub-families. Deep Boosting algorithm for multi-class: similar extension taking into account the complexities of sub-families. several variants depending on number of classes. different possible loss functions for each variant. page #38

Multi-Class Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 4c2 T tr m ( 1 (H kt )) + O with c t=1 number of classes. log p 2 m log 2 m log p, and 1(H k )={x h(x, y): y Y,h H k }. page #39

Conclusion Deep Boosting: ensemble learning with increasingly complex families. data-dependent theoretical analysis. algorithm grounded in theory. extension to multi-class. ranking and other losses. enhancement of many existing algorithms. compares favorably to AdaBoost and AdaBoost-L1 in preliminary experiments. page #40