Advanced Machine Learning

Similar documents
Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

ADANET: adaptive learning of neural networks

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Boosting Ensembles of Structured Prediction Rules

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Structured Prediction Theory and Algorithms

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Learning with Rejection

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Learning theory. Ensemble methods. Boosting. Boosting: history

Deep Boosting. Abstract. 1. Introduction

Foundations of Machine Learning

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Structural Maxent Models

Structured Prediction

Margin Maximizing Loss Functions

COMS 4771 Lecture Boosting 1 / 16

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

A Brief Introduction to Adaboost

Generalization, Overfitting, and Model Selection

Hierarchical Boosting and Filter Generation

Rademacher Complexity Bounds for Non-I.I.D. Processes

Gradient Boosting (Continued)

Monte Carlo Theory as an Explanation of Bagging and Boosting

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Precise Statements of Convergence for AdaBoost and arc-gv

A Simple Algorithm for Learning Stable Machines

A Study of Relative Efficiency and Robustness of Classification Methods

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Computational and Statistical Learning Theory

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

VBM683 Machine Learning

Does Modeling Lead to More Accurate Classification?

Stochastic Gradient Descent

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Advanced Machine Learning

Boosting the margin: A new explanation for the effectiveness of voting methods

Introduction to Machine Learning

CS7267 MACHINE LEARNING

Totally Corrective Boosting Algorithms that Maximize the Margin

ECE 5424: Introduction to Machine Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Rademacher Bounds for Non-i.i.d. Processes

Machine Learning. Ensemble Methods. Manfred Huber

Statistical Machine Learning from Data

CSCI-567: Machine Learning (Spring 2019)

Ensemble Methods and Random Forests

Advanced Machine Learning

Voting (Ensemble Methods)

Ensemble Methods for Machine Learning

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Lecture 8. Instructor: Haipeng Luo

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Online Learning for Time Series Prediction

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Learning Ensembles of Structured Prediction Rules

Perceptron Mistake Bounds

Logistic Regression and Boosting for Labeled Bags of Instances

Advanced Machine Learning

Boosting. March 30, 2009

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

CS229 Supplemental Lecture notes

Understanding Generalization Error: Bounds and Decompositions

Adaptive Boosting of Neural Networks for Character Recognition

Deep Boosting. Abstract. 1. Introduction

Boosting Algorithms for Maximizing the Soft Margin

Sample Selection Bias Correction

i=1 = H t 1 (x) + α t h t (x)

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Learning Weighted Automata

Computational and Statistical Learning theory

ECE 5984: Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Bias-Variance in Machine Learning

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Advanced Machine Learning

An adaptive version of the boost by majority algorithm

Generalized Boosted Models: A guide to the gbm package

Margin-Based Ranking Meets Boosting in the Middle

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Ensembles of Classifiers.

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Geometry of U-Boost Algorithms

Generalization and Overfitting

Announcements Kevin Jamieson

Foundations of Machine Learning

Machine Learning Lecture 5

Lossless Online Bayesian Bagging

A Magiv CV Theory for Large-Margin Classifiers

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Generalized Boosting Algorithms for Convex Optimization

Online Learning and Sequential Decision Making

Transcription:

Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2

Model Selection Problem: how to select hypothesis set? H too complex, no gen. bound, overfitting. H too simple, gen. bound, but underfitting. balance between estimation and approx. errors. H H h h h Bayes page 3

Estimation and Approximation General equality: for any, h H best in class R(h) R =[R(h) R(h )] estimation Bayes error +[R(h ) R ]. approximation Approximation: not a random variable, only depends on. Estimation: only term we can hope to bound; for ERM, bounded by two times gen. bound: R(h ERM ) R(h )=R(h ERM ) R(h ERM )+R(h ERM ) R(h ) R(h ERM ) R(h ERM )+R(h ) R(h ) 2 sup h H R(h) R(h). H page 4

Structural Risk Minimization SRM: solution: error H = k=1 f (Vapnik and Chervonenkis, 1974; Vapnik, 1995) H k with H 1 H 2 H k = argmin h H k,k 1 R S (h)+pen(k, m). training error + penalty penalty training error complexity page 5

SRM Guarantee Definitions: f H k(h) f simplest hypothesis set containing. the hypothesis returned by SRM: = argmin h H k,k 1 R S (h)+r m (H k )+ Theorem: for any, with probability at least, h log k m >0 1 = F k(h). R(f ) R(h )+2R m (H k(h ) )+ log k(h ) m + 2 log 3 m. page 6

Proof General bound for all : h Pr sup R(h) h2h h =Pr apple = apple apple sup k 1 h 1X Pr k=1 k=1 i F k(h) (h) > sup R(h) h2h k sup R(h) h2h k h H i F k (h) > i F k (h) > 1X h Pr sup R(h) RS b (h) R m (H k ) > + h2h k 1X exp k=1 h 2m + 1X e 2m 2 e 2logk k=1 = e 2m 2 1 X k=1 q i 2 log k m 1 k 2 = 2 6 e 2m 2 apple 2e 2m 2. r log k i m page 7

Proof Using the union bound and the bound just derived gives: q log k(h ) m h i Pr R(f ) R(h ) 2R m (H k(h )) > h apple Pr R(f ) F k(f )(f ) > i h +Pr F k(f 2 )(f ) R(h ) 2R m (H k(h )) h q apple 2e m 2 2 +Pr F k(h )(h ) R(h log k(h ) 2R m (H k(h )) ) m > i h 2 =2e m 2 2 +Pr brs (h ) R(h ) R m (H k(h )) > 2i q log k(h ) m > i 2 =2e m 2 2 + e m 2 2 =3e m 2 2. page 8

Remarks SRM bound: similar to learning bound when k(h ) is known! can be extended if approximation error assumed to be small or zero. if Hcontains the Bayes classifier, only finitely many hypothesis sets need to be considered in practice. restriction: H decomposed as countable union of families with converging Rademacher complexity. Issues: (1) SRM typically computationally intractable; (2) how should we choose s? H k page 9

Voted Risk Minimization Ideas: no selection of specific. H k instead, use all s:,,. hypothesis-dependent penalty: p k=1 Deep ensembles. H k h = p k=1 kh k h k H k kr m (H k ). page 10

Outline Model selection. Deep boosting. theory. algorithm. experiments. page 11

Ensemble Methods in ML Combining several base classifiers to create a more accurate one. Bagging (Breiman 1996). AdaBoost (Freund and Schapire 1997). Stacking (Smyth and Wolpert 1999). Bayesian averaging (MacKay 1996). Other averaging schemes e.g., (Freund et al. 2004). Often very effective in practice. Benefit of favorable learning guarantees. page 12

Convex Combinations Base classifier set. boosting stumps. decision trees with limited depth or number of leaves. Ensemble combinations: convex hull of base classifier set. conv(h) = T t=1 H th t : t 0; T t=1 t 1; t, h t H. page 13

Ensembles - Margin Bound (Bartlett and Mendelson, 2002; Koltchinskii and Panchencko, 2002) H >0 Theorem: let be a family of real-valued functions. Fix. Then, for any >0, with probability at least 1, the following holds for all : f = T t=1 th t conv(h) s log 1 2m, R(f) apple b R S, (f)+ 2 R m(h)+ where R S, (f) = 1 m m i=1 1 yi f(x i ). page 14

Questions Can we use a much richer or deeper base classifier set? richer families needed for difficult tasks in speech and image processing. but generalization bound indicates risk of overfitting. page 15

AdaBoost Description: coordinate descent applied to (Freund and Schapire, 1997) F ( )= m i=1 e y if(x i ) = m i=1 exp y i T t=1 th t (x i ). Guarantees: ensemble margin bound. but AdaBoost does not maximize the margin! some margin maximizing algorithms such as arc-gv are outperformed by AdaBoost! (Reyzin and Schapire, 2006) page 16

Suspicions Complexity of hypotheses used: arc-gv tends to use deeper decision trees to achieve a larger margin. Notion of margin: minimal margin perhaps not the appropriate notion. margin distribution is key. can we shed more light on these questions? page 17

Question Main question: how can we design ensemble algorithms that can succeed even with very deep decision trees or other complex sets? theory. algorithms. experimental results. page 18

Base Classifier Set H Decomposition in terms of sub-families or their union. H 1 H 2 H 3 H 4 H 5 H 1 H 1 H 2 H 1 H p page 19

Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page 20

Ideas Use hypotheses drawn from s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? H k can we provide learning guarantees guiding these choices? page 21

Learning Guarantee (Cortes, MM, and Syed, 2014) Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 4 T t=1 tr m (H kt )+O log p 2 m. page 22

Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to inspire, or directly define an ensemble algorithm. page 23

Set-Up H 1,...,H p [ 1, +1] : disjoint sub-families of functions taking values in. Further assumption (not necessary): symmetric subfamilies, i.e.. Notation: h H k h H k r j = R m (H kj ) h j H kj with. page 24

Derivation Learning bound suggests seeking to minimize 1 m m i=1 1 yi P Tt=1 th t (x i ) + 4 T 0 with t=1 tr t. T t=1 t 1 page 25

Convex Surrogates u ( u) u 1 u 0 Let be a decreasing convex function upper bounding, with differentiable. Two principal choices: Exponential loss: Logistic loss: ( u) =exp( u). ( u) =log 2 (1 + exp( u)). page 26

Optimization Problem (Cortes, MM, and Syed, 2014) Moving the constraint to the objective and using the fact that the sub-families are symmetric leads to: min R N 1 m m i=1 1 y i N j=1 jh j (x i ) + N t=1 ( r j + ) j, where, 0, and for each hypothesis, keep either h or -h. page 27

DeepBoost Algorithm Coordinate descent applied to convex objective. non-differentiable function. definition of maximum coordinate descent. page 28

Direction & Step Maximum direction: definition based on the error t,j = 1 2 1 E [y i h j (x i )], i D t where D t is the distribution over sample at iteration t. Step: closed-form expressions for exponential and logistic losses. general case: line search. page 29

Pseudocode j = r j +. page 30

Connections with Previous Work For = =0, DeepBoost coincides with AdaBoost (Freund and Schapire 1997), run with union of subfamilies, for the exponential loss. additive Logistic Regression (Friedman et al., 1998), run with union of sub-families, for the logistic loss. =0 =0 For and, DeepBoost coincides with L1-regularized AdaBoost (Raetsch, Mika, and Warmuth 2001), for the exponential loss. coincides with L1-regularized Logistic Regression (Duchi and Singer 2009), for the logistic loss. page 31

Rad. Complexity Estimates Benefit of data-dependent analysis: empirical estimates of each. example: for kernel function K k, R S (H k ) R m (H k ) Tr[K k ] m. alternatively, upper bounds in terms of growth functions, R m (H k ) 2log Hk (m) m. page 32

Experiments (1) Family of base classifiers defined by boosting stumps: boosting stumps (threshold functions). in dimension,, thus decision trees of depth 2, d H stumps 1 R m (H stumps 1 ) H stumps 1 question at the internal nodes of depth 1. d (m) 2md 2log(2md) m H stumps 2, with the same in dimension,, thus R m (H stumps 2 ) H stumps 2 (m) (2m) 2 d(d 1) 2log(2m 2 d(d 1)) m. 2. page 33

Experiments (1) Base classifier set:. Data sets: same UCI Irvine data sets as (Breiman 1999) and (Reyzin and Schapire 2006). H stumps 1 H stumps 2 OCR data sets used by (Reyzin and Schapire 2006): ocr17, ocr49. MNIST data sets: ocr17-mnist, ocr49-mnist. Experiments with exponential loss. Comparison with AdaBoost and AdaBoost-L1. page 34

Data Statistics breastcancer ionosphere german (numeric) Examples 699 351 1000 Attributes 9 34 24 diabetes ocr17 ocr49 Examples 768 2000 2000 Attributes 8 196 196 ocr17-mnist ocr49-mnist Examples 15170 13782 Attributes 400 400 page 35

Experiments - Stumps Exp Loss (Cortes, MM, and Syed, 2014) page 36

Experiments (2) Family of base classifiers defined by decision trees of depth. For trees with at most n nodes: k R m (T n ) (4n +2)log 2 (d +2)log(m +1) m. K k=1 Htrees k Base classifier set:. Same data sets as with Experiments (1). Both exponential and logistic loss. Comparison with AdaBoost and AdaBoost-L1. page 37

Experiments - Trees Exp Loss (Cortes, MM, and Syed, 2014) breastcancer AdaBoost AdaBoost-L1 DeepBoost ocr17 AdaBoost AdaBoost-L1 DeepBoost Error 0.0267 0.0264 0.0243 Error 0.004 0.003 0.002 (std dev) (0.00841) (0.0098) (0.00797) (std dev) (0.00316) (0.00100) (0.00100) Avg tree size 29.1 28.9 20.9 Avg tree size 15.0 30.4 26.0 Avg no. of trees 67.1 51.7 55.9 Avg no. of trees 88.3 65.3 61.8 ionosphere AdaBoost AdaBoost-L1 DeepBoost ocr49 AdaBoost AdaBoost-L1 DeepBoost Error 0.0661 0.0657 0.0501 Error 0.0180 0.0175 0.0175 (std dev) (0.0315) (0.0257) (0.0316) (std dev) (0.00555) (0.00357) (0.00510) Avg tree size 29.8 31.4 26.1 Avg tree size 30.9 62.1 30.2 Avg no. of trees 75.0 69.4 50.0 Avg no. of trees 92.4 89.0 83.0 german AdaBoost AdaBoost-L1 DeepBoost ocr17-mnist AdaBoost AdaBoost-L1 DeepBoost Error 0.239 0.239 0.234 Error 0.00471 0.00471 0.00409 (std dev) (0.0165) (0.0201) (0.0148) (std dev) (0.0022) (0.0021) (0.0021) Avg tree size 3 7 16.0 Avg tree size 15 33.4 22.1 Avg no. of trees 91.3 87.5 14.1 Avg no. of trees 88.7 66.8 59.2 diabetes AdaBoost AdaBoost-L1 DeepBoost ocr49-mnist AdaBoost AdaBoost-L1 DeepBoost Error 0.249 0.240 0.230 Error 0.0198 0.0197 0.0182 (std dev) (0.0272) (0.0313) (0.0399) (std dev) (0.00500) (0.00512) (0.00551) Avg tree size 3 3 5.37 Avg tree size 29.9 66.3 30.1 Avg no. of trees 45.2 28.0 19.0 Avg no. of trees 82.4 81.1 80.9 page 38

Experiments - Trees Log Loss (Cortes, MM, and Syed, 2014) breastcancer LogReg LogReg-L1 DeepBoost ocr17 LogReg LogReg-L1 DeepBoost Error 0.0351 0.0264 0.0264 Error 0.00300 0.00400 0.00250 (std dev) (0.0101) (0.0120) (0.00876) (std dev) (0.00100) (0.00141) (0.000866) Avg tree size 15 59.9 14.0 Avg tree size 15.0 7 22.1 Avg no. of trees 65.3 16.0 23.8 Avg no. of trees 75.3 53.8 25.8 ionosphere LogReg LogReg-L1 DeepBoost ocr49 LogReg LogReg-L1 DeepBoost Error 0.074 0.060 0.043 Error 0.0205 0.0200 0.0170 (std dev) (0.0236) (0.0219) (0.0188) (std dev) (0.00654) (0.00245) (0.00361) Avg tree size 7 30.0 18.4 Avg tree size 31.0 31.0 63.2 Avg no. of trees 44.7 25.3 29.5 Avg no. of trees 63.5 54.0 37.0 german LogReg LogReg-L1 DeepBoost ocr17-mnist LogReg LogReg-L1 DeepBoost Error 0.233 0.232 0.225 Error 0.00422 0.00417 0.00399 (std dev) (0.0114) (0.0123) (0.0103) (std dev) (0.00191) (0.00188) (0.00211) Avg tree size 7 7 14.4 Avg tree size 15 15 25.9 Avg no. of trees 72.8 66.8 67.8 Avg no. of trees 71.4 55.6 27.6 diabetes LogReg LogReg-L1 DeepBoost ocr49-mnist LogReg LogReg-L1 DeepBoost Error 0.250 0.246 0.246 Error 0.0211 0.0201 0.0201 (std dev) (0.0374) (0.0356) (0.0356) (std dev) (0.00412) (0.00433) (0.00411) Avg tree size 3 3 3 Avg tree size 28.7 33.5 72.8 Avg no. of trees 46.0 45.5 45.5 Avg no. of trees 79.3 61.7 41.9 page 39

Margin Distribution page 40

Multi-Class Learning Guarantee (Kuznetsov, MM, and Syed, 2014) Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 8c T t=1 tr m ( 1 (H kt )) + O log p 2 m log 2 c 2 m 4 log p. with and c number of classes. 1(H k )={x h(x, y): y Y,h H k }. page 41

Extension to Multi-Class Similar data-dependent learning guarantee proven for the multi-class setting. bound depending on mixture weights and complexity of sub-families. Deep Boosting algorithm for multi-class: similar extension taking into account the complexities of sub-families. several variants depending on number of classes. different possible loss functions for each variant. page 42

Other Related Algorithms Structural Maxent models (Cortes, Kuznetsov, MM, and Syed, ICML 2015): feature functions chosen from a union of very complex families. page 43

Other Related Algorithms Deep Cascades (DeSalvo, MM, and Syed, ALT 2015): cascade of predictors with leaf predictors and node questions selected from very rich families. Node 1: q 1 (x) 1 µ 1 µ 1 Leaf 1: h 1 (x) 1 µ 2 µ 2 1 µ 3 µ 3 page 44

Conclusion Deep Boosting: ensemble learning with increasingly complex families. data-dependent theoretical analysis. algorithm based on learning bound. extension to multi-class. ranking and other losses. enhancement of many existing algorithms. compares favorably to AdaBoost and additive Logistic Regression or their L1-regularized variants in experiments. page 45

References Bartlett, Peter L. and Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3, 2002. Breiman, Leo. Bagging predictors. Machine Learning, 24 (2):123 140, 1996. Breiman, Leo. Prediction games and arcing algorithms. Neural Computation, 11(7):1493 1517, 1999. Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In ICML, 2014. Duchi, John C. and Singer, Yoram. Boosting with structural sparsity. In ICML, pp. 38, 2009. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer System Sciences, 55(1):119 139, 1997. page 46

References Koltchinskii, Vladmir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Multi-class deep boosting. In NIPS, 2014. Raetsch, Gunnar, Onoda, Takashi, and Mueller, Klaus- Robert. Soft margins for AdaBoost. Machine Learning, 42(3):287 320, 2001. Raetsch, Gunnar, Mika, Sebastian, and Warmuth, Manfred K. On the convergence of leveraging. In NIPS, pp. 487 494, 2001. page 47

References Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier complexity. In ICML, pp. 753 760, 2006. Schapire, Robert E., Freund, Yoav, Bartlett, Peter, and Lee, Wee Sun. Boosting the margin: A new explanation for the effectiveness of voting methods. In ICML, pp. 322 330, 1997. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Vladimir N. Vapnik and Alexey Ya.Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. page 48