Computational and Statistical Learning Theory

Similar documents
Computational and Statistical Learning Theory

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Lecture 8. Instructor: Haipeng Luo

Computational and Statistical Learning Theory

Introduction to Machine Learning

Computational and Statistical Learning Theory

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Computational and Statistical Learning Theory

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Littlestone s Dimension and Online Learnability

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Computational and Statistical Learning theory

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning theory. Ensemble methods. Boosting. Boosting: history

COMS 4771 Lecture Boosting 1 / 16

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CS229 Supplemental Lecture notes

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Generalization, Overfitting, and Model Selection

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Stochastic Gradient Descent

Voting (Ensemble Methods)

Ensemble Methods for Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Computational Learning Theory

ECE 5424: Introduction to Machine Learning

Generalization and Overfitting

Introduction to Machine Learning (67577) Lecture 5

2 Upper-bound of Generalization Error of AdaBoost

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Learning with multiple models. Boosting.

Ensembles. Léon Bottou COS 424 4/8/2010

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Computational and Statistical Learning Theory

Advanced Machine Learning

i=1 = H t 1 (x) + α t h t (x)

Agnostic Online learnability

VBM683 Machine Learning

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

1 Rademacher Complexity Bounds

ECE 5984: Introduction to Machine Learning

An overview of Boosting. Yoav Freund UCSD

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Data Warehousing & Data Mining

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Generalization, Overfitting, and Model Selection

A Brief Introduction to Adaboost

An adaptive version of the boost by majority algorithm

Introduction to Machine Learning (67577) Lecture 7

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

TDT4173 Machine Learning

Web-Mining Agents Computational Learning Theory

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

On the tradeoff between computational complexity and sample complexity in learning

Ensembles of Classifiers.

Announcements Kevin Jamieson

CS7267 MACHINE LEARNING

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

The Frank-Wolfe Algorithm:

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

10.1 The Formal Model

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Introduction to Machine Learning

Gradient Boosting (Continued)

Infinite Ensemble Learning with Support Vector Machinery

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Boos$ng Can we make dumb learners smart?

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Multiclass Boosting with Repartitioning

Does Unlabeled Data Help?

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

The Perceptron algorithm

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

PAC-learning, VC Dimension and Margin-based Bounds

1 Review of Winnow Algorithm

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

ICML '97 and AAAI '97 Tutorials

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

COMS 4771 Introduction to Machine Learning. Nakul Verma

Self bounding learning algorithms

Machine Learning. Ensemble Methods. Manfred Huber

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Statistical Machine Learning from Data

Name (NetID): (1 Point)

Transcription:

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes)

Boosting the Error If we have an efficient learning algorithm that for any distribution realizable by H n returns a predictor that is guaranteed to be slightly better then chance, i.e. has error ε 0 = 1 γ < 1 (for some γ > 0), can we construct a 2 2 new efficient algorithm that for any distribution realizable by H n achieves arbitrarily low error ε? Posed (as a theoretical question) by Valiant and Kearns, Harvard 1988 Solved by MIT student Robert Schapire, 1990 AdaBoost Algorithm by Schapire and Yoav Fruend, AT&T 1995 Leslie Valiant Michael Kearns Rob Schapire Yoav Fruend

AdaBoost Input: Training set S = { x 1, y 1, x 2, y 2,, x m, y m } Weak Learner A, which will be applied to distributions D over S If thinking of A(S ) as accepting a sample S : each x, y S is set to (x i, y i ) w.p. D i (independently and with replacements) Can often think of A as operating on a weighted sample, with weights D i Output: hypothesis h Initialize D 1 = 1 m, 1 m,, 1 m For t=1,, T: h t = A(D t ) ε t = L D t h t = σ i D i t h t x i y i α t = 1 2 log 1 ε t 1 D t+1 i = D t i exp αt y i h t x i t σ j D i exp αt y j h t x j Output: h T x = sign σt t=1 α t h t x

AdaBoost: Weight Update Initialize D 1 = 1 m, 1 m,, 1 m For t=1,, T: h t = A(D t ) ε t = L D t h t = σ i D i t h t x i y i α t = 1 2 log 1 ε t 1 D t+1 i = D t i exp αt y i h t x i t σ j D i exp αt y j h t x j T Output: h s x = sign σ t=1 α t h t x Increase weight on errors, decrease on correct points: D i t+1 D i t D i t 1 ε t ε t if h t x i y i ε t 1 ε t if h t x i = y i

AdaBoost: Weight Update D i t+1 = D t i exp αt y i h t x i Z t = 1 Z t D i t D i t 1 ε t ε t if h t x i y i ε t 1 ε t if h t x i = y i = D i t 2ε t if h t x i y i D i t 2 1 ε t if h t x i = y i Z t = σ ht x i y i D i t 1 ε t ε t + σ ht x i =y i D i t ε t 1 ε t = ε t 1 ε t ε t + 1 ε t ε t 1 ε t = 2 ε t 1 ε t L D t+1 h t = σ ht x i y i D i t+1 = σ ht x i y i D i t 1 2ε t = ε t 1 2ε t = 1 2

AdaBoost: Weight Update Initialize D 1 = 1 m, 1 m,, 1 m For t=1,, T: h t = A(D t ) ε t = L D t h t = σ i D i t h t x i y i α t = 1 2 log 1 ε t 1 D t+1 i = D t i exp αt y i h t x i t σ j D i exp αt y j h t x j T Output: h s x = sign σ t=1 α t h t x Increase weight on errors, decrease on correct points: D i t+1 D i t D i t 1 ε t ε t if h t x i y i ε t 1 ε t if h t x i = y i L D t+1 h t = 0.5

AdaBoost as Learning a Linear Classifier Recall: h T x = sign σt t=1 α t h t x Let B = all hypothesis outputed by A ( Base Class, e.g. decision stumps) w h = h t =h α t h T sign w, φ x w R B Class of halfspaces L B exp 1 L S w = m σlexp (h w x i ; y i ) l exp y w,φ x h w (x), y = e Each step of AdaBoost: Coordinate descent on L S exp (w) Choose coordinate h of φ(x) s.t. φ x w h L S h = h(x) exp w is very negative Update w h = arg min L S exp (w) s.t. h h w[h ] is unchanged

Coordinate Descent on L S exp w L exp w h S w = w h 1 m σe y ih w x i = 1 m σe y ih w x i y i h w x i w h = 1 m σe y ih w x i y i h(x i ) = 1 m σe y i σ t=1 T 1 α t h t x i y i h x i 2L D T h 1 ς T 1 t=1 e y iα t h t x i D i T Minimize L D T h Minimize L exp w h S w (make it very negative) Updating w[h]: set w t h t = w t 1 h t + α α = arg min L S exp w t 0 = L exp α S w t = exp L w h t S w t 2L D t+1 h t 1 choose α s.t. L D t+1 h t = 1 2

AdaBoost: Minimizing L S (h) Theorem: If t ε t 1 2 γ, then L S 01 h T L S exp ht e 2γ2 T Proof: L S exp ht D T+1 i = 1 ς T m t=1 = 1 σ m i e y i σ T t=1 e y i α t h t x i = ςt t 1 2 ε t 1 ε t 1 2γ 1 + 2γ z t σ i D i T+1 = 1 α t h t x i = 1 σ m i D T+1 i m ςt t=1 z t = ςt t=1 z t TΤ 2 = 1 4γ 2 TΤ 2 e 2γ2 T If A( ) is a weak learner with δ 0, ε 0 = 1 2 γ, and if L D h = 0: L S h = 0 L D t h = 0 w.p. 1 δ, L D t h 1 2 γ w.p. 1 δt, L S h s e 2γ2 T To get any ε > 0, run AdaBoost for T = log 1 ε 2γ 2 rounds Setting ε = 1 log 2m, after T = rounds: L 2m 2γ 2 S h s = 0! What about L D h?

Sparse Linear Classifiers Recall: h s x = sign σt t=1 w t h t x Let B = all hypothesis outputed by A Base Class, e.g. decision stumps h T sign w, φ x w R B, w 0 T Class of halfspaces L B

Sparse Linear Classifiers Recall: h s x = sign σt t=1 w t h t x Let B = all hypothesis outputed by A Base Class, e.g. decision stumps h T sign w, φ x w R B, w 0 T Class of sparse halfspaces L B, T You already saw: VCdim(L B, T ) O T log B Even if B is infinite (e.g. in the case of decision stumps): VCdim L B, T O T VCdim B Sample complexity: m = O log m γ 2 VCdim B ε But if weak learner is improper and VCdim B =? = O VCdim B γ 2 ε

Compression Bounds Focus on realizable case, and learning rules s.t. L S A S = 0 Suppose A(S) only dependent on first r < m examples, A x 1, y 1,, x m, y m = ሚA x 1, y 1,, x r, y r : δ L S r+1:m ሚA S[1: r] = 0 S D m L D A S log 1 δ m r In fact, same holds for any predetermined i 1,, i r, if A(S) only depends on x i1 y i1,, x ir, y ir Now consider A S = ሚA S I S with I: X Y m 1.. m r. That is, can represent A(S) using r training points, but need to choose which ones. Taking a union bound over m r choices of indices: r log m + log 1 δ L D A S m r

Compression Schemes A S is r-compressing if A S = ሚA S I S for some I: X Y m 1.. m r Axis Aligned Rectangles I(S) = { leftmost positive, rightmost positive, top positive, bottom positive} r = 4 Halfspaces in R d A bit trickier, but can be done with r = d + 1 (for non-homogenous) δ A is r-compressing and L S A S = 0 for m > 2r, S D m r log m + log 1 δ L D A S 2 m By VC lower bound: FINDCONS H is r-compressing VCdim H O(r) In fact: VCdim H r Conjecture: every H has a VCdim H -compressing FINDCONS H Manfred Warmuth

Back to Boosting A(S) is an (ε 0 = 1 2 γ, δ 0) weak learner that uses m 0 samples. Boost the confidence to get a ( 1 γ, δ ) learner that uses 2 2 m 1 δ = O m 0 log1 δ + log1 δ log log 1 δ0 samples log 1 γ δ0 2 Run AdaBoost on m samples for T = using m = m 1 δ 2 log m γ 2 iterations, each time T samples for the weak learner to get L S h T = 0 T h T = α t h t t=1 (h 1,, h T ) has a compression scheme with r = T m points What about α t??? h t = A(subsample of size m )

Partial Compression Instead of r training points specifying A(S) exactly, suppose they only specify a limited set of hypothesis in which A(S) lies. I: X Y m 1.. m r F: X Y r hypothesis classes, each with VCdim F S A S F I S D Theorem: If A(S) has a compression scheme as above and L S A S = 0, δ then for m 2r + D, S D m D + r log m + log 2 L D A S O δ m Proof outline: take union bound over choice of indices I(S), of the VC-based uniform convergence bounds, each time using just the points outside I(S).

Back to Boosting A(S) is an (ε 0 = 1 2 γ, δ 0) weak learner that uses m 0 samples. Boost the confidence to get a ( 1 2 γ 2, δ ) learner that uses m 1 δ = O m 0 log1 δ log 1 δ0 + log1 δ log log 1 δ0 γ 2 Run AdaBoost on m samples for T = using m = m 1 Conclusion: δ 2 log m γ 2 samples iterations, each time T samples for the weak learner to get L S h T = 0 T h T = α t h t L h 1,, h T = F(I S ) t=1 For fixed ε 0, δ 0 L D h T O T + Tm log m + log 1 δ m m ε, δ = O = O m 0 log 2 m log 1 δ m m 0 log 21 Τε log 1 δ ε 1 γ 2 log 1 δ0

AdaBoost In Practice Complexity control is in terms of sparsity (#iterations) T Realizable case (MDL): use first T s.t. L S h T = 0 More realistically (SRM): Use validation/cross-validation to select T Test error T=1000 T=100 Training error T=5 Even after L S h T = 0, AdaBoost keeps improving the l 1 margin

Interpretations of AdaBoost Boosting weak learning to get arbitrary small error Theory is for realizable case Shows efficient weak and strong learning equivalent Ensemble method for combining many simpler predictors E.g. combining decision stumps or decision trees Other ensemble methods: bagging, averaging, gating networks Method for learning using sparse linear predictors with large (infinite?) dimensional feature space Sparsity controls complexity Number of iterations controls sparsity Coordinate-wise optimization of L S exp w We ll get back to this when we talk about real-valued loss Learning (in high dimensions) with large l 1 margin Learning guarantee in terms of l 1 margin We ll get back to this when we talk about l 1 margin

Just one more thing

Back to Hardness of Agnostic Learning H = {x w, x > 0 w R n } H k n = h 1 h 2 h k h i H Lemma: h Hk L D h = 0 h H L D h < 1 1 2 2k 2 Want to show: hard to learn agnostically Hard to learn even if realizable H k n H is efficiently agnostically learnable Efficient weak learner for H k n with γ = 1 2k 2 is efficiently learnable (in realizable case) for, e.g. k n = n Conclusion: subject to lattice-based crypto hardness, halfspaces are not efficiently agnostically learnable (even improperly)