Littlestone s Dimension and Online Learnability

Size: px

Start display at page:

Download "Littlestone s Dimension and Online Learnability"

Janel Potter
5 years ago
Views:

1 Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David and David Pal Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 1 / 21

2 Online Learning For t = 1,..., T Environment presents input x t X Learner predicts label ŷ t {0, 1} Environment reveals true label y t {0, 1} Learner pays 1 if ŷ t y t and 0 otherwise Goal: Make few mistakes Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 2 / 21

3 Online Learning For t = 1,..., T Environment presents input x t X Learner predicts label ŷ t {0, 1} Environment reveals true label y t {0, 1} Learner pays 1 if ŷ t y t and 0 otherwise Goal: Make few mistakes Online Learnability: When can we guarantee to make few mistakes? Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 2 / 21

4 Online Learning For t = 1,..., T Environment presents input x t X Learner predicts label ŷ t {0, 1} Environment reveals true label y t {0, 1} Learner pays 1 if ŷ t y t and 0 otherwise Goal: Make few mistakes Online Learnability: When can we guarantee to make few mistakes? PAC Learnability: well understood (VC theory) Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 2 / 21

5 Outline Online Learnability: Can we be almost as good as the best predictor in a reference class H? Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 3 / 21

6 Outline Online Learnability: Can we be almost as good as the best predictor in a reference class H? Finite H Infinite H margin-based H No noise Halving Arbitrary noise Stochastic noise Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 3 / 21

7 Outline Online Learnability: Can we be almost as good as the best predictor in a reference class H? Finite H Infinite H margin-based H No noise Halving L 88 Arbitrary noise Stochastic noise Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 3 / 21

8 Outline Online Learnability: Can we be almost as good as the best predictor in a reference class H? Finite H Infinite H margin-based H No noise Halving L 88 Arbitrary noise LW 94 Stochastic noise Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 3 / 21

9 Outline Online Learnability: Can we be almost as good as the best predictor in a reference class H? Finite H Infinite H margin-based H No noise Halving L 88 Arbitrary noise LW 94 Stochastic noise Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 3 / 21

10 Outline Online Learnability: Can we be almost as good as the best predictor in a reference class H? Finite H Infinite H margin-based H No noise Halving L 88 Arbitrary noise LW 94 Stochastic noise Upper and (almost) matching lower bounds Seamlessly deriving new algorithms/bounds Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 3 / 21

11 Realizable Case (no noise) Realizable Assumption: Environment answers y t = h(x t ), where h H and the hypothesis class, H, is known to the learner Theorem (Littlestone 88) A combinatorial dimension, Ldim(H), characterizes online learnability: Any algorithm might make at least Ldim(H) mistakes Exists algorithm that makes at most Ldim(H) mistakes But, only in the realizable case... Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 4 / 21

12 Littlestone s dimension Motivation h 1 h 2 h 8 Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 5 / 21

13 Littlestone s dimension Motivation h 1 h h h 1 h 2 h 5 h 6 Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 6 / 21

14 Littlestone s dimension Definition Ldim(H) is the maximal depth of a full binary tree such that each path is explained by some h H Lemma Any learner can be forced to make at least Ldim(H) mistakes Proof. Adversarial environment will walk on the tree, while on each round setting y t = ŷ t. Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 7 / 21

15 Standard Optimal Algorithm (SOA) initialize: V 1 = H for t = 1, 2,... receive x t for r {0, 1} let V (r) t = {h V t : h(x t ) = r} predict ŷ t = arg max r Ldim(V (r) t ) receive true answer y t update V t+1 = V (yt) t Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 8 / 21

16 Standard Optimal Algorithm (SOA) initialize: V 1 = H for t = 1, 2,... receive x t for r {0, 1} let V (r) t = {h V t : h(x t ) = r} predict ŷ t = arg max r Ldim(V (r) t ) receive true answer y t update V t+1 = V (yt) t Theorem SOA makes at most Ldim(H) mistakes. Proof. Whenever SOA errs we have Ldim(V t+1 ) Ldim(V t ) 1. Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 8 / 21

17 Intermediate Summary Littlestone s dimension characterizes online learnability Example: H = { all 100 characters long C++ functions } Ldim(H) 500 Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 9 / 21

18 Intermediate Summary Littlestone s dimension characterizes online learnability Example: H = { all 100 characters long C++ functions } Ldim(H) 500 Received relatively little attention by researchers Maybe due to: Non-realistic realizable assumption Lack of interesting examples Lack of margin-based theory Coming Next Generalizing to: Agnostic case (noise is allowed) Fat dimension and margin-based bounds Linear separators new algorithms/bounds Shai Shalev-Shwartz (TTI-C) Online Learnability Feb 09 9 / 21

19 Agnostic Online Learning and Regret Analysis Make no assumptions on origin of labels Analyze regret of not following best predictor in H: T t=1 ŷ t y t min h H When can we guarantee low regret? T h(x t ) y t t=1 Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

20 Cover s impossibility result H = {h(x) = 1, h(x) = 0} Ldim(H) = 1 Environment will output y t = ŷ t Learner makes T mistakes Best in H makes at most T/2 mistakes Regret is at least T/2 Corollary: Online learning in the non-realizable case is impossible?!? Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

21 Randomized Prediction and Expected Regret Let s weaken the environment it should decide on y t before seeing ŷ t For deterministic learner, environment can simulate learner so there s no difference For learner that randomizes his predictions big difference We analyze expected regret T t=1 E[ ŷ t y t ] min h H T h(x t ) y t t=1 This enables to sidestep Cover s impossibility result Online learning in the non-realizable case becomes possible! Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

22 Weighed Majority WM for learning with d experts initialize: assign weight w i = 1 for each expert for t = 1, 2,..., T each expert predicts f i {0, 1} environment determines y t without revealing it to the learner predict ŷ t = 1 w.p. i:f i =1 w i receive label y t foreach wrong expert: w i η w i Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

23 Weighed Majority WM for learning with d experts initialize: assign weight w i = 1 for each expert for t = 1, 2,..., T each expert predicts f i {0, 1} environment determines y t without revealing it to the learner predict ŷ t = 1 w.p. i:f i =1 w i receive label y t foreach wrong expert: w i η w i Theorem WM achieves expected regret of at most: ln(d) T Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

24 WM and Online Learnability WM regret bound a finite H is learnable with regret ln( H ) T Is this the best we can do? And, what if H is infinite? Solution: Combing WM with SOA Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

25 WM and Online Learnability WM regret bound a finite H is learnable with regret ln( H ) T Is this the best we can do? And, what if H is infinite? Solution: Combing WM with SOA Theorem Exists learner with expected regret Ldim(H) T log(t ) No learner can have expected regret smaller than Ldim(H) T Therefore: H is agnostic online learnable Ldim(H) < Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

26 Proof idea Expert(i 1,..., i L ) initialize: V 1 = H for t = 1, 2,... receive x t for r {0, 1} let V (r) t = {h V t : h(x t ) = r} define ŷ t = arg max r Ldim(V (r) t ) if t {i 1,..., i L } flip prediction: ŷ t ŷ t update V t+1 = V (ŷt) t Lemma If Ldim(H) <, then for any h H exists i 1,..., i L, L < Ldim(H), s.t. Expert(i 1,..., i L ) agrees with h on the entire sequence. Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

27 Agnostic Online Learning Bounded Stochastic Noise Previous theorem holds for any noise For stochastic noise better results Assume: y t = h(x t ) + 2 ν t, where P[ν t = 1] γ < 1 2 Then, there exists learner with: [ T ] 1 E ŷ t h(x t ) Ldim(H) ln(t ) 1 2 γ(1 γ) t=1 Learner is better than teacher: Learner makes O(ln(T )) mistakes while teacher makes γ T mistakes Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

28 Fat Littlestone s dimension Consider hypotheses of the form h : X R, where actual prediction is sign(h(x)) Fat Littlestone s dimension: Maximal depth of tree such that each path is explained by some h H with margin γ Importance: Can apply analysis tools for bounding a combinatorial object Theorem Let M be expected #mistakes of online learner Let M γ (H) be #margin-mistakes of optimal h H M M γ (H) + Ldim γ (H) ln(t ) T Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

29 Fat Littlestone s dimension of linear separators Linear predictors: H = {x w, x : w 1} Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

30 Fat Littlestone s dimension of linear separators Linear predictors: H = {x w, x : w 1} Lemma If X is the unit ball of a σ-regular Banach space (B, ), then Ldim γ (H) σ γ 2 Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

31 Fat Littlestone s dimension of linear separators Linear predictors: H = {x w, x : w 1} Lemma If X is the unit ball of a σ-regular Banach space (B, ), then Ldim γ (H) σ γ 2 Examples: X H Ldim γ (H) {x : x 2 1} {x w, x : w 2 1} 1 γ 2 {x : x 1} {x w, x : w 1 1} log(n) γ 2 Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

32 (Surprising) Corollary: Regret with non-convex loss M M γ (H) + 1 γ ln(t ) T l(a, y) Freund and Schapire 99 Quadratic loss Gentile 02 hinge loss No result with non-convex loss a Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

33 Summary Online Learning PAC Learning Dimension Ldim(H) VCdim(H) Realizable case: dim T Agnostic case: dim T Low noise: dim T Margin: Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

34 Some Open Problems Ldim and fat-ldim calculus Bridging the log(t ) gap between lower and upper bounds Other noise conditions (Tsybakov, Steinwart) Multiclass prediction with bandit feedback: Efficient algorithms? Lower bounds? Low Ldim Compression scheme Low VCdim Low Ldim? Compression scheme? Low VCdim Shai Shalev-Shwartz (TTI-C) Online Learnability Feb / 21

Agnostic Online learnability

Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes