Learning with Large Number of Experts: Component Hedge Algorithm

Size: px

Start display at page:

Download "Learning with Large Number of Experts: Component Hedge Algorithm"

Carmella Haynes
5 years ago
Views:

1 Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, / 3

2 Learning with Large Number of Experts Regret of RWM is O( T ln N). Informative even for very large number of experts. What if there is overlap between experts? RWM with path experts FPL with path experts can we do better? [Littlestone and Warmuth, 1989; Kalai and Vempala, 24] 2 / 3

3 Better Bounds in Structured Case? Can overlap between experts lead to better regret guarantees? What are the lower bounds in the structured setting? Computationally efficient solutions that realize these bounds? 3 / 3

4 Outline Learning Scenario Component Hedge Algorithm Regret Bounds Applications & Lower Bounds Conclusion & Open Problems 4 / 3

5 Learning Scenario Assumptions: Goal: Structured concept class C {, 1} d Composed of components: C t indicates which components are used for each trial t. Additive loss l t incurred at each trial t. Loss of each concept C is C l t M := max C C C minimize expected regret after T trials T T R T = E[C t ] l t min C l t C C t=1 t=1 5 / 3

6 Component Hedge Algorithm [Koolen, Warmuth, and Kivinen, 21 ] CH maintains weights w t conv(c) [, 1] d over the components at each round t. Update: 1 weights: ŵ t i = wi t 1 e ηlt i 2 relative entropy projection: w t := argmin w conv(c) (w ŵ t ) where (w v) = d i=1 (w i ln w i v i + v i w i ) 6 / 3

7 Component Hedge Algorithm Prediction: 1 Decomposition of weights: w t = C C α C C where α is a distribution over C 2 Sample C t+1 according to α 7 / 3

8 Efficiency Need efficient implementation of: Decomposition (not unique) of weights over the concepts Entropy projection step (convex problem) Sufficient: conv(c) described by polynomial in d constraints 8 / 3

9 Regret Bounds Theorem: Regret Bounds for CH Let l = min C C C (l l T ) be the loss of the best concept in hindsight, then by choosing η = R T 2l M ln(d/m) + M ln(d/m) 2M ln(d/m) l Since l MT, regret R T O(M T ln d). Matching lower bounds in applications. 9 / 3

10 Comparison of CH, RWM and FPL 1 CH has significantly better regret bounds: CH: R T O(M T ln d). RWM: R T O(M MT ln d) FPL: R T O(M dt ln d) 2 CH is optimal w.r.t. regret bounds while RWM and FPL are not optimal. 3 Standard expert setting (no structure): CH, RWM and FPL reduce to the same algorithm. 1 / 3

11 Applications On-line shortest path problems. On-line PCA (k-sets). On-line ranking (k-permutations). Spanning trees. 11 / 3

12 On-line Shortest Path Problem (SPP) G = (V, E) is a directed graph. s is the source and t is the destination. Each s t path is an expert. The loss is additive over edges. 12 / 3

13 Unit Flow Polytope Convex hull of paths cannot be captured by linear constraints Unit flow polytope relaxation is used: w u,v, (u, v) E w s,v = 1 v V v V w v,u = v V w u,v, u V Relaxation does not hurt regret bounds. 13 / 3

14 Example of Unit Flow Polytope / 3

15 Entropy Projection on Unit Flow Polytope min w (u,v) E w u,v ln w u,v ŵ u,v + ŵ u,v w u,v subject to: w u,v, (u, v) E w s,v = 1 v V v V w v,u = v V w u,v, u V 15 / 3

16 Dual problem max {λ s } ŵ u,v e λ u λ v λ (u,v) E No constraints. Only V variables. Primal solution: w u,v = ŵ u,v e λ u λ v 16 / 3

17 Convex Decomposition 1 Find any non-zero path from s to t. 2 Subtract the smallest weight from each edge. 3 Repeat until no path is found. = At most E iteration is needed. 17 / 3

18 Example of Convex Decomposition / 3

19 Example of Convex Decomposition / 3

20 Example of Convex Decomposition / 3

21 Example of Convex Decomposition / 3

22 Example of Convex Decomposition / 3

23 Regret Bounds for SPP Expected regret is bounded by 2 l k ln V + 2k ln V O(M T ln V ) Bound holds for arbitrary graphs. 23 / 3

24 Lower Bounds Any algorithm can be forced to have expected regret l k ln V k Idea of the proof: Minimize the overlap. Create V /k disjoint paths of length k. Apply lower bounds for standard expert setting. 24 / 3

25 Conclusions Regret of CH is often better than that of RWM or FPL in structured setting. Regret of CH often matches lower bounds in applications. Efficient solutions exist for a wide range of applications: on-line shortest path, on-line PCA, on-line ranking, spanning trees. 25 / 3

26 References Wouter Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging Structured Concepts. In COLT (21). Nick Littlestone and Manfred K. Warmuth. The Weighted Majority Algorithm. FOCS 1989: Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 23. Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. JMLR, 4:773818, 23. T. van Erven, W. Kotlowski, and M. K. Warmuth. Follow the leader with dropout perturbations. In COLT, 214. Nicolo Cesa-Bianchi and Gabor Lugosi. Combinatorial bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, / 3

27 Regret Bounds Theorem: Regret Bounds for CH Let l = min C C C (l l T ) be the loss of the best concept in hindsight, then by choosing η = R T 2l M ln(d/m) + M ln(d/m) 2M ln(d/m) l 27 / 3

28 Proof of CH Regret Bound 1 Bound: (1 e η )w t 1 l t (C w t 1 ) (C w t )+ηc l t. 1 e ηx (1 e η )x Generalized Pythagorean Theorem 2 Sum over trials t: (1 e η ) T t=1 w t 1 l t (C w ) (C w T ) + ηc l T where l T = l l T. 3 Use w t 1 l t = E[C t ] l t : T t=1 E[C t ] l t (C w ) (C w T )+ηc l T (1 e η ) 28 / 3

29 Proof of CH Regret Bound 4 w assumes uniform distribution over concepts wi = M d = (C w ) = M ln( d M ) 5 let l best concept in hind-sight and choosing 2M ln( η = d M ) l = Regret bound R T. 29 / 3

Advanced Machine Learning

Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.