CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

Size: px

Start display at page:

Download "CS281B/Stat241B. Statistical Learning Theory. Lecture 14."

Charleen Rose
5 years ago
Views:

1 CS281B/Stat241B. Statistical Learning Theory. Lecture 14. Wouter M. Koolen Convex losses Exp-concave losses Mixable losses The gradient trick Specialists 1

2 Menu Today we solve new online learning problems by reducing them to problems/algorithms/analyses we already cracked before. 2

3 Prediction with Expert Advice Prediction with expert advice: Protocol: For t = 1,2,... Experts announce actions a 1 t,...,a K t A. Learner chooses an action a t A. Adversary reveals outcome x t X. Learner incurs lossl(a t,x t ). Goal: small regret w.r.t. best expert. 3

4 Convex: Reduction to dot loss SayL(a,x) is[0,1]-bounded and convex inafor each x: ( K K ) w k L(a k,x) L w k a k,x k=1 Then we can feed Hedge l k t = L(a k t,x t ). k=1 Hedge outputs w t. Play the mean action a t = K k=1 wk ta k t. K wtl(a k k t,x t ) k=1 } {{ } Dot lossw t l t L(a t,x t ) }{{} actual loss Dot-loss bound translates to convex bounded lossl. R T T/2lnK 4

5 Exp-concave: Reduction to mix loss Definition: We say L(a,x) isη-exp-concave inafor each x if K k=1 w k e ηl(ak,x) e ηl ( K k=1 wk a k,x) Then we can feed the AAl k t = ηl(a k t,x t ). The AA outputs w t. Play the mean action a t = K k=1 wk ta k t. ( K ) ( K ) ln wte k ηl(ak t,x t) ηl wta k k t,x t k=1 } {{ } Mix loss of w t on ηl t k=1 } {{ } actual loss 5

6 Mix-loss bound translates to exp-concave lossl: R T lnk η 6

7 Example: square loss is exp-concave Let s consider L(a,x) = (a x) 2 where A = X = [ 1,+1]. Findη such that L isη-exp-concave by testing negative second derivative: 2 a 2e η(a x)2 = a 2e η(a x)2 η(a x) = e η(a x)2 η ( 4η(a x) 2 2 ) Highest η such that 4η(a x) for all a,x. η = 1/8. For X = A = [ Y,+Y] we findη = 1 8Y 2. 7

8 Mixable loss: Reduction to mix loss Crux: exp-concavity is convenient but too strong. Definition: We say L(a,x) isη-mixable if w a 1,...,a K a x L(a,x) 1 η ln ( K ) w k e ηl(ak,x) k=1 Mapping from w,a 1,...,a K to witnessacalled substitution function. Mixable losses behave just enough like the mix loss to carry the AA regret bound through. R T lnk η 8

9 Square loss is mixable Square loss is mixable withη = 1 2. The substitution function is w,a 1,...,a K m1 2 ( 1) m1 2 (+1) 4 where m η (x) = 1 η ln K k=1 wk e η(ak x) 2 See (Vovk 1990, Haussler, Kivinen, Warmuth, 1998) 9

10 Mixable loss list Popular mixable losses: mix loss, log loss, entropic loss square loss, Brier loss Hellinger lossa = X = [0,1]: L(a,x) := 1 ( ( 1 x 1 a) 2 +( x a) ) 2 Characterisation of mixability: (Van Erven, Reid, Williamson 2012). 10

11 Gradient trick 11

12 Gradient trick Abusing an algorithm that can compete with the best expert to in fact compete with the best convex combination (cf portfolios). Assume convex lossl(w,x): L(w,x) L(w t,x)+(w w t ) w L(w t,x) }{{} First-order expansion around algorithm s actionw t Idea: feed Hedge l t = w L(w t,x) (may need restriction + translation + scaling to make this[0,1] bounded). Get w t. Playw t. 12

13 Imaginary regret upper bounds the actual regret: T t=1 w tl t min k T t=1 l k t = max k = max w max w = T t=1 T t=1 ( w ) tl t l k t T (w t w) w l(w t,x) t=1 T (L(w t,x) L(w,x)) t=1 L(w t,x) max w T L(w, x) t=1 Caveat: even if original loss was nice (mixable/curved/... ), the imagined lossw w w L(w t,x) is linear. Regret of order T. 13

14 Specialists 14

15 Motivation Not all expert predictions/actions available every round. Missing data Noise Too expensive ($/time/memory) How to model missingness? Adversarial. How to redefine the objective? New variant of regret. How to still do something optimal? Upgrade of AA. 15

16 Mix loss game with specialists Protocol: For t = 1,2,... Adversary picks the subset A t [K] of awake specialists. Learner chooses a distributionw t on awake specialists A t. Adversary reveals loss vector l t (, ] A t. Learner s loss is the mix loss ln ( ) k A t w t,k e l t,k 16

17 Objective There is no loss when a specialist is asleep. Regret w.r.t. specialist j: only measured during rounds where j is awake R j T = ( ) wte k l k t j t k A t ln t [T] j A t l t [T] j A t 17

18 Specialist AA Definition: The Specialist Aggregating Algorithm (SAA) maintains a distributionu t. It starts uniform u k 1 = 1/K. In round t with awake experts A t, SAA predict with w k t = u t (k A t ) = uk t1 {k At } j A t u j t Update: u k t+1 = u k t e lk t j A t u k t e lk t u k t j A t u k t k A t k / A t AA update relative to awake set A t 18

19 What makes this tick Consider the sequence l 1,l 2 obtained by completing l 1,l 2 by assigning in each round the SAA mix loss to all the asleep specialists. Theorem: SAA onland AA onl produce identical weights u t = w t and suffer identical mix loss. Proof: (homework) 19

20 Specialist regret bound for SAA The AA has small regret w.r.t. expert j: ( T K ) lnk ln w tk e l k t = t=1 t=1 = = R j T k=1 T t=1 ( ) T ln wte k l k t k A t ln t=[t] t A j ( wte k l k t k A t ) l j t l t=[t] t A j j lt t=[t] t A j j t t=[t] t/ A j ( ) ln wte k l k t k A t Adversary more power (sleeping) but regret stilllnk: SAA minimax for specialist mix-loss regret game. 20

21 Discussion Convex bounded losses are easier than dot loss. Mixable losses are easier than mix loss. Gradient trick allows us to compete with mixtures (at a cost) Specialists extension deals with missing data (at no cost). 21

Putting Bayes to sleep

Putting Bayes to sleep Wouter M. Koolen Dmitry Adamskiy Manfred Warmuth Thursday 30 th August, 2012 Menu How a beautiful and intriguing piece of technology provides new insights in existing methods and