Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH.
General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a tree. includes expert setting. sum of edge losses along a path. sum of other substructures losses in a discrete problem. page 2
FPL General linear decision problem: w t 2 W R N player selects,. x t 2 X R N player incurs loss, sup w x apple R. player receives,. Objective: minimize cumulative loss or regret. M(x) = argmin Notation:. w t x t X {x: kxk 1 apple X 1 } w2w w2w,x2x w x l 1 -diam(w) apple W 1 (Kalai and Vempala, 2004) page 3
FL Follow the Leader (FL): use play). at every round (aka fictitious FL problem: Suppose and consider a sequence 0 starting with and then alternating and. hen, 1/2 M N =2 FL incurs loss 1 at every round, overall. any single expert incurs loss overall. /2 ( 1 0 ) ( 0 1 ) page 4
FPL Algorithms Additive bound Follow the Perturbed Leader (FPL):. Multiplicative bound Follow the Perturbed Leader (FPL*): p t U([0, 1/ ] N ) w t = argmin w2w P t 1 s=1 w x s + w p t = M(x 1:t 1 + p t ). p t f(x) = 2 e kxk 1 Laplacian with density. w t = argmin w2w P t 1 s=1 w x s + w p t = M(x 1:t 1 + p t ). (Hannan 1957; Kalai and Vempala, 2004) page 5
FPL - Bound >0 heorem: fix. hen, the expected cumulative loss of additive FPL( ) is bounded as follows For = E[L ] apple L min + RX 1 + W 1. r W1 RX 1 E[L ] apple L min +2 p X 1 W 1 R. page 6
FPL* - Bound >0 heorem: fix and assume that. hen, the expected cumulative loss of (multiplicative) FPL*( /2X 1 ) is bounded as follows For =min q E[L ] apple L min +4 1/2X 1, L min q W, X R N + E[L ] apple (1 + )L min + 2X 1W 1 (1 + log N) W 1 (1 + log N)/X 1 L min X 1W 1 (1 + log N)+4X 1 W 1 (1 + log N).. page 7
Proof Outline Be the perturbed leader (BPL): w t = M(x 1:t + p t ). 1. Bound on regret of BPL: E[R (BPL)] apple W 1. 2. Bound on difference of regrets of FPL and BPL: E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ]. 3. Difference of expectations small because similar distributions. page 8
Proof: BL Regret Lemma 1: P M(x 1:t) x t apple M(x 1: ) x 1:. Proof: case =1 is clear. By induction, X+1 M(x 1:t ) x t apple M(x 1: ) x 1: + M(x 1: +1 ) x +1 apple M(x 1: +1 ) x 1: + M(x 1: +1 ) x +1 (induction) (def. of M(x 1: ) as minimizer) = M(x 1: +1 ) x 1: +1. page 9
Proof: BPL Regret p 0 =0 Lemma 2: let. hen, the following holds: X X M(x 1:t + p t ) x t apple M(x 1: ) x 1: + W 1 kp t p t 1 k 1. hus, Proof: use Lemma 1 with x 0 t = x t + p t p t 1 X M(x 1:t + p t ) (x t + p t p t 1 ) apple M(x 1: + p ) (x 1: + p ) X M(x 1:t + p t ) x t apple M(x 1: ) x 1: + apple M(x 1: ) x 1: + W 1, then apple M(x 1: ) (x 1: + p ) X = M(x 1: ) x 1: + M(x 1: ) p t p t 1. X M(x1: ) M(x 1:t + p t ) p t p t 1 X p t p t 1 1. page 10
Proof: FPL vs. BPL Regrets p t = p 1 t>0 X M(x 1:t + p 1 ) x t apple M(x 1: ) x 1: + W 1 kp 1 k 1. Proof: for the expected loss, we can just choose all, which yields: hus, X E[M(x 1:t 1 + p 1 ) x t ] = apple X E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ]+E[M(x 1:t + p 1 ) x t ] X h E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ] i + L min + W 1 kp 1 k 1. for page 11
Proof: FPL By definition of the perturbation,. x 1:t + p 1 x 1:t 1 + p 1 Now, and both follow a uniform distribution over a cube. hus, wo cubes and overlap over at least the fraction : if but then for at least one i, most. kp 1 k 1 apple 1 E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ] apple R(1 fraction of overlap). [0, 1/ ] N v +[0, 1/ ] N (1 kvk 1 ) x 2 [0, 1/ ] N x 62 v +[0, 1/ ] N x i 62 v i +[0, 1/ ] N v i 0 v i 1/ v i +1/, which has probability at v i mass page 12
Proof: FPL hus, E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ] apple R kx t k 1 apple R X 1. And, E[R ] apple R X 1 + W 1. page 13
Proof: FPL* Lemma 3: E[M(x 1:t 1 + p 1 ) x t ] apple e X 1 E[M(x 1:t + p 1 ) x t ]. Proof: E[M(x 1:t 1 + p 1 ) x t ] Z = M(x 1:t 1 + u) x t dµ(u) R Z N = M(x 1:t + v) x t dµ(x t + v) (change of var. v = u + x t ) R Z N = M(x 1:t + v) x t e kx t+vk 1 kvk 1 {z } d(v) R N applee X 1 apple e X 1 E[M(x 1:t + p 1 ) x t ]. page 14
Proof: FPL* apple 1/X 1 For,, thus, X E[M(x 1:t 1 + p 1 ) x t ] apple hus, h i E[kp 1 k 1 ]=E max p 1,i i2[1,n] e X 1 apple (1 + 2 X 1 ) = apple apple 2 =2 X (1 + 2 X 1 )E[M(x 1:t + p 1 ) x t ] X Z +1 0 Z +1 0 Z u 0 apple 2u + N (1 + 2 X 1 )(L min + W 1 E[kp 1 k 1 ]). h i Pr max p 1,i >t dt i2[1,n] h i Pr max p 1,i >t dt i2[1,n] h i Pr max p 1,i >t dt + i2[1,n] Z +1 u =2u + N e u Pr apple h i p 1,1 >t dt 2(1 + log N) Z +1 u h i Pr max p 1,i >t dt i2[1,n] (best choice of u). page 15
Expert Setting W 1 =1X 1 = N R =1,, and ; for FLP*( ), E[L ] apple (1 + 2N )L min + 2(1+log(N). More favorable bound: x t! x t,1 e 1...x t,n e N. new L min N = old L min. E[L old ] apple E[L new N]. new guarantee: for FLP*( ), E[L ] apple (1 + 2 )L min + 2(1+log(N)). E[R ] apple 2 p 2L min (1 + log(n)). page 16
RWM = FPL Let FPL( ) be an instance of the general FPL algorithm with a perturbation defined by apple log( log(u1 )) p 1 =,..., log( log(u > N )), where u j is drawn according to the uniform distribution over [0, 1]. hen, FPL( ) and RWM( ) coincide. page 17
References Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On-Line Learning Algorithms. IEEE ransactions on Information heory 50(9): 2050-2057. 2004. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COL 1998. ACM Press, 1998. Adam. Kalai, Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71(3): 291-307. 2005. Nick Littlestone. From On-Line to Batch Learning. COL 1989: 269-284. Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm" Machine Learning 285-318(2). 1988. page 18
References Nick Littlestone, Manfred K. Warmuth: he Weighted Majority Algorithm. FOCS 1989: 256-261. om Mitchell. Machine Learning, McGraw Hill, 1997. Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical heory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn. page 19