Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Shavonne Campbell
5 years ago
Views:

1 Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI COURAN INSIUE & GOOGLE RESEARCH.

2 General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a tree. includes expert setting. sum of edge losses along a path. sum of other substructures losses in a discrete problem. page 2

3 FPL General linear decision problem: w t 2 W R N player selects,. x t 2 X R N player incurs loss, sup w x apple R. player receives,. Objective: minimize cumulative loss or regret. M(x) = argmin Notation:. w t x t X {x: kxk 1 apple X 1 } w2w w2w,x2x w x l 1 -diam(w) apple W 1 (Kalai and Vempala, 2004) page 3

4 FL Follow the Leader (FL): use play). at every round (aka fictitious FL problem: Suppose and consider a sequence 0 starting with and then alternating and. hen, 1/2 M N =2 FL incurs loss 1 at every round, overall. any single expert incurs loss overall. /2 ( 1 0 ) ( 0 1 ) page 4

5 FPL Algorithms Additive bound Follow the Perturbed Leader (FPL):. Multiplicative bound Follow the Perturbed Leader (FPL*): p t U([0, 1/ ] N ) w t = argmin w2w P t 1 s=1 w x s + w p t = M(x 1:t 1 + p t ). p t f(x) = 2 e kxk 1 Laplacian with density. w t = argmin w2w P t 1 s=1 w x s + w p t = M(x 1:t 1 + p t ). (Hannan 1957; Kalai and Vempala, 2004) page 5

6 FPL - Bound >0 heorem: fix. hen, the expected cumulative loss of additive FPL( ) is bounded as follows For = E[L ] apple L min + RX 1 + W 1. r W1 RX 1 E[L ] apple L min +2 p X 1 W 1 R. page 6

7 FPL* - Bound >0 heorem: fix and assume that. hen, the expected cumulative loss of (multiplicative) FPL*( /2X 1 ) is bounded as follows For =min q E[L ] apple L min +4 1/2X 1, L min q W, X R N + E[L ] apple (1 + )L min + 2X 1W 1 (1 + log N) W 1 (1 + log N)/X 1 L min X 1W 1 (1 + log N)+4X 1 W 1 (1 + log N).. page 7

8 Proof Outline Be the perturbed leader (BPL): w t = M(x 1:t + p t ). 1. Bound on regret of BPL: E[R (BPL)] apple W Bound on difference of regrets of FPL and BPL: E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ]. 3. Difference of expectations small because similar distributions. page 8

9 Proof: BL Regret Lemma 1: P M(x 1:t) x t apple M(x 1: ) x 1:. Proof: case =1 is clear. By induction, X+1 M(x 1:t ) x t apple M(x 1: ) x 1: + M(x 1: +1 ) x +1 apple M(x 1: +1 ) x 1: + M(x 1: +1 ) x +1 (induction) (def. of M(x 1: ) as minimizer) = M(x 1: +1 ) x 1: +1. page 9

10 Proof: BPL Regret p 0 =0 Lemma 2: let. hen, the following holds: X X M(x 1:t + p t ) x t apple M(x 1: ) x 1: + W 1 kp t p t 1 k 1. hus, Proof: use Lemma 1 with x 0 t = x t + p t p t 1 X M(x 1:t + p t ) (x t + p t p t 1 ) apple M(x 1: + p ) (x 1: + p ) X M(x 1:t + p t ) x t apple M(x 1: ) x 1: + apple M(x 1: ) x 1: + W 1, then apple M(x 1: ) (x 1: + p ) X = M(x 1: ) x 1: + M(x 1: ) p t p t 1. X M(x1: ) M(x 1:t + p t ) p t p t 1 X p t p t 1 1. page 10

11 Proof: FPL vs. BPL Regrets p t = p 1 t>0 X M(x 1:t + p 1 ) x t apple M(x 1: ) x 1: + W 1 kp 1 k 1. Proof: for the expected loss, we can just choose all, which yields: hus, X E[M(x 1:t 1 + p 1 ) x t ] = apple X E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ]+E[M(x 1:t + p 1 ) x t ] X h E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ] i + L min + W 1 kp 1 k 1. for page 11

12 Proof: FPL By definition of the perturbation,. x 1:t + p 1 x 1:t 1 + p 1 Now, and both follow a uniform distribution over a cube. hus, wo cubes and overlap over at least the fraction : if but then for at least one i, most. kp 1 k 1 apple 1 E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ] apple R(1 fraction of overlap). [0, 1/ ] N v +[0, 1/ ] N (1 kvk 1 ) x 2 [0, 1/ ] N x 62 v +[0, 1/ ] N x i 62 v i +[0, 1/ ] N v i 0 v i 1/ v i +1/, which has probability at v i mass page 12

13 Proof: FPL hus, E[M(x 1:t 1 + p 1 ) x t ] E[M(x 1:t + p 1 ) x t ] apple R kx t k 1 apple R X 1. And, E[R ] apple R X 1 + W 1. page 13

14 Proof: FPL* Lemma 3: E[M(x 1:t 1 + p 1 ) x t ] apple e X 1 E[M(x 1:t + p 1 ) x t ]. Proof: E[M(x 1:t 1 + p 1 ) x t ] Z = M(x 1:t 1 + u) x t dµ(u) R Z N = M(x 1:t + v) x t dµ(x t + v) (change of var. v = u + x t ) R Z N = M(x 1:t + v) x t e kx t+vk 1 kvk 1 {z } d(v) R N applee X 1 apple e X 1 E[M(x 1:t + p 1 ) x t ]. page 14

15 Proof: FPL* apple 1/X 1 For,, thus, X E[M(x 1:t 1 + p 1 ) x t ] apple hus, h i E[kp 1 k 1 ]=E max p 1,i i2[1,n] e X 1 apple (1 + 2 X 1 ) = apple apple 2 =2 X (1 + 2 X 1 )E[M(x 1:t + p 1 ) x t ] X Z +1 0 Z +1 0 Z u 0 apple 2u + N (1 + 2 X 1 )(L min + W 1 E[kp 1 k 1 ]). h i Pr max p 1,i >t dt i2[1,n] h i Pr max p 1,i >t dt i2[1,n] h i Pr max p 1,i >t dt + i2[1,n] Z +1 u =2u + N e u Pr apple h i p 1,1 >t dt 2(1 + log N) Z +1 u h i Pr max p 1,i >t dt i2[1,n] (best choice of u). page 15

16 Expert Setting W 1 =1X 1 = N R =1,, and ; for FLP*( ), E[L ] apple (1 + 2N )L min + 2(1+log(N). More favorable bound: x t! x t,1 e 1...x t,n e N. new L min N = old L min. E[L old ] apple E[L new N]. new guarantee: for FLP*( ), E[L ] apple (1 + 2 )L min + 2(1+log(N)). E[R ] apple 2 p 2L min (1 + log(n)). page 16

17 RWM = FPL Let FPL( ) be an instance of the general FPL algorithm with a perturbation defined by apple log( log(u1 )) p 1 =,..., log( log(u > N )), where u j is drawn according to the uniform distribution over [0, 1]. hen, FPL( ) and RWM( ) coincide. page 17

18 References Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On-Line Learning Algorithms. IEEE ransactions on Information heory 50(9): Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COL ACM Press, Adam. Kalai, Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71(3): Nick Littlestone. From On-Line to Batch Learning. COL 1989: Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm" Machine Learning (2) page 18

19 References Nick Littlestone, Manfred K. Warmuth: he Weighted Majority Algorithm. FOCS 1989: om Mitchell. Machine Learning, McGraw Hill, Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical heory of Automata, 12, Polytechnic Institute of Brooklyn. page 19

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.