On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Structured Prediction Structured output: Y = Y 1 Y l. Loss function: L: Y Y! R + decomposable. Example: Hamming loss. lx L(y, y 0 )= 1 l Example: edit-distance loss. k=1 1 yk 6=y 0 k L(y, y 0 )= 1 l d edit(y 1 y l,y 0 1 y 0 l). page 2

Examples Pronunciation modeling. Part-of-speech tagging. Named-entity recognition. Context-free parsing. Dependency parsing. Machine translation. Image segmentation. page 3

Examples: NLP Tasks Pronunciation: POS tagging: I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax The thief stole a car D N V D N Context-free parsing/dependency parsing: S VP NP NP D N V D N The thief stole a car root The thief stole a car page 4

Examples: Image Segmentation page 5

Ensemble Structured Prediction Input: labeled sample S =((x 1, y 1 ),...,(x m, y m )) 2 X Y. access to p predictors h 1,...,h p : X! Y. each expert decomposes: h j (x) =(h 1 j(x),...,h l j(x)). multiple path experts. Problem: how do we learn to combine path predictors to devise a more accurate predictor? page 6

Path Experts page 7

On-Line Formulation Learner maintains a distribution p t (Cortes, Kuznetsov, and MM, 2014) over path experts. At each round t : the learner receives input x t ; incurs loss E h pt [L(h(x t ), y t )] = P h p t(h)l(h(x t ), y t ) ; updates distribution: p t! p t+1. On-line-to-batch conversion and guarantees. page 8

Problem Learning: regret guarantees for best algorithms of the form R T = O( p T log N). informative even for N very large. O(N) Problem: computational complexity of algorithm in. can we derive more efficient algorithms when experts admit some structure and when loss is decomposable? page 9

This Talk Can we devise efficient on-line learning algorithms for path experts with non-additive losses? Examples: machine translation (BLEU score). computational biology (sequence similarities). speech recognition (edit-distance). page 10

Two Solution Families Extension of Randomized Weighted Majority (RWM) algorithm: rational losses. tropical losses. Extension of Follow-the-Perturbed Leader (FPL) algorithm: rational losses. tropical losses. page 11

Outline Additive loss. Rational loss. Tropical loss. page 12

Randomized Weighted Majority Randomized-Weighted-Majority () 1 for i 1 to N do 2 w 1,i 1 1 3 p 1,i N 4 for t 1 to T do 5 for i 1 to N do 6 w t+1,i e l t,i w t,i 7 p t+1,i 8 return p T +1 w t+1,i P N j=1 w t+1,j (Littlestone and Warmuth, 1988) Advanced Machine Learning - page 13

Example: Online Shortest Path Problems: path experts. sending packets along paths of a network with routers (vertices); delays (losses). car route selection in presence of traffic (loss). page 14

Additive Loss = e 02 e 23 e 34 For path, l t ( ) =l t (e 02 )+l t (e 23 )+l t (e 34 ). 3 4 e23 e34 e03 2 e24 e14 e02 0 1 e01 page 15

RWM + Path Experts Weight update: at each round, update weight of path expert : =e 1 e n w t [ ] w t 1 [ ] e l t( ) w t [e i ] w t 1 [e i ] e l t(e i ). e34 t 3 4 ; equivalent to (Takimoto and Warmuth, 2002) e03 e23 2 e24 e14 w t 1 [e 14 ] e l t(e 14 ) e02 0 1 e01 Sampling: need to make graph/automaton stochastic. page 16

Weight Pushing Algorithm Weighted directed graph with set of initial vertices and final vertices : I Q q 2 Q for any, e 2 E G =(Q, E, w) F Q for any with, w[e] d[q] = X 2P (q,f) d[orig(e)] 6= 0 d[orig(e)] w[ ]. 1 w[e] d[dest(e)]. (MM 1997; MM, 2009) for any q 2 I, initial weight (q) d(q). page 17

Illustration 0 a/0 b/1 c/5 d/0 1 e/0 f/1 e/4 3 0/15 a/0 b/(1/15) c/(5/15) d/0 1 e/0 f/1 e/(4/9) 3 e/1 2 f/5 e/(9/15) 2 f/(5/9) page 18

Properties Stochasticity: for any with, X e2e[q] w 0 [e] = Invariance: path weight preserved. Weight of path from to : I F q 2 Q d[q] 6= 0 X w[e] d[dest(e)] = d[q] d[q] d[q] =1. e2e[q] (orig(e 1 ))w 0 [e 1 ] w 0 [e n ] = d[orig(e 1 )] w[e 1]d[dest(e 1 )] d[orig(e 1 )] = w[e 1 ] w[e n ]d[dest(e n )] = w[e 1 ] w[e n ]=w[ ]. w[e 2 ]d[dest(e 2 )] d[dest(e 1 )] =e 1 e n page 19

Shortest-Distance Computation Acyclic case: special instance of a generic single-source shortestdistance algorithm working with an arbitrary queue discipline and any k-closed semiring (MM, 2002). linear-time algorithm with the topological order queue O( Q + E ) discipline,. page 20

Outline Additive loss. Rational loss. Tropical loss. page 21

Weighted Transducers b:a/0.6 b:a/0.2 0 a:b/0.1 1 a:b/0.5 a:a/0.4 2 b:a/0.3 3/0.1 T (x, y) = Sum of the weights of all accepting paths with input and output y. x T (abb, baa) =.1.2.3.1+.5.3.6.1 page 22

Weighted Determinization (MM 1997) b/3 b/3 0 a/1 a/2 1 c/5 b/3 d/6 3/0 (0,0) a/1 (1,0),(2,1) c/5 d/7 (3,0)/0 2 page 23

Composition Composition of two weighted transducers T 1 and T 2 : (T 1 T 2 )(x, y) = M T 1 (x, z) T 2 (z,y). z2 (Pereira and Riley, 1997; MM et al. 1996) 0 a:b/0.1 1 a:b/0.2 b:b/0.3 2 a:b/0.5 b:b/0.4 a:a/0.6 3/0.7 b:a/0.2 a:b/0.3 2 b:a/0.5 0 b:b/0.1 1 a:b/0.4 3/0.6 a:a/.04 (0, 1) a:a/.02 a:b/.18 (3, 2) (0, 0) a:b/.01 (1, 1) b:a/.06 (2, 1) a:a/0.1 (3, 1) a:b/.24 b:a/.08 (3, 3)/.42 page 24

Rational Losses Definition: rational kernel K is a kernel computed by a weighted transducer U, K(x, y) =U(x, y). Theorem: any weighted transducer U = T T 1 over the semiring (R +, +,, 0, 1) defines a PDS rational kernel. Definition: rational loss defined by weighted transducer over the semiring (R +, +,, 0, 1), U 8x, y 2,L U (x, y) = log U(x, y). page 25

Bigram Transducer Bigram transducer T bigram defined over prob. semiring: a:ε/1 a:a/1 a:a/1 a:ε/1 0 b:b/1 1 b:b/1 2/1 b:ε/1 b:ε/1 Property: 8x 2,u2 2, T bigram (x, u) = x u. bigram kernel transducer: U bigram = T bigram T 1 bigram ; U bigram (x, y) = X x u y u. u2 2 page 26

Path Expert Automata 3 4 e23 e03 e02 e34 2 e24 e14 0 1 e01 a 3 4 a b 2 a a b 0 a 1 3 e34:a 4 e23:a e24:b e03:a 2 e14:a e02:b 0 e01:a 1 Path expert automaton Prediction automaton A t at time t. Expert-to-prediction transducer T t at time t. page 27

Questions Weight update (with a rational loss): how to compute for each path expert Sampling: exp TX t=1 how can we sample according to the distribution defined by these weights? L( (t),y t ). page 28

η-power Semiring For any > 0, system S =(R + [ {+1},,, 0, 1) where 8x, y 2 R + [ {+1}, x y = x 1 + y 1. Semiring morphism: :(R + [ {+1}, +,, 0, 1)! S x 7! x. page 29

η-power Semiring WFSTs Y t : WFA over S accepting only y t with all weights set to 1. T t : expert-to-prediction WFST over with all weights set to 1. S eu : WFST over S derived from U by changing each weight into x. x page 30

Path Weights Proposition: for the following WFAs over, V t =Det( (Y t U e Tt )) and W t = V 1 V t ; for any t 2 [1,T] and path expert, W t ( ) =e P t s=1 L U(y s, (s)). S Proof: observe that V t ( ) = (Y t e U Tt )( ) = M Y t (z 1 ) U(z e 1,z 2 )T t (z 2, ) z 1,z 2 = e U(y t, (t)) = e L U(y t, (t)). page 31

Rational Weighted Majority Alg. RRWM(T ) 1 W 0 1. deterministic one-state WFA over the semiring S. 2 for t 1 to T do 3 x t Receive() 4 T t PathExpertPredictionTransducer(x t ) 5 V t Det( (Y t U e Tt )) 6 W t W t 1 V t 7 W t WeightPush(W t, (+, )) 8 by t+1 Sample(W t ) 9 return W T page 32

Time Complexity Polynomial-time overall complexity: worst-case complexity of determinization: exponential. but, complexity is only polynomial in this context. proof based on new string combinatorics arguments. page 33

Regret Guarantee Theorem: let N be the total number of path experts and M an upper bound on the loss of any path expert. Then, the following upper bound holds for the regret of RRWM: E[R T (RRWM)] apple 2M p T log N. page 34

Outline Additive loss. Rational loss. Tropical loss. page 35

Tropical Losses Definition: rational loss defined by weighted transducer over the semiring (R [ { 1, +1}, min, +, +1, 0), U 8x, y 2,L U (x, y) =U(x, y). page 36

Edit-Distance Transducer Edit-distance weighted transducer U edit defined over tropical semiring (R [ { 1, +1}, min, +, +1, 0). page 37

Algorithm Syntactically the same algorithm! Only change semiring from S to ([0, 1], max,, 0, 1). page 38

Conclusion On-line learning algorithms for path experts with rational or tropical losses: Rational and Tropical Randomized Weighted Majority. Rational and Tropical Follow-the-Perturbed Leader. Polynomial-time algorithms for rational losses. Applications to MT, ASR, computational biology. Implementation using OpenFst. page 39