Learning Weighted Automata

Size: px

Start display at page:

Download "Learning Weighted Automata"

Lenard Parsons
6 years ago
Views:

1 Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Weighted Automata (WFAs) page 2

3 Motivation Weighted automata (WFAs): image processing (Kari, 1993). automatic speech recognition (MM, Pereira, Riley, 1996, 2008). speech synthesis (Sproat, 1995; Allauzen, MM, Riley 2004). machine translation (e.g., Iglesias et al., 2011). many other NLP tasks (very long list of refs). bioinformatics (Durbin et al., 1998). optical character recognition (Bruel, 2008). model checking (Baier et al., 2009; Aminof et al., 2011). machine learning (Cortes, Kuznetsov, MM, Warmuth, 2015). 3 page

4 Motivation Theory: rational power series, extensively studied (Eilenberg, 1993; Salomaa and Soittola, 1978; Kuich and Salomaa, 1986; Berstel and Retenauer, 1988). Algorithms (see survey chapter: MM, 2009): rational operations, intersection or composition, epsilon-removal, determinization, minimization, disambiguation. page 4

5 Learning Automata Classical results: passive learning (Gold, 1978; Angluin, 1978), (Pitt and Warmuth, 1993). active learning (model with membership and equivalences queries) (Angluin, 1987), (Bergadano and Varrichio, 1994, 1996, 2000) Spectral learning: algorithms (Hsu et al., 2009; Bailly et al., 2009), (Balle and MM, 2012). natural language processing (Balle et al., 2014). reinforcement learning (Boots et al., 2009; Hamilton et al., 2013). page 5

6 Learning Guarantees Existing analyses: (Hsu et al., 2009; Denis et al., 2016): statistical consistency, finitesample guarantees in the realizable case. (Balle and MM, 2012): algorithm-dependent finite-sample guarantee based on a stability analysis. (Kulesza et al., 2014): algorithm-dependent guarantees with distributional assumption (data drawn from some WFA). can we derive general theoretical guarantees for learning WFAs? page 6

7 This Talk Learning scenario, complexity tools. Hypothesis sets. Learning guarantees. page 7

8 Learning Scenario Training data: sample drawn i.i.d. from some distribution D, S = (x 1,y 1 ),...,(x m,y m ). X Yaccording to Problem: find WFA A in hypothesis set H with small expected loss L(A) = E [L(A(x),y)]. (x,y) D note: problem not assumed realizable (distribution according to probabilistic WFA). page 8

9 Emp. Rademacher Complexity Definition: G family of functions mapping from set Z to [a, b]. sample S =(z 1,...,z m ). i s (Rademacher variables): independent uniform random variables taking values in { 1, +1}. apple apple 1 1 apple. g(z1 apple ). 1 mx br S (G) =E sup. g2g m. =E sup m g(z m ) g2g m correlation with random noise i=1 ig(z i ). page 9

10 Emp. Rademacher Complexity Definition: G family of functions mapping from set Z to [a, b]. sample S =(z 1,...,z m ). i s (Rademacher variables): independent uniform random variables taking values in { 1, +1}. apple apple 1 1 apple. g(z1 apple ). 1 mx br S (G) =E sup. g2g m. =E sup m g(z m ) g2g m correlation with random noise Rademacher complexity of G: R m (G) = E S D m[ R b S (G)]. i=1 ig(z i ). page 10

11 Rademacher Complexity Bound Theorem: Let G be a family of functions mapping from Z to [0, 1]. Then, for any > 0, with probability at least 1, the following holds for all g 2G: s E[g(z)] apple 1 mx log 1 g(z i )+2R m (G)+ m 2m. i=1 s E[g(z)] apple 1 mx g(z i )+2R m b log 2 S (G)+3 2m. i=1 Proof: Apply McDiarmid s inequality to (Koltchinskii and Panchenko, 2002; MM et al., 2012) (S) =supe[g] g2g b ES [g]. page 11

12 This Talk Learning scenario, complexity tools. Hypothesis sets. Learning guarantees. page 12

13 Learning Automata Classical formulation: sample find smallest automaton S = (x 1,y 1 ),...,(x m,y m ) 2 ( {0, 1}) m. min A kak 0 Aconsistent with sample: s.t. 8i 2 [m], A(x i )=y i. NP-complete problem (Gold 1978, Angluin 1978); even polynomial approximation is NP-hard (Pitt and Wamuth, 1993). not the right formulation. page 13

14 Analogy: Linear Classifiers Sparse learning formulation: min w2r N kwk 0 s.t. Aw = b. non-convex optimization problem. NP-hard problem. not the right formulation. alternative norm (e.g. norm-1). page 14

15 Questions What is the appropriate norm to use for learning WFAs? Which hypothesis sets should we consider? description in terms of Hankel matrix. description in terms of transition matrices. description in terms of function norm. page 15

16 WFA - Definition WFA A over a semiring (S,,, 0, 1) and alphabet with a finite set of states is defined by Q A initial weight vector A 2 S QA ; final weight vector A 2 S Q A ; transition weight matrix A a 2 S Q A Q A, a 2. Function defined: for any x = x 1 x k 2, Notation: A x = A x1 A xk. A(x) = > AA x1 A xk A. page 16

17 WFA - Illustration A = A = A a = A b = state number initial weight final weight page 17

18 Hankel Matrix Definition: the Hankel matrix H f of function f :! R is the infinite matrix defined by 8u, v 2, H f (u, v) =f(uv). redundancy: f(x) appears in all entries (u, v) with x = uv. H f = 2 v f(uv) 7 5 u.. page 18

19 Theorem of Fliess Theorem (Fliess, 1974): rank(h f )<+1 iff f is rational. In that case, there exists a (minimal) WFA A representing f with rank(h f ) states. page 19

20 Theorem of Fliess Theorem (Fliess, 1974): rank(h f )<+1 iff f is rational. In that case, there exists a (minimal) WFA A representing f with rank(h f ) states. Proof: For any u, v 2, if H is the Hankel matrix of A, then, H(u, v) =A(uv) =( > AA u )(A v A). Thus, H = P A S > A with " #. P A = > A A u 2 R. Q A S A =. ". > A A> v. # 2 R Q A. page 20

21 Standardization (Schützenberger, 1961; Cardon and Crochemore, 1980) page 21

22 Hypothesis Sets In view of the theorem of Fliess, a natural choice is n o H 0 = A: rank(h A ) <r. for some r<+1. But, rank does not define a convex function (equivalent of norm-0 for column vectors). Instead, definition based on nuclear norm and more generally Schatten p-norms: n o H p = A: kh A k p <r, with kh A k p = apple X i p i (H A) 1 p. page 22

23 This Talk Learning scenario, complexity tools. Hypothesis sets. Learning guarantees. page 23

24 Schatten Norms Common choices for p : p =1: nuclear norm (or trace norm) kak. q 1 =Tr p A> A p =2: Frobenius norm kak 2 = Tr[A > A]. p =+1: spectral norm kak +1 = p max(a > A)= max (A). Properties: 1 Hölder s inequality: for p, p 1 with p + 1 p =1, ha, Bi applekak p kbk p. von Neumann s trace inequality theorem: ha, Bi apple P i i(a) i (B). page 24

25 Emp. Rademacher Complexity By definition of the dual norm (or Hölder s inequality), for a sample S =(x 1,...,x m ) and any decomposition x i = u i v i, br S (H p )= 1 m E apple = 1 m E apple apple r m E apple sup A2H p mx i=1 sup kh A k p appler m X i=1 ie > u i H A e vi D m X i=1 ie vi e > u i p. ie vi e > u i, H A E page 25

26 Rad. Complexity for p = 2 Lemma: br S (H 2 ) apple r p m. Proof: since p =2for p =2, br S (H 2 ) apple r apple m m E X ie vi e > u i 2 i=1 v apple r apple u X t m E ie vi e m > 2 u i 2 i=1 v = r apple u X m te i jhe ui e m > v i, e uj e > v j i i,j=1 v = r apple u X m t E he ui e m > v i, e ui e > v i i = p r. m i=1 page 26

27 Lower Bound By the Khintchine-Kahane inequality, v apple r m m E X apple ie vi e > r u u i p t E 2 2m i=1 m X i=1 ie vi e > u i 2 = p r apple se he ui e > v 2m i, e vi e > u i i = 1 p 2 r m. 2 page 27

28 Generalization Bound Theorem: assume that the loss L is the L p loss and is bounded by M. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all A 2 H 2, s L(A) apple L b S (A)+ 2µ p pr log 1 + M m 2m, where µ p = pm p 1. page 28

29 Proof By Talagrand s contraction lemma, o br n(x, y) 7! L(A(x),y): A 2 H 2 = 1 m E apple apple µ p m E = µ p m E = µ p m E apple apple apple sup A2H 2 sup A2H 2 sup A2H 2 sup A2H 2 mx i=1 mx i=1 mx i=1 mx i=1 i A(x i ) y i p i(a(x i ) y i ) (x 7! x p µ p -Lipschitz) ia(x i ) + µ apple m p m E X i=1 ia(x i ) = µ p b RS (H 2 ). iy i page 29

30 Rad. Complexity for p = 1 Lemma: br S (H 1 ) apple r m apple c 1 log(2m + 1) + c 2 p WS log(2m + 1), where W S =min decomp. max{u S,V S }, with U S = max u2 {i: u i = u} V S = max u2 {i: v i = v}. Proof: apply Matrix Berstein bound with M =1, d apple m, V 1 = P i e u i e > u i V 2 = P i e v i e > v i kv 1 k op = U S kv 2 k op = V S. page 30

31 Matrix Bernstein Bound Corollary: let M = P i M i be a finite sum of i.i.d. random matrices with E[M] =0 and km i k op apple M for all i; P i E[M im > i ] V P 1 and i E[M> i M i] V 2. Then, E[kMk op ] apple c 1 M log(d + 1) + c 2 p log(d + 1), (MInsker, 2011; Tropp, 2015) 2+8/ log(2) 3 where c 1 =, c 2 = p 2+4/ p log(2) ; V = diag(v 1, V 2 ), = kvk op, d = Tr(V). kvk op page 31

32 Generalization Bound Theorem: assume that the loss L is the L p loss and is bounded by M. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all A 2 H 1, L(A) apple L b S (A)+ 2µ pc 1 r log(2m + 1) m + 2µ pc 2 r p s W S log(2m + 1) log 2 +3M m 2m, where c 1 = 2+8/ log(2) 3 µ p = pm p 1. c 2 = p 2+4/ p log(2) page 32

33 Conclusion Theory of learning WFAs: data-dependent learning guarantees. can help guide design of algorithms. key role of notion of Hankel matrix ( spectral data-dependent combinatorial quantities (e.g. ). methods, (Hsu et al., 2009; Balle and MM, 2012)). W S page 33

34 Questions Questions: can we use learning bounds (e.g. W S ) to select prefixes/ suffixes defining sub-blocks of Hankel matrix? can we derive learning guarantees for more general algorithms than (Balle and MM, 2012)? computational challenges. page 34

35 Hypothesis Sets Definition based on matrix representation: n A n,p,r = A: Q A = n, k k appler, k k p apple r, max a kak p apple r o. page 35

36 Rademacher Complexities Corollary: let L S = max x i and L m = E, then, i S D m[l S] r n(n + 2)r r br S (A n,p,r ) apple 6 C + p log(l S + 2) m r n(n + 2)r r R m (A n,p,r ) apple 6 C + p log(l m + 2) m, where q (log(r r )) + n +2 + C = q r = max{ r /r, q log + (r )+ q r /r, p r r /r } q log + ( r)+3 p log(2) page 36

Learning Weighted Automata

Learning Weighted Automata Borja Balle 1 and Mehryar Mohri 2,3 1 School of Computer Science, McGill University, Montréal, Canada 2 Courant Institute of Mathematical Sciences, New York, NY 3 Google Research,