Learning Automata with Hankel Matrices

Size: px

Start display at page:

Download "Learning Automata with Hankel Matrices"

Asher Banks
5 years ago
Views:

1 Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights London, September 2017

2 Weighted Finite Automata (WFA) (over R) Graphical Representation Algebraic Representation 1 a, 1.2 b, 2 q a, 2 b, a, 1 b, 2 a, 3.2 b, 5 q 2 0 α β A xα, β, ta a u apσ y j j A a A b j j Behavioral Representation Each WFA A computes a function A : Σ Ñ R given by Apx 1 x T q α J A x1 A xt β

3 In This Talk... Describe a core algorithm common to many algorithms for learning weighted automata Explain the role this core plays in three learning problems in different setups Survey extensions to more complex models and some applications

4 Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

5 Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

6 Hankel Matrices and Fliess Theorem Given f : Σ Ñ R define its Hankel matrix H f P R Σ ˆΣ as H f ɛ a b s» ɛ f pɛq f paq f pbq. a f paq f paaq f pabq. b f pbq f pbaq f pbbq.. p f ppsq. fi fl Theorem [Fli74] 1. The rank of H f is finite if and only if f is computed by a WFA 2. The rank rankpf q rankph f q equals the number of states of a minimal WFA computing f

7 The Structure of Hankel Matrices App 1 p T s 1 s T 1q α J A p1 A pt A s1 A st 1 β H s» p f ppsq fi» fl fi» fi fl fl App 1 p T as 1 s T 1q α J A p1 A pt A a A s1 A st 1 β H a s» p f ppasq fi» fl fi» fi» fl fi fl fl Algebraically: Factorizing H lets us solve for A a H P S =ñ H σ P A a S =ñ A a P` H a S`

8 SVD-based Reconstruction [HKZ09; Bal+14] Inputs Desired number of states r Basis B pp, Sq with P, S Ă Σ, ɛ P P X S Finite Hankel blocks indexed by prefixes and suxes in B: H B P R PˆS H B Σ th B a P R PˆS : a P Σu Algorithm: SpectralpH B, H B Σ, rq 1. Compute the rank r SVD of H B «UDV J 2. Let A a D 1 U J H a V 3. Let α V J H B pɛ, q and β D 1 U J H B p, ɛq 4. Return A xα, β, ta a uy Running time: 1. SVD takes Op P S rq 2. Matrix multiplications take Op Σ P S r q

9 Properties of Spectral [HKZ09; Bal13; BM15a] Consistency If P is prefix-closed, S is sux-closed, and r rankph B q rankprh B H B Σ sq P P P Σ, the WFA A SpectralpH B, H B Σ, rq satisfies App sq H B pp, sq and Ãpp a sq H B a pp, sq Recovery If H B and H B Σ are sub-blocks of H f with r rankpf q rankph B q Then the WFA A SpectralpH B, H B Σ, rq satisfies A f Robustness If r rankph B q rankprh B H B Σ sq and }HB Ĥ B } ď ε and }H B a Ĥ B a } ď ε for all a P Σ Then xα, β, ta a uy SpectralpH B, H B Σ, rq and xˆα, ˆβ, tâ a uy SpectralpĤ B, Ĥ B Σ, rq satisfy }α ˆα}, }β ˆβ}, }A a Âa} ď ε

10 Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

11 Learning Models 1. Exact query learning: membership + equivalence queries [BV96; BBM06; BM15a] 2. Distributional PAC learning: samples from a stochastic WFA [HKZ09; BDR09; Bal+14] 3. Statistical learning: optimize output predictions wrt a loss function [BM12; BM15b]

12 Exact Learning of WFA with Queries Setup: Unknown f : Σ Ñ R with rankpf q n Membership oracle: MQ f pxq returns f pxq for any x P Σ Equivalence oracle: EQ f paq returns true if f A and pfalse, zq if f pzq Apzq Algorithm: 1. Initialize P S tɛu and maintain B pp, Sq 2. Let A SpectralpH B, H B Σ, rankphb qq 3. While EQpAq pfalse, zq 3.1 Let z p a s with p the longest prefix of z in P 3.2 Let S S Y suxespsq 3.3 While Dp P P and Da P Σ such that H B a pp, q R rowspanph B q, add p a to P 3.4 Let A SpectralpH B, H B Σ, rankphb qq Analysis: At most n ` 1 calls to EQ f and Op Σ n 2 Lq calls to MQ f, where L max z Can be improved to Opp Σ ` log Lqn 2 q calls to MQ f ; can reduce calls to EQ f by increasing calls to MQ f

13 PAC Learning Stochastic WFA Setup: Unknown f : Σ Ñ R with rankpf q n defining probability distribution on Σ Data: x p1q,..., x pmq i.i.d. strings sampled from f Parameters: n and B pp, Sq such that rankph B q n and ɛ P P X S Algorithm: 1. Estimate Hankel matrices Ĥ B and Ĥ B a for all a P Σ using empirical probabilities 2. Return Â SpectralpĤB, ĤB Σ, nq Analysis: ˆf pxq 1 m mÿ 1rx piq xs Running time is Op P S m ` Σ P S nq With high probability ř x ďl f pxq Âpxq O L2 Σ? n σ n ph B f q 2? m i 1

14 Statistical Learning of WFA Setup: Unknown distribution D over Σ ˆ R Data: px p1q, y p1q q,..., px pmq, y pmq q i.i.d. string-label pairs sampled from D Parameters: n, convex loss function l : R ˆ R Ñ R`, convex regularizer R, regularization parameter λ ą 0, and B pp, Sq with ɛ P P X S Algorithm: 1. Build B 1 pp 1, Sq with P 1 P Y P Σ 2. Find the Hankel matrix Ĥ B1 1 ř solving min m H m i 1 lphpx piq q, y piq q ` λrphq 3. Return Â SpectralpĤB, ĤB Σ, nq, where ĤB and ĤB Σ are submatrices of ĤB1 Analysis: Running time is polynomial in n, m, Σ, P, and S With high probability E px,yq D rlpâpxq, yqs ď 1 mÿ lpâpx piq q, y piq q ` O m i 1 ˆ 1? m

15 Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

16 Extensions 1. More complex models Transducers and taggers [BQC11; Qua+14] Grammars and tree automata [Luq+12; Bal+14; RBC16] Reactive models [BBP15; LBP16; BM17a] 2. More realistic setups Multiple related tasks [RBP17] Timing data [BBP15; LBP16] Single trajectory [BM17a] Probabilistic models [BHP14] 3. Deeper theory Convex relaxations [BQC12] Generalization bounds [BM15b; BM17b] Approximate minimisation [BPP15] Bisimulation metrics [BGP17]

17 And It Works Too! Spectral methods are competitive against traditional methods: Expectation maximization Conditional random fields Tensor decompositions L1 distance HMM k HMM FST # training samples (in thousands) Hamming Accuracy (test) K 2K 5K 10K 15K Training Samples No Regularization Avg. Perceptron CRF Spectral IO HMM L2 Max Margin Spectral Max Margin In a variety of problems: Sequence tagging Constituency and dependency parsing Timing and geometry learning POS-level language modelling Runtime [log(sec)] Initialization Model Building 0 Spec-Str Spec-Sub CO Tensor EM length Rel. error mu qn SVTA SVTA* True ODM Hankel rank length mu qn SVTA SVTA* Word Error Rate (%) all sentences mu qn SVTA SVTA* Number of States Spectral, Σ basis Spectral, basis k=25 Spectral, basis k=50 Spectral, basis k=100 Spectral, basis k=300 Spectral, basis k=500 Unigram Bigram

18 Open Problems and Current Trends Optimal selection of P and S from data Scalable convex optimization over sets of Hankel matrices Constraining the output WFA (eg. probabilistic automata) Relations between learning and approximate minimisation How much of this can be extended to WFA over semi-rings? Spectral methods for initializing non-convex gradient-based learning algorithms

19 Conclusion Take home points A single building block based on SVD of Hankel matrices Implementation only requires linear algebra Analysis involves linear algebra, probability, convex optimization Can be made practical for a variety of models and applications Want to know more? EMNLP 14 tutorial (with slides, video, and code) Survey papers [BM15a; TJ15] Python toolkit Sp2Learn [Arr+16] Neighbouring literature: Predictive state representations (PSR) [LSS02] and Observable operator models (OOM) [Jae00]

20 Thanks To All My Collaborators! Guillaume Rabusseau Franco M. Luque Xavier Carreras Mehryar Mohri Prakash Panangaden Pierre-Luc Bacon Pascale Gourdeau Odalric-Ambrym Maillard Will Hamilton Lucas Langer Shay Cohen Amir Globerson Joelle Pineau Doina Precup Ariadna Quattoni

21 Bibliography I [Arr+16] [Bal+14] [Bal13] [BBM06] [BBP15] [BDR09] D. Arrivault, D. Benielli, F. Denis, and R. Eyraud. Sp2Learn: A Toolbox for the Spectral Learning of Weighted Automata. In: ICGI B. Balle, X. Carreras, F.M. Luque, and A. Quattoni. Spectral learning of weighted automata: A forward-backward perspective. In: Machine Learning (2014). B. Balle. Learning Finite-State Machines: Algorithmic and Statistical Aspects. PhD thesis. Universitat Politècnica de Catalunya, L. Bisht, N. H. Bshouty, and H. Mazzawi. On Optimal Learning Algorithms for Multiplicity Automata. In: COLT P.-L. Bacon, B. Balle, and D. Precup. Learning and Planning with Timing Information in Markov Decision Processes. In: UAI R. Bailly, F. Denis, and L. Ralaivola. Grammatical inference as a principal component analysis problem. In: ICML

22 Bibliography II [BGP17] [BHP14] [BM12] [BM15a] [BM15b] [BM17a] B. Balle, P. Gourdeau, and P. Panangaden. Bisimulation Metrics for Weighted Automata. In: ICALP B. Balle, W. L. Hamilton, and J. Pineau. Methods of Moments for Learning Stochastic Languages: Unified Presentation and Empirical Comparison. In: ICML B. Balle and M. Mohri. Spectral learning of general weighted automata via constrained matrix completion. In: NIPS B. Balle and M. Mohri. Learning Weighted Automata (invited paper). In: CAI B. Balle and M. Mohri. On the Rademacher complexity of weighted automata. In: ALT B. Balle and O.-A. Maillard. Spectral Learning from a Single Trajectory under Finite-State Policies. In: ICML

23 Bibliography III [BM17b] [BPP15] [BQC11] [BQC12] [BV96] [Fli74] [HKZ09] B. Balle and M. Mohri. Generalization Bounds for Learning Weighted Automata. In: Theor. Comput. Sci. (to appear) (2017). B. Balle, P. Panangaden, and D. Precup. A Canonical Form for Weighted Automata and Applications to Approximate Minimization. In: LICS B. Balle, A. Quattoni, and X. Carreras. A spectral learning algorithm for finite state transducers. In: ECML-PKDD B. Balle, A. Quattoni, and X. Carreras. Local loss optimization in operator models: A new insight into spectral learning. In: ICML F. Bergadano and S. Varricchio. Learning behaviors of automata from multiplicity and equivalence queries. In: SIAM Journal on Computing (1996). M. Fliess. Matrices de Hankel. In: Journal de Mathématiques Pures et Appliquées (1974). D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. In: COLT

24 Bibliography IV [Jae00] [LBP16] [LSS02] [Luq+12] [Qua+14] [RBC16] [RBP17] H. Jaeger. Observable operator models for discrete stochastic time series. In: Neural Computation (2000). L. Langer, B. Balle, and D. Precup. Learning Multi-Step Predictive State Representations. In: IJCAI M. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In: NIPS F.M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectral learning in non-deterministic dependency parsing. In: EACL A. Quattoni, B. Balle, X. Carreras, and A. Globerson. Spectral Regularization for Max-Margin Sequence Tagging. In: ICML G. Rabusseau, B. Balle, and S. B. Cohen. Low-Rank Approximation of Weighted Tree Automata. In: AISTATS G. Rabusseau, B. Balle, and J. Pineau. Multitask Spectral Learning of Weighted Automata. In: NIPS

25 Bibliography V [TJ15] M. R. Thon and H. Jaeger. Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework. In: Journal of Machine Learning Research (2015).

26 Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights London, September 2017

Theoretical Guarantees for Learning Weighted Automata

Theoretical Guarantees for Learning Weighted Automata Borja Balle ICGI Keynote October 2016 Thanks To My Collaborators! Mehryar Mohri Ariadna Quattoni Xavier Carreras Prakash Panangaden Doina Precup Outline