arxiv: v1 [cs.lg] 4 Jul 2018 Doina Precup McGill University / DeepMind Montreal

Connecting Weighted Automata and Recurrent Neura Networks through Spectra Learning Guiaume Rabusseau McGi University guiaume.rabusseau@mcgi.ca Tianyu Li McGi University tianyu.i@mai.mcgi.ca arxiv:807.0406v [cs.lg] 4 Ju 08 Doina Precup McGi University / DeepMind Montrea dprecup@cs.mcgi.ca Abstract In this paper, we unrave a undamenta connection between weighted inite automata (WFAs) and second-order recurrent neura networks (-RNNs): in the case o sequences o discrete symbos, WFAs and -RNNs with inear activation unctions are expressivey equivaent. Motivated by this resut, we buid upon a recent extension o the spectra earning agorithm to vector-vaued WFAs and propose the irst provabe earning agorithm or inear -RNNs deined over sequences o continuous input vectors. This agorithm reies on estimating ow rank sub-bocks o the so-caed Hanke tensor, rom which the parameters o a inear -RNN can be provaby recovered. The perormances o the proposed method are assessed in a simuation study. Introduction Many tasks in natura anguage processing, computationa bioogy, reinorcement earning, and time series anaysis rey on earning with sequentia data, i.e. estimating unctions deined over sequences o observations rom training data. Weighted inite automata (WFAs) and recurrent neura networks (RNNs) are two poweru and exibe casses o modes which can eicienty represent such unctions. On the one hand, WFAs are tractabe, they encompass a wide range o machine earning modes (they can or exampe compute any probabiity distribution deined by a hidden Markov mode (HMM) [0] and can mode the transition and observation behavior o partiay observabe Markov decision processes []) and they oer appeaing theoretica guarantees. In particuar, the so-caed spectra methods or earning HMMs [9], WFAs [, 4] and reated modes [6, 6], provide an aternative to EM based agorithms that is both computationay eicient and consistent. On the other hand, RNNs are remarkaby expressive modes they can represent any computabe unction [] and they have successuy tacked many practica probems in speech and audio recognition [7, 4, ], but their theoretica anaysis is diicut. Even though recent work provides interesting resuts on their expressive power [, 7] as we as aternative training agorithms coming with earning guarantees [], the theoretica understanding o RNNs is sti imited. In this work, we bridge a gap between these two casses o modes by unraveing a undamenta connection between WFAs and second-order RNNs (-RNNs): when considering input sequences o discrete symbos, -RNNs with inear activation unctions and WFAs are one and the same, i.e. they are expressivey equivaent and there exists a one-to-one mapping between the two casses (moreover, this mapping conserves mode sizes). Whie connections between inite state machines (e.g. deterministic Preprint. Work in progress.

inite automata) and recurrent neura networks have been noticed and investigated in the past (see e.g. [4, 5]), to the best o our knowedge this is the irst time that such a cear equivaence between inear -RNNs and weighted automata is expicity ormaized. This resut naturay eads to the observation that inear -RNNs are a natura generaization o WFAs (which take sequences o discrete observations as inputs) to sequences o continuous vectors, and raises the question o whether the spectra earning agorithm or WFAs can be extended to inear -RNNs. The second contribution o this paper is to show that the answer is in the positive: buiding upon the spectra earning agorithm or vector-vaued WFAs introduced recenty in [9], we propose the irst provabe earning agorithm or second-order RNNs with inear activation unctions. Our earning agorithm reies on estimating sub-bocks o the so-caed Hanke tensor, rom which the parameters o a -inear RNN can be recovered using basic inear agebra operations. One o the key technica diicuties in designing this agorithm resides in estimating these sub-bocks rom training data where the inputs are sequences o continuous vectors. We everage mutiinear properties o inear -RNNs and the act that the Hanke sub-bocks can be reshaped into higher-order tensors o ow tensor train rank (a resut we beieve is o independent interest) to perorm this estimation eicienty using matrix sensing and tensor recovery techniques. As a proo o concept, we vaidate our theoretica indings in a simuation study on toy exampes where we experimentay compare dierent recovery methods and investigate the robustness o our agorithm to noise and rank mis-speciication. Reated work. Combining the spectra earning agorithm or WFAs with matrix competion techniques (a probem which is cosey reated to matrix sensing) has been theoreticay investigated in [5]. The connections between tensors and RNNs have been previousy everaged to study the expressive power o RNNs in [] and to achieve mode compression in [7, 6, 4]. Exporing the reationship between RNNs and automata or interpretabiity purposes has been expored in [5] and the abiity o RNNs to earn casses o orma anguages has been investigated in []. The predictive state RNN mode introduced in [] is cosey reated to -RNNs and the authors propose to use the spectra earning agorithm or predictive state representations to initiaize a gradient based agorithm; their approach however comes without theoretica guarantees. Lasty, a provabe agorithm or RNNs reying on the tensor method o moments has been proposed in [] but it is imited to irst-order RNNs with quadratic activation unctions (which do not encompass inear -RNNs). The proos o the resuts given in the paper can be ound in the suppementary materia. Preiminaries In this section, we irst present basic notions o tensor agebra beore introducing second-order recurrent neura network, weighted inite automata and the spectra earning agorithm. We start by introducing some notation. For any integer k we use [k] to denote the set o integers rom to k. We use to denote the smaest integer greater or equa to. For any set S, we denote by S = k N Sk the set o a inite-ength sequences o eements o S (in particuar, Σ wi denote the set o strings on a inite aphabet Σ). We use ower case bod etters or vectors (e.g. v R d ), upper case bod etters or matrices (e.g. M R d d ) and bod caigraphic etters or higher order tensors (e.g. T R d d d ). We use e i to denote the ith canonica basis vector o R d (where the dimension d wi aways appear ceary rom context). The d d identity matrix wi be written as I d. The ith row (resp. coumn) o a matrix M wi be denoted by M i,: (resp. M :,i ). This notation is extended to sices o a tensor in the straightorward way. I v R d and v R d, we use v v R d d to denote the Kronecker product between vectors, and its straightorward extension to matrices and tensors. Given a matrix M R d d, we use vec(m) R d d to denote the coumn vector obtained by concatenating the coumns o M. The inverse o M is denoted by M, its Moore-Penrose pseudo-inverse by M, and the transpose o its inverse by M ; the Frobenius norm is denoted by M F and the nucear norm by M. Tensors. We irst reca basic deinitions o tensor agebra; more detais can be ound in []. A tensor T R d dp can simpy be seen as a mutidimensiona array (T i,,i p : i n [d n ], n [p]). The mode-n ibers o T are the vectors obtained by ixing a indices except the nth one, e.g. T :,i,,i p R d. The nth mode matricization o T is the matrix having the mode-n ibers o T or coumns and is denoted by T (n) R dn d dn dn+ dp. The vectorization o a tensor is deined by vec(t ) = vec(t () ). In the oowing T aways denotes a tensor o size d d p.

T 4 = d d d d 4 d G R G R G R G 4 d d d 4 Figure : Tensor network representation o a rank R tensor train decomposition (nodes represent tensors and an edge between two nodes represents a contraction between the corresponding modes o the two tensors). The mode-n matrix product o the tensor T and a matrix X R m dn is a tensor denoted by T n X. It is o size d d n m d n+ d p and is deined by the reation Y = T n X Y (n) = XT (n). The mode-n vector product o the tensor T and a vector v R dn is a tensor deined by T n v = T n v R d dn dn+ dp. It is easy to check that the n-mode product satisies (T n A) n B = T n BA where we assume compatibe dimensions o the tensor T and the matrices A and B. Given stricty positive integers n,, n k satisying i n i = p, we use the notation (T ) n,n,,n k to denote the kth order tensor obtained by reshaping T into a tensor o size ( n i d = i ) ( n i d = n +i ) ( n k i k = d n + +n k +i k ). In particuar we have (T ) p = vec(t ) and (T ),p = T (). A rank R tensor train (TT) decomposition [6] o a tensor T R d dp consists in actorizing T into the product o p core tensors G R d R, G R R d R,, G p R R dp R, G p R R dp, and is deined by R T i,,i p = (G ) i,r (G ) r,i,r (G ) r,i,r (G p ) rp,i p,r p (G p ) rp,i p r,,r p = or a indices i [d ],, i p [d p ]; we wi use the notation T = G,, G p to denote such a decomposition. A tensor network representation o this decomposition is shown in Figure. Whie the probem o inding the best approximation o TT-rank R o a given tensor is NP-hard [8], a quasi-optima SVD based compression agorithm (TT-SVD) has been proposed in [6]. It is worth mentioning that the TT decomposition is invariant under change o basis in the sense that, or any invertibe matrix M and any core tensors G, G,, G p, we have G,, G p = G M, G M M,, G p M M, G p M. Second-order RNNs. A second-order recurrent neura network (-RNN) [5, 7, ] with n hidden units can be deined as a tupe M = (h 0, A, Ω) where h 0 R n is the initia state, A R n d n is the transition tensor, and Ω R p n is the output matrix, with d and p being the input and output dimensions respectivey. A -RNN maps any sequence o inputs x,, x k R d to a sequence o outputs y,, y k R p deined by y t = z (Ωh t ) with h t = z (A x t h t ) or t =,, k () where z : R n R n and z : R p R p are activation unctions. Aternativey, one can think o a -RNN as computing a unction M : (R d ) R p mapping each input sequence x,, x k to the corresponding ina output y k. Whie z and z are usuay non-inear component-wise unctions, we consider in this paper the case where both z and z are the identity, and we reer to the resuting mode as a inear -RNN. For a inear -RNN M, the unction M is mutiinear in the sense that, or any integer, its restriction to the domain (R d ) is mutiinear. Another useu observation is that inear -RNNs are invariant under change o basis: or any invertibe matrix P, the inear -RNN M = (P h 0, A P P, PΩ) is such that M = M. We wi say that a inear -RNN M with n states is minima i its number o hidden units is minima (i.e. any inear -RNN computing M has at east n hidden units). Note that the speciic ordering used to perorm matricization, vectorization and such a reshaping is not reevant as ong as it is consistent across a operations. The cassica deinition o the TT-decomposition aows the rank R to be dierent or each mode, but this deinition is suicient or the purpose o this paper.

Weighted automata and spectra earning. Vector-vaued weighted inite automaton (vv-wfa) have been introduced in [9] as a natura generaization o weighted automata rom scaar-vaued unctions to vector-vaued ones. A p-dimensiona vv-wfa with n states is a tupe A = (α, {A σ } σ Σ, Ω) where α R n is the initia weights vector, Ω R p n is the matrix o ina weights, and A σ R n n is the transition matrix or each symbo σ in a inite aphabet Σ. A vv-wfa A computes a unction A : Σ R p deined by A (x) = Ω(A x A x A x k ) α or each word x = x x x k Σ. We ca a vv-wfa minima i its number o states is minima. Given a unction : Σ R p we denote by rank() the number o states o a minima vv-wfa computing (which is set to i cannot be computed by a vv-wfa). The spectra earning agorithm or vv-wfas reies on the oowing undamenta theorem reating the rank o a unction : Σ R d to its Hanke tensor H R Σ Σ p, which is deined by H u,v,: = (uv) or a u, v Σ. Theorem ([9]). Let : Σ R d and et H be its Hanke tensor. Then rank() = rank(h () ). The vv-wfa earning agorithm everages the act that the proo o this theorem is constructive: one can recover a vv-wfa computing rom any ow rank actorization o H (). In practice, a inite subbock H R P S p o the Hanke tensor is used to recover the vv-wfa, where P, S Σ are inite sets o preixes and suixes orming a compete basis or, i.e. such that rank( H () ) = rank(h () ). More detais can be ound in [9]. A Fundamenta Reation between WFAs and Linear -RNNs We start by unraveing a undamenta connection between vv-wfas and inear -RNNs: vv-wfas and inear -RNNs are expressivey equivaent or representing unctions deined over sequences o discrete symbos. More ormay, we have the oowing theorem. Theorem. Any unction that can be computed by a vv-wfa with n states can be computed by a inear -RNN with n hidden units. Conversey, any unction that can be computed by a inear -RNN with n hidden units on sequences o one-hot vectors (i.e. canonica basis vectors) can be computed by a WFA with n states. More precisey, the WFA A = (α, {A σ } σ Σ, Ω) with n states and the inear -RNN M = (α, A, Ω) with n hidden units, where A R n Σ n is deined by A :,σ,: = A σ or a σ Σ, are such that A (σ σ σ k ) = M (x, x,, x k ) or a sequences o input symbos σ,, σ k Σ, where or each i [k] the input vector x i R Σ is the one-hot encoding o the symbo σ i. This resut irst impies that inear -RNNs deined over sequence o discrete symbos (using one-hot encoding) can be provaby earned using the spectra earning agorithm or WFAs/vv-WFAs; indeed, these agorithms have been proved to return consistent estimators. Furthermore, Theorem reveas that inear -RNNs are a natura generaization o cassica weighted automata to unctions deined over sequences o continuous vectors instead o discrete symbos. This spontaneousy raises the question o whether the spectra earning agorithms or WFAs and vv-wfas can be extended to the genera setting o inear -RNNs; we show that the answer is in the positive in the next section. 4 Spectra Learning o Linear -RNNs In this section, we extend the earning agorithm or vv-wfas to inear -RNNs, thus providing the irst provabe earning agorithm or inear second-order RNNs. 4. Recovering a -RNN rom Hanke Tensors We irst present an identiiabiity resut showing how one can recover a inear -RNN computing a unction : (R d ) R p rom observabe tensors extracted rom some Hanke tensor associated with. Intuitivey, we obtain this resut by reducing the probem to the one o earning a vv-wfa. This is done by considering the restriction o to canonica basis vectors; oosey speaking, since the domain o this restricted unction is isomorphic to [d], this aows us to a back onto the setting o sequences o discrete symbos. 4

α = n A n A n A n A n Ω H (4) d d d d p Figure : Tensor network representation o the TT decomposition o the Hanke tensor H (4) by a inear -RNN (α, A, Ω). induced Given a unction : (R d ) R p, we deine its Hanke tensor H R [d] [d] p by (H ) i i s,j j t,: = (e i,, e is, e j,, e jt ) or a i,, i s, j,, j t [d], which is ininite in two o its modes. It is easy to see that H is aso the Hanke tensor associated with the unction : [d] R p mapping any sequence i i i k [d] to (e i,, e ik ). Moreover, in the specia case where can be computed by a inear -RNN, one can use the mutiinearity o to show that (x,, x k ) = d i,,i k = (x ) i (x ) ik (i i k ), making it intuitivey cear that one can earn by earning a vv-wfa computing using the spectra earning agorithm. That is, given a arge enough sub-bock H R P S p o H or some preix and suix sets P, S [d], one shoud be abe to recover a vv-wfa computing and consequenty a inear -RNN computing using Theorem. For the sake o carity, we present the earning agorithm or the particuar case where there exists an L such that the preix and suix sets consisting o a sequences o ength L, i.e. P = S = [d] L, orms a compete basis or. This assumption aows us to present in a simper way a the key eements o the agorithm, the technica detais needed to it this assumption are given in the suppementary materia. For any integer, we deine the inite tensor H () R d d p o order + by (H () ) i,,i,: = (e i,, e i ) or a i,, i [d]. Observe that or any integer, the tensor H () can be obtained by reshaping a inite sub-bock o the Hanke tensor H. When is computed by a inear -RNN, we have the useu property that, or any integer, (x,, x ) = H () x x () or any sequence o inputs x,, x R d (which can easiy be shown using the mutiinearity o ). Beore ormaizing the identiiabiity resut or inear -RNNs in Theorem beow, we remark that the tensors H () are o ow tensor train rank. Indeed, or any, it is easy to check that H () = A α, A,, A, Ω (the tensor network representation o this decomposition is shown }{{} times in Figure ). This property wi be particuary reevant to the earning agorithm we design in the oowing section, but it is aso a undamenta reation that deserves some attention on its own: it impies in particuar that, beyond the cassica reation between the rank o the Hanke matrix H and the number states o a minima WFA computing, the Hanke matrix possesses a deeper structure intrinsicay connecting weighted automata to the tensor train decomposition. Whie this reation have been noticed previousy (see e.g. [8, 9, 8]), we beieve that it coud be successuy everaged to design more eicient earning agorithms (both in terms o sampe and computationa compexity) or WFAs and reated modes. Theorem. Let : (R d ) R p be a unction computed by a minima inear -RNN with n hidden units and et L be an integer such that rank((h (L) ) L,L+ ) = n. Then, or any P R dl n and S R n dlp such that (H (L) ) L,L+ = PS, the inear -RNN M = (α, A, Ω) deined by α = (S ) (H (L) ) L+, A = ((H (L+) ) L,,L+ ) P (S ), Ω = P (H (L) is a minima inear -RNN computing. ) L, 5

First observe that such an integer L exists under the assumption that P = S = [d] L orms a compete basis or. It is aso worth mentioning that a necessary condition or rank((h (L) ) L,L+ ) = n is that d L n, i.e. L must be o the order og d (n). 4. Hanke Tensors Recovery rom Linear Measurements We showed in the previous section that, given the Hanke tensors H (L), H (L) and H (L+), one can recover a inear -RNN computing i it exists. This readiy impies that the cass o unctions that can be computed by inear -RNNs is earnabe in Anguin s exact earning mode [] where one has access to an orace that can answer membership queries (e.g. what is the vaue computed by the target on (x,, x k )?) and equivaence queries (e.g. is my current hypothesis h equa to the target?). Whie this undamenta resut is o signiicant theoretica interest, assuming access to such an orace is unreaistic. In this section, we show that a stronger earnabiity resut can be obtained in a more reaistic setting, where we ony assume access to randomy generated input/output exampes ((x (n), x(n),, x(n) ), y n ) (R d ) R p where y n = (x (n), x(n),, x(n) ). The key observation is that such an input/output exampe ((x (n), x(n),, x(n) ), y n ) can be seen as a inear measurement o the Hanke tensor H (). Indeed, we have y n = (x (n), x(n),, x(n) ) = H () x x = (H () ), x(n) where x (n) = x (n) x (n) R d. Hence, by regrouping N output exampes y n into the matrix Y R N p and the corresponding input vectors x (n) into the matrix X R N d, one can recover H () by soving the inear system Y = X(H () ),, which has a unique soution whenever X is o u coumn rank. This naturay eads to the oowing theorem, whose proo reies on the act that the matrix X wi be o u coumn rank whenever N d and the components o each x (n) i or i [], n [N] are drawn independenty rom a continuous distribution over R d (w.r.t. the Lebesgue measure). Theorem 4. Let (h 0, A, Ω) be a minima inear -RNN with n hidden units computing a unction : (R d ) R p, and et L be an integer such that rank((h (L) ) L,L+ ) = n. Suppose we have access to datasets D = {((x (n), x(n) {L, L, L + } where the entries o each x (n) i norma distribution and where each y n = (x (n), x(n),, x(n) ).,, x(n) ), y n )} N n= (Rd ) R p or are drawn independenty rom the standard Then, whenever N d or each {L, L, L + }, the inear -RNN M returned by Agorithm with the east-squares method satisies M = with probabiity one. A ew remarks on this theorem are in order. The irst observation is that the datasets D L, D L and D L+ can either be drawn independenty or not (e.g. the sequences in D L can be preixes o the sequences in D L but it is not necessary). In particuar, the resut sti hods when the datasets D are constructed rom a unique dataset S = {((x (n), x(n),, x(n) T ), (y(n), y(n),, y(n) T ))}N n= o input/output sequences with T L +, where y (n) t = (x (n), x(n),, x(n) T ) or any t [T ]. Observe that having access to such input/output training sequences is not an unreaistic assumption: or exampe when training RNNs or anguage modeing the output y t is the next symbo conditiona probabiity vector, and or cassiication tasks the output is the one-hot encoded abe or a time steps. Lasty, when the outputs y n are noisy, one can sove the east-squares probem Y X(H () ), F to approximate the Hanke tensors; we wi empiricay evauate this approach in Section 5 and we deer its theoretica anaysis in the noisy setting to uture work. Whie the east-squares method is suicient to obtain the theoretica guarantees o Theorem, it does not everage the ow rank structure o the Hanke tensors H (). We now propose three aternative recovery methods to everage this structure, whose eiciency wi be assessed in a simuation study in Section 5 (deriving improved sampe compexity guarantees using these methods is et Note that the theorem can be adapted i such an integer L does not exists (see suppementary materia). 6

Agorithm RNN-SL: Spectra Learning o inear -RNNs Input: Three training datasets D L, D L, D L+ with input sequences o ength L, L and L + respectivey, a recovery_method, rank R and earning rate γ (or IHT/TIHT). : or {L, L, L + } do : From the dataset D = {((x (n), x(n) with rows x (n) x (n) x (n),, x(n) ), y n )} N n= (Rd ) R p, buid X R N d or n [N ] and Y R N p with rows y n or n [N ]. : i recovery_method = "Least-Squares" then 4: H () = arg min T R d d p X(T ), Y F. 5: ese i recovery_method = "Nucear Norm Minimization" then 6: H () = arg min T R d d p (T ) /, / + subject to X(T ), = Y. 7: ese i recovery_method = "IHT" or "TIHT" then 8: Initiaize H () R d d p to 0. 9: repeat 0: (H () ), = (H () ), + γx (Y X(H () ), ) : H () = project(h (), R) (using either SVD or IHT or TT-SVD or TIHT) : unti convergence : end i 4: end or 5: Let (H () ),+ = PS be a rank R actorization. 6: Return the inear -RNN (h 0, A, Ω) where α = (S ) (H () ), A = + ((H(+) ),,+ ) P (S ), Ω = P (H () ), or uture work). In the noiseess setting, we irst propose to repace soving the inear system Y = X(H () ), with a nucear norm minimization probem (see ine 6 o Agorithm ), thus everaging the act that (H () ) /, / + is potentiay o ow matrix rank. We aso propose to use iterative hard threshoding (IHT) [0] and its tensor counterpart TIHT [0], which are based on the cassica projected gradient descent agorithm and have shown to be robust to noise in practice. These two methods are impemented in ines 8- o Agorithm. There, the project method either projects (H () ) /, / + onto the maniod o ow rank matrices using SVD (IHT) or projects H () onto the maniod o tensors with TT-rank R (TIHT). 5 Experiments In this section, we perorm experiments on synthetic data to compare how the choice o the recovery method (LeastSquares, NucearNorm, IHT and TIHT) aects the sampe eiciency o the earning agorithm 4, and to assess the robustness o RNN-SL to noise and the robustness o the IHT and TIHT recovery methods to rank mis-speciication. We perorm two experiments. In the irst one, we randomy generated a inear -RNN with 5 units computing a unction : R R by drawing the entries o a parameters (h 0, A, Ω) independenty rom the norma distribution N (0, 0.). The training data consists o independenty drawn sets D = {((x (n), x(n),, x(n) ), y n )} N n= (Rd ) R p or {L, L, L + } with L =, where each x (n) i N (0, I) and where the outputs can be noisy, i.e. y n = (x (n), x(n),, x(n) ) + ξ n where ξ n N (0, σ ) or some noise variance σ. In the second experiment, the goa is to earn a simpe arithmetic unction that computes the sum o the running dierences between the two components o a sequence o -dimensiona vectors, i.e. (x,, x k ) = k i= v x i where v = ( ). The training datasets are generated using the same process as above and a constant entry equa to one is added to a the input vectors to encode a bias term (one can check that the resuting unction can be computed by a -RNN with hidden units). 4 The NucearNorm method was ony used in the noiseess setting (though it coud be extended to the noisy recovery setting in practice). 7

MSE MSE = 0 = 0. = 0.5 0 0 0 0 TIHT(R = 5) 0 0 TIHT(R = 6) 0 0 6 0 IHT(R = 5) 0 IHT(R = 6) LeastSquares 0 9 0 0 0 NucearNorm 0 0 0 0 0 0 0 0 Training Size Figure : Average MSE as a unction o the training set size or the irst experiment (earning a random inear -RNN) or dierent vaues o output noise. 0 0 0 0 6 0 9 0 = 0 0 = 0. 0 0 0 0 0 0 0 0 0 0 0 0 Training Size Figure 4: Average MSE as a unction o the training set size or the second experiment (earning a simpe arithmetic unction) or dierent vaues o output noise. 0 = TIHT(R = ) TIHT(R = 4) IHT(R = ) IHT(R = 4) LeastSquares NucearNorm We run the experiments or dierent sizes o datasets N ranging rom 0 to 5, 000 (we set N L = N L = N L+ = N) and we compare the dierent methods in terms o mean squared error 5 (MSE) on a test set o size 5, 000 containing sequences o ength 6 generated in the same way as the training data (note that the training data ony contains sequences o engths up to 5). The resuts are reported in Figure and 4 where we see that a recovery methods ead to consistent estimates o the target unction given enough training data. This is the case even in the presence o noise (in which case more sampes are needed to achieve the same accuracy, as expected). We can aso see that IHT and TIHT are overa more sampe eicient than the other methods (especiay with noisy data), showing that taking the ow rank structure o the Hanke tensors into account is proitabe. Moreover, TIHT tends to perorm better than its matrix counterpart, conirming our intuition that everaging the tensor train structure is beneicia. Lasty, as one can expect, when the rank parameter R is over-estimated IHT and TIHT sti converge to the target but they need more sampes (when the rank parameter was underestimated both agorithms did not earn at a). The running time o a methods were overa o the same order. 6 Concusion and Future Directions In this paper, we proposed the irst provabe earning agorithm or second-order RNNs with inear activation unctions. The eaboration o this meaningu earnabiity resut reied on showing that inear -RNNs are a natura extension o vv-wfas, and on extending the vv-wfa spectra earning agorithm to inear -RNNs. We beieve that the resuts presented in this paper open a number o exciting and promising research directions on both the theoretica and practica perspectives. We irst pan to use the spectra earning estimate as a starting point or gradient based methods to train - RNNs. More precisey, since inear -RNNs can be thought o as -RNN using LeakyReu activation unctions with negative sope, one can use a inear -RNN as initiaization beore graduay reducing the negative sope parameter during training. The extension o spectra earning to inear -RNNs aso opens the door to scaing up the cassica spectra method to probem with arge size discrete aphabets (which is a known caveat o the spectra agorithm or WFAs) since it aows one to use ow dimensiona embeddings o arge vocabuaries (using e.g. wordvec or atent semantic anaysis). From the theoretica perspective, we pan on deriving earning guarantees or inear -RNNs in the noisy setting (e.g. using the PAC earnabiity ramework). Even though it is intuitive that such guarantees shoud hod (given the continuity o a operations used in our agorithm), we beieve 5 Using the IHT/TIHT methods, it sedom happens on sma datasets that RNN-SL returns aberrant modes (due to numerica instabiities). We used the oowing scheme to circumvent this issue: when the training MSE o the hypothesis was greater than the one o the zero unction, the zero unction was returned instead. 8

that such an anaysis may entai resuts o independent interest. In particuar, anaogousy to the matrix case studied in [7], obtaining rate optima convergence rates or the recovery o the ow TT-rank Hanke tensors rom rank one measurements is an interesting direction; such a resut coud or exampe aow us to improve the generaization bounds provided in [5] or spectra earning o genera WFAs. 9

Reerences [] Dana Anguin. Queries and concept earning. Machine earning, (4):9 4, 988. [] Enes Avcu, Chihiro Shibata, and Jerey Heinz. Subreguar compexity and deep earning. arxiv preprint arxiv:705.05940, 07. [] Raphaë Baiy, François Denis, and Liva Raaivoa. Grammatica inerence as a principa component anaysis probem. In ICML, pages 40, 009. [4] Borja Bae, Xavier Carreras, Franco M Luque, and Ariadna Quattoni. Spectra earning o weighted automata. Machine earning, 96(-): 6, 04. [5] Borja Bae and Mehryar Mohri. Spectra earning o genera weighted automata via constrained matrix competion. In NIPS, pages 59 67, 0. [6] Byron Boots, Sajid M. Siddiqi, and Georey J. Gordon. Cosing the earning-panning oop with predictive state representations. Internationa Journa o Robotics Research, 0(7):954 966, 0. [7] T Tony Cai, Anru Zhang, et a. Rop: Matrix recovery via rank-one projections. The Annas o Statistics, 4():0 8, 05. [8] Andrew Critch. Agebraic geometry o hidden Markov and reated modes. PhD thesis, University o Caiornia, Berkeey, 0. [9] Andrew Critch and Jason Morton. Agebraic geometry o matrix product states. SIGMA, 0(095):095, 04. [0] François Denis and Yann Esposito. On rationa stochastic anguages. Fundamenta Inormaticae, 86(, ):4 77, 008. [] Carton Downey, Ahmed Heny, Byron Boots, Georey J Gordon, and Boyue Li. Predictive state recurrent neura networks. In NIPS, pages 6055 6066, 07. [] Herbert Federer. Geometric measure theory. Springer, 04. [] Feix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to orget: Continua prediction with LSTM. Neura Computation, (0):45 47, 000. [4] C Lee Gies, Ciord B Mier, Dong Chen, Hsing-Hen Chen, Guo-Zheng Sun, and Yee-Chun Lee. Learning and extracting inite state automata with second-order recurrent neura networks. Neura Computation, 4():9 405, 99. [5] C Lee Gies, Guo-Zheng Sun, Hsing-Hen Chen, Yee-Chun Lee, and Dong Chen. Higher order recurrent networks and grammatica inerence. In NIPS, pages 80 87, 990. [6] Hadrien Gaude and Oivier Pietquin. Pac earning o probabiistic automaton based on the method o moments. In ICML, pages 80 89, 06. [7] Aex Graves, Abde-rahman Mohamed, and Georey Hinton. Speech recognition with deep recurrent neura networks. In ICASSP, pages 6645 6649. IEEE, 0. [8] Christopher J Hiar and Lek-Heng Lim. Most tensor probems are np-hard. Journa o the ACM (JACM), 60(6):45, 0. [9] Danie J. Hsu, Sham M. Kakade, and Tong Zhang. A spectra agorithm or earning hidden markov modes. In COLT, 009. [0] Prateek Jain, Raghu Meka, and Inderjit S Dhion. Guaranteed rank minimization via singuar vaue projection. In NIPS, pages 97 945, 00. [] Vaentin Khrukov, Aexander Novikov, and Ivan Oseedets. Expressive power o recurrent neura networks. arxiv preprint arxiv:7.008, 07. [] Tamara G Koda and Brett W Bader. Tensor decompositions and appications. SIAM review, 5():455 500, 009. [] YC Lee, Gary Dooen, HH Chen, GZ Sun, Tom Maxwe, HY Lee, and C Lee Gies. Machine earning using a higher order correation network. Physica D: Noninear Phenomena, (- ):76 06, 986. [4] Tomáš Mikoov, Stean Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Extensions o recurrent neura network anguage mode. In ICASSP, pages 558 55. IEEE, 0. 0

[5] Christian W Omin and C Lee Gies. Constructing deterministic inite-state automata in recurrent neura networks. Journa o the ACM (JACM), 4(6):97 97, 996. [6] Ivan V Oseedets. Tensor-train decomposition. SIAM Journa on Scientiic Computing, (5):95 7, 0. [7] Jordan B Poack. The induction o dynamica recognizers. In Connectionist Approaches to Language Learning, pages 48. Springer, 99. [8] Guiaume Rabusseau. A Tensor Perspective on Weighted Automata, Low-Rank Regression and Agebraic Mixtures. PhD thesis, Aix-Marseie Université, 06. [9] Guiaume Rabusseau, Borja Bae, and Joee Pineau. Mutitask spectra earning o weighted automata. In NIPS, pages 585 594, 07. [0] Hoger Rauhut, Reinhod Schneider, and Žejka Stojanac. Low rank tensor recovery via iterative hard threshoding. Linear Agebra and its Appications, 5:0 6, 07. [] Hanie Sedghi and Anima Anandkumar. Training input-output recurrent neura networks through spectra methods. arxiv preprint arxiv:60.00954, 06. [] Hava T Siegemann and Eduardo D Sontag. On the computationa power o neura nets. In COLT, pages 440 449. ACM, 99. [] Michae Thon and Herbert Jaeger. Links between mutipicity automata, observabe operator modes and predictive state representations: a uniied earning ramework. Journa o Machine Learning Research, 6:0 47, 05. [4] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing recurrent neura network with tensor train. In IJCNN, pages 445 4458. IEEE, 07. [5] Gai Weiss, Yoav Godberg, and Eran Yahav. Extracting automata rom recurrent neura networks using queries and counterexampes. arxiv preprint arxiv:7.09576, 07. [6] Yinchong Yang, Denis Krompass, and Voker Tresp. Tensor-train recurrent neura networks or video cassiication. arxiv preprint arxiv:707.0786, 07. [7] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term orecasting using tensor-train rnns. arxiv preprint arxiv:7.0007, 07.

A Proos A. Proo o Theorem Theorem. Any unction that can be computed by a vv-wfa with n states can be computed by a inear -RNN with n hidden units. Conversey, any unction that can be computed by a inear -RNN with n hidden units on sequences o one-hot vectors (i.e. canonica basis vectors) can be computed by a WFA with n states. More precisey, the WFA A = (α, {A σ } σ Σ, Ω) with n states and the inear -RNN M = (α, A, Ω) with n hidden units, where A R n Σ n is deined by A :,σ,: = A σ or a σ Σ, are such that A (σ σ σ k ) = M (x, x,, x k ) or a sequences o input symbos σ,, σ k Σ, where or each i [k] the input vector x i R Σ is the one-hot encoding o the symbo σ i. Proo. We irst show by induction on k that, or any sequence σ σ k Σ, the hidden state h k computed by M (see Eq. ()) on the corresponding one-hot encoded sequence x,, x k R d satisies h k = (A σ A σ k ) α. The case k = 0 is immediate. Suppose the resut true or sequences o ength up to k. One can check easiy check that A x i = A σi or any index i. Using the induction hypothesis it then oows that To concude, we thus have h k+ = A h k x k+ = A σ k+ h k = (A σ k+ ) h k = (A σ k+ ) (A σ A σ k ) α = (A σ A σ k+ ) α. M (x, x,, x k ) = Ωh k = Ω(A σ A σ k ) α = A (σ σ σ k ). A. Proo o Theorem Theorem. Let : (R d ) R p be a unction computed by a minima inear -RNN with n hidden units and et L be an integer such that rank((h (L) ) L,L+ ) = n. Then, or any P R dl n and S R n dlp such that (H (L) ) L,L+ = PS, the inear -RNN M = (α, A, Ω) deined by α = (S ) (H (L) ) L+, A = ((H (L+) ) L,,L+ ) P (S ), Ω = P (H (L) is a minima inear -RNN computing. Proo. Let P R dl n and S R n dlp be such that (H (L) ) L,L+ = PS Deine the tensors ) L, P = A α, A,, A, I }{{} n R d d n and S = I n, A,, A, Ω R n d d p }{{} L times L times o order L + and L + respectivey, and et P = (P ), R d n and S = (S ),L+ R n dp. Using the identity H (j) oowing identities: = A α, A,, A, Ω or any j, one can easiy check the }{{} j times (H (L) ) L,L+ = P S, (H (L+) ) L,,L+ = A P (S ), (H (L) ) L, = P (Ω ), (H (L) ) L+ = (S ) α. Let M = P P. We wi show that α = M α, A = A M M and Ω = MΩ, which wi entai the resuts since inear -RNN are invariant under change o basis (see Section ). First observe that M = S S. Indeed, we have P P S S = P (H () ),+ S = P PSS = I where we used the act that P (resp. S) is o u coumn rank (resp. row rank) or the ast equaity.

The oowing derivations then oow rom basic tensor agebra: α = (S ) (H (L) ) L+ = (S ) (S ) α = (S S ) = M α, A = ((H (L+) ) L,,L+ ) P (S ) = (A P (S ) ) P (S ) = A P P (S S ) = A M M, which concudes the proo. A. Proo o Theorem 4 Ω = P (H (L) ) L, = P P (Ω ) = MΩ, Theorem. Let (h 0, A, Ω) be a minima inear -RNN with n hidden units computing a unction : (R d ) R p, and et L be an integer 6 such that rank((h (L) ) L,L+ ) = n. Suppose we have access to datasets D = {((x (n), x(n) {L, L, L + } where the entries o each x (n) i norma distribution and where each y n = (x (n), x(n),, x(n) ).,, x(n) ), y n )} N n= (Rd ) R p or are drawn independenty rom the standard Then, whenever N d or each {L, L, L + }, the inear -RNN M returned by Agorithm with the east-squares method satisies M = with probabiity one. Proo. We just need to show or each {L, L, L + } that, under the hypothesis o the Theorem, the Hanke tensors Ĥ() computed in ine 4 o Agorithm are equa to the true Hanke tensors H () with probabiity one. Reca that these tensors are computed by soving the east-squares probem Ĥ () = arg min T R d d p X(T ), Y F where X R N d is the matrix with rows x (n) x (n) x (n) or each n [N ]. Since X(H () ), = Y and since the soution o the east-squares probem is unique as soon as X is o u coumn rank, we just need to show that this is the case with probabiity one when the entries o the vectors x (n) i are drawn at random rom a standard norma distribution. The resut wi then directy oow by appying Theorem. We wi show that the set S = {(x (n),, x(n) ) n [N ], dim(span({x (n) x (n) x (n) })) < d } has Lebesgue measure 0 in ((R d ) ) N R dn as soon as N d, which wi impy that it has probabiity 0 under any continuous probabiity, hence the resut. For any S = {(x (n),, x(n) )} N n=, we denote by X S R N d the matrix with rows x (n) x (n) x (n). One can easiy check that S S i and ony i X S is o rank stricty ess than d, which is equivaent to the determinant o X S X S being equa to 0. Since this determinant is a poynomia in the entries o the vectors x (n) i, S is an agebraic subvariety o R dn. It is then easy to check that the poynomia det(x S X S) is not uniormy 0 when N d. Indeed, it suices to choose the vectors x (n) i such that the amiy (x (n) x (n) x (n) ) N n= spans the whoe space Rd (which is possibe since we can choose arbitrariy any o the N d eements o this amiy), hence the resut. In concusion, S is a proper agebraic subvariety o R dn and hence has Lebesgue measure zero [, Section.6.5]. 6 Note that the theorem can be adapted i such an integer L does not exists (see suppementary materia).

B Liting the simpiying assumption We now show how a our resuts sti hod when there does not exist an L such that rank((h (L) ) L,L+ ) = n. Reca that this simpiying assumption oowed rom assuming that the sets P = S = [d] L orm a compete basis or the unction : [d] R p deined by (i i i k ) = (e i, e i,, e ik ). We irst show that there aways exists an integer L such that P = S = i L [d] i orms a compete basis or. Let M = (α, A, Ω ) be a inear -RNN with n hidden units computing (i.e. such that M = ). It oows rom Theorem and rom the discussion at the beginning o Section 4. that there exists a vv-wfa computing and it is easy to check that rank( ) = n. This impies rank((h ) () ) = n by Theorem. Since P = S = i [d] i converges to [d] as grows to ininity, there exists an L such that the inite sub-bock H R P S p o H R [d] [d] p satisies rank(( H ) () ) = n, i.e. such that P = S = i L [d] i orms a compete basis or. Now consider the inite sub-bocks H + R P [d] S p and H RP p o H deined by (H (L) ( H + ) u,i,v,: = (uiv), and( H ) u,: = (u) or any u P = S and any i [d]. One can check that Theorem hods by repacing mutatis mutandi ) L,L+ by ( H ) (), (H (+) ),,+ by H +, (H (L) ) L, by H by vec( H ). and (H(L) ) L+ To concude, it suices to observe that both H + and H can be constructed rom the entries or the tensors H () or L +, which can be recovered (or estimated in the noisy setting) using the techniques described in Section 4. o the main paper. We thus showed that inear -RNNs can be provaby earned even when there does not exist an L such that rank((h (L) ) L,L+ ) = n. In this setting, one needs to estimate enough o the tensors H () to reconstruct a compete sub-bock H o the Hanke tensor H (aong with the corresponding tensor H + and matrix H ) and recover the inear -RNN by appying Theorem. In addition, one needs to have access to suicienty arge datasets D or each [L + ] rather than ony the three datasets mentioned in Theorem 4. However the data requirement remains the same in the case where we assume that each o the datasets D is constructed rom a unique training dataset S = {((x (n), x(n),, x(n) ), (y(n), y(n),, y(n) ))}N n= o input/output sequences. T T 4