ON THE COMPRESSION OF RECURRENT NEURAL NETWORKS WITH AN APPLICATION TO LVCSR ACOUSTIC MODELING FOR EMBEDDED SPEECH RECOGNITION

Size: px

Start display at page:

Download "ON THE COMPRESSION OF RECURRENT NEURAL NETWORKS WITH AN APPLICATION TO LVCSR ACOUSTIC MODELING FOR EMBEDDED SPEECH RECOGNITION"

Corey Williams
6 years ago
Views:

1 ON THE COMPRESSION OF RECURRENT NEURAL NETWORKS WITH AN APPLICATION TO LVCSR ACOUSTIC MODELING FOR EMBEDDED SPEECH RECOGNITION Roit Prabavakar Ouais Asarif Antoine Bruguier Ian McGraw Googe Inc. ABSTRACT We study te probem of compressing recurrent neura networks (RNNs). In particuar, we focus on te compression of RNN acoustic modes, wic are motivated by te goa of buiding compact and accurate speec recognition systems wic can be run efficienty on mobie devices. In tis work, we present a tecnique for genera recurrent mode compression tat jointy compresses bot recurrent and non-recurrent inter-ayer weigt matrices. We find tat te proposed tecnique aows us to reduce te size of our Long Sort-Term Memory (LSTM) acoustic mode to a tird of its origina size wit negigibe oss in accuracy. Index Terms mode compression, LSTM, RNN, SVD, embedded speec recognition 1. INTRODUCTION Neura networks (NNs) wit mutipe feed-forward [1, 2] or recurrent idden ayers [3, 4] ave emerged as state-of-teart acoustic modes (AMs) for automatic speec recognition (ASR) tasks. Advances in computationa capabiities couped wit te avaiabiity of arge annotated speec corpora ave made it possibe to train NN-based AMs wit a arge number of parameters [5] wit great success. As speec recognition tecnoogies continue to improve, tey are becoming increasingy ubiquitous on mobie devices: voice assistants suc as Appe s Siri, Microsoft s Cortana, Amazon s Aexa and Googe Now [6] enabe users to searc for information using teir voice. Atoug te traditiona mode for tese appications as been to recognize speec remotey on arge servers, tere as been growing interest in deveoping ASR tecnoogies tat can recognize te input speec directy on-device [7]. Tis as te promise to reduce atency wie enabing user interaction even in cases were a mobie data connection is eiter unavaiabe, sow or unreiabe. Some of te main caenges in tis regard are te disk, memory and computationa constraints imposed by tese devices. Since te number of operations in neura Equa contribution. Te autors woud ike to tank Haşim Sak and Razie Avarez for epfu comments and suggestions on tis work, and Cris Tornton and Yu-sin Cen for comments on an earier draft. networks is proportiona to te number of mode parameters, compressing te mode is desirabe from te point of view of reducing memory usage and power consumption. In tis paper, we study tecniques for compressing recurrent neura networks (RNNs), specificay RNN acoustic modes. We demonstrate ow a generaization of conventiona inter-ayer matrix factorization tecniques (e.g., [8, 9]), were we jointy compress bot recurrent and inter-ayer weigt matrices, aows us to compress acoustic modes up to a tird of teir origina size wit negigibe oss in accuracy. Wie we focus on acoustic modeing, te tecniques presented can be appied to RNNs in oter domains, e.g., andwriting recognition [10] and macine transation [11] inter aia. Te tecnique presented in tis paper encompasses bot traditiona recurrent neura networks (RNNs) as we as Long Sort-Term Memory (LSTM) neura networks. In Section 2, we review previous work tat as focussed on tecniques for compressing neura networks. Our proposed compression tecnique is presented in Section 3. We examine te effectiveness of proposed tecniques in Sections 4 and 5. Finay, we concude wit a discussion of our findings in Section RELATED WORK Tere ave been a number of previous proposas to compress neura networks, bot in te context of ASR as we as in te broader fied of macine earning. We summarize a number of proposed approaces in tis section. It as been noted in previous work tat tere is a arge amount of redundancy in te parameters of a neura network. For exampe, Deni et a. [12] sow tat te entire neura network can be reconstructed given te vaues of a sma number of parameters. Caruana and coeagues sow tat te output distribution earned by a arger neura network can be approximated by a neura network wit fewer parameters by training te smaer network to directy predict te outputs of te arger network [13, 14]. Tis approac, termed mode compression [13] is cosey reated to te recent distiation approac proposed by Hinton et a. [15]. Te redundancy in a neura network as aso been expoited in te HasNet approac of Cen et a. [16], wic imposes parameter tying in /16/$ IEEE 5970 ICASSP 2016

2 network based on a set of as functions. In te context of ASR, previous approaces to acoustic mode compression ave focused mainy on te case of feedforward DNNs. One popuar tecnique is based on sparsifying te weigt matrices in te neura network, for exampe, by setting weigts wose magnitude fas beow a certain tresod to zero [1] or based on te second-derivative of te oss function in te optima brain damage procedure [17]. In fact, Seide et a. [1] demonstrate tat up to two-tirds of te weigts of te feed-forward network can be set to zero witout incurring any oss in performance. Atoug tecniques based on sparsification do decrease te number of effective weigts, encoding te subset of weigts wic can be zeroed out requires additiona memory. Furter, if te weigt matrices are represented as dense matrices for efficient computation, ten te parameter savings on disk wi not transate in to savings of runtime memory. Oter tecniques to reduce te number of mode parameters is based on canging te neura network arcitecture, e.g., by introducing botteneck ayers [18] or troug a ow-rank matrix factorization ayer [19]. We aso note recent work by Wang et a. [20] wic uses a combination of singuar vaue decomposition (SVD) and vector quantization to compress acoustic modes. Te metods investigated in our work are most simiar to previous work tat as examined using SVD to reduce te number of parameters in te network in te context of feedforward DNNs [8, 9, 21]. As we describe in Section 3, our metods can be tougt of as an extension of te tecniques proposed by Xue et a. [8], werein we jointy factorize bot recurrent and (non-recurrent) inter-ayer weigt matrices in te network. 3. MODEL COMPRESSION In tis section, we present a genera tecnique for compressing individua recurrent ayers in a recurrent neura network, tus generaizing te metods proposed by Xue et a. [8]. We describe our approac in te most genera setting of a standard RNN. We denote te activations of te -t idden ayer, consisting of N nodes, at time t by t R N. Te inputs to tis ayer at time t wic are in turn te activations from te previous ayer or te input features are denoted by R N 1. We can ten write te foowing equations wic define te output activations of te -t and ( + 1)-t ayers in a standard RNN: 1 t t = σ(wx 1 1 t + W t 1 + b ) (1) +1 t = σ(wx t + W t 1 + b +1 ) (2) were, b R N and b +1 R N +1 represent bias vectors, σ( ) denotes a non-inear activation function, and W x R N +1 N and W RN N denote weigt matrices tat we refer to respectivey as te inter-ayer and te recurrent Fig. 1. Te initia mode (Figure (a)) is compressed by jointy factorizing recurrent (W ) and inter-ayer (W x) matrices, using a sared recurrent projection matrix (P ) [3] (Figure (b)). weigt matrices, respectivey 1. Since our proposed approac can be appied independenty for eac recurrent idden ayer, we ony describe te compression operations for a particuar ayer. We jointy compress te recurrent and inter-ayer matrices corresponding to a specific ayer by determining a suitabe recurrent projection matrix [3], denoted by P R r N, of rank r < N suc tat, W = Z P and W x = Z xp, tus aowing us to re-write (1) and (2) as, t = σ(wx 1 1 t + ZP t 1 + b ) (3) +1 t = σ(zxp t + W t 1 + b +1 ) (4) were, Z RN r and Z x R N +1 r. Tis compression process is depicted grapicay in Figure 1. We note tat saring P across te recurrent and interayer matrices aows for more efficient parameterization of te weigt matrices; as sown in Section 5, tis does not resut in a significant oss of performance. Tus, te degree of compression in te mode can be controed by setting te ranks r of te projection matrices in eac of te ayers of te network. We determine te recurrent projection matrix P, by first computing an SVD of te recurrent weigt matrix, wic we ten truncate, retaining ony te top r singuar vaues (denoted by Σ ) and te corresponding singuar vectors from U and V (denoted by Ũ and Ṽ, respectivey): W = U Σ V T (Ũ Σ ) Ṽ T = Z P (5) were Z = Ũ Σ and P = Ṽ T. Finay, we determine Z x, as te soution to te foowing east-squares probem: Zx = arg min Y P Wx 2 F (6) Y 1 Te equations are sigty more compicated wen using LSTM ces in te recurrent ayer, but te basic form remains te same. See Section

3 were, X F denotes te Frobenius norm of te matrix. In piot experiments we found tat te proposed SVD-based initiaization performed better tan training a mode wit recurrent projection matrices (i.e., same mode arcitecture) but wit random initiaization of te network weigts Appying our tecnique to LSTM RNNs Generaizing te procedure described above in te context of standard RNNs to te case of LSTM RNNs [3, 22, 23] is straigtforward. Using te notation in [3], note tat te recurrent-weigt matrix W in te case of te LSTM is te concatenation of te four gate weigt matrices, obtained by stacking tem verticay: [W im, W om, W fm, W cm ] T wic represent respectivey, recurrent connections to te input gate, te output gate, te forget gate and te ce state. Simiary, te inter-ayer matrix W x is te concatenation of te matrices: [W ix, W fx, W ox, W cx ] T wic correspond to te input gate, te forget gate, te output gate and te ce state (of te next ayer). Wit tese definitions, compression can be appied as described in Section 3. Note tat we do not compress te peep-oe weigts, since tey are aready narrow, singe coumn matrices and do not contribute significanty to te tota number of parameters in te network. 4. EXPERIMENTAL SETUP In order to determine te effectiveness of te proposed RNN compression tecnique, we conduct experiments on a openended arge-vocabuary dictation task. As we mentioned in Section 1, one of our primary motivations beind investigating acoustic mode compression is to buid compact acoustic modes tat can be depoyed on mobie devices. In recent work, Sak et a. ave demonstrated tat deep LSTM-based AMs trained to predict eiter contextindependent (CI) poneme targets [22] or context-dependent (CD) poneme targets [23] approac state-of-te-art performance on speec tasks. Tese systems ave two important caracteristics: in addition to te CI or CD poneme abes, te system can aso ypotesize a bank abe if it is unsure of te identity of te current poneme, and te systems are trained to optimize te connectionist tempora cassification (CTC) criterion [24] wic maximizes te tota probabiity of correct abe sequence conditioned on te input sequence. More detais can be found in [22, 23]. Foowing [22], our baseine mode is tus a CTC mode: a five idden ayer RNN wit 500 LSTM ces in eac ayer, wic predicts 41 CI ponemes (pus bank ). As a point of comparison, we aso present resuts obtained using a muc arger state-of-te-art server-sized mode wic is too arge to depoy on embedded devices but nonetess serves as an upper-bound performance for our modes on tis dataset. Tis mode consists of five idden ayers wit 600 LSTM ces per ayer, and is trained to predict one of 9287 context-dependent ponemes (pus bank ). Our systems are trained using distributed asyncronous stocastic gradient descent wit a parameter server [25]. Te systems are first trained to convergence to optimize te CTC criterion, foowing wic tese are discriminativey sequence trained to optimize te state-eve minimum Bayes risk (smbr) criterion [26, 27]. As discussed in Section 5, after appying te proposed compression sceme, we furter fine-tune te network: first wit te CTC criterion, foowed by sequence discriminative training wit te smbr criterion. Tis additiona fine-tuning step was found to be necessary to acieve good performance, particuary as te amount of compression was increased. Te anguage mode used in tis work is a 5-gram mode trained on 100M sentences of in-domain data, wit entropybased pruning appied to reduce te size of te LM down to rougy 1.5M n-grams (mainy bigrams) wit a 64K vocabuary. Since our goa is to buid a recognizer to run efficienty on mobie devices, we minimize te size of te decoder grap used for recognition, foowing te approac outined in [7]: we perform an additiona pruning step to generate a muc smaer first-pass anguage mode (69.5K n-grams; mainy unigrams), wic is composed wit te exicon transducer to construct te decoder grap. We ten perform on-te-fy rescoring wit te arger LM. Te resuting modes, wen compressed for use on-device, tota about 20.3 MB, tus enabing tem to be run many times faster tan rea-time on recent mobie devices [28]. We parameterize te input acoustics by computing 40- dimensiona og me-fiterbank energies over te 8Kz range, wic are computed every 10ms over 25ms windowed speec segments. Te server-sized system uses 80-dimensiona features computed over te same range since tis resuted in sigty improved performance. Foowing [23], we stabiize CTC training by stacking togeter 8 consecutive speec frames (7 rigt context frames); ony every tird stacked frame is presented as an input to te network Training and Evauation Data Our systems are trained on 3M and-transcribed anonymized utterances extracted from Googe voice searc traffic ( 2000 ours). We create muti-stye training data by synteticay distorting utterances to simuate background noise and reverberation using a room simuator wit noise sampes extracted from YouTube videos and environmenta recordings of everyday events; 20 distorted exampes are created for eac utterance in te training set. Systems are additionay adapted using te smbr criterion [26, 27] on a set of 5972

4 1M anonymized and-transcribed (in-domain) dictation utterances extracted from Googe traffic, processed to generate muti-stye training data as described above, wic improves performance on our dictation task. A resuts are reported on a set of 13.3K and-transcribed anonymized utterances extracted from Googe traffic from an open-ended dictation domain. 5. RESULTS In our experiments, we seek to determine te impact of te proposed joint SVD-based compression tecnique on system performance. In particuar, we are interested in determining ow system performance varies as a function of te degree of compression, wic is controed by setting te ranks of te recurrent projection matrices r as described in Section 3. Notice tat since te proposed compression sceme is appied to a idden ayers of te baseine system, tere are numerous settings of te ranks r for te projection matrices in eac ayer wic resut in te same number of tota parameters in te compressed network. In order to avoid tis ambiguity, we set te various projection ranks using te foowing criterion: Given a tresod τ, for eac ayer, we set te rank r of te corresponding projection matrix suc tat it corresponds to retaining a fraction of at most τ of te expained variance after te truncated SVD of W. More specificay, if te singuar vaues in Σ in (5) are sorted in non-increasing order as σ1 σ2 σn, we set eac r as: r = arg max 1 k N { k j=1 σ 2 } j N j=1 σ 2 τ j Coosing te projection ranks using (7) aows us to contro te degree of compression, and tus compressed mode size by varying a singe parameter, τ. In piot experiments we found tat tis sceme performed better tan setting ranks to be equa for a ayers (given te same tota parameter budget). Once te projection ranks r ave been determined for te various projection matrices we fine-tune te compressed modes by first optimizing te CTC criterion, foowed by sequence training wit te smbr criterion and adaptation on in-domain data as described in Section 4.1. Te resuts of our experiments are presented in Tabe 1. As can be seen in Tabe 1, te baseine system wic predicts CI poneme targets is ony 10% reative worse tan te arger server-sized system, atoug it as af as many parameters. Since te ranks r are a cosen to retain a given fraction of te expained variance in te SVD operation, we aso note tat earier idden ayers in te network appear to ave ower ranks tan ater ayers, since most of te variance is accounted for by a smaer number of singuar vaues. It can be seen from Tabe 1 tat word error rates increase as te amount of compression is increased, atoug performance of (7) System Projection ranks, r Params WER server M 11.3 baseine - 9.7M 12.4 τ = , 375, 395, 405, M 12.3 τ = , 305, 335, 345, M 12.5 τ = , 215, 245, 260, M 12.5 τ = , 150, 180, 195, M 12.6 τ = , 105, 130, 145, M 12.9 τ = , 70, 90, 100, M 13.2 τ = , 45, 55, 65, M 14.4 Tabe 1. Word error rates (%) on te test set as a function of te percentage of expained variance retained (τ) after te SVDs of te recurrent weigt matrices W in te idden ayers of te RNN. te compressed systems are cose to te baseine for moderate compression (τ 0.7). Using a vaue of τ = 0.6, enabes te mode to be compressed to a tird of its origina size, wit ony a sma degradation in accuracy. However, performance begins to degrade significanty for τ 0.5. Future work wi consider aternative tecniques for setting te projection ranks r in order to examine teir impact on system performance. 6. CONCLUSIONS We presented a tecnique to compress RNNs using a joint factorization of recurrent and inter-ayer weigt matrices, generaizing previous work [8]. Te proposed tecnique was appied to te task of compressing LSTM RNN acoustic modes for embedded speec recognition, were we found tat we coud compress our baseine acoustic mode to a tird of its origina size wit negigibe oss in accuracy. Te proposed tecniques, in combination wit weigt quantization, aow us to buid a sma and efficient speec recognizer tat run many times faster tan rea-time on recent mobie devices [28]. 7. REFERENCES [1] F. Seide, G. Li, and D. Yu, Conversationa speec transcription using context-dependent deep neura networks, in Proc. of Interspeec, 2011, pp [2] G. Hinton, L. Deng, D. Yu, G. E. Da, A.-R. Moamed, N. Jaity, A. Senior, V. Vanoucke, P. Nguyen, T. N. Sainat, and B. Kingsbury, Deep neura networks for acoustic modeing in speec recognition: Te sared views of four researc groups, IEEE Signa Processing Magazine, vo. 29, no. 6, pp , [3] H. Sak, A. Senior, and F. Beaufays, Long sort-term memory recurrent neura network arcitectures for arge scae acoustic modeing, in Proc. of Interspeec, 2014, pp

5 [4] T. N. Sainat, O. Vinyas, A. Senior, and H. Sak, Convoutiona, ong sort-term memory, fuy connected deep neura networks, in Proc. of ICASSP, 2015, pp [5] L. Deng and D. Yu, Deep earning: metods and appications, Foundations and Trends in Signa Processing, vo. 7, no. 3 4, pp , [6] J. Scakwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Ceba, M. Coen, M. Kamvar, and B. Strope, Your Word is my Command : googe searc by voice: A case study, in Advances in Speec Recognition, pp Springer US, [7] X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, Accurate and compact arge vocabuary speec recognition on mobie devices, in Proc. of Interspeec, 2013, pp [8] J. Xue, J. Li, and Y. Gong, Restructuring of deep neura network acoustic modes wit singuar vaue decomposition, in Proc. of Interspeec, 2013, pp [9] J. Xue, J. Li, D. Yu, M. Setzer, and Y. Gong, Singuar vaue decomposition based ow-footprint speaker adaptation and personaization for deep neura network, in Proc. of ICASSP, 2014, pp [10] A. Graves, M. Liwicki, S. Fernández, R. Bertoami, H. Bunke, and J. Scmiduber, A nove connectionist system for unconstrained andwriting recognition, IEEE Transactions on Pattern Anaysis and Macine Inteigence, vo. 31, no. 5, pp , [11] I. Sutskever, O. Vinyas, and Q. V. Le, Sequence to sequence earning wit neura networks, in Proc. of NIPS, 2014, pp [12] M. Deni, B. Sakibi, L. Din, M. Ranzato, and N. de Freitas, Predicting parameters in deep earning, in Proc. of NIPS, 2013, pp [13] C. Buciuă, R. Caruana, and A. Nicuescu-Mizi, Mode compression, in Proc. of te 12t ACM SIGKDD internationa conference on Knowedge discovery and data mining, 2006, pp [14] L. J. Ba and R. Caruana, Do deep nets reay need to be deep?, in Proc. of NIPS, 2014, pp [15] G. Hinton, O. Vinyas, and J. Dean, Distiing te knowedge in a neura network, arxiv preprint arxiv: , [16] W. Cen, J. T. Wison, S. Tyree, K. Q. Weinberger, and Y. Cen, Compressing neura networks wit te asing trick, in Proc. of ICML, 2015, pp [17] Y. LeCun, J. S. Denker, and S. A. Soa, Optima brain damage, in Proc. of NIPS, 1989, pp [18] F. Gréz and P. Fousek, Optimizing botte-neck features for LVCSR, in Proc. of ICASSP, Marc 2008, pp [19] T. N. Sainat, B. Kingsbury, V. Sindwani, E. Arisoy, and B. Ramabadran, Low-rank matrix factorization for deep neura network training wit ig-dimensiona output targets, in Proc. of ICASSP, 2013, pp [20] Y. Wang, J. Li, and Y. Gong, Sma-footprint igperformance deep neura network-based speec recognition using spit-vq, in Proc. of ICASSP, 2015, pp [21] P. Nakkiran, R. Avarez, R. Prabavakar, and C. Parada, Compressing deep neura networks using a rankconstrained topoogy, in Proc. of Interspeec, 2015, pp [22] H. Sak, A. Senior, K. Rao, O. İrsoy, A. Graves, F. Beaufays, and J. Scakwyk, Learning acoustic frame abeing for speec recognition wit recurrent neura networks, in Proc. of ICASSP, 2015, pp [23] H. Sak, A. Senior, K. Rao, and F. Beaufays, Fast and accurate recurrent neura network acoustic modes for speec recognition, in Proc. of Interspeec, 2015, pp [24] A. Graves, S. Fernández, F. Gomez, and J. Scmiduber, Connectionist tempora cassification: Labeing unsegmented sequence data wit recurrent neura networks, in Proc. of ICML, 2006, pp [25] J. Dean, G. S. Corrado, R. Monga, K. Cen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, Large scae distributed deep networks, in Proc. of NIPS, 2012, pp [26] B. Kingsbury, Lattice-based optimization of sequence cassification criteria for neura-network acoustic modeing, in Proc. of ICASSP, 2009, pp [27] H. Sak, O. Vinyas, G. Heigod, A. Senior, E. McDermott, R. Monga, and M. Mao, Sequence discriminative distributed training of ong sort-term memory recurrent neura networks, in Proc. of Interspeec, 2014, pp [28] I. McGraw, R. Prabavakar, R. Avarez, M. Gonzaez Arenas, K. Rao, D. Rybac, O. Asarif, H. Sak, A. Gruenstein, F. Beaufays, and C. Parada, Personaized speec recognition on mobie devices, in Proc. of ICASSP,

Appendix A: MATLAB commands for neural networks

Appendix A: MATLAB commands for neura networks 132 Appendix A: MATLAB commands for neura networks p=importdata('pn.xs'); t=importdata('tn.xs'); [pn,meanp,stdp,tn,meant,stdt]=prestd(p,t); for m=1:10 net=newff(minmax(pn),[m,1],{'tansig','purein'},'trainm');