RECURRENT SUPPORT VECTOR MACHINES FOR SPEECH RECOGNITION. Shi-Xiong Zhang, Rui Zhao, Chaojun Liu, Jinyu Li and Yifan Gong

Size: px

Start display at page:

Download "RECURRENT SUPPORT VECTOR MACHINES FOR SPEECH RECOGNITION. Shi-Xiong Zhang, Rui Zhao, Chaojun Liu, Jinyu Li and Yifan Gong"

Francine McCoy
5 years ago
Views:

1 RECURRENT SUPPORT VECTOR MACHINES FOR SPEECH RECOGNITION Shi-Xiong Zhang, Rui Zhao, Chaojun Liu, Jinyu Li and Yifan Gong Microoft Corporation, One Microoft Way, Redmond, WA zhahi, ruzhao, chaojunl, jinyli, ABSTRACT Recurrent Neural Network (RNN) uing Long-Short Term Memory (LSTM) architecture have demontrated the tate-of-the-art performance on peech recognition. Mot of deep RNN ue the oftmax activation function in the lat layer for claification. Thi paper illutrate mall but conitent advantage of replacing the oftmax layer in RNN with Support Vector Machine (SVM). The parameter of RNN and SVM are jointly learned uing a equence-level maxmargin criteria, intead of cro-entropy. The reulting model i termed Recurrent SVM. The conventional SVM need to predefine a feature pace and do not have internal tate to deal with arbitrary long-term dependencie in equence. The propoed recurrent SVM ue LSTM to learn the feature pace and to capture temporal dependencie, while uing the SVM (in the lat layer) for equence claification. The model i evaluated on the Window phone tak for large vocabulary continuou peech recognition. Index Term Deep learning, LSTM, SVM, maximum margin, equence training 1. INTRODUCTION Recurrent Neural Network (RNN) [1] and Structured SVM [2, 3] are two ucceful model for equence claification, uch a peech recognition [3, 4]. The RNN are univeral model in the ene that they can approximate any equenceto-equence mapping to arbitrary accuracy [5, 6]. However, there are three major drawback of RNN. Firt, the training uually require to olve a highly nonlinear optimization problem which ha many local minima. Second, they tend to overfit given the limited data if training goe on too long [1]. Third, the number of neuron in RNN i fixed, empirically. Alternatively, Support Vector Machine (SVM) upplied the maximum margin claifier idea [7], ha attracted extenive reearch interet [8 10]. The SVM wa originally propoed for binary claification. It can be extended to handle the multicla claification or equence recognition. The modified SVM for multicla claification and equence recognition are known a the multicla SVM [11] and tructured SVM [2, 3, 12], repectively. The SVM ha erval prominent feature. Firt, it ha been proven that maximizing the margin i equivalent to minimiing an upper bound of generalization error [7]. Second, the optimization problem of SVM i convex, which i guaranteed to have a global optimal olution. Third, the ize of SVM model i determined by the number of upport vector [7] which i learned from training data, intead of a fixed deign in RNN. However, SVM are hallow architecture, wherea deep architecture of feedforward and recurrent neural network have hown tate-of-theart performance in peech recognition [4, 13 15]. Recently deep learning uing SVM in combination with CNN/DNN ha hown promiing reult [16, 17]. In thi work, a novel model uing SVM in combination with RNN i propoed. Conventional RNN ue the oftmax activation function at the lat layer for claification. Thi work illutrate the advantage of replacing the oftmax layer with SVM. Two training algorithm are propoed, at frame and equence-level, to jointly learn the parameter of SVM and RNN in the maxmargin criteria. The reulting model i termed Recurrent SVM. Conventional SVM ue a predefined feature pace and do not have internal tate to deal with arbitrary long-term dependencie. The propoed Recurrent SVM ue RNN to learn the feature pace and to capture temporal dependencie, while uing the SVM (in the lat layer) for claification. It decoding proce i imilar to the RNN-HMM hybrid ytem but with frame-level poterior probabilitie replaced by core from the SVM. We verify it effectivene on the Window phone tak for peech recognition. 2. RECURRENT NEURAL NETWORK Given an input equence X 1:T = x 1,..., x T, the imple RNN compute the hidden vector h 1,..., h T by iterating the following equation from t = 1,..., T, = H(W xh W hh 1 b h ) (1) where W i the weight matrix, e.g., W xh i the input-hidden weight matrix, b i the bia vector, e.g., b h i the hidden bia vector, and H( ) i the recurrent hidden layer function. Denote the correponding tate label a = 1,..., T, the tate-label poterior for frame t i computed by the oftmax exp ( ) w T P ( t X 1:t ) = t b t N exp ( ) (2) w T t b t

where N i the total number of label, w t i the weight vector connecting the hidden layer to the output tate t. 1 H( ) in equation (1) i typically an element-wie igmoid function.

In LSTM-RNN, the recurrent hidden layer function H( ) i implemented by the following compoite function, z t = tanh(w zx W zh 1 b z ) i t = (W ix W ih 1 W ic c t 1 b i ) f t = (W fx W fh 1 W fc c t 1

2 where N i the total number of label, w t i the weight vector connecting the hidden layer to the output tate t. 1 H( ) in equation (1) i typically an element-wie igmoid function. However, [18] found that uing the Long Short- Term Memory (LSTM) architecture can alleviate the gradient vanihing/exploding iue. A imple LSTM memory block i illutrated in Figure 1. In LSTM-RNN, the recurrent hidden layer function H( ) i implemented by the following compoite function, z t = tanh(w zx W zh 1 b z ) i t = (W ix W ih 1 W ic c t 1 b i ) f t = (W fx W fh 1 W fc c t 1 b f ) c t = i t z t f t c t 1 o t = (W ox W oh 1 W oc c t 1 b o ) block input input gate forget gate cell tate output gate = o t tanh(c t ) block output (3) where vector i t, f t, o t are the activation of the input, output, and forget gate repectively, W x are the weight matrice for input, W h are the weight matrice for recurrent input. W c are the diagonal weight matrice for the peephole connection. tanh( ), ( ) and are the hyperbolic tangent, logitic igmoid and multiplication, element-wie function, repectively. 3. RECURRENT SVM In LSTM-RNN, the oftmax normalization term (Eq. (2)) i a contant for all label, thu, it can be ignored during decoding. For example, given a input equence X 1:t, the mot likely tate label at time t can be inferred by arg max log P ( X 1:t ) = arg max w T (4) For multicla SVM [11], the claification function i arg max w T φ( ) (5) where φ( ) i the predefined feature pace of SVM and w i the weight parameter for cla/tate. If RNN are ued to derive the feature pace, e.g., φ( ), decoding of multicla SVM and RNN are the ame. Thi inpire u to replace the oftmax layer in RNN with SVM. The reulting model i named Recurrent SVM. It architecture i illutrated in Figure 1. Note that RNN can be trained uing frame-level croentropy, connectionit temporal claification (CTC) [19] or equence-level MMI/MBR criteria [15,20]. In thi work, two training algorithm at frame-level (Section 3.1) and equencelevel (Section 3.2), uing max-margin criterion, are propoed. Each algorithm include two iterative tep. The firt tep i to etimate the parameter of SVM in the lat layer uing the quadratic programing. The econd tep i to update the parameter of RNN in all previou layer uing ubgradient approache. 1 For implicity, the tate bia b t i ignored for the ret of the paper. 1 time lag Recurrent SVM SVM w1 T w T w T N zoom in 1 tanh z t 1 peephole i t c t tanh f t 1 LSTM memory block Fig. 1. The architecture of Recurrent SVM with LSTM unit. The dah arrow illutrate the connection witime lag. The SVM in the lat layer can be Multicla/Structured SVM Frame-level training (Recurrent multicla SVM) In the frame-level max-margin training, given training input and tate label, (, t ) T, let φ() a the feature pace derived from RNN recurrent tate, the parameter of the lat layer are firt etimated uing the multicla SVM training algorithm [11], 1 min w,ξ t 2 N T w 2 2 C ξt 2 (6) =1.t. for every training frame t = 1,..., T, for every competing tate t 1,..., N : w T t w T t 1 ξ t, t t where ξ t 0 i the lack variable which penalize the data point that violate the margin requirement. The equation (6) baically ay that, the core of the correct tate label, w T t, ha to be greater than the core of any other tate, w T t, by a margin. Subtituting ξ t from the contraint into the objective function, equation (6) can be reformulated a minimizing F frm (w,w )= 1 N w C T [ 1 w T t maxw T t t t =1 (7) where depend on all the weight W in previou layer hown in equation (3). [x] = max(0, x) i a hinge function. Note the maximum of a et of linear function i convex, thu equation (7) i convex w.r.t. w. The firt tep of recurrent SVM training i to optimize w by minimizing (7). The econd tep of recurrent SVM training i to optimize the parameter in previou layer, e.g., W ih in equation (3). Thee parameter can be updated by back propagating the gradient from the top layer. F frm = ( Ffrm h t Note / i the ame a tandard RNN which can be computed uing BPTT algorithm. The key here i to compute the derivative of F frm w.r.t. the. However, equation (7) i ) (8) ] 2 o t 1

not differentiable becaue of the max( ). To handle thi, the ubgradient method [21] i applied. Given the current SVM parameter, w, in the lat layer, the ubgradient of objective function (7) w.r.t. can be derived a F frm = 2C [ 1 w T t w T t ] (w t w t ) (9) log P ( X ) w T φ(x, ) reference tate equence Margin competing tate equence where t =arg max t w T t i the mot competing tate label.

3 not differentiable becaue of the max( ). To handle thi, the ubgradient method [21] i applied. Given the current SVM parameter, w, in the lat layer, the ubgradient of objective function (7) w.r.t. can be derived a F frm = 2C [ 1 w T t w T t ] (w t w t ) (9) log P ( X ) w T φ(x, ) reference tate equence Margin competing tate equence where t =arg max t w T t i the mot competing tate label. After thi point, the BPTT algorithm work exactly the ame a the tandard RNN. Note, after SVM training for current iteration, mot of data may be claified correctly and beyond the margin, i.e., [ 1 w T t w T t = 0 for thee frame. ] Interetingly, thi mean, only the data located in the margin (known a the upport vector) have non-zero gradient Sequence-level training (Recurrent tructured SVM) In the max-margin equence training, for implicity, let conider one training utterance (X, ) where X = x 1,..., x T i the input and = 1,..., T i reference tate label. The parameter of the model can be etimated by maximiing min log P ( X ) P ( X ) = min log Here the margin i defined a the minimum ditance between the reference tate equence and competing tate equence in the log poterior domain a illutrated in the Fig. 2. Note that, unlike MMI/MBR equence training, the normalization term p(x, ) in poterior probability i cancelled out, a it appear in both numerator and denominator. For clarity, the language model probability i not hown here. To generalize the above objective function, a lo function L(, ) i introduced to control the ize of the margin, a hinge function [ ] i applied to ignore the data that beyond the margin, and a prior P (w) i incorporated to further reduce the generalization error. Thu the criterion become minimizing [ ] 2 log P (w) max L(, ) log (10) For recurrent SVM, log () can be computed via ( w T t log P ( t ) log P ( t t 1 ) ) = w T φ(x, )(11) where φ(x, ) i the joint feature [22], which characterize the dependencie between two equence, X and, δ( t = 1) w 1 φ(x, ) =. δ( t = N) log P ( t ) log P ( t t 1 ), w =. w N 1 1 (12) pace φ(x, ) Fig. 2. The equence-level max-margin criterion. Margin i defined in the log-poterior domain between the reference tate equence S and the mot competing tate equence. where δ( ) i the Kronecker delta (indicator) function. Here the prior, P (w), i aumed to be a Gauian with a zero mean and a caled identity covariance matrix CI, thu log P (w) = log N (0, CI) 1 2C wt w. Subtituting the prior and equation (11) into criterion (10), the parameter of recurrent SVM can be etimated by minimizing 2 linear F eq (w, W ) = 1 [ 2 w 2 2 C w T φ(x, ) (13) max L(, ) w T φ(x, ) convex where φ(, ) depend on and depend on all the weight W in previou layer hown in equation (3). The firt tep of equence training i to optimize w = w 1,..., w,..., w N in the lat layer by minimizing (13). Similar to F frm, the F eq i alo convex for w. Interetingly, given the feature pace φ, equation (13) i the ame a the training criterion for tructured SVM [12]. Thi optimization can be olved uing the cutting plane algorithm [2]. The econd tep of equence training i to optimize the parameter W in the previou layer, e.g., W ih in equation (3). Thee parameter can be updated by back propagating the gradient from the top layer. F eq = ( Feq ) ] 2 (14) Again / i the ame a tandard RNN. The key i to calculate the ubgradient of F eq w.r.t. for each frame t, F eq = 2C [ L w T φ w T φ ] (w t w t ) (15) where L i the lo between the reference and it mot competing tate equence, and φ i hort for feature φ(x, ). After thi point, the BPTT algorithm work exactly the ame a the tandard RNN. The term [ L w T φ w T φ ] in equation (15) indicate that, only the utterance located in the margin (upport vector) have non-zero gradient. 2 For implicity, only one training utterance i hown in equation (13).

4 3.3. Inference The decoding proce i imilar to the RNN-HMM hybrid ytem but withe log poterior probabilitie, log P ( t ), replaced by the core from recurrent SVM, w T t. Note that earching the mot likely tate equence during decoding i eentially the ame a earche mot competing tate equence in equation (13), except the lo L(, ). They can be olved uing the ame Viterbi algorithm Practical Iue An efficient implementation of the algorithm i important for peech recognition. In thi ection everal deign option for recurrent SVM training are decribed that have a ubtantial influence on computational efficiency. Form of Prior. Previouly a zero-mean Gauian prior i introduced. However, a proper mean of prior hould be the one can yield LSTM performance, log P (w) = log N (w RNN, CI). Thu the term 1 2 w 2 2 in (7) and (13) become 1 2 w w RNN 2 2. Since a better mean i applied, a maller variation C can be ued to reduce the training iteration [23]. Lattice. Solving the optimization (13) require to earche mot competing tate equence efficiently. The computational load during training i dominated by thi earch proce. To peed up the training, denominator lattice with tate alignment are ued to contrain the earching pace. Then a lattice-baed forward-backward earch [3, 23] i applied to find the mot competing tate equence. Caching. To further avoid the computation cot of earching the in (13), during training iteration, the five mot recently ued tate equence for each utterance are cached. Thi reduce the number of call to earch in the lattice. 4. EXPERIMENTS The Window Phone hort meage diction tak i ued to evaluate the effectivene of the recurrent SVM propoed in Section 3. The training data conit of 60 hour of trancribed US-Englih peech. The tet et conit of 3 hour of data and 16k word. All the experiment were implemented baed on the computational network toolkit (CNTK) [24]. 87 dimentional log-filter-bank feature are ued for LSTM and recurrent SVM. The feature context window i 11 frame. The baeline DNN ha 5 hidden layer, each layer include 2048 hidden node. All the ytem ue 1812 tied triphone tate. The baeline LSTM ha four layer, each ha 1024 hidden node and the output ize of each layer i reduced to 512 uing a linear projection layer [25]. The tate label i delayed by 5 frame for LSTM a decribed in [25]. No frame tacking wa applied. During LSTM training, the back propagation thougime (BPTT) tep i et to 20. The forget gate bia b f in (3) i initialized a 1 according to [26]. The Recurrent SVM i contructed bae on the baeline LSTM (four LSTM layer) in combination with a SVM in Model WER % frame-level training equence training DNN 23.06% (CE) % (MMI) LSTM-RNN 21.14% (CE) 20.40% (MMI) Recurrent SVM 20.69% (MM) 19.83% (MM) Table 1. The reult (in word error rate) for DNN, LSTM- RNN and Recurrent SVM ytem, trained uing frame-level and equence-level criterion. The CE, MMI and MM are hort for cro entropy (CE), maximum mutual information (MMI) and maximum margin (MM) criterion. Training Recurrent SVM LSTM oftmax layer previou layer baeline frame-level % 20.69% 21.14% equence-level 20.19% 19.83% 20.40% Table 2. The impact of each layer in recurrent SVM uing the frame-level and equence-level max-margin training. the lat oftmax layer. In the frame-level training, the caled 0/1 lo i ued in equation (6) and the CE trained LSTM i ued a a prior. The calar i tuned uing the cro validation et. In the equence training, the tate lo [15] i applied in equation (13) and the MMI-trained LSTM i ued a a prior. The reult of all ytem are hown in Table 1. In frame-level training, the propoed recurrent SVM improved the LSTM (CE) by 2.1%, and improved the DNN (CE) by 10.3%. In equence-level training, the recurrent SVM outperformed the LSTM (MMI) by 2.8%, and improved the DNN (MMI) by 6%. Table 2 ugget that in botraining criterion, more than 65% of gain in recurrent SVM are coming from updating the recurrent layer. Note that althoughe improvement of recurrent SVM over LSTM i not big, it doe not increae any latency/computation cot for the runtime decoding. 5. CONCLUSION AND FUTURE WORK A new type of recurrent model i decribed in thi work. The traditional RNN ue the oftmax activation at the lat layer for claification. The propoed model intead ue a SVM at the lat layer. Two training algorithm are propoed at frame and equence-level to learn parameter of the SVM and RNN jointly in maximum-margin criteria. In frame-level training, the new model i hown to be related to the multicla SVM with RNN feature; In equence-level training, it i related to the tructured SVM with RNN feature and tate tranition feature. The propoed model, named recurrent SVM, yield 2.8% improvement over LSTM (MMI trained) on Window Phone tak without increaing any runtime latency. Unlike the conventional SVM need predefined feature pace, the propoed recurrent SVM ue LSTM to learn the feature pace and to capture temporal dependencie, while uing the linear SVM in the lat layer for claification. Future work will invetigate recurrent SVM with non-linear kernel.

5 6. REFERENCES [1] C. M. Bihop, Pattern recognition and machine learning, Springer, [2] T. Joachim, T. Finley, and C.-N. J. Yu, Cuttingplane training of tructural SVM, Journal of Machine Learning Reearch, vol. 77, pp , [3] S.-X. Zhang and M. J. F. Gale, Structured SVM for automatic peech recognition, IEEE Tranaction Audio, Speech and Language Proceing, vol. 21, no. 3, pp , [4] A. Grave, A. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural network, in ICASSP. IEEE, 2013, pp [5] D. E. Rumelhart, G. E. Hinton, and R. J. William, Learning internal repreentation by back-propagating error, Nature, vol. 323, pp , [6] G. Cybenko, Approximation by uperpoition of a igmoidal function, Math. Control, Signal Syt, vol. 2, pp , [7] V. N. Vapnik, The Nature of Statitical Learning Theory, Springer-Verlag, New York, [8] C. J. C. Burge, A tutorial on upport vector machine for pattern recognition, Data Mining and Knowledge Dicovery, vol. 2, pp , [9] T. Joachim, Text categorization with uport vector machine: Learning with many relevant feature, in Proceeding of 10th European Conference on Machine Learning, 1998, pp [10] J. Salomon, S. King, and M. Oborne, Framewie phone claification uing upport vector machine, in Proceeding of Interpeech, 2002, pp [11] K. Crammer and Y. Singer, On the algorithmic implementation of multicla kernel-baed vector machine, vol. 2, pp , [12] S.-X. Zhang and M. J. F. Gale, Structured upport vector machine for noie robut continuou peech recognition, in Proceeding of Interpeech, Florence, Italy, 2011, pp [13] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Contextdependent pre-trained deep neural network for large vocabulary peech recognition, IEEE Tranaction on Audio, Speech, and Language Proceing, vol. 20, no. 1, pp , [14] G. Hinton, L. Deng, D. Yu, G. E. Dahl, et al., Deep neural network for acoutic modeling in peech recognition: The hared view of four reearch group, Signal Proceing Magazine, vol. 29, no. 6, pp , [15] K. Veelý, A. Ghohal, L. Burget, and D. Povey, Sequence-dicriminative training of deep neural network, in INTERSPEECH, 2013, pp [16] Y. Tang, Deep learning uing linear upport vector machine, in Internaltion Conference on Machine Learning, December [17] S.-X. Zhang, C. Liu, K. Yao, and Y. Gong, Deep neural upport vector machine for peech recognition, in ICASSP. IEEE, April 2015, pp [18] S. Hochreiter and J. Schmidhuber, Long hort-term memory, Neural Comput., vol. 9, no. 8, pp , Nov [19] A. Grave, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionit temporal claification: labelling unegmented equence data with recurrent neural network, in ICML, 2006, pp [20] H. Sak, A. Senior, K. Rao, O. Iroy, A. Grave, F. Beaufay, and J. Schalkwyk, Learning acoutic frame labeling for peech recognition with recurrent neural network, in ICASSP. IEEE, 2015, pp [21] Y. Singer and N. Srebro, Pegao: Primal etimated ub-gradient olver for SVM, in Proceeding of ICML, 2007, pp [22] S.-X. Zhang, A. Ragni, and M. J. F. Gale, Structured log linear model for noie robut peech recognition, Signal Proceing Letter, IEEE, vol. 17, pp , [23] S.-X. Zhang, Structured Support Vector Machine for Speech Recogntion, Ph.D. thei, Cambridge Univerity, March [24] D. Yu, A. Everole, M. Seltzer, K. Yao, et al., An introduction to computational network and the computational network toolkit, Tech. Rep. MSR-TR , Microoft Technical Report, [25] H. Sak, A. Senior, and F. Beaufay, Long hort-term memory recurrent neural network architecture for large cale acoutic modeling, in INTERSPEECH, [26] R. Jozefowicz, W. Zaremba, and L. Sutkever, An empirical exploration of recurrent network architecture, in ICML, 2015, pp

DEEP NEURAL SUPPORT VECTOR MACHINES FOR SPEECH RECOGNITION. Shi-Xiong Zhang, Chaojun Liu, Kaisheng Yao and Yifan Gong

DEEP NEURAL SUPPORT VECTOR MACHINES FOR SPEECH RECOGNITION Shi-Xiong Zhang, Chaojun Liu, Kaiheng Yao and Yifan Gong Microoft Corporation, One Microoft Way, Redmond, WA 98052 zhahi, chaojunl, kaiheny, ygong@microoft.com