Fast Tree-Structured Recursive Neural Tensor Networks

Size: px

Start display at page:

Download "Fast Tree-Structured Recursive Neural Tensor Networks"

Moses O’Neal’
6 years ago
Views:

1 Fast Tree-Structured ecursve Neural Tensor Networks Anand Avat, Na-Cha Chen Stanford Unversty Project TA: Youssef Ahres 1 Introducton In ths project we explore dfferent ways n whch we can optmze the computaton of tranng a Tree-structured NTN, n partcular batchng technques n combnng many matrx-vector multplcatons nto matrx-matrx multplcatons, and many tensor-vector operatons nto tensor-matrx operatons We assume that tranng s performed usng mn-batch AdaGrad algorthm, and explore how we can explot the presence of multple examples and batch computaton across the set as a whole We explore how we can apply our optmzaton technques to the forward propagaton phase, the back propagaton phase, and run the batched operatons on GPUs Our goal s to speed up the executon of Tree-structured NNs so that ts runtme performance s no more a lmtng factor for adopton We use the Stanford CoreNP project that has an mplementaton of NTN n Java as our baselne All our mplementaton and experments are performed over ths 2 Background - ecursve Neural Tensor Networks ecursve Neural Tensor Network (NTN) s a model for semantc compostonalty, proposed by Socher et al [1] Ths network has been successfully appled to sentment analyss, where the nput s a sentence n ts parse tree structure, and the output s the classfcaton for the nput sentence, e, whether the meanng s very negatve, negatve, neutral, postve, or very postve 21 Forward propagaton In the forward phase, each nput s a sentence n ts parse tree structure (See fgure (1)) Each word n the nput sentence s converted to a d-dmensonal word vector through a word embeddng matrx d V, where V s the sze of the vocabulary Further more, each word vector s converted to a d-dmensonal node vector through the element-wse tanh functon and stored n the leaf node of the tree The node vectors of nternal nodes are then computed n a bottom-up fashon as follows et x be the parent node of ts left and rght chldren nodes x, x, then the node vector at x s defned as ] x x T [ ] [ ] = tanh([ x x V x x + W x ) d, (1) where W 2d d s the weght matrx, V 2d 2d d s the weght tensor, and tanh apples on vectors element-wse Each node n the tree s now a d-dmensonal vector The network predcts the meanng (very negatve, negatve, neutral, postve, or very postve) of every node by producng a probablty vector y 5 defned as y = softmax(w s x ) 5, (2) 1

2 where W s 5 d s the sentment classfcaton matrx Moreover, each node x s assocated wth a ground truth (or target vector) t 5, whch s a bnary vector whose j-th component s 1 f j s the correct label and s 0 n all other components In order to mnmze the K-dvergence between the predcted dstrbuton y and the true dstrbuton t, the error functon wth regularzaton s defned as E(θ) = < t, log y > +λ θ 2, where the model parameters are θ = (, W, W s, V ), and the log functon apples on y elementwse The move s fantastc Fgure 1: Each tranng example s a sentence 22 Backpropagaton [2] The formulae for errors at the node x are as follows et δ,s denote the softmax error and δ,com denote the complete ncomng error, then δ,s = Ws T (y t ), δ,s, f x s the root δ,com = δ,s + δ p(),down [1 : d], f x s the left chld of x p() δ,s + δ p(),down [d + 1 : 2d], f x s the rght chld of x p() δ,down = (W T δ,com + S ) f (x ), where denotes the Hadamard product, f (x) = 1 x 2 d ( ) [ ] S = δ,com l V [l] + (V [l] ) T x x l=1 (3) W = The gradents of the error wth respect to W, W s, V [l] are [ ] δ,com x T x, = W s (y t )x T, V [l] = δ,com l [ ] [ ] x x T (4) x x 3 Technques for Batchng Computaton The exstng code n Stanford CoreNP trans NTN wth mn-batch adaptve gradent descent (AdaGrad) [3], where the gradents are computed one tranng example at a tme, and the forward/backward propagaton are computed one node at a tme For these computatons, we observe that n the formulae (1, 2, 3), the parameters (W, W s, V ) are shared across tranng examples n a mn-batch and across nodes For foward propagaton, we say a node x s ready to compute f both x and x have been computed For backward propagaton, we say δ s ready to compute f the errors of ts parent node, e, δ p(),s, δ p(),com, and δ p(),down, have been computed 2

We may group ready nodes across trees and compute them all together at once by usng the followng formulae 31 Group matrx-vector multplcatons After a rearrangement of ndces, let x, = 1, k, be the

Group Blnear Operatons To compute the tensor part of (1) over all ready nodes, we have [ u T 1 Vu 1 u T 2 Vu 2 u T k Vu ] k = Flatten2 (V) ( U U ), where s the Khatr-ao product (defned n the

Slce of a 3rd-order tensor 33 Group Errors δ,com and δ,down respectvely To compute the δ,com over all ready nodes usng (3), let com = δ 1,com δ 2,com δ k,com d k, then we have δ 1,com 1 u 1 δ k,com S

3 We may group ready nodes across trees and compute them all together at once by usng the followng formulae 31 Group matrx-vector multplcatons After a rearrangement of ndces, let x, = 1, k, be the nodes that are ready to compute and we have [ ] [ ] x u = x, U = u 1 u 2 u k To compute the matrx-vector multplcaton part of (1) over all ready nodes, we have [Wu 1 Wu 2 Wu k ] = W [u 1 u 2 u k ] 32 Group Blnear Operatons To compute the tensor part of (1) over all ready nodes, we have [ u T 1 Vu 1 u T 2 Vu 2 u T k Vu ] k = Flatten2 (V) ( U U ), where s the Khatr-ao product (defned n the appendx), and the matrx Flatten 2 (V) d 4d2 s obtaned by takng lateral slces of the tensor V and orderng these slces from left to rght See fgure 2(b) (a) Horzontal (b) ateral (c) Frontal Fgure 2: Slce of a 3rd-order tensor 33 Group Errors δ,com and δ,down respectvely To compute the δ,com over all ready nodes usng (3), let com = δ 1,com δ 2,com δ k,com d k, then we have δ 1,com 1 u 1 δ k,com S 1 S k = [ A [1] A [d]] δ 1,com 2 u 1 δ k,com δ 1,com d u 1 δ k,com d 1 u k 2 u k u k = Flatten 1(A)( com U), where A [l] = V [l] +(V [l] ) T, and the matrx Flatten 1 (A) 2d 2d2 s obtaned by takng horzontal slces of the tensor A and orderng these slces from left to rght See fgure 2(a) As for δ,down, we have ( δ 1,down δ 2,down δ k,down = W T com + ) ( S 1 S 2 S k f x 1 x 2 x k ) 3

34 Group Gradents We rewrte equatons (4) as follows: W = [ ] δ,com x T x = δ,com u T = u T 1 δ 1,com δ k,com u T 2 = com U T = (y t )x T = (Y T )X T W s As for the gradents of the tensor V, we have

4 34 Group Gradents We rewrte equatons (4) as follows: W = [ ] δ,com x T x = δ,com u T = u T 1 δ 1,com δ k,com u T 2 = com U T = (y t )x T = (Y T )X T W s As for the gradents of the tensor V, we have δ,com 1 u u T u V [1] T 1 = δ,com V [d] d u u T = [ ] δ 1,com u 1 δ k,com u T 2 u k = ( com U)U T u T k 4 Experments and esults We have mplemented the above batchng technques as modfcatons to the sentment module n the Stanford corenp project We use INDArray [5]s from the Nd4j [6] project to represent matrces and tensors Ths way t s easy to run the batched operatons on a GPU We ran smlar workloads on the three confguratons - unmodfed corenp baselne, batched mplementaton on CPU, batched mplementaton on GPU and we share the results below u T k The specfcs of our experments are as follows: We have used gt commt d 55bbcb of corenpgt as our baselne 4

5 Nd4j verson 04-rc36 CUDA/jcublas verson 65 CPU: Intel() Xeon() CPU E GHz (8 cores) AM: 128GB DD3 GPU: GeForce GTX 680 Dataset: Stanford Sentment Treebank Workload: 1 epoch of tranng on the Dataset varyng batch sze (25, 50, 100) and wordvector dmensons (25, 50, 100) n each of the three modes (CPU, CPU-batch, GPU) Code: All our code s at 5 Concluson and Future Work Based on the results shown above, we conclude the followng: Batchng computaton s always better GPUs offer sgnfcant speed-up (up to 4x n our tests) when word-vector dmensons and batch szes are large enough At lower batch and word-vector dmensons, the overheads of managng data on GPU surpass the benefts of faster computaton Testng wth larger word-vector szes was nterrupted due to a known bug n Nd4j We are eager to resume testng once the bug s fxed as we expect greater speed-up 6 Appendx In ths secton, we revew the defntons of three matrx products [4] Defntons: et A = (a j ), B = (b j ) be m-by-n matrces, C = (c j ) be a p-by-q matrx, and D = (d j ) be a r-by-n matrx Then 1 Hadamard product: a 11 b 11 a 12 b 12 a 1n b 1n a 21 b 21 a 22 b 22 a 2n b 2n A B = m n a m1 b m1 a m2 b m2 a mn b mn 2 Kronecker product: a 11 C a 12 C a 1n C a 21 C a 22 C a 2n C A C = mp nq a m1 C a m2 C a mn C 3 Khatr-ao product: Acknowledgement A D = [a 1 d 1 a 2 d 2 a n d n ] mr n We thank Sam Bowman for ntroducng the problem and many helpful dscussons, Youssef Ahres for mentorng us and offerng hs expertse when we needed, and fnally Prof Andrew Ng for the educaton and conductng ths wonderful course 5

6 eferences [1] chard Socher, Alex Perelygn, Jean Y Wu, Jason Chuang, Chrstopher D Mannng, Andrew Y Ng, and Chrstopher Potts, ecursve Deep Models for Semantc Compostonalty Over a Sentment Treebank, EMNP (2013) [2] Chrstoph Goller and Andreas Kchler, earnng Task-Dependent Dstrbuted epresentatons by Backpropagaton Through Structure, ICNN (1996) [3] John Duch, Elad Hazan, and Yoram Snger, Adaptve Subgradent Methods for Onlne earnng and Stochastc Optmzaton, JM (2012) [4] Shuangzhe u and Gotz Trenkler, Hadamard, Khatr-ao, Kronecker and Other Matrx Products, Int J Inform Syst Sc (2008) [5] [6] Nd4j 6

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to