Seq2Tree: A Tree-Structured Extension of LSTM Network

Similar documents
arxiv: v1 [cs.cl] 21 May 2017

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Improved Learning through Augmenting the Loss

Recurrent Neural Network

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

arxiv: v3 [cs.lg] 14 Jan 2018

Deep NMF for Speech Separation

An exploration of dropout with LSTMs

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

EE-559 Deep learning LSTM and GRU

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Combining Static and Dynamic Information for Clinical Event Prediction

Gated Recurrent Neural Tensor Network

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Lecture 11 Recurrent Neural Networks I

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

arxiv: v2 [cs.ne] 7 Apr 2015

Recurrent Neural Networks

Introduction to RNNs!

Slide credit from Hung-Yi Lee & Richard Socher

Generating Sequences with Recurrent Neural Networks

Long-Short Term Memory

Long-Short Term Memory and Other Gated RNNs

Multi-Source Neural Translation

Deep Learning For Mathematical Functions

MASK WEIGHTED STFT RATIOS FOR RELATIVE TRANSFER FUNCTION ESTIMATION AND ITS APPLICATION TO ROBUST ASR

Neural Architectures for Image, Language, and Speech Processing

Deep Learning Recurrent Networks 2/28/2018

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep Recurrent Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Faster Training of Very Deep Networks Via p-norm Gates

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Learning Long-Term Dependencies with Gradient Descent is Difficult

arxiv: v1 [stat.ml] 18 Nov 2017

Sequence Modeling with Neural Networks

Contents. (75pts) COS495 Midterm. (15pts) Short answers

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

arxiv: v2 [cs.sd] 7 Feb 2018

arxiv: v1 [cs.cl] 31 May 2015

arxiv: v1 [cs.lg] 27 Oct 2017

CSC321 Lecture 16: ResNets and Attention

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

Lecture 17: Neural Networks and Deep Learning

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

KNOWLEDGE-GUIDED RECURRENT NEURAL NETWORK LEARNING FOR TASK-ORIENTED ACTION PREDICTION

Recurrent and Recursive Networks

Multi-Source Neural Translation

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

EE-559 Deep learning Recurrent Neural Networks

Minimal Gated Unit for Recurrent Neural Networks

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Deep Learning for Automatic Speech Recognition Part II

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Recurrent Neural Networks. Jian Tang

Unfolded Recurrent Neural Networks for Speech Recognition

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

arxiv: v3 [cs.cl] 24 Feb 2018

Duration-Controlled LSTM for Polyphonic Sound Event Detection

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

arxiv: v1 [cs.ne] 19 Dec 2016

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Inter-document Contextual Language Model

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

arxiv: v2 [cs.cl] 1 Jan 2019

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley

Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

YNU-HPCC at IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis Using a Bi-directional LSTM-CRF Model

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

On the use of Long-Short Term Memory neural networks for time series prediction

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Residual Learning for Sequence Classification

Presented By: Omer Shmueli and Sivan Niv

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

BEYOND WORD IMPORTANCE: CONTEXTUAL DE-

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Deep clustering-based beamforming for separation with unknown number of sources

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Transcription:

Seq2Tree: A Tree-Structured Extension of LSTM Network Weicheng Ma Computer Science Department, Boston University 111 Cummington Mall, Boston, MA wcma@bu.edu Kai Cao Cambia Health Solutions kai.cao@cambiahealth.com Zhaoheng Ni Graduate Center, City University of New York 365 5th Ave., New York, NY zni@gradcenter.cuny.edu Xiang Li Cambia Health Solutions xiang.li@cambiahealth.com Sang Chin Computer Science Department, Boston University 111 Cummington Mall, Boston, MA spchin@cs.bu.edu Abstract Long Short-Term Memory network(lstm) has attracted much attention on sequence modeling tasks, because of its ability to preserve longer term information in a sequence, compared to ordinary Recurrent Neural Networks(RNN s). The basic LSTM structure assumes a chain structure of the input sequence. However, audio streams often show a trend of combining phonemes into meaningful units, which could be words in speech processing task, or a certain type of noise in signal and noise separation task. We introduce Seq2Tree network, a modification of the LSTM network which constructs a tree structure from an input sequence. Experiments show that Seq2Tree network outperforms the state-of-the-art Bidirectional LSTM(BLSTM) model on the signal and noise separation task, namely CHiME Speech Separation and Recognition Challenge. 1 Introduction Recent RNN based approaches are achieving high performance in speech processing tasks, including but are not limited to signal and noise separation task (Erdogan et al. (2015); Zhu and Vogel-Heuser (2014); Wu et al. (2015); Barker et al. (2015)). The underlying hypothesis is that the energy in each frequency bin over a period of time is continuous and predictable. However, in real life scenes noises can break in at any time and intertwine with the sound signal with no predictable pattern, which undermines these models ability in predicting the distribution of noise over frequency bins. To address the problem of finding correct boundaries of noises, some variants of the original LSTM network are used. The current state-of-the-art system on this task applies BLSTM, which tries to bound noises by foreseeing future information (Erdogan et al. (2015); Weninger et al. (2015); Grais et al. (2014)). Nevertheless information from the future contains also further away sound signals, which does not solve the signal superposition problem. Furthermore, we believe the future for speech processing should be dominated by real-time speech processing techniques, which BLSTM models are not able to handle. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

What s more important, phonemes in sound signals make no sense if not combined to be words", neither are those in noise signals. So the sound waveforms should not be understood as a chain of phonemes but on word" level. This leads to a natural choice of tree structured modeling of the waveforms. In this paper we introduce a novel architecture which extends LSTM networks to be able to parse sequential input into tree structure and show its superiority in decomposing sound and noise signals. We call our new RNN architecture Seq2Tree. Our architecture differs from the standard LSTM since each node inherits the hidden state not from the previous state in time sequence but from its parent in the tree structure, based on its position predicted by the network itself. Our evaluation demonstrates the advancement of our Seq2Tree network compared with the BLSTM baseline on the signal and noise separation task (Barker et al. (2015)). Experiments shows that our system shows comparable performance to the BLSTM implementation, while outperforming it in more complex scenarios. Further optimization and adjustment to this task will follow. 2 LSTM Network RNN s have the advantage of processing input sequences regardless of their lengths. The sequence and elements can be of arbitrary types, for example phonemes in a piece of sound, when it comes to the task of speech processing. However RNN s are easily trapped by the explosively growth or rapid vanishment of the gradient over long distances (Hochreiter (1998); Hochreiter et al. (2001)). This makes it difficult for RNN s to represent long-term information. LSTM network is introduced to deal with this problem (Hochreiter (1998); Hochreiter et al. (2001); Zaremba and Sutskever (2014); Zaremba et al. (2014)). Different from directly passing the previous state and the current input to the transition function on which the gradient is calculated, LSTM uses a memory cell to preserve the longer-term information. Using the settings in (Zaremba and Sutskever (2014); Zaremba et al. (2014)), the LSTM transition functions are as follows: i t = σ(w (i) x t + U (i) h t 1 + b (i), f t = σ(w (f) x t + U (f) h t 1 + b (f), o t = σ(w (o) x t + U (o) h t 1 + b (o), (1) u t = tanh(w (u) x t + U (u) h t 1 + b (u), c t = i t u t + f t c t 1, h t = u t tanh(c t ) where i t, f t, o t, c t are the input gate, forget gate, output gate and the memory cell respectively, and refers to element-wise multiplication. As is shown in the equations, the input gate decides how much information from the new input will be added to the memory cell. Similarly, the forget gate f controls how much information to forget from the previous states, and the output gate limits the amount of information to expose. By balancing the incoming and outgoing information amount, LSTM is able to prevent the gradient vanishment and explosion problems. Ordinary LSTM is based on chain structured sequences. There exists two common variants of LSTM networks on structure, namely BLSTM and Multilayer LSTM, which combines multiple LSTM networks together to provide additional information in the prediction at each time step. Tree LSTM (Tai et al. (2015)) could be regarded as one variation of Multilayer LSTM with the dependency relation reversed. 3 Seq2Tree Network The LSTM architectures described in the previous section all have limits in discovering tree structure from sequential input. Though Multilayer LSTM and Tree LSTM networks are able to maintain multilevel dependencies, Multilayer LSTM exposes children cells to all the other units, and Tree LSTM requires tree structured input. These characteristics limit their use in speech processing tasks where no reliable parser exists, especially in the case of online speech processing. Here we propose two variants to LSTM network - Single Level Seq2Tree and Multilayer Seq2Tree architectures. Both 2

Figure 1: Seq2Tree Network Architecture variants are able to find dependencies from adjacent signals, while Multilayer Seq2Tree architecture catches deeper, weaklier bounded correlations. The most significant difference between original LSTM and Seq2Tree model is that instead of taking the previous state as the preceding state, Seq2Tree networks use one additional direction gate d t to choose the direction to go at time step t. The path selection gate is implemented differently in Single Level Seq2Tree and Multilayer Seq2Tree architectures. When the network gets a new input at each step, there exists 3 possible directions to go: up, right and down. Before moving, the previous hidden state h t 1 is added to the parent state list and is set to be active. The strategy is that if the predicted direction to go is up", the parent node of the current active state becomes the parent state. If the direction is down" then the network chooses h t 1 as the parent node. Otherwise the new unit becomes the child of the parent state of the current active node. Since signal patterns can overlap with each other, multiple jumps towards the root in one time step is also allowed. Moreover, at each jump an update gate is used to control the amount of change on the parent layer, and the remainder is passed to even higher states in the tree if there are further jumps. After processing each state its parent node s information is updated. The forget gate of the child state f t controls the amount of change to give its parent state. This mechanism is inspired by the Tree LSTM networks. The transition functions of the Seq2Tree network are as follows: d t = σ(w (d) x t + U (d) h t 1 + b (d) ), h parent = d t (h parent h t 1 ), i t = σ(w (i) x t + U (i) h parent + b (i) ), f t = σ(w (f) x t + U (f) (d t (h parent )) + b (f) ), o t = σ(w (o) x t + U (o) h parent + b (o) ), u t = tanh(w (u) x t + U (u) h parent + b (u) ), (2) c t = i t u t + f t c parent, h t = u t tanh(c t ), f t = σ(w (f) x t + U (f) h t + b (f) ), c parent = c parent + f t c t, h parent = o parent tanh(c parent ). where d t is a selection gate which chooses an h from the recorded h parent with the largest d t gate value, σ represents sigmoid function, and means element-wise multiplication. 4 Task and Model 4.1 Signal and Noise Separation Task We test our Seq2Tree architecture on the signal and noise separation task, the goal of which is to predict a mask which weakens the energy of noise when applied to the input sound. The task is defined in the Second CHiME Speech Separation and Recognition Challenge (Vincent et al. (2013)). 3

4.2 Tree2Seq Signal Masking Model For this task, at each time step t we want to predict a mask over all frequency bins. We achieve this by training a softmax regression matrix which takes the current hidden state: mask = softmax(u (R) h) where U (R) is the regression matrix. We train our signal masking model in two stages using two different loss calculations, as is suggested in Weninger et al. (2014); Erdogan et al. (2015). The two losses we applied are: J 1 (t) = 1 c (mask i label i ) 2 c J 2 (t) = 1 c i=1 c ( x t (mask i label i )) 2 i=1 where c is the number of frequency bins, mask i is the predicted mask at time t for bin i and label i is the labeled mask on bin i at time t. 5 Experiments We evaluate our Seq2Tree architecture on the signal and noise separation task. The data is a fraction of 1500 audio files from CHiME dataset (Vincent et al. (2016)), in which 10% is used for test and the rest for training. Each input file is Fourier Transformed and fed to the models. Every model predicts a mask given the input matrix. The quality of the mask is evaluated in terms of Overall Perceptual Score(OPS) by applying the mask onto the source waveform, given the noise-removed audio gold standard (Emiya et al. (2011)). In our experiment the shape of training data is 50 513, representing the energy at 50 time steps in 513 frequency bins. The test data has variable length over time steps, taking advantage of LSTM models ability to deal with variable length inputs. We compare the results generated by our Seq2Tree model with those output by the BLSTM baseline. The hidden layer size for our Seq2Tree network is set to 256, and we list the results with different numbers of iterations. The BLSTM baseline applies also 256 hidden layer size. Both models are trained for 100 epochs. Best and worst scores of our model are also included. Implementation OPS(dB) BLSTM 25.01 Seq2Tree 24.41 Seq2Tree(Worst Case) 24.23 Seq2Tree(Best Case) 25.75 Table 1: Evaluation Results. As is shown in the results table, our Seq2Tree model has comparable performance as the BLSTM implementation. More importantly, when looked into the specific wav files, in more complex cases where noises overlap with each other, the Seq2Tree model largely outperformed the BLSTM models, which agrees with our estimation. The Seq2Tree model shows advantage in both stability and the ability to deal with more complex situations. We could foresee the growth in performance of our Seq2Tree architecture if it is trained thoroughly, and when more challenging cases are imported. Further experiments are on going to demonstrate the effectiveness of the Seq2Tree architecture on the noise separation task and many other fields. 6 Conclusion In this paper, we introduced a tree-structured LSTM network architecture. The Seq2Tree architecture can be applied to arbitrary sequential input with potential local dependencies among nodes. We demonstrated its effectiveness by evaluating a Seq2Tree based model on the signal and noise separation task. Experiments show that our Seq2Tree model outperforms the baseline since it correctly represented a large portion of the CHiME data. Further evaluations will be done to prove the correctness and advancement of our Seq2Tree architecture in more general cases. 4

References Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. 2015. The third CHiME speech separation and recognition challenge: Dataset, task and baselines. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 504 511. Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker Hohmann. 2011. Subjective and objective quality assessment of audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 19, 7 (2011), 2046 2057. Hakan Erdogan, John R Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 708 712. Emad M Grais, Mehmet Umut Sen, and Hakan Erdogan. 2014. Deep neural networks for single channel source separation. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 3734 3738. Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6, 02 (1998), 107 116. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. (2001). Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arxiv preprint arxiv:1503.00075 (2015). Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second?chime?speech separation and recognition challenge: An overview of challenge systems and outcomes. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 162 167. Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. 2016. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language (2016). Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation. Springer, 91 99. Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller. 2014. Discriminatively trained recurrent neural networks for single-channel speech separation. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on. IEEE, 577 581. Minshun Wu, Zhiqiang Liu, Li Xu, and Degang Chen. 2015. Accurate and cost-effective technique for jitter and noise separation based on single-frequency measurement. Electronics Letters 52, 2 (2015), 106 107. Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arxiv preprint arxiv:1410.4615 (2014). Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arxiv preprint arxiv:1409.2329 (2014). Kunpeng Zhu and Birgit Vogel-Heuser. 2014. Sparse representation and its applications in micromilling condition monitoring: noise separation and tool condition monitoring. The International Journal of Advanced Manufacturing Technology 70, 1-4 (2014), 185 199. 5