Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC
|
|
- Alison Willis
- 6 years ago
- Views:
Transcription
1 Andrew Senior 1 Speech recognition Lecture 14: Neural Networks Andrew Senior Google NYC December 12, 2013
2 Andrew Senior 2 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling
3 Andrew Senior 3 The perceptron Input x 1 Input x 2 Input x 3 Input x 4 Input x w 4 5 Output A perceptron is a linear classifier: f (x) = 1 if w.x > 0 (1) = 0 otherwise. (2) Add an extra always one input to provide an offset or bias. The weights w can be learned for a given task with the Perceptron Algorithm.
4 Andrew Senior 4 Perceptron algorithm (Rosenblatt, 1957) Adapt the weights w, example-by example: 1 Initialise the weights and the threshold. 2 For each example j in our training set D, perform the following steps over the input x j and desired output ŷ j : 3 1 Calculate the actual output: y j (t) = f [w(t) x j ] = f [w 0 (t) + w 1 (t)x j,1 + w 2 (t)x j,2 + + w n (t)x j,n ] 2 Update the weights: w i (t + 1) = w i (t) + α(ŷ j y j (t))x j,i, for all nodes 0 i n. 4 Repeat Step 2 until the iteration error 1 s s j [ŷ j y j (t)] is less than a user-specified error threshold γ, or a predetermined number of iterations have been completed.
5 Andrew Senior 5 Nonlinear perceptrons Introduce a nonlinearity: y i = σ( j w ij x j ) Each unit is a simple nonlinear function of a linear combination of its inputs Typically logistic sigmoid: or tanh: σ(z) = e z σ(z) = tanh z
6 Andrew Senior 6 Multilayer perceptrons Extend the network to multiple layers Now a hidden layer of nodes computes a function of the inputs, and output nodes compute a function of the hidden nodes activations. Input layer Hidden layer Output layer Input x 1 Input x 2 Input x 3 y 1 y 2 y 3 Input x 4
7 Andrew Senior 7 Cost function Such networks can be optimized ( trained ) to minimize a cost function (Loss function or objective function) that is a numerical score of the network s performance with respect to targets ŷ i (t). Squared Error L SE = 1 (y i (t) ŷ i (t)) 2 2 t i This is a frame-based criterion, where t would ideally be across the entire space of decoding frames, but in practice is across the training set, and we measure it across a development set. Cross Entropy L CE = t ŷ i (t) log y i (t) i
8 Andrew Senior 8 Targets We need targets / labels ŷ i (t), for each frame usually provided by forced-alignment. (Lecture 8) Viterbi alignment gives one target class for each frame t. Baum-Welch soft-alignments gives a target distribution across ŷ i (t) for each t
9 Andrew Senior 9 Softmax output layer If the output units are logistic, then they are suitable for representing Multivariate Bernouilli random variables P(ŷ i = 1 x) To model a multi-class categorical distribution then we use the Softmax (?) y i = P(c i x) = exp(z i) j exp(z j) which is normalized to sum to one. This reduces to the logistic sigmoid when there are two output classes
10 Andrew Senior 10 Gradient descent To minimize the loss L, compute a gradient L w update it using simple gradient descent: for each parameter w and w = w η L w η is a learning rate which is chosen (typically by cross-validation) but may be set automatically. We can apply the chain rule to compute L w for parameters deep in the network.
11 Andrew Senior 11 Back-propagation 0 Derivatives of the loss functions: L CE = ŷ j (t) log y j (t) (3) y i y i j = ŷi(t) y i (t) (4) L SE = 1 (y j (t) ŷ j (t)) 2 y i y i 2 (5) i = y j (t) ŷ j (t) (6)
12 Andrew Senior 12 Back-propagation 0 Derivative of Logistic activation function: Because y i z i = = 1 z i (7) 1 + e z i e z i (1 + e z (8) i ) 2 = y i (1 y i ) (9) (10) 1 y = (1 + e z ) 1 (1 + e z ) (11) (12)
13 Andrew Senior 13 Back-propagation 0 Derivative of Softmax activation function: y k = e zk z i z i j ez j (13) = δ ik( j ez j )e z k e z k e z i ( j ez j ) 2 (14) = ez i j ez j ( j ez j )δ ik e z k j ez j (15) = y i (δ ik y k ) (16)
14 Back-propagation I For a weight in the final layer, by the chain rule for one example: L = L y k z i (17) w ij y k z i w ij k For Softmax & L CE L CE = ŷk y k y k [Outer gradient.] (18) y k = y i (δ ik y k ) [Derivative of softmax activation function.] z i (19) z i = x j w ij (20) L = ŷ k y k (δ ik y i )x j (21) w ij y k k = x j ŷ k (δ ik y i ) (22) k = x j (ŷ i y i ) (23) Andrew Senior <andrewsenior@google.com> 14
15 Andrew Senior 15 Back-propagation II Back-propagating (Rumelhart et al., 1986) to an earlier hidden layer with weights w jk, activations x j and inputs x k : x j = σ(z j ) = σ( k w jk x k ) (24) First find the gradient w.r.t. the hidden layer activation x j : L x j = i L y i y i z i z i x j (25) z i x j = w ij (26) i.e. we pass the vector of gradients L y i back through the nonlinearity and then project back with the layer s output back through the weight matrix W.
16 Andrew Senior 16 Back-propagation III L w jk x j z j = L x j x j z j z j w jk [Same form as eqn. 17.] (27) = x j (1 x j ) [Derivative of sigmoid activation function.] (28) z j w jk = x k (29) Continue to arbitrary depth: compute activations gradients and then weight gradients for each layer.
17 Andrew Senior 17 Stochastic Gradient Descent Since L is typically defined on the entire training-set, it takes a long time to compute it and its derivatives (summed across all exemplars), and it s only an approximation to the true loss on the theoretical set of all utterances. We can compute a noisy estimate of L w on a small subset of the training set, and make a Stochastic Gradient Descent (SGD) update very quickly. In the limit, we could update on every frame, but a useful compromise is to use a minibatch of around 200 frames.
18 Andrew Senior 18 Second-order optimization Compute the second derivative and optimize a second-order approximation to the error- surface. More computation per step. Requires less-noisy estimates of gradient / curvature (bigger batches). Each step is more effective. Variants: Newton-Raphson Quickprop LBFGS Hessian-free Conjugate gradient
19 Andrew Senior 19 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling
20 Andrew Senior 20 Two main paradigms for neural networks for speech Use neural networks to compute nonlinear feature representations. Bottleneck or tandem features (Hermansky et al., 2000) Low-dimensional representation is modelled conventionally with GMMs. Allows all the GMM machinery and tricks to be exploited. Use neural networks to estimate CD state probabilities.
21 Andrew Senior 21 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling
22 Andrew Senior 22 Neural network features Train a neural network to discriminate classes. Use output or a low-dimensional bottleneck layer representation as features. x 1 x 2 x 3 x 4 Input layer Hidden layers Bottleneck layer Output layer y 1 y 2 y 3 y 4 y 5
23 Andrew Senior 23 Neural network features TRAP: Concatenate PLP-HLDA features and NN features. Bottleneck outperforms posterior features (Grezl et al., 2007) Generally DNN features + GMMs reach about the same performance as hybrid DNN-GMM systems, but are much more complex.
24 Andrew Senior 24 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling
25 Andrew Senior 25 Hybrid networks: Decoding (recap) Recall (Lecture 1) that we choose the decoder output as the optimal word sequence ŵ for an observation sequence o: and ŵ = arg max Pr[w o] (30) w Σ = arg max Pr[o w]pr[w] (31) w Σ Pr(o w) = d,c,p Pr(o c)pr(c p)pr(p w) (32) Where p is the phone sequence and c is the CD state sequence.
26 Empirically (by cross validation) we actually find better results with a prior smoothing term α 0.8. Andrew Senior <andrewsenior@google.com> 26 Hybrid Neural network decoding Now we model P(o c) with a Neural network instead of a Gaussian Mixture model. Everything else stays the same. P(o c) = t P(o t c t ) (33) P(o t c t ) = P(c t o t )P(o t ) P(c t ) P(c t o t ) P(c t ) For observations o t at time t and a CD state sequence c t. We can ignore P(o t ) since it is the same for all decoding paths. The last term is called the scaled posterior : (34) (35) log P(o t c t ) = log P(c t o t ) α log P(c t ) (36)
27 Andrew Senior 27 Input features Neural networks can handle high-dimensional features with correlated features. Use (26) stacked filterbank inputs. (40-dimensional mel-spaced filterbanks) Example filters learned in the first layer:
28 Andrew Senior 28 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling
29 Andrew Senior 29 Rough History Multi-layer perceptron 1986 Speech recognition with neural networks Superseded by GMMs Neural network features 2002 Deep networks 2006 (Hinton, 2002) Deep networks for speech recognition Good results on TIMIT (Mohamed et al., 2009) Results on large vocabulary systems 2010 (Dahl et al., 2011) Google launches DNN ASR product 2011 Dominant paradigm for ASR 2012 (Hinton et al., 2012)
30 Andrew Senior 30 What is new? Fast GPU-based training (distributed CPU-based training is even faster) Pretraining (turns out not to be important) Deeper networks - enabled by faster training Large datasets Machine learning understanding
31 Andrew Senior 31 State of the art Google s current speech production systems 26 frames of 40-dimensional filterbank inputs 8 hidden layers of 2560 hidden units. Rectified Linear nonlinearity (Zeiler et al., 2013) 14,000 outputs 85 million parameters, trained on 2,000 hours of speech data. Running quantized with 8 bit integer weights. On Android phones we run a smaller model with 2.7M parameters.
32 Andrew Senior 32 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling
33 Andrew Senior 33 Sequence training for neural networks Neural networks are trained with a frame-level discriminative criterion (cross-entropy L CE ) Far from the minimum WER criterion we care about. GMM-HMMs trained with sequence-level discriminative training (MMI, bmmi (Povey et al., 2008), MPE, MBR etc.) outperform Maximum-Likelihood models. Kingsbury (2009) shows how to compute a gradient for back-propagation from the numerator and denominator statistics for truth / alternative hypothesis lattices. Given this outer gradient we use back-propagation to compute parameter updates for the neural network.
34 Andrew Senior 34 Pretraining If we have a small amount of supervised data, we can use unlabelled data to get the parameters into reasonable places to model the distribution of the inputs, without knowing the labels. Pretraining is done layer-by layer so is faster than supervised training. There are several methods Contrastive divergence RBM training; Autoencoder; Greedy-layerwise [actually supervised] but none seems necessary for large speech corpora.
35 Andrew Senior 35 Alternative nonlinearities 1 Sigmoid σ(z) = 1 + e z (37) Tanh σ(z) = tanh(z) (38) ReLU σ(z) = max(z, 0) (39) z Softsign σ(z) = 1+ z (40) Softplus σ(z) = log(1 + e z ) (41)
36 Andrew Senior 36 Alternative nonlinearities Note: ReLU gives sparse activations. ReLU Gradient is zero x < 0, one x > 0, so propagated gradients don t attenuate as much as in other nonlinearities. ReLU & softsign are unbounded. Gradients asymptote differently for other nonlinearities.
37 Andrew Senior 37 Neural network variants Many variations Convolutional neural networks (Abdel-Hamid et al., 2012) Convolve a filter with the input weight sharing saves parameters and gives invariance to frequency shifts. Recurrent neural networks Take one frame at a time but store a history of the previous frames, so could theoretically model long-term context. Long-Short Term Memory (Graves et al., 2013) A successful specialization of the recurrent neural network. With complex memory cells.
38 Andrew Senior 38 Recurrent neural networks A recurrent neural network has additional output nodes which are copied back to its inputs with a time delay. (Robinson et al., 1993) Training is with Back-Propagation Through Time. x 1 x 2 x 3 x 4 y 1 y 2 y 3 y 4 y 5 r 1 r 2 r 3 r 4 r 5 r 6
39 Andrew Senior 39 Neural network language modelling Model P(w n w n 1, w n 2, w n 3...) with a neural network instead of with an n-gram (pure frequency counts with back-off). Simply train a softmax for each w n, and use an input representation of w n 1, w n 2, w n 3,.... Even more effectively, train a recurrent neural network. (Mikolov et al., 2010) Leads to word-embeddings - a linear projection of sparse word identities (O(millions)) into a lower-dimensional (O(hundreds)) dense vector space. Easy to add other features (class, part-of-speech) Best performance when combined with an n-gram. Hard to do real-time decoding, though much of the performance can be retained when knowledge is extracted and stored in a WFST. (Arisoy et al., 2013)
40 Andrew Senior 40 Bibliography I Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., and Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In ICASSP, pages IEEE. Arisoy, E., Chen, S. F., Ramabhadran, B., and Sethy, A. (2013). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages IEEE. Dahl, G., Yu, D., Li, D., and Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In ICASSP. Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In ASRU. Grezl, Karafiat, and Cernocky (2007). Neural network topologies and bottleneck features. Speech Recognition. Hermansky, H., Ellis, D., and Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In ICASSP. Hinton, G., Deng, L., Yu, D., Dahl, G., A., M., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29: Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation. Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In ICASSP, pages Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech. Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition. In NIPS. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., and Visweswariah, K. (2008). Boosted MMI for model and feature-space discriminative training. In Proc. ICASSP. Robinson, A. J., Almeida, L., m. Boite, J., Bourlard, H., Fallside, F., Hochberg, M., Kershaw, D., Kohn, P., Konig, Y., Morgan, N., Neto, J. P., Renals, S., Saerens, M., and Wooters, C. (1993). A neural network based, speaker independent, large vocabulary, continuous speech recognition system: The Wernicke project. In PROC. EUROSPEECH 93, pages Rosenblatt, F. (1957). The perceptron a perceiving and recognizing automaton. Technical Report , Cornell Aeronautical Laboratory. Rumelhart, D. E., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323(6088): Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., and Hinton, G. (2013). On rectified linear units for speech processing. In ICASSP.
Deep Neural Networks
Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi
More informationLarge Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks
Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Haşim Sak, Andrew Senior, Oriol Vinyals, Georg Heigold, Erik McDermott, Rajat Monga, Mark Mao, Françoise Beaufays
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationBoundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks
INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationWhy DNN Works for Acoustic Modeling in Speech Recognition?
Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationVery Deep Convolutional Neural Networks for LVCSR
INTERSPEECH 2015 Very Deep Convolutional Neural Networks for LVCSR Mengxiao Bi, Yanmin Qian, Kai Yu Key Lab. of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab,
More informationUnfolded Recurrent Neural Networks for Speech Recognition
INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 8: Tied state HMMs + DNNs in ASR Instructor: Preethi Jyothi Aug 17, 2017 Final Project Landscape Voice conversion using GANs Musical Note Extraction Keystroke
More informationBeyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition
INTERSPEECH 2014 Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition Zhen Huang 1, Jinyu Li 2, Chao Weng 1, Chin-Hui Lee
More informationSMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong
SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ Yongqiang Wang, Jinyu Li and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 {erw, jinyli,
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Recurrent Neural Networks
Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)
More information] Automatic Speech Recognition (CS753)
] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017 Discriminative Training Recall: MLE for HMMs Maximum likelihood estimation (MLE)
More informationMULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals
MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING Liang Lu and Steve Renals Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK {liang.lu, s.renals}@ed.ac.uk ABSTRACT
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationHighway-LSTM and Recurrent Highway Networks for Speech Recognition
Highway-LSTM and Recurrent Highway Networks for Speech Recognition Golan Pundak, Tara N. Sainath Google Inc., New York, NY, USA {golan, tsainath}@google.com Abstract Recently, very deep networks, with
More informationModeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks
Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSpeaker recognition by means of Deep Belief Networks
Speaker recognition by means of Deep Belief Networks Vasileios Vasilakakis, Sandro Cumani, Pietro Laface, Politecnico di Torino, Italy {first.lastname}@polito.it 1. Abstract Most state of the art speaker
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationDeep Neural Network and Its Application in Speech Recognition
Deep Learning What, Why, and How A Tutorial Given at NLP&CC 2013 Deep Neural Network and Its Application in Speech Recognition Dong Yu Microsoft Research Thanks to my collaborators: Li Deng, Frank Seide,
More informationResidual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition
INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim 1, Mostafa El-Khamy 1, Jungwon Lee 1 1 Samsung Semiconductor,
More informationSPEECH recognition systems based on hidden Markov
IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. X, 2014 1 Probabilistic Linear Discriminant Analysis for Acoustic Modelling Liang Lu, Member, IEEE and Steve Renals, Fellow, IEEE Abstract In this letter, we
More informationarxiv: v1 [cs.cl] 23 Sep 2013
Feature Learning with Gaussian Restricted Boltzmann Machine for Robust Speech Recognition Xin Zheng 1,2, Zhiyong Wu 1,2,3, Helen Meng 1,3, Weifeng Li 1, Lianhong Cai 1,2 arxiv:1309.6176v1 [cs.cl] 23 Sep
More informationA Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property
A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property Jianshu Chen Department of Electrical Engineering University of California Los Angeles, CA 90034, USA
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More informationGate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries Yu-Hsuan Wang, Cheng-Tao Chung, Hung-yi
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationSPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton
SPEECH RECOGITIO WITH DEEP RECURRET EURAL ETWORKS Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton Department of Computer Science, University of Toronto ABSTRACT Recurrent neural networks (Rs) are
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationFeature-space Speaker Adaptation for Probabilistic Linear Discriminant Analysis Acoustic Models
Feature-space Speaker Adaptation for Probabilistic Linear Discriminant Analysis Acoustic Models Liang Lu, Steve Renals Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK {liang.lu,
More informationarxiv: v2 [cs.ne] 7 Apr 2015
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies
More informationMachine Learning. Neural Networks
Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationNeural Networks and Deep Learning.
Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More information4F10: Deep Learning. Mark Gales. Michaelmas 2016
4F10: Deep Learning Mark Gales Michaelmas 2016 What is Deep Learning? From Wikipedia: Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationarxiv: v4 [cs.cl] 5 Jun 2017
Multitask Learning with CTC and Segmental CRF for Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith Toyota Technological Institute at Chicago, USA School of Computer Science, Carnegie
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationACCELERATING RECURRENT NEURAL NETWORK TRAINING VIA TWO STAGE CLASSES AND PARALLELIZATION
ACCELERATING RECURRENT NEURAL NETWORK TRAINING VIA TWO STAGE CLASSES AND PARALLELIZATION Zhiheng Huang, Geoffrey Zweig, Michael Levit, Benoit Dumoulin, Barlas Oguz and Shawn Chang Speech at Microsoft,
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationCourse 395: Machine Learning - Lectures
Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationUsing Deep Belief Networks for Vector-Based Speaker Recognition
INTERSPEECH 2014 Using Deep Belief Networks for Vector-Based Speaker Recognition W. M. Campbell MIT Lincoln Laboratory, Lexington, MA, USA wcampbell@ll.mit.edu Abstract Deep belief networks (DBNs) have
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationStochastic gradient descent; Classification
Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationUsually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,
Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny
More informationBackpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline
More information