Automatic Speech Recognition (CS753)

Similar documents
] Automatic Speech Recognition (CS753)

Introduction to Neural Networks

Automatic Speech Recognition (CS753)

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Deep Neural Networks

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Automatic Speech Recognition (CS753)

Feedforward Neural Networks

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Why DNN Works for Acoustic Modeling in Speech Recognition?

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Lecture 5 Neural models for NLP

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSC321 Lecture 5: Multilayer Perceptrons

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Feedforward Neural Networks. Michael Collins, Columbia University

Automatic Speech Recognition (CS753)

Course 395: Machine Learning - Lectures

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Very Deep Convolutional Neural Networks for LVCSR

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Deep Learning for Speech Recognition. Hung-yi Lee

Hidden Markov Model and Speech Recognition

EE-559 Deep learning Recurrent Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC 411 Lecture 10: Neural Networks

Hidden Markov Models and Gaussian Mixture Models

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural Networks (Part 1) Goals for the lecture

Reading Group on Deep Learning Session 1

Augmented Statistical Models for Speech Recognition

Neural Networks in Structured Prediction. November 17, 2015

Intro to Neural Networks and Deep Learning

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 17: Neural Networks and Deep Learning

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Modelling

Artificial Neuron (Perceptron)

Machine Learning (CSE 446): Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Hidden Markov Models and Gaussian Mixture Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Neural Networks and Deep Learning

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Brief Introduction of Machine Learning Techniques for Content Analysis

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Investigate more robust features for Speech Recognition using Deep Learning

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Introduction to Machine Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Deep Learning Recurrent Networks 2/28/2018

4. Multilayer Perceptrons

Deep Feedforward Networks

Speech Recognition. Department of Computer Science & Information Engineering National Taiwan Normal University. References:

Lecture 13: Structured Prediction

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Unfolded Recurrent Neural Networks for Speech Recognition

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Artificial Neural Networks

Multilayer Neural Networks

SGD and Deep Learning

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Acoustic Modeling for Speech Recognition

Recurrent and Recursive Networks

A summary of Deep Learning without Poor Local Minima

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Lecture 3: ASR: HMMs, Forward, Viterbi

Machine Learning Lecture 5

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Deep Neural Networks (1) Hidden layers; Back-propagation

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Deep Recurrent Neural Networks

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Artificial Neural Networks. MGS Lecture 2

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Neural networks. Chapter 19, Sections 1 5 1

1. Markov models. 1.1 Markov-chain

Speech and Language Processing

Transcription:

Automatic Speech Recognition (CS753) Lecture 8: Tied state HMMs + DNNs in ASR Instructor: Preethi Jyothi Aug 17, 2017

Final Project Landscape Voice conversion using GANs Musical Note Extraction Keystroke detection from keyboard acoustics End-to-end speech-to-text translation End-to-end speaker recognition Code by Voice Text-independent Speaker Verification Stenograph ASR for speech Script generator for with stutters conversations Voice assistant I know who! Transcribing TED Talks Swapping instruments in recordings Guitar note recognition Audio classification Emotion recognition using multimodal cues Sign language to speech conversion InfoGAN for music Multi-source speech extraction in noisy environments

Recap: Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., Tree-based state tying for high accuracy acoustic modeling, ACL-HLT, 1994

Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number Which of mixture states should components be tied in each tied together? state Use is increased decision trees. and models re-estimated using BW Image from: Young et al., Tree-based state tying for high accuracy acoustic modeling, ACL-HLT, 1994

How do we build these phone DTs? 1. What questions are used? Linguistically-inspired binary questions: Does the left or right phone come from a broad class of phones such as vowels, stops, etc.? Is the left or right phone [k] or [m]? 2. What is the training data for each phone state, pj? (root node of DT)

Training data for DT nodes Align training data, x i = (x i1,, x it i) i=1 N where x it R d, against a set of triphone HMMs Use Viterbi algorithm to find the best HMM state sequence corresponding to each x i Tag each x it with ID of current phone along with left-context and right-context { xit { { sil/b/aa b/aa/g aa/g/sil xit is tagged with ID aa2[b/g] i.e. xit is aligned with the second state of the 3-state HMM corresponding to the triphone b/aa/g For a state j in phone p, collect all x it s that are tagged with ID pj[?/?]

How do we build these phone DTs? 1. What questions are used? Linguistically-inspired binary questions: Does the left or right phone come from a broad class of phones such as vowels, stops, etc.? Is the left or right phone [k] or [m]? 2. What is the training data for each phone state, pj? (root node of DT) All speech frames that align with the j th state of every triphone HMM that has p as the middle phone 3. What criterion is used at each node to find the best question to split the data on? Find the question which partitions the states in the parent node so as to give the maximum increase in log likelihood

Likelihood of a cluster of states If a cluster of HMM states, S = {s 1, s 2,, s M } consists of M states and a total of K acoustic observation vectors are associated with S, {x 1, x 2, x K }, then the log likelihood associated with S is: KX X L(S) = log Pr(x i ; µ S, S ) s (x i ) i=1 s2s For a question q that splits S into S yes and S no, compute the following quantity: q = L(S q yes)+l(s q no) L(S) Go through all questions, find Δ q for each question q and choose the question for which Δ q is the biggest Terminate when: Final Δ q is below a threshold or data associated with a split falls below a threshold

Likelihood criterion Given a phonetic question, let the initial set of untied states S be split into two partitions S yes and S no Each partition is clustered to form a single Gaussian output distribution with mean µ Syes and covariance Σ Syes Use the likelihood of the parent state and the subsequent split states to determine which question a node should be split on Image from: Young et al., Tree-based state tying for high accuracy acoustic modeling, ACL-HLT, 1994

Example: Phonetic Decision Tree (DT) One tree is constructed for each state of each phone to cluster all the corresponding triphone states Head node aa/ow2/f, aa/ow2/s, aa/ow2/d, h/ow2/p, aa/ow2/n, aa/ow2/g, DT for center state of [ow] Uses all training data tagged as ow2[?/?] Is left ctxt a vowel? Yes No Is right ctxt a fricative? Yes Leaf A aa/ow2/f, aa/ow2/s, Is right ctxt nasal? No Leaf B aa/ow2/d, aa/ow2/g, Yes No Is right ctxt a glide? Yes Leaf C h/ow2/l, b/ow2/r, No Leaf D h/ow2/p, b/ow2/k, Leaf E aa/ow2/n, aa/ow2/m,

For an unseen triphone at test time Transition Matrix: All triphones of a given phoneme use the same transition matrix common to all triphones of a phoneme State observation densities: Use the triphone identity to traverse all the way to a leaf of the decision tree Use the state observation probabilities associated with that leaf

That s a wrap on HMM-based acoustic models Acoustic Indices Acoustic Models Triphones Context Transducer Monophones Pronunciation Model Words Language Model Word Sequence H a/a_b f0:a:a_b f1:ε f4:ε f3:ε f2:ε f4:ε b/a_b... x/y_z f5:ε f6:ε } One 3-state FST Union + HMM for Closure each tied-state triphone; parameters estimated using Baum-Welch algorithm Resulting FST H

DNN-based acoustic models? Acoustic Indices Acoustic Models Triphones Context Transducer Monophones Pronunciation Model Words Language Model Word Sequence H Phone posteriors } Can we use deep neural networks instead of HMMs to Resulting learn mappings FST between acoustics H and phones?

Brief Introduction to Neural Networks

Feed-forward Neural Network Output Layer Input Layer Hidden Layer

Feed-forward Neural Network Brain Metaphor Single neuron xi wi g (activation function) yi yi=g(σi wi xi) Image from: https://upload.wikimedia.org/wikipedia/commons/1/10/blausen_0657_multipolarneuron.png

Feed-forward Neural Network Parameterized Model x1 1 w14 w13 w23 3 w35 5 a5 x2 a5 = g(w35 a3 + w45 a4) 2 w24 4 w45 = g(w35 (g(w13 a1 + w23 a2)) + w45 (g(w14 a1 + w24 a2))) Parameters of the network: all wij (and biases not shown here) If x is a 2-dimensional vector and the layer above it is a 2-dimensional vector h, a fully-connected layer is associated with: h = xw + b where wij in W is the weight of the connection between i th neuron in the input row and j th neuron in the first hidden layer and b is the bias vector

Feed-forward Neural Network Parameterized Model x1 1 w14 w13 w23 3 w35 5 a5 x2 2 w24 4 w45 a5 = g(w35 a3 + w45 a4) = g(w35 (g(w13 a1 + w23 a2)) + w45 (g(w14 a1 + w24 a2))) The simplest neural network is the perceptron: Perceptron(x) = xw + b A 1-layer feedforward neural network has the form: MLP(x) = g(xw1 + b1) W2 + b2

Common Activation Functions (g) Sigmoid: σ(x) = 1/(1 + e -x ) nonlinear activation functions output 0.0 0.2 0.4 0.6 0.8 1.0 sigmoid 10 5 0 5 10 x

Common Activation Functions (g) Sigmoid: σ(x) = 1/(1 + e -x ) Hyperbolic tangent (tanh): tanh(x) = (e 2x - 1)/(e 2x + 1) nonlinear activation functions output 1.0 0.5 0.0 0.5 1.0 10 5 0 5 10 x tanh sigmoid

Common Activation Functions (g) Sigmoid: σ(x) = 1/(1 + e -x ) Hyperbolic tangent (tanh): tanh(x) = (e 2x - 1)/(e 2x + 1) Rectified Linear Unit (ReLU): RELU(x) = max(0, x) nonlinear activation functions output 0 2 4 6 8 10 ReLU tanh sigmoid 10 5 0 5 10 x

Optimization Problem To train a neural network, define a loss function L(y,ỹ): a function of the true output y and the predicted output ỹ L(y,ỹ) assigns a non-negative numerical score to the neural network s output, ỹ The parameters of the network are set to minimise L over the training examples (i.e. a sum of losses over different training samples) L is typically minimised using a gradient-based method

Stochastic Gradient Descent (SGD) SGD Algorithm Inputs: Function NN(x; θ), Training examples, x 1 x n and outputs, y 1 y n and Loss function L. do until stopping criterion Pick a training example x i, y i Compute the loss L(NN(x i ; θ), y i ) Compute gradient of L, L with respect to θ θ θ - η L done Return: θ

Training a Neural Network Define the Loss function to be minimised as a node L Goal: Learn weights for the neural network which minimise L Gradient Descent: Find L/ w for every weight w, and update it as w w - η L/ w How do we efficiently compute L/ w for all w? Will compute L/ u for every node u in the network! L/ w = L/ u u/ w where u is the node which uses w

Training a Neural Network New goal: compute L/ u for every node u in the network Simple algorithm: Backpropagation Key fact: Chain rule of differentiation If L can be written as a function of variables v1,, vn, which in turn depend (partially) on another variable u, then L/ u = Σi L/ vi vi/ u

Backpropagation If L can be written as a function of variables v1,, vn, which in turn depend (partially) on another variable u, then L/ u = Σi L/ vi vi/ u Consider v1,, vn as the layer above u, Γ(u) v u L Then, the chain rule gives L/ u = Σv Γ(u) L/ v v/ u

Backpropagation L/ u = Σv Γ(u) L/ v v/ u Backpropagation Base case: L/ L = 1 For each u (top to bottom): For each v Γ(u): Inductively, have computed L/ v Directly compute v/ u Compute L/ u Compute L/ w where L/ w = L/ u u/ w v u L Where values computed in the forward pass may be needed Forward Pass First, in a forward pass, compute values of all nodes given an input (The values of each node will be needed during backprop)

[M09] A. Mohamed, G. Dahl, and G. Hinton, Deep belief networks for phone recognition, NIPS Workshop on Deep Learning for Speech Recognition, 2009. [D11] G. Dahl, D. Yu, L. Deng, and A. Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, TASL 20(1), pp. 30 42, 2012. [H12] G. Hinton, et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, 2012. History of Neural Networks in ASR Neural networks for speech recognition were explored as early as 1987 Deep neural networks for speech Beat state-of-the-art on the TIMIT corpus [M09] Significant improvements shown on large-vocabulary systems [D11] Dominant ASR paradigm [H12]

What s new? Why have NN-based systems come back to prominence? Important developments Vast quantities of data available for ASR training Fast GPU-based training Improvements in optimization/initialization techniques Deeper networks enabled by fast training Larger output spaces enabled by fast training and availability of data

Neural Networks for ASR Two main categories of approaches have been explored: 1. Hybrid neural network-hmm systems: Use DNNs to estimate HMM observation probabilities 2. Tandem system: NNs used to generate input features that are fed to an HMM-GMM acoustic model

Neural Networks for ASR Two main categories of approaches have been explored: 1. Hybrid neural network-hmm systems: Use DNNs to estimate HMM observation probabilities 2. Tandem system: DNNs used to generate input features that are fed to an HMM-GMM acoustic model

Decoding an ASR system Recall how we decode the most likely word sequence W for an acoustic sequence O: W = arg max W Pr(O W )Pr(W ) The acoustic model Pr(O W) can be further decomposed as (here, Q,M represent triphone, monophone sequences resp.): Pr(O W )= X Q,M Pr(O, Q, M W ) = X Q,M Pr(O Q, M, W )Pr(Q M,W)Pr(M W ) X Q,M Pr(O Q)Pr(Q M)Pr(M W )

Hybrid system decoding Pr(O W ) X Q,M Pr(O Q)Pr(Q M)Pr(M W ) You ve seen Pr(O Q) estimated using a Gaussian Mixture Model. Let s use a neural network instead to model Pr(O Q). Pr(O Q) = Y t Pr(o t q t ) Pr(o t q t )= Pr(q t o t )Pr(o t ) Pr(q t ) / Pr(q t o t ) Pr(q t ) where o t is the acoustic vector at time t and qt is a triphone HMM state Here, Pr(q t o t ) are posteriors from a trained neural network. Pr(o t q t ) is then a scaled posterior.

Computing Pr(qt ot) using a deep NN Triphone state labels How do we get these labels in order to train the NN? 39 features in one frame Fixed window of 5 speech frames

Triphone labels Forced alignment: Use current acoustic model to find the most likely sequence of HMM states given a sequence of acoustic vectors. (Algorithm to help compute this?) The Viterbi paths for the training data is referred to as forced alignment sil 1 /b/ aa sil 1 /b/ aa sil 2 /b/ aa sil 2 /b/ aa ee 3 /k/ sil Triphone HMMs (Viterbi) Phone sequence p1,,pn Dictionary Training word sequence w1,,wn o1 o2 o3 o4 ot

Computing Pr(qt ot) using a deep NN Triphone state labels How do we get these labels in order to train the NN? (Viterbi) Forced alignment 39 features in one frame Fixed window of 5 speech frames

Computing priors Pr(qt) To compute HMM observation probabilities, Pr(o t q t ), we need both Pr(q t o t ) and Pr(q t ) The posterior probabilities Pr(q t o t ) are computed using a trained neural network Pr(q t ) are relative frequencies of each triphone state as determined by the forced Viterbi alignment of the training data

Hybrid Networks The hybrid networks are trained with a minimum crossentropy criterion L(y, ŷ) = X i y i log(ŷ i ) Advantages of hybrid systems: 1. No assumptions made about acoustic vectors being uncorrelated: Multiple inputs used from a window of time steps 2. Discriminative objective function

Summary of DNN-HMM acoustic models Comparison against HMM-GMM on different tasks [TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS. TASK HOURS OF TRAINING DATA DNN-HMM GMM-HMM WITH SAME DATA GMM-HMM WITH MORE DATA SWITCHBOARD (TEST SET 1) 309 18.5 27.4 18.6 (2,000 H) SWITCHBOARD (TEST SET 2) 309 16.1 23.6 17.1 (2,000 H) ENGLISH BROADCAST NEWS 50 17.5 18.8 BING VOICE SEARCH (SENTENCE ERROR RATES) 24 30.4 36.2 GOOGLE VOICE INPUT 5,870 12.3 16.0 (22 5,870 H) YOUTUBE 1,400 47.6 52.3 Hybrid DNN-HMM systems consistently outperform GMM- HMM systems (sometimes even when the latter is trained with lots more data) Table copied from G. Hinton, et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, 2012.

Neural Networks for ASR Two main categories of approaches have been explored: 1. Hybrid neural network-hmm systems: Use DNNs to estimate HMM observation probabilities 2. Tandem system: NNs used to generate input features that are fed to an HMM-GMM acoustic model

Tandem system First, train a DNN to estimate the posterior probabilities of each subword unit (monophone, triphone state, etc.) In a hybrid system, these posteriors (after scaling) would be used as observation probabilities for the HMM acoustic models In the tandem system, the DNN outputs are used as feature inputs to HMM-GMM models

Bottleneck Features Output Layer Bottleneck Layer Hidden Layers Input Layer Use a low-dimensional bottleneck layer representation to extract features These bottleneck features are in turn used as inputs to HMM-GMM models