CS 136 Lecture 5 Acoustic modeling Phoneme modeling

Similar documents
Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Lecture 5: GMM Acoustic Modeling and Feature Extraction

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models and Gaussian Mixture Models

Speech Recognition. CS 294-5: Statistical Natural Language Processing. State-of-the-Art: Recognition. ASR for Dialog Systems.

Hidden Markov Models and Gaussian Mixture Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: Pattern Classification

Machine Learning for Data Science (CS4786) Lecture 12

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CS534 Machine Learning - Spring Final Exam

Automatic Speech Recognition (CS753)

CSE446: Clustering and EM Spring 2017

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Statistical Pattern Recognition

Sparse Models for Speech Recognition

Hidden Markov Models Part 2: Algorithms

L11: Pattern recognition principles

Hidden Markov Modelling

Expectation Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization Algorithm

Lecture 3: Pattern Classification. Pattern classification

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Hidden Markov Model and Speech Recognition

Automatic Speech Recognition (CS753)

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

COM336: Neural Computing

K-Means and Gaussian Mixture Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Hidden Markov Models

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Statistical Pattern Recognition

Expectation Maximization

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

CS6220: DATA MINING TECHNIQUES

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Automatic Speech Recognition (CS753)

Math 350: An exploration of HMMs through doodles.

4 Gaussian Mixture Models

Motivating the Covariance Matrix

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Design and Implementation of Speech Recognition Systems

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine learning - HT Maximum Likelihood

L2: Review of probability and statistics

Weighted Finite-State Transducers in Computational Biology

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Expectation Maximization (EM)

MIXTURE MODELS AND EM

Clustering, K-Means, EM Tutorial

Mixture Models & EM algorithm Lecture 21

Hidden Markov models

Session 1: Pattern Recognition

Inf2b Learning and Data

Automatic Speech Recognition (CS753)

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

CSC411 Fall 2018 Homework 5

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Machine Learning Techniques for Computer Vision

Unsupervised Learning

Metric-based classifiers. Nuno Vasconcelos UCSD

The generative approach to classification. A classification problem. Generative models CSE 250B

Generative Learning algorithms

Mixture Models and EM

Lecture 2: GMM and EM

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 3: Machine learning, classification, and generative models

Introduction to Machine Learning

The Gaussian distribution

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Mixtures of Gaussians continued

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Data Preprocessing. Cluster Similarity

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Transcription:

+ September 9, 2016 Professor Meteer CS 136 Lecture 5 Acoustic modeling Phoneme modeling Thanks to Dan Jurafsky for these slides

+ Directly Modeling Continuous Observations n Gaussians n Univariate Gaussians n Baum-Welch for univariate Gaussians n Multivariate Gaussians n Baum-Welch for multivariate Gausians n Gaussian Mixture Models (GMMs) n Baum-Welch for GMMs n Multivariant Gaussian Mixture Models n Gaussian: Determine the likelihood of a real value to be in a particular state n Multivariant: We have a vector of values, not just one n Mixture Models: Values may not be best modeled by a single Gaussian n Learning the parameters (means and variances) using the backward forward algorithm

+ Gaussians are parameters by mean and variance

+ Reminder: means and variances n For a discrete random variable X n Mean is the expected value of X n Weighted sum over the values of X n Variance is the squared average deviation from mean

+ Gaussian models n Suppose you wanted to know the likely nationality of a student and all you knew was their height n Data: Height & Nationality n Collect data n Each row is a student and their height and their nationality n Learn the mean and variance for each n Decode : For a new student, what s the likelihood of being each nationality

+ Mixture models n Suppose you find that the data does not fall into a nice Gaussian, but that if you model males and females separately, you have a better model n E.g. 5 8 is tall for a female but short for a male n You can build a mixture model that better fits the data

120 100 80 55 60 65 70 75 80 height + Old Faithful Data 1: A scatterplot of height and weight of various men(blue crosses) and women (red circles). We superimpose two 2D ans, with the 2 Σ ellipsoids representing the 95% confidence interval. Produced by biometric_plot.m. 100 100 100 80 80 80 60 60 60 40 1 2 3 4 5 6 40 1 40 2 1 3 2 4 3 5 4 6 5 6 2: The Old Faithful data. Horizontal axis duration of the erruption in minutes. Vertical axis is time until the next erruption Figure 2: The Old Faithful data. Horizontal axis is duration of the erruption in minutes. Vertical axis is time until the next erruption tes. n is of the eruption in minutes. in (a) minutes. A single (a) Gaussian. A single Gaussian. (b) A mixture (b) A of mixture two Gaussians. of two Gaussians. Source: [Bis06] Source: Figure [Bis06] 2.21. Figure 2.21. n Vertical axis is time until the next eruption in minutes. e corresponding The corresponding incomplete incomplete data logdata likelihood log likelihood is is l(θ) = l(θ) logp(x = 1:N logp(x θ) 1:N θ) (5) (5) = log = p(x n log θ) p(x n θ) (6) (6) n n n (a) A single Gaussian. (b) A mixture of two Gaussians. = n log = p(x log n,z n p(x θ) n,z n θ) (7) (7) z n z n = K = K log log π(zn )N (x π(zn n z)n n,µ(z (x n n z), n,µ(z Σ(z n )) ), Σ(z n )) (8) (8)

+ EM Algorithm applied to Old Faithful data 2 2 2 L =1 0 0 0 2 2 2 2 0 (a) 2 2 0 (b) 2 2 0 (c) 2 2 L =2 2 L =5 2 L = 20 0 0 0 2 2 2 2 0 (d) 2 2 0 (e) 2 2 0 (f) 2 Figure 9: Illustration of the EM for a GMM applied to the Old Faithful data. Source: [Bis06] Figure 9.8. function mu = kmeans(data, K, maxiter, thresh) [N D] = size(data); % initialization by picking random pixels % mu(k,:) = k th center perm = randperm(n);

+ Multivariant Gaussians n Gray-scale is real value from 0-100 Gray Scale 0 50 Black Gray n Color is a combination of 3 values: Color Vectors 100 White 255 0 0 0 255 0 0 0 25 5 Red Green Blue

+ Training data Observation 243 142 53 Label Orange 255 244 102 Yellow 170 109 27 Brown 255 64 160 Pink 133 207 255 Blue

+ Learning Purple using Multivariant Gaussians Collect all the observations labeled purple R G B 135 38 224 104 74 141 128 28 177 66 47 133 167 0 255 Means: 120 37.4 186 Covariance matrix R G B R Var R RG RB G RG Var G GB B RB GB Var B Are the elements of the vector independent? Compare: A lottery of 3 digits If Independent: Each observation is modeled with two vectors: The mean and the diagonal of the covariance of the matrix

+ Back to Acoustic Modeling n Acoustic Model n Increasingly sophisticated models n Acoustic Likelihood for each state: n Gaussians n Multivariate Gaussians n Mixtures of Multivariate Gaussians n Where a state is progressively: n Context Independent Subphone (3ish per phone) n Context Dependent phone (triphones) n State-tying of Context Dependent phones

+ Reminder: VQ n To compute p(o t q j ) n Compute distance between feature vector o t n and each codeword (prototype vector) n in a preclustered codebook n where distance is either n Euclidean n Mahalanobis n Choose the vector that is the closest to o t n and take its codeword v k n And then look up the likelihood of v k given HMM state j in the B matrix n B j (o t )=b j (v k ) s.t. v k is codeword of closest vector to o t

+ Better than VQ n VQ is insufficient for real ASR n Instead: Assume the possible values of the observation feature vector o t are normally distributed. n Represent the observation likelihood function b j (o t ) as a Gaussian with mean µ j and variance σ j 2 f (x µ,σ) = 1 σ 2π exp( (x µ)2 2σ 2 )

+ Gaussian as Probability Density Function

+ Gaussian PDFs n A Gaussian is a probability density function; probability is area under curve. n To make it a probability, we constrain area under curve = 1 n BUT n We will be using point estimates ; value of Gaussian at point. n Technically these are not probabilities, since a pdf gives a probability over a interval, needs to be multiplied by dx n As we will see later, this is ok since the same value is omitted from all Gaussians, so argmax is still correct.

+ Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means n P(o q): P(o q) is highest here at mean P(o q) P(o q is low here, very far from mean) o

+ Using a (univariate) Gaussian as an acoustic likelihood estimator n Let s suppose our observation was a single real-valued feature (instead of 39D vector) n Then if we had learned a Gaussian over the distribution of values of this feature n We could compute the likelihood of any given observation o t as follows: observation mean variance variance

+ Training a Univariate Gaussian (just a teaser: we ll come back to this) n A (single) Gaussian is characterized by a mean and a variance n Imagine that we had some training data in which each state was labeled n We could just compute the mean and variance from the data: mean µ i = 1 T T o t s.t. q t is state i t=1 variance σ i 2 = 1 T T t=1 (o t µ i ) 2 s.t. q t is state i

+ Training Univariate Gaussians (just a teaser: we ll come back to this) n But we don t know which observation was produced by which state! n What we want: to assign each observation vector o t to every possible state i, prorated by the probability that the HMM was in state i at time t. n The probability of being in state i at time t is ξ t (i) mean µ i = T t=1 T t=1 ξ t (i)o t ξ t (i) variance σ 2 i = T t=1 ξ t (i)(o t µ i ) 2 T t=1 ξ t (i)

+ Multivariate Gaussians n Instead of a single mean µ and variance σ: f (x µ,σ) = 1 σ 2π exp( (x µ)2 2σ 2 ) n Vector of observations x modeled by vector of means µ and covariance matrix Σ f (x µ,σ) = 1 (2π) D / 2 Σ exp 1 1/ 2 2 (x µ)t Σ 1 (x µ)

+ Correlated vs. Uncorrelated Values From Chen, Picheny et al lecture slide

+ In two dimensions From Chen, Picheny et al lecture slide

+ Gaussian Intuitions: Off-diagonal n As we increase the off-diagonal entries, more correlation between value of x and value of y Text and figures from Andrew Ng s lecture notes for C

+ Gaussian Intuitions: off-diagonal n As we increase the off-diagonal entries, more correlation between value of x and value of y Text and figures from Andrew Ng s lecture notes for C

+ But: assume diagonal covariance n Assume that the features in the feature vector are uncorrelated n This isn t true for FFT features, but is true for MFCC features, as we saw previously n Computation and storage much cheaper if diagonal covariance. n Only diagonal entries are non-zero n Diagonal contains the variance of each dimension σ ii 2 n So this means we consider the variance of each acoustic feature (dimension) separately

+ Diagonal covariance n Diagonal contains the variance of each dimension σ ii 2 n So this means we consider the variance of each acoustic feature (dimension) separately n state j, time t, dimension d b j (o t ) = D d =1 1 2 2πσ jd exp 1 o td µ jd 2 σ jd 2 b j (o t ) = 1 D d =1 2π D 2 σ jd 2 exp( 1 D (o td µ jd ) 2 ) 2 d =1 σ jd 2

+ But we re not there yet n Single Gaussian may do a bad job of modeling distribution in any dimension: n Solution: Mixtures of Gaussians Figure from Chen, Picheney et al slide

+ Mixture of Gaussians to model a function

+ Mixtures of Gaussians n M mixtures of Gaussians: f (x µ jk,σ jk ) = M c jk k=1 1 (2π) D / 2 Σ jk exp 1 1/ 2 2 (x µ jk) T Σ 1 (x µ jk ) n For diagonal covariance: b j (o t ) = M D c jk k=1 d =1 1 2 2πσ jkd exp 1 o td µ jkd 2 σ jkd 2 b j (o t ) = M k=1 c jk D d =1 2π D 2 σ jkd 2 exp( 1 2 D (x jkd µ jkd ) 2 ) d =1 σ jkd 2

+ GMMs n Summary: each state has a likelihood function parameterized by: n M Mixture weights n M Mean Vectors of dimensionality D n Either n M Covariance Matrices of DxD n Or more likely n M Diagonal Covariance Matrices of DxD n which is equivalent to n M Variance Vectors of dimensionality D

+ Context Dependent Acoustic Models: Triphones n Our phoneme models represent each phones with 3 states: beginning middle and end n But rather then just modeling the phonemes, we model the phonemes in context n A Triphone model represents a phone with a particular right and left context. Thanks to Dan Jurafsky for these slides

+ Phoneme Variation Thanks to Dan Jurafsky for these slides

+ Sparse data problem n For a 50 phoneme set we would need 125,000 triphones n In practice, not all combinations occur n 55K triphones needed for 20K word WSJ corpus n Only 18.5K occurred in the training data n Attempting to train all of these triphones would result in many of then not having enough samples to adequately train. Thanks to Dan Jurafsky for these slides

+ Context Dependent Acoustic Models: Triphones n Our phoneme models represent each phones with 3 states: beginning middle and end n But rather then just modeling the phonemes, we model the phonemes in context n A Triphone model represents a phone with a particular right and left context. Thanks to Dan Jurafsky for these slides

+ Phoneme Variation Thanks to Dan Jurafsky for these slides

+ Sparse data problem n For a 50 phonemes set we would need 125,000 triphones n In practice, not all combinations occur n 55K triphones needed for 20K word WSJ corpus n Only 18.5K occurred in the training data n Attempting to train all of these triphones would result in many of then not having enough samples to adequately train. Thanks to Dan Jurafsky for these slides

+ Reducing triphone parameters n Clustering contexts similar contexts n Tying subphones whose clusters fall into the same contexts n States that are shared use the same Gaussians n This significantly cuts down on the number of parameters to be trained Thanks to Dan Jurafsky for these slides

+ Phonemes with similar contexts Thanks to Dan Jurafsky for these slides

+ How to determine which contexts to cluster? n Decision tree based on phonetic features n Root is the phoneme with all contexts n Each level of the tree splits the cluster based on a set of questions about a particular phonetic features n Generally based on articulatory features n Stop, nasal fricative n Front vowel, rounded, reduced Thanks to Dan Jurafsky for these slides

+ Feature Sets Thanks to Dan Jurafsky for these slides

+ Decision Tree Thanks to Dan Jurafsky for these slides

+ Tied states Thanks to Dan Jurafsky for these slides

+ Steps to train CD State Tied models Thanks to Dan Jurafsky for these slides