Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Size: px
Start display at page:

Download "Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom"

Transcription

1 Lecture 3 Gaussian Mixture Models and Introduction to HMM s Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 3 February 2016

2 Administrivia 2 / 106 Lab 1 due Friday, February 12th at 6pm. Should have received username and password. Courseworks discussion has been started. TA office hours: Minda, Wed 2-4pm, EE lounge, 13th floor Mudd; Srihari, Thu 2-4pm, 122 Mudd.

3 Feedback Muddiest topic: MFCC(7), DTW(5), DSP (4), PLP (2) Comments (2+ votes): More examples/spend more time on examples (8) Want slides before class (4) More explanation of equations (3) Engage students more; ask more questions to class (2) 3 / 106

4 Where Are We? 4 / 106 Can extract features over time (MFCC, PLP, others) that... Characterize info in speech signal in compact form. Vector of features extracted 100 times a second

5 DTW Recap Training: record audio A w for each word w in vocab. Generate sequence of MFCC features A w (template for w). Test time: record audio A test, generate sequence of MFCC features A test. For each w, compute distance(a test, A w) using DTW. Return w with smallest distance. DTW computes distance between words represented as sequences of feature vectors... While accounting for nonlinear time alignment. Learned basic concepts (e.g., distances, shortest paths)... That will reappear throughout course. 5 / 106

6 What are Pros and Cons of DTW? 6 / 106

7 Pros 7 / 106 Easy to implement.

8 Cons: It s Ad Hoc Distance measures completely heuristic. Why Euclidean? Weight all dimensions of feature vector equally? - ugh!! Warping paths heuristic. Human derived constraints on warping paths like weights, etc - ugh!! Doesn t scale well Run DTW for each template in training data - what if large vocabulary? -ugh!! Plenty other issues. 8 / 106

9 Can We Do Better? Key insight 1: Learn as much as possible from data. e.g., distance measure; warping functions? Key insight 2: Use probabilistic modeling. Use well-described theories and models from... Probability, statistics, and computer science... Rather than arbitrary heuristics with ill-defined properties. 9 / 106

10 Next Two Main Topics Gaussian Mixture models (today) A probabilistic model of... Feature vectors associated with a speech sound. Principled distance between test frame... And set of template frames. Hidden Markov models (next week) A probabilistic model of... Time evolution of feature vectors for a speech sound. Principled generalization of DTW. 10 / 106

11 Part I 11 / 106 Gaussian Distributions

12 Gauss 12 / 106 Johann Carl Friedrich Gauss ( ) "Greatest Mathematician since Antiquity"

13 Gauss s Dog 13 / 106

14 The Scenario 14 / 106 Compute distance between test frame and frame of template Imagine 2d feature vectors instead of 40d for visualization.

15 Problem Formulation 15 / 106 What if instead of one training sample, have many?

16 Ideas 16 / 106 Average training samples; compute Euclidean distance. Find best match over all training samples. Make probabilistic model of training samples.

17 Where Are We? 17 / Gaussians in One Dimension 2 Gaussians in Multiple Dimensions 3 Estimating Gaussians From Data

18 Problem Formulation, Two Dimensions 18 / 106 Estimate P(x 1, x 2 ), the frequency... That training sample occurs at location (x 1, x 2 ).

19 Let s Start With One Dimension 19 / 106 Estimate P(x), the frequency... That training sample occurs at location x.

20 The Gaussian or Normal Distribution 20 / 106 P µ,σ 2(x) = N (µ, σ 2 ) = 1 e (x µ) 2 2σ 2 2πσ Parametric distribution with two parameters: µ = mean (the center of the data). σ 2 = variance (how wide data is spread).

21 Visualization 21 / 106 Density function: µ 4σ µ 2σ µ µ + 2σ µ + 4σ Sample from distribution: µ 4σ µ 2σ µ µ + 2σ µ + 4σ

22 Properties of Gaussian Distributions Is valid probability distribution. 1 e (x µ) 2 2σ 2 dx = 1 2πσ Central Limit Theorem: Sums of large numbers of identically distributed random variables tend to Gaussian. Lots of different types of data look bell-shaped. Sums and differences of Gaussian random variables... Are Gaussian. If X is distributed as N (µ, σ 2 )... ax + b is distributed as N (aµ + b, (aσ) 2 ). Negative log looks like weighted Euclidean distance! ln 2πσ + (x µ)2 2σ 2 22 / 106

23 Where Are We? 23 / Gaussians in One Dimension 2 Gaussians in Multiple Dimensions 3 Estimating Gaussians From Data

24 Gaussians in Two Dimensions 24 / 106 N (µ 1, µ 2, σ 2 1, σ 2 2) = 1 2πσ 1 σ 2 1 r 2 e 1 2(1 r 2 ) (x 1 µ 1 ) 2 σ 2 1 «2rx 1 x 2 + (x 2 µ 2 )2 σ 1 σ 2 σ 2 2 If r = 0, simplifies to 1 e (x 1 µ 1) 2σ 1 2 2πσ e (x 2 µ 2) 2σ 2 2 = N (µ 1, σ1)n 2 (µ 2, σ2) 2 2πσ2 i.e., like generating each dimension independently.

25 25 / 106 Example: r = 0, σ 1 = σ 2 x 1, x 2 uncorrelated. Knowing x 1 tells you nothing about x 2.

26 26 / 106 Example: r = 0, σ 1 σ 2 x 1, x 2 can be uncorrelated and have unequal variance.

27 27 / 106 Example: r > 0, σ 1 σ 2 x 1, x 2 correlated. Knowing x 1 tells you something about x 2.

28 Generalizing to More Dimensions 28 / 106 If we write following matrix: Σ = σ1 2 rσ 1 σ 2 rσ 1 σ 2 σ2 2 then another way to write two-dimensional Gaussian is: N (µ, Σ) = 1 (2π) d/2 Σ where x = (x 1, x 2 ), µ = (µ 1, µ 2 ). 1 e 2 (x µ)t Σ 1 (x µ) 1/2 More generally, µ and Σ can have arbitrary numbers of components. Multivariate Gaussians.

29 Diagonal and Full Covariance Gaussians Let s say have 40d feature vector. How many parameters in covariance matrix Σ? The more parameters,... The more data you need to estimate them. In ASR, usually assume Σ is diagonal d params. This is why like having uncorrelated features! 29 / 106

30 Computing Gaussian Log Likelihoods Why log likelihoods? Full covariance: log P(x) = d 2 ln(2π) 1 2 ln Σ 1 2 (x µ)t Σ 1 (x µ) Diagonal covariance: log P(x) = d d 2 ln(2π) ln σ i 1 2 i=1 d (x i µ i ) 2 /σi 2 i=1 Again, note similarity to weighted Euclidean distance. Terms on left independent of x; precompute. A few multiplies/adds per dimension. 30 / 106

31 Where Are We? 31 / Gaussians in One Dimension 2 Gaussians in Multiple Dimensions 3 Estimating Gaussians From Data

32 Estimating Gaussians Give training data, how to choose parameters µ, Σ? Find parameters so that resulting distribution... Matches data as well as possible. Sample data: height, weight of baseball players / 106

33 Maximum-Likelihood Estimation (Univariate) 33 / 106 One criterion: data matches distribution well... If distribution assigns high likelihood to data. Likelihood of string of observations x 1, x 2,..., x N is... Product of individual likelihoods. N L(x1 N 1 µ, σ) = e (x i µ) 2 2σ 2 2πσ i=1 Maximum likelihood estimation: choose µ, σ... That maximizes likelihood of training data. (µ, σ) MLE = arg max L(x1 N µ, σ) µ,σ

34 Why Maximum-Likelihood Estimation? Assume we have correct model form. Then, as the number of training samples increases... ML estimates approach true parameter values (consistent) ML estimators are the best! (efficient) ML estimation is easy for many types of models. Count and normalize! 34 / 106

35 What is ML Estimate for Gaussians? Much easier to work with log likelihood L = ln L: L(x1 N µ, σ) = N 2 ln 2πσ2 1 N (x i µ) 2 2 σ 2 Take partial derivatives w.r.t. µ, σ: L(x1 N µ, σ) N (x i µ) = µ σ 2 i=1 i=1 L(x1 N µ, σ) = N σ 2 2σ + N (x i µ) 2 2 σ 4 Set equal to zero; solve for µ, σ 2. µ = 1 N x i σ 2 = 1 N N i=1 i=1 N (x i µ) 2 i=1 35 / 106

36 What is ML Estimate for Gaussians? 36 / 106 Multivariate case. N µ = 1 N Σ = 1 N i=1 x i N (x i µ) T (x i µ) i=1 What if diagonal covariance? Estimate params for each dimension independently.

37 Example: ML Estimation 37 / 106 Heights (in.) and weights (lb.) of 1033 pro baseball players. Noise added to hide discretization effects. stanchen/e6870/data/mlb_data.dat height weight

38 Example: ML Estimation 38 /

39 Example: Diagonal Covariance 39 / 106 µ 1 = 1 ( ) = µ 2 = 1 ( ) = σ1 2 = 1 [ ( ) 2 + ( ) 2 + ) ] 1033 = 5.43 σ2 2 = 1 [ ( ) 2 + ( ) 2 + ) ] 1033 =

40 Example: Diagonal Covariance 40 /

41 Example: Full Covariance 41 / 106 Mean; diagonal elements of covariance matrix the same. Σ 12 = Σ 21 = 1 [( ) ( ) ( ) ( ) + )] = µ = [ ] Σ =

42 Example: Full Covariance 42 /

43 Recap: Gaussians Lots of data looks Gaussian. Central limit theorem. ML estimation of Gaussians is easy. Count and normalize. In ASR, mostly use diagonal covariance Gaussians. Full covariance matrices have too many parameters. 43 / 106

44 Part II 44 / 106 Gaussian Mixture Models

45 Problems with Gaussian Assumption 45 / 106

46 Problems with Gaussian Assumption Sample from MLE Gaussian trained on data on last slide. Not all data is Gaussian! 46 / 106

47 Problems with Gaussian Assumption 47 / 106 What can we do? What about two Gaussians? where p 1 + p 2 = 1. P(x) = p 1 N (µ 1, Σ 1 ) + p 2 N (µ 2, Σ 2 )

48 Gaussian Mixture Models (GMM s) More generally, can use arbitrary number of Gaussians: P(x) = j p j 1 (2π) d/2 Σ j 1 e 2 (x µ j )T Σ 1 j (x µ j ) 1/2 where j p j = 1 and all p j 0. Also called mixture of Gaussians. Can approximate any distribution of interest pretty well... If just use enough component Gaussians. 48 / 106

49 Example: Some Real Acoustic Data 49 / 106

50 Example: 10-component GMM (Sample) 50 / 106

51 Example: 10-component GMM (µ s, σ s) 51 / 106

52 ML Estimation For GMM s Given training data, how to estimate parameters... i.e., the µ j, Σ j, and mixture weights p j... To maximize likelihood of data? No closed-form solution. Can t just count and normalize. Instead, must use an optimization technique... To find good local optimum in likelihood. Gradient search Newton s method Tool of choice: The Expectation-Maximization Algorithm. 52 / 106

53 Where Are We? 53 / The Expectation-Maximization Algorithm 2 Applying the EM Algorithm to GMM s

54 Wake Up! 54 / 106 This is another key thing to remember from course. Used to train GMM s, HMM s, and lots of other things. Key paper in 1977 by Dempster, Laird, and Rubin [2]; citations to date. "the innovative Dempster-Laird-Rubin paper in the Journal of the Royal Statistical Society received an enthusiastic discussion at the Royal Statistical Society meeting...calling the paper "brilliant""

55 What Does The EM Algorithm Do? Finds ML parameter estimates for models... With hidden variables. Iterative hill-climbing method. Adjusts parameter estimates in each iteration... Such that likelihood of data... Increases (weakly) with each iteration. Actually, finds local optimum for parameters in likelihood. 55 / 106

56 What is a Hidden Variable? A random variable that isn t observed. Example: in GMMs, output prob depends on... The mixture component that generated the observation But you can t observe it Important concept. Let s discuss!!!! 56 / 106

57 Mixtures and Hidden Variables 57 / 106 So, to compute prob of observed x, need to sum over... All possible values of hidden variable h: P(x) = h P(h, x) = h P(h)P(x h) Consider probability distribution that is a mixture of Gaussians: P(x) = p j N (µ j, Σ j ) j Can be viewed as hidden model. h Which component generated sample. P(h) = p j ; P(x h) = N (µ j, Σ j ). P(x) = h P(h)P(x h)

58 The Basic Idea If nail down hidden value for each x i,... Model is no longer hidden! e.g., data partitioned among GMM components. So for each data point x i, assign single hidden value h i. Take h i = arg max h P(h)P(x i h). e.g., identify GMM component generating each point. Easy to train parameters in non-hidden models. Update parameters in P(h), P(x h). e.g., count and normalize to get MLE for µ j, Σ j, p j. Repeat! 58 / 106

59 The Basic Idea 59 / 106 Hard decision: For each x i, assign single h i = arg max h P(h, x i )... With count 1. Test: what is P(h, x i ) for Gaussian distribution? Soft decision: For each x i, compute for every h... the Posterior prob P(h x i ) = P(h,x i ) Ph. P(h,x i ) Also called the fractional count e.g., partition event across every GMM component. Rest of algorithm unchanged.

60 The Basic Idea, using more Formal Terminology Initialize parameter values somehow. For each iteration... Expectation step: compute posterior (count) of h for each x i. P(h x i ) = P(h, x i) h P(h, x i) Maximization step: update parameters. Instead of data x i with hidden h, pretend... Non-hidden data where... (Fractional) count of each (h, x i ) is P(h x i ). 60 / 106

61 Example: Training a 2-component GMM 61 / 106 Two-component univariate GMM; 10 data points. The data: x 1,..., x , 7.6, 4.2, 2.6, 5.1, 4.0, 7.8, 3.0, 4.8, 5.8 Initial parameter values: p 1 µ 1 σ 2 1 p 2 µ 2 σ Training data; densities of initial Gaussians.

62 The E Step 62 / 106 x i p 1 N 1 p 2 N 2 P(x i ) P(1 xi ) P(2 xi ) P(h x i ) = P(h, x i) h P(h, x i) = p h N h P(x i ) h {1, 2}

63 The M Step View: have non-hidden corpus for each component GMM. For hth component, have P(h x i ) counts for event x i. Estimating µ: fractional events. µ = 1 N N x i µ h = i=1 i 1 P(h x i ) N P(h x i )x i i=1 µ 1 = ( ) = 3.98 Similarly, can estimate σ 2 h with fractional events. 63 / 106

64 The M Step (cont d) 64 / 106 What about the mixture weights p h? To find MLE, count and normalize! p 1 = = 0.59

65 The End Result 65 / 106 iter p 1 µ 1 σ1 2 p 2 µ 2 σ

66 First Few Iterations of EM 66 / 106 iter 0 iter 1 iter 2

67 Later Iterations of EM 67 / 106 iter 2 iter 3 iter 10

68 Why the EM Algorithm Works 68 / 106 x = (x 1, x 2,...) = whole training set; h = hidden. θ = parameters of model. Objective function for MLE: (log) likelihood. L(θ) = log P θ (x) = log P θ (x, h) log P θ (h x) Form expectation with respect to θ n, the estimate of θ on the n th estimation iteration: P θ n(h x) log P θ (x) = h h h P θ n(h x) log P θ (x, h) P θ n(h x) log P θ (h x) rewrite as : log P θ (x) = Q(θ θ n ) + H(θ θ n )

69 Why the EM Algorithm Works 69 / 106 log P θ (x) = Q(θ θ n ) + H(θ θ n ) What is Q? In the Gaussian example above Q is just P θ n(h x) log p h N x (µ h, Σ h ) h It can be shown (using Gibb s inequality) that H(θ θ n ) H(θ n θ n ) for any θ θ n So that means that any choice of θ that increases Q will increase log P θ (x). Typically we just pick θ to maximize Q altogether, can often be done in closed form.

70 The E Step 70 / 106 Compute Q.

71 The M Step 71 / 106 Maximize Q with respect to θ Then repeat - E/M, E/M till likelihood stops improving significantly. That s the E-M algorithm in a nutshell!

72 Discussion EM algorithm is elegant and general way to... Train parameters in hidden models... To optimize likelihood. Only finds local optimum. Seeding is of paramount importance. 72 / 106

73 Where Are We? 73 / The Expectation-Maximization Algorithm 2 Applying the EM Algorithm to GMM s

74 Another Example Data Set 74 / 106

75 Question: How Many Gaussians? Method 1 (most common): Guess! Method 2: Bayesian Information Criterion (BIC)[1]. Penalize likelihood by number of parameters. BIC(C k ) = k { 1 2 n j log Σ j } Nk(d + 1 d(d + 1)) 2 j=1 k = Gaussian components. d = dimension of feature vector. n j = data points for Gaussian j; N = total data points. Discuss! 75 / 106

76 The Bayesian Information Criterion 76 / 106 View GMM as way of coding data for transmission. Cost of transmitting model number of params. Cost of transmitting data log likelihood of data. Choose number of Gaussians to minimize cost.

77 Question: How To Initialize Parameters? Set mixture weights p j to 1/k (for k Gaussians). Pick N data points at random and... Use them to seed initial values of µ j. Set all σ s to arbitrary value... Or to global variance of data. Extension: generate multiple starting points. Pick one with highest likelihood. 77 / 106

78 Another Way: Splitting Start with single Gaussian, MLE. Repeat until hit desired number of Gaussians: Double number of Gaussians by perturbing means... Of existing Gaussians by ±ɛ. Run several iterations of EM. 78 / 106

79 Question: How Long To Train? i.e., how many iterations of EM? Guess. Look at performance on training data. Stop when change in log likelihood per event... Is below fixed threshold. Look at performance on held-out data. Stop when performance no longer improves. 79 / 106

80 The Data Set 80 / 106

81 Sample From Best 1-Component GMM 81 / 106

82 The Data Set, Again 82 / 106

83 20-Component GMM Trained on Data 83 / 106

84 20-Component GMM µ s, σ s 84 / 106

85 Acoustic Feature Data Set 85 / 106

86 5-Component GMM; Starting Point A 86 / 106

87 5-Component GMM; Starting Point B 87 / 106

88 5-Component GMM; Starting Point C 88 / 106

89 Solutions With Infinite Likelihood 89 / 106 Consider log likelihood; two-component 1d Gaussian. ( ) N 1 ln p 1 e (x i µ 2 1) 1 2σ p 2 e (x i µ 2 2) 2σ 2 2 2πσ1 2πσ2 i=1 If µ 1 = x 1, above reduces to ( 1 ln 2 + 2πσ πσ 2 e 1 2 (x 1 µ 2 ) 2 σ 2 2 ) + N... i=2 which goes to as σ 1 0. Only consider finite local maxima of likelihood function. Variance flooring. Throw away Gaussians with count below threshold.

90 Recap GMM s are effective for modeling arbitrary distributions. State-of-the-art in ASR for decades (though may be superseded by NNs at some point, discuss later in course) The EM algorithm is primary tool for training GMM s (and lots of other things) Very sensitive to starting point. Initializing GMM s is an art. 90 / 106

91 References 91 / 106 S. Chen and P.S. Gopalakrishnan, Clustering via the Bayesian Information Criterion with Applications in Speech Recognition, ICASSP, vol. 2, pp , A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Stat. Society. Series B, vol. 39, no. 1, 1977.

92 What s Next: Hidden Markov Models Replace DTW with probabilistic counterpart. Together, GMM s and HMM s comprise... Unified probabilistic framework. Old paradigm: New paradigm: w = arg min distance(a test, A w) w vocab w = arg max P(A test w) w vocab 92 / 106

93 Part III 93 / 106 Introduction to Hidden Markov Models

94 Introduction to Hidden Markov Models 94 / 106 The issue of weights in DTW. Interpretation of DTW grid as Directed Graph. Adding Transition and Output Probabilities to the Graph gives us an HMM! The three main HMM operations.

95 Another Issue with Dynamic Time Warping 95 / 106 Weights are completely heuristic! Maybe we can learn weights from data? Take many utterances...

96 Learning Weights From Data For each node in DP path, count number of times move up right and diagonally. Normalize number of times each direction taken by total number of times node was actually visited (= C/N) Take some constant times reciprocal as weight (αn/c) Example: particular node visited 100 times. Move 40 times; 20 times; 40 times. Set weights to 2.5, 5, and 2.5, (or 1, 2, and 1). Point: weight distribution should reflect... Which directions are taken more frequently at a node. Weight estimation not addressed in DTW... But central part of Hidden Markov models. 96 / 106

97 DTW and Directed Graphs 97 / 106 Take following Dynamic Time Warping setup: Let s look at representation of this as directed graph:

98 DTW and Directed Graphs 98 / 106 Another common DTW structure: As a directed graph: Can represent even more complex DTW structures... Resultant directed graphs can get quite bizarre.

99 Path Probabilities Let s assign probabilities to transitions in directed graph: a ij is transition probability going from state i to state j, where j a ij = 1. Can compute probability P of individual path just using transition probabilities a ij. 99 / 106

100 Path Probabilities It is common to reorient typical DTW pictures: Above only describes path probabilities associated with transitions. Also need to include likelihoods associated with observations. 100 / 106

101 Path Probabilities As in GMM discussion, let us define likelihood of producing observation x i from state j as b j (x i ) = m c jm 1 (2π) d/2 Σ jm 1 e 2 (x i µ jm ) T Σ 1 jm (x i µ jm ) 1/2 where c jm are mixture weights associated with state j. This state likelihood is also called the output probability associated with state. 101 / 106

102 Path Probabilities 102 / 106 In this case, likelihood of entire path can be written as:

103 Hidden Markov Models The output and transition probabilities define a Hidden Markov Model or HMM. Since probabilities of moving from state to state only depend on current and previous state, model is Markov. Since only see observations and have to infer states after the fact, model is hidden. One may consider HMM to be generative model of speech. Starting at upper left corner of trellis, generate observations according to permissible transitions and output probabilities. Not only can compute likelihood of single path... Can compute overall likelihood of observation string... As sum over all paths in trellis. 103 / 106

104 HMM: The Three Main Tasks Compute likelihood of generating string of observations from HMM (Forward algorithm). Compute best path from HMM (Viterbi algorithm). Learn parameters (output and transition probabilities) of HMM from data (Baum-Welch a.k.a. Forward-Backward algorithm). 104 / 106

105 Part IV 105 / 106 Epilogue

106 Course Feedback 1 Was this lecture mostly clear or unclear? What was the muddiest topic? 2 Other feedback (pace, content, atmosphere)? 106 / 106

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

EECS E6870: Lecture 4: Hidden Markov Models

EECS E6870: Lecture 4: Hidden Markov Models EECS E6870: Lecture 4: Hidden Markov Models Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY 10549 stanchen@us.ibm.com, picheny@us.ibm.com,

More information

Lecture 4. Hidden Markov Models. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 4. Hidden Markov Models. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm + September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Temporal Modeling and Basic Speech Recognition

Temporal Modeling and Basic Speech Recognition UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

CS 136 Lecture 5 Acoustic modeling Phoneme modeling + September 9, 2016 Professor Meteer CS 136 Lecture 5 Acoustic modeling Phoneme modeling Thanks to Dan Jurafsky for these slides + Directly Modeling Continuous Observations n Gaussians n Univariate Gaussians

More information

Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Statistical NLP Spring Digitizing Speech

Statistical NLP Spring Digitizing Speech Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon

More information

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ... Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Dr Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK 1 3 2 http://www.ee.surrey.ac.uk/personal/p.jackson/isspr/ Outline 1. Recognizing patterns

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

Speech Recognition HMM

Speech Recognition HMM Speech Recognition HMM Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/38 Agenda Recap variability

More information

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II) Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II) Outline for ASR ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2012 Class 9: Templates to HMMs 20 Feb 2012 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

Lecture 6. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 6.  Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 6 pr@ n2nsi"eis@n "mad@lin Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 6: Gaussian Mixture Models (GMM) Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning

More information

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 10 Discriminative Training, ROVER, and Consensus Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details

More information

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course) 10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information