Sta$s$cal sequence recogni$on

Similar documents
Bayesian networks Lecture 18. David Sontag New York University

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Par$cle Filters Part I: Theory. Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

CSE 473: Ar+ficial Intelligence

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Hidden Markov Models Part 2: Algorithms

Temporal Modeling and Basic Speech Recognition

Latent Dirichlet Alloca/on

STA 4273H: Sta-s-cal Machine Learning

STAD68: Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

Introduc)on to Ar)ficial Intelligence

Bayesian Networks BY: MOHAMAD ALSABBAGH

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

HMM part 1. Dr Philip Jackson

CS 10: Problem solving via Object Oriented Programming Winter 2017

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models Part 1: Introduction

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Hidden Markov Models and Gaussian Mixture Models

STA 414/2104: Machine Learning

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

CS 188: Artificial Intelligence Fall 2011

Design and Implementation of Speech Recognition Systems

Statistical Sequence Recognition and Training: An Introduction to HMMs

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

Probability and Structure in Natural Language Processing

Bias/variance tradeoff, Model assessment and selec+on

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Introduction to Machine Learning CMU-10701

Introduction to Particle Filters for Data Assimilation

Hidden Markov Models (HMMs) November 14, 2017

Lecture 11: Hidden Markov Models

Introduc)on to Ar)ficial Intelligence

8: Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis

Least Square Es?ma?on, Filtering, and Predic?on: ECE 5/639 Sta?s?cal Signal Processing II: Linear Es?ma?on

Regression.

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CS 343: Artificial Intelligence

Hidden Markov Models

CS 6140: Machine Learning Spring 2017

Hidden Markov Models and Gaussian Mixture Models

Statistical Methods for NLP

Statistical Machine Learning from Data

ASR using Hidden Markov Model : A tutorial

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

STA 4273H: Statistical Machine Learning

Part 2. Representation Learning Algorithms

COMP90051 Statistical Machine Learning

Basic math for biology

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

CS 5522: Artificial Intelligence II

Hidden Markov Modelling

CS 5522: Artificial Intelligence II

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Hidden Markov Models. Algorithms for NLP September 25, 2014

Probability and Structure in Natural Language Processing

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Learning Deep Genera,ve Models

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

Machine Learning for natural language processing

Brief Introduction of Machine Learning Techniques for Content Analysis

CS 343: Artificial Intelligence

CS 7180: Behavioral Modeling and Decision- making in AI

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Quan&fying Uncertainty. Sai Ravela Massachuse7s Ins&tute of Technology

Hidden Markov Models. George Konidaris

Linear Dynamical Systems

Some Review and Hypothesis Tes4ng. Friday, March 15, 13

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Parametric Models Part III: Hidden Markov Models

Lecture 6: Graphical Models

Dynamic Approaches: The Hidden Markov Model

Computational Genomics and Molecular Biology, Fall

Be#er Generaliza,on with Forecasts. Tom Schaul Mark Ring

Mixture Models. Michael Kuhn

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Computer Vision. Pa0ern Recogni4on Concepts. Luis F. Teixeira MAP- i 2014/15

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA

Temporal Models.

Transcription:

Sta$s$cal sequence recogni$on

Determinis$c sequence recogni$on Last $me, temporal integra$on of local distances via DP Integrates local matches over $me Normalizes $me varia$ons For cts speech, segments as well as classifies Limita$ons End- point detec$on required Choosing local distance (effect on global) Doesn t model context effect between words

Sta$s$cal vs determinis$c sequence recogni$on Sta$s$cal models can also be used (rather than examples) with DP S$ll integrate local matches over $me Normalize $me varia$ons For cts speech, segment as well as classify Helping with DTW Limita$ons End- point detec$on not as cri$cal Local distance comes from the model Cross- word modeling is straighmorward (though it does require enough data to train good models)

Sta$s$cal sequence recogni$on Powerful tools exist Density es$ma$on Training data alignment Recogni$on given the models Increases generality over determinis$c Any distance choice is equivalent to implicit model Sufficiently detailed sta$s$cs can model any distribu$on In recogni$on, find MAP choice for sequence In prac$ce, approxima$ons used

Probabilis$c problem statement Bayes rela$on for models where M is the j th j stat model for a sequence of speech units And X P(M j X) = P(X M j )P(M j ) P(X) is the sequence of feature vectors Minimum probability of error if j chosen to maximize P(M j X)

Decision rule Bayes rela$on for models j best = argmax j = argmax j P(M j X) P(X M j )P(M j ) since X is fixed for all choices of j

Bayes decision rule for models M 1 P(X M 1 )P(M 1 ) X M 2 P(X M 2 )P(M 2 ) M J P(X M J )P(M J )

Acous$c and language models So far, no assump$ons But how do we get the probabili$es? Es$mate from training data Then, first assump$on: acous$c parameters independent of language parameters j best = argmax j P(M j X,!) = argmax j P(X M j,! A )P(M j! L ) where! are parameters es$mated from training data

The three problems (1) How should P(X M j,! A ) be computed? (2) How should parameters! A be determined? (3) Given the model and parameters, how can we find the best sequence to classify the input sequence? Today s focus is on problem 1, with some on problem 3; problem 2 will be next $me.

Genera$ve model for speech M j,! A X P(X M j,! A )

Composi$on of a model Could collect sta$s$cs for whole words More typically, densi$es for subword units Models consist of states States have possible transi$ons States have observed outputs (feature vectors) States have density func$ons General sta$s$cal formula$ons hard to es$mate To simplify, use Markov assump$ons

Markov assump$ons Assume finite state automaton Assume stochas$c transi$ons Each random variable only depends on the previous n variables (typically n=1) HMMs have another layer of indeterminacy Let s start with Markov models per se

Example: Markov Model state a b c output sunny cloudy rainy 1/3 1/3 1/2 a 1/4 b 1/4 1/3 c 1/4 1/4 1/2 Numbers on arcs are probabili$es of transi$ons

Markov Model By defini$on of joint and condi$onal probability, if Q = (q 1,q 2,q 3,...,q N ) then N P(Q) = P(q 1 )! P(q i q i"1,q i"2,...,q 1 ) i=2 And with 1 st order Markov assump$on, P(Q) = P(q 1 )! P(q i q i"1 ) N i=2

Example: Markov Model state a b c output sunny cloudy rainy P(abc) = P(c b)p(b a)p(a) If we assume that we start with a so that P(a)=1, then 1/3 1/3 1/2 a 1/4 b 1/4 1/3 c 1/4 1/4 1/2 P(abc) = ( 1 4 )(1 3 ) = 1 12

Hidden Markov Model (HMM) Outputs of Markov Model are determinis$c For HMM, outputs are stochas$c Instead of a fixed value, a pdf Genera$ng state sequence is hidden Doubly stochas$c Transi$ons Emissions

Line- up: ASR researchers and Swedish basketball players Sigh! (f 2 =1400) Sigh! (f 2 =1450) Sigh! (f 2 =1650)

MM vs HMM MM: bbplayer- >1400 Hz; researcher- >1600 Hz HMM: bbplayer- > 1400 Hz mean, 200 Hz std dev etc. (Gaussian) Outputs are observed in both cases; but only one possible for MM, >1 possible for HMM For MM, then directly know the state; for HMM, probabilis$c inference In both cases, two states, two possible transi$ons from each

Two- state HMM q 1 q 2 Associated func$ons: P(x n q i ) and P(q j q i )

Emissions and transi$ons P(x n q 1 ) could be density for F2 of bbplayer (emission probability) P(x n q 2 ) could be density for F2 of researcher P(q 2 q 1 ) could be probability for transi$on of bbplayer- >researcher in the lineup P(q 1 q 2 ) could be probability for transi$on of researcher- >bbplayer in the lineup P(q 1 q 1 ) could be probability for transi$on of bbplayer- >bbplayer in the lineup P(q 2 q 2 ) could be probability for transi$on of researcher- >researcher in the lineup

States vs speech classes State is just model component; could be $ed (same class as a state in a different model) States are oden parts of a speech sound, (e.g., three to a phone) States can also correspond to specific contexts (e.g., uh with a led context of b ) States can just be repeated versions of same speech class enforces minimum dura$on In prac$ce state iden$$es are learned

Temporal constraints Minimum dura$on from shortest path Self- loop probability vs. forward transi$on probability Transi$ons not in models not permifed Some$mes explicit dura$on models are used

Es$ma$on of P(X M) Given states, topology, probabili$es Assume 1 st order Markov For now, assume transi$on probabili$es are known, emission probabili$es es$mated How to es$mate likelihood func$on? Hint: at the end, one version will look like DTW

Total likelihood es$ma$on i th path through model M is Q i N is the number of frames in the input X is the data sequence L(M) is the number of states in the model Then expand to P(X M ) =! P(Q i, X M ) all Q i in M of length N But this requires O(N L(M) N ) steps Fortunately, we can reuse intermediate results

Forward recurrence (1) Expand to joint probability at last frame only P(X M ) = L(M )! P(q N l, X M ) l =1 Decompose into local and cumula$ve terms, N where X can be expressed as X 1 Using P(a,b c)=p(a b,c)p(b c), get L(M ) P(q n l, X n 1 M ) = " n!1 P(q k, X n!1 1 M )P(q n l,x n q n!1 k, X n!1 1,M ) k=1

Forward recurrence (2) Now define a joint probability for state at $me n being q l, and the observa$on sequence:! n (l M ) = P(q l n, X 1 n M ) Then, resta$ng the forward recurrence, L(M ) # k =1! n (l M ) =! n"1 (k)p(q l n, x n q k n"1, X 1 n"1, M )

Forward recurrence (3) The local term can be decomposed further: P(q l n, x n q k n!1, M ) = P(q l n q k n!1, X 1 n!1, M )P(x n q l n,q k n!1, X 1 n!1, M ) But these are very hard to es$mate. So we need to make two assump$ons of condi$onal independence

Assump$on 1: 1 st order Markov State chain: state of Markov chain at $me n depends only on state at $me n- 1, condi$onally independent of the past P(q l n q k n!1, X 1 n!1, M ) = P(q l n q k n!1, M )

Assump$on 2: condi$onal independence of the data Given the state, observa$ons are independent of the past states and observa$ons P(x n q l n,q k n!1, X 1 n!1, M ) = P(x n q l n, M )

Forward recurrence (4) Given those assump$ons, the local term is P(q l n q k n!1, M )P(x n q l n, M ) And the forward recurrence is L(M ) # P(q l n q k n"1, M )P(x n q l n, M )! n (l M ) =! n"1 (k M ) k=1 Or, suppressing the dependence on M, L # P(q l n q k n"1 )P(x n q l n )! n (l) =! n"1 (k) k=1

How do we start it? Recall defini$on Set n=1! n (l M ) = P(q l n, X 1 n M )! 1 (l) = P(q l 1, X 1 1 ) = P(q l 1 )P(x 1 q l 1 )

How do we finish it? Recall defini$on Sum over all states in model for n=n L "! (l M ) = N " P(q N, X N M ) = P(X M ) l 1 l =1! n (l M ) = P(q l n, X 1 n M ) L l =1

Forward recurrence summary Decompose data likelihood into sum (over predecessor states) of product of local and global probabili$es Condi$onal independence assump$ons Local probability is product of emission and transi$on probabili$es Global probability is a cumula$ve value A lot like DTW!

Forward recurrence vs DTW Terms are probabilis$c Predecessors are model states, not observa$on frames Predecessors are always the previous frame Local and global factors are combined by product, not sum (but you could take the log) Combina$on of terms over predecessors are done by summa$on rather than finding the maximum (or min for distance/distor$on)

Viterbi Approxima$on Summing very small products is tricky (numerically) Instead of total likelihood for model, can find best path through states Summa$on replaced by max Probabili$es can be replaced by log probabili$es Then summa$ons can be replaced by min of the sum of nega$ve log probabili$es! log P(q l n, X 1 n ) = min k [! log P(q k n!1, X 1 n!1 )! log P(q l n q k n!1 )! log P(x n q l n )]

Similarity to DTW! log P(q l n, X 1 n ) = min k [! log P(q k n!1, X 1 n!1 )! log P(q l n q k n!1 )! log P(x n q l n )] Nega$ve log probabili$es are the distance! But instead of a frame in the reference, we compare to a state in a model, i.e., D(n,q l n ) = min k [D(n! 1,q k n!1 ) + d(n,q l n ) + T (q l n,q k n!1 )] Note that we also now have explicit transi$on costs

Viterbi vs DTW Models, not examples The distance measure is now dependent on es$ma$ng probabili$es good tools exist We now have explicit way to specify priors State sequences: transi$on probabili$es Word sequences: P(M) priors come from a language model

Assump$ons required Language and acous$c model parameters are separable State chain is first- order Markov Observa$ons independent of past Recogni$on via best path (best state sequence) is good enough don t need to sum over all paths to get best model If all were true, then resul$ng inference would be op$mum