An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Similar documents
6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Course 395: Machine Learning - Lectures

Machine learning: Density estimation

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Rules of Probability

Hidden Markov Model Cheat Sheet

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Hidden Markov Models

Lecture Notes on Linear Regression

EM and Structure Learning

Limited Dependent Variables

Retrieval Models: Language models

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Semi-Supervised Learning

Expectation Maximization Mixture Models HMMs

Stochastic Structural Dynamics

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Gaussian Mixture Models

Probability and Random Variable Primer

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability

Lecture Nov

The Geometry of Logit and Probit

Chapter Newton s Method

The Expectation-Maximization Algorithm

Statistics and Quantitative Analysis U4320. Segment 3: Probability Prof. Sharyn O Halloran

First Year Examination Department of Statistics, University of Florida

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

3.1 ML and Empirical Distribution

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 3: Probability Distributions

Hidden Markov Models

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

} Often, when learning, we deal with uncertainty:

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

9 : Learning Partially Observed GM : EM Algorithm

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Mixture o f of Gaussian Gaussian clustering Nov

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Chapter 1. Probability

Maxent Models & Deep Learning

The Expectation-Maximisation Algorithm

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Goodness of fit and Wilks theorem

Sampling Theory MODULE VII LECTURE - 23 VARYING PROBABILITY SAMPLING

Calculation of time complexity (3%)

The Basic Idea of EM

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Ensemble Methods: Boosting

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Finding Dense Subgraphs in G(n, 1/2)

CS-433: Simulation and Modeling Modeling and Probability Review

Note on EM-training of IBM-model 1

Engineering Risk Benefit Analysis

Randomness and Computation

Introduction to Hidden Markov Models

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Why BP Works STAT 232B

Hidden Markov Models. Hongxin Zhang State Key Lab of CAD&CG, ZJU

Speech and Language Processing

18.1 Introduction and Recap

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

For example, if the drawing pin was tossed 200 times and it landed point up on 140 of these trials,

Problem Set 9 Solutions

Probability-Theoretic Junction Trees

Expected Value and Variance

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto


Introduction to Algorithms

Maximum Likelihood Estimation (MLE)

Simulation and Random Number Generation

Expectation Maximization Mixture Models

Multilayer Perceptron (MLP)

Vapnik-Chervonenkis theory

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

RELIABILITY ASSESSMENT

CS286r Assign One. Answer Key

Maximum Likelihood Estimation

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Introduction to the R Statistical Computing Environment R Programming

Andreas C. Drichoutis Agriculural University of Athens. Abstract

Lecture 14: Bandits with Budget Constraints

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Lecture 21: Numerical methods for pricing American type derivatives

CHAPTER 3: BAYESIAN DECISION THEORY

Logistic Classifier CISC 5800 Professor Daniel Leeds

Transcription:

An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads For each tral I do the followng: Frst I toss Con 0 If Con 0 turns up heads, I toss con 1 three tmes If Con 0 turns up tals, I toss con 2 three tmes I don t tell you whether Con 0 came up heads or tals, or whether Con 1 or 2 was tossed three tmes, but I do tell you how many heads/tals are seen at each tral you see the followng sequence: HHH, T T T, HHH, T T T, HHH What would you estmate as the values for λ, p 1 and p 2? 1 3 Overvew The EM algorthm n general form The EM algorthm for hdden markov models (brute force) The EM algorthm for hdden markov models (dynamc programmng) Maxmum Lkelhood Estmaton We have data ponts x 1, x 2,... x n drawn from some (fnte or countable) set X We have a parameter vector Θ We have a parameter space Ω We have a dstrbuton P (x Θ) for any Θ Ω, such that P (x Θ) 1 and P (x Θ) 0 for all x x X We assume that our data ponts x 1, x 2,... x n are drawn at random (ndependently, dentcally dstrbuted) from a dstrbuton P (x Θ ) for some Θ Ω 2 4

Log-Lkelhood We have data ponts x 1, x 2,... x n drawn from some (fnte or countable) set X We have a parameter vector Θ, and a parameter space Ω We have a dstrbuton P (x Θ) for any Θ Ω The lkelhood s n Lkelhood(Θ) P (x 1, x 2,... x n Θ) P (x Θ) 1 Maxmum Lkelhood Estmaton Gven a sample x 1, x 2,... x n, choose Θ ML argmax Θ Ω L(Θ) argmax Θ Ω log P (x Θ) For example, take the con example: say x 1... x n has Count(H) heads, and (n Count(H)) tals L(Θ) log (Θ Count(H) (1 Θ) n Count(H)) Count(H) log Θ + (n Count(H)) log(1 Θ) The log-lkelhood s L(Θ) log Lkelhood(Θ) n log P (x Θ) 1 We now have Θ ML Count(H) n 5 7 A Frst Example: Con Tossng X {H,T}. Our data ponts x 1, x 2,... x n are a sequence of heads and tals, e.g. HHTTHHHTHH Parameter vector Θ s a sngle parameter,.e., the probablty of con comng up heads Parameter space Ω [0, 1] Dstrbuton P (x Θ) s defned as { Θ If x H P (x Θ) 1 Θ If x T A Second Example: Probablstc Context-Free Grammars X s the set of all parse trees generated by the underlyng context-free grammar. Our sample s n trees T 1... T n such that each T X. R s the set of rules n the context free grammar N s the set of non-termnals n the grammar Θ r for r R s the parameter for rule r Let R(α) R be the rules of the form α β for some α The parameter space Ω s the set of Θ [0, 1] R such that for all α N r R(α) Θ r 1 6 8

We have P (T Θ) Θr Count(T,r) r R Multnomal Dstrbutons X s a fnte set, e.g., X {dog, cat, the, saw} where Count(T, r) s the number of tmes rule r s seen n the tree T log P (T Θ) r R Count(T, r) log Θ r Our sample x 1, x 2,... x n s drawn from X e.g., x 1, x 2, x 3 dog, the, saw The parameter Θ s a vector n R m where m X e.g., Θ 1 P (dog), Θ 2 P (cat), Θ 3 P (the), Θ 4 P (saw) The parameter space s m Ω {Θ : Θ 1 and, Θ 0} 1 If our sample s x 1, x 2, x 3 dog, the, saw, then L(Θ) log P (x 1, x 2, x 3 dog, the, saw) log Θ 1 +log Θ 3 +log Θ 4 9 11 Maxmum Lkelhood Estmaton for PCFGs Models wth Hdden Varables We have log P (T Θ) r R Count(T, r) log Θ r Now say we have two sets X and Y, and a jont dstrbuton P (x, y Θ) where Count(T, r) s the number of tmes rule r s seen n the tree T And, L(Θ) log P (T Θ) Solvng Θ ML argmax Θ Ω L(Θ) gves Count(T, r) Θ r s R(α) Count(T, s) where r s of the form α β for some β Count(T, r) log Θ r r R If we had fully observed data, (x, y ) pars, then L(Θ) log P (x, y Θ) If we have partally observed data, x examples, then L(Θ) log P (x Θ) log P (x, y Θ) y Y 10 12

The EM (Expectaton Maxmzaton) algorthm s a method for fndng Θ ML argmax Θ log P (x, y Θ) y Y Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 13 15 e.g., n the three cons example: Y {H,T} X {HHH,TTT,HTT,THH,HHT,TTH,HTH,THT} Θ {λ, p 1, p 2 } and where and P (x, y Θ) P (y Θ)P (x y, Θ) P (y Θ) { λ If y H 1 λ If y T { p h P (x y, Θ) 1 (1 p 1 ) t If y H p h 2(1 p 2 ) t If y T where h number of heads n x, t number of tals n x Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 14 16

Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 Fully observed data mght look lke: ( HHH, H), ( T T T, T ), ( HHH, H), ( T T T, T ), ( HHH, H) P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 In ths case maxmum lkelhood estmates are: λ 3 5 P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 p 1 9 9 p 2 0 6 17 19 Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 Partally observed data mght look lke: HHH, T T T, HHH, T T T, HHH P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 How do we fnd the maxmum lkelhood parameters? P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 18 20

Partally observed data mght look lke: HHH, T T T, HHH, T T T, HHH If current parameters are λ, p 1, p 2 P (y H x HHH ) P (y H x TTT ) 21 P ( HHH, H) P ( HHH, H) + P ( HHH, T) λp 3 1 λp 3 1 + (1 λ)p 3 2 P ( TTT, H) P ( TTT, H) + P ( TTT, T) λ(1 p 1 ) 3 λ(1 p 1 ) 3 + (1 λ)(1 p 2 ) 3 After fllng n hdden varables for each example, partally observed data mght look lke: ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 ( TTT, H) P (y H TTT) 0.6967 ( TTT, T ) P (y T TTT) 0.3033 ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 ( TTT, H) P (y H TTT) 0.6967 ( TTT, T ) P (y T TTT) 0.3033 ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 23 If current parameters are λ, p 1, p 2 P (y H x HHH ) P (y H x TTT ) If λ 0.3, p 1 0.3, p 2 0.6: λp 3 1 λp 3 1 + (1 λ)p 3 2 λ(1 p 1 ) 3 λ(1 p 1 ) 3 + (1 λ)(1 p 2 ) 3 P (y H x HHH ) 0.0508 P (y H x TTT ) 0.6967 New Estmates: p 1 p 2 ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 ( TTT, H) P (y H TTT) 0.6967 ( TTT, T ) P (y T TTT) 0.3033... λ 3 0.0508 + 2 0.6967 5 0.3092 3 3 0.0508 + 0 2 0.6967 3 3 0.0508 + 3 2 0.6967 0.0987 3 3 0.9492 + 0 2 0.3033 3 3 0.9492 + 3 2 0.3033 0.8244 22 24

: Summary Begn wth parameters λ 0.3, p 1 0.3, p 2 0.6 Fll n hdden varables, usng P (y H x HHH ) 0.0508 P (y H x TTT ) 0.6967 Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 p 5 0 0.3000 0.3000 0.6000 0.0508 0.6967 0.0508 0.6967 0.0508 1 0.3092 0.0987 0.8244 0.0008 0.9837 0.0008 0.9837 0.0008 2 0.3940 0.0012 0.9893 0.0000 1.0000 0.0000 1.0000 0.0000 3 0.4000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 The con example for { HHH, T T T, HHH, T T T, HHH }. λ s now 0.4, ndcatng that the con-tosser has probablty 0.4 of selectng the tal-based con. Re-estmate parameters to be λ 0.3092, p 1 0.0987, p 2 0.8244 25 27 Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.3000 0.6000 0.0508 0.6967 0.0508 0.6967 1 0.3738 0.0680 0.7578 0.0004 0.9714 0.0004 0.9714 2 0.4859 0.0004 0.9722 0.0000 1.0000 0.0000 1.0000 3 0.5000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 The con example for y { HHH, T T T, HHH, T T T }. The soluton that EM reaches s ntutvely correct: the con-tosser has two cons, one whch always shows up heads, the other whch always shows tals, and s pckng between them wth equal probablty (λ 0.5). The posteror probabltes p show that we are certan that con 1 (tal-based) generated y 2 and y 4, whereas con 2 generated y 1 and y 3. Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.3000 0.6000 0.1579 0.6967 0.0508 0.6967 1 0.4005 0.0974 0.6300 0.0375 0.9065 0.0025 0.9065 2 0.4632 0.0148 0.7635 0.0014 0.9842 0.0000 0.9842 3 0.4924 0.0005 0.8205 0.0000 0.9941 0.0000 0.9941 4 0.4970 0.0000 0.8284 0.0000 0.9949 0.0000 0.9949 The con example for y { HHT, T T T, HHH, T T T }. EM selects a tals-only con, and a con whch s heavly heads-based (p 2 0.8284). It s certan that y 1 and y 3 were generated by con 2, as they contan heads. y 2 and y 4 could have been generated by ether con, but con 1 s far more lkely. 26 28

Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.7000 0.7000 0.3000 0.3000 0.3000 0.3000 1 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 2 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 3 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 4 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 5 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 6 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 The con example for y { HHH, T T T, HHH, T T T }, wth p 1 and p 2 ntalsed to the same value. EM s stuck at a saddle pont Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.6999 0.7000 0.2999 0.3002 0.2999 0.3002 1 0.3001 0.4998 0.5001 0.2996 0.3005 0.2996 0.3005 2 0.3001 0.4993 0.5003 0.2987 0.3014 0.2987 0.3014 3 0.3001 0.4978 0.5010 0.2960 0.3041 0.2960 0.3041 4 0.3001 0.4933 0.5029 0.2880 0.3123 0.2880 0.3123 5 0.3002 0.4798 0.5087 0.2646 0.3374 0.2646 0.3374 6 0.3010 0.4396 0.5260 0.2008 0.4158 0.2008 0.4158 7 0.3083 0.3257 0.5777 0.0739 0.6448 0.0739 0.6448 8 0.3594 0.1029 0.7228 0.0016 0.9500 0.0016 0.9500 9 0.4758 0.0017 0.9523 0.0000 0.9999 0.0000 0.9999 10 0.4999 0.0000 0.9999 0.0000 1.0000 0.0000 1.0000 11 0.5000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 The con example for y { HHH, T T T, HHH, T T T }. If we ntalse p 1 and p 2 to be a small amount away from the saddle pont p 1 p 2, the algorthm dverges from the saddle pont and eventually reaches the global maxmum. 29 31 The EM Algorthm Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.7001 0.7000 0.3001 0.2998 0.3001 0.2998 1 0.2999 0.5003 0.4999 0.3004 0.2995 0.3004 0.2995 2 0.2999 0.5008 0.4997 0.3013 0.2986 0.3013 0.2986 3 0.2999 0.5023 0.4990 0.3040 0.2959 0.3040 0.2959 4 0.3000 0.5068 0.4971 0.3122 0.2879 0.3122 0.2879 5 0.3000 0.5202 0.4913 0.3373 0.2645 0.3373 0.2645 6 0.3009 0.5605 0.4740 0.4157 0.2007 0.4157 0.2007 7 0.3082 0.6744 0.4223 0.6447 0.0739 0.6447 0.0739 8 0.3593 0.8972 0.2773 0.9500 0.0016 0.9500 0.0016 9 0.4758 0.9983 0.0477 0.9999 0.0000 0.9999 0.0000 10 0.4999 1.0000 0.0001 1.0000 0.0000 1.0000 0.0000 11 0.5000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 The con example for y { HHH, T T T, HHH, T T T }. If we ntalse p 1 and p 2 to be a small amount away from the saddle pont p 1 p 2, the algorthm dverges from the saddle pont and eventually reaches the global maxmum. Θ t s the parameter vector at t th teraton Choose Θ 0 (at random, or usng varous heurstcs) Iteratve procedure s defned as Θ t argmax Θ Q(Θ, Θ t 1 ) where Q(Θ, Θ t 1 ) P (y x, Θ t 1 ) log P (x, y Θ) y Y 30 32

The EM Algorthm Iteratve procedure s defned as Θ t argmax Θ Q(Θ, Θ t 1 ), where Q(Θ, Θ t 1 ) P (y x, Θ t 1 ) log P (x, y Θ) Key ponts: y Y Intuton: fll n hdden varables y accordng to P (y x, Θ) EM s guaranteed to converge to a local maxmum, or saddle-pont, of the lkelhood functon In general, f argmax Θ log P (x, y Θ) has a smple (analytc) soluton, then argmax Θ P (y x, Θ) log P (x, y Θ) also has a smple (analytc) soluton. y The Structure of Hdden Markov Models Have N states, states 1... N Wthout loss of generalty, take N to be the fnal or stop state Have an alphabet K. For example K {a, b} Parameter π for 1... N s probablty of startng n state Parameter a,j for 1... (N 1), and j 1... N s probablty of state j followng state Parameter b (o) for 1... (N 1), and o K s probablty of state emttng symbol o 33 35 Overvew The EM algorthm n general form The EM algorthm for hdden markov models (brute force) The EM algorthm for hdden markov models (dynamc programmng) An Example Take N 3 states. States are {1, 2, 3}. Fnal state s state 3. Alphabet K {the, dog}. Dstrbuton over ntal state s π 1 1.0, π 2 0, π 3 0. Parameters a,j are j1 j2 j3 1 0.5 0.5 0 2 0 0.5 0.5 Parameters b (o) are othe odog 1 0.9 0.1 2 0.1 0.9 34 36

A Generatve Process A Hdden Varable Problem Pck the start state s 1 probablty π. to be state for 1... N wth We have an HMM wth N 3, K {e, f, g, h} We see the followng output sequences n tranng data Set t 1 Repeat whle current state s t s not the stop state (N): Emt a symbol o t K wth probablty b st (o t ) Pck the next state s t+1 as state j wth probablty a st,j. t t + 1 e e f f g h h g How would you choose the parameter values for π, a,j, and b (o)? 37 39 Probabltes Over Sequences An output sequence s a sequence of observatons o 1... o T where each o K e.g. the dog the dog dog the A state sequence s a sequence of states s 1... s T where each s {1... N} e.g. 1 2 1 2 2 1 HMM defnes a probablty for each state/output sequence par e.g. the/1 dog/2 the/1 dog/2 the/2 dog/1 has probablty Another Hdden Varable Problem We have an HMM wth N 3, K {e, f, g, h} We see the followng output sequences n tranng data e g h e h f h g f g g e h π 1 b 1 (the) a 1,2 b 2 (dog) a 2,1 b 1 (the) a 1,2 b 2 (dog) a 2,2 b 2 (the) a 2,1 b 1 (dog)a 1,3 Formally: ( T ) ( T ) P (s 1... s T, o 1... o T ) π s1 P (s s 1 ) P (o s ) P (N s T ) 2 1 38 How would you choose the parameter values for π, a,j, and b (o)? 40