p(d θ ) l(θ ) 1.2 x x x

Similar documents
Parametric Techniques

Parametric Techniques Lecture 3

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning 2017

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Computational Genomics and Molecular Biology, Fall

Basic math for biology

Minimum Error-Rate Discriminant

Linear Dynamical Systems

Hidden Markov Models

Conditional Random Field

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Hidden Markov models

Statistical Machine Learning from Data

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture Notes on the Gaussian Distribution

Hidden Markov Models

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Today s Lecture: HMMs

CS 195-5: Machine Learning Problem Set 1

A gentle introduction to Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

Pattern Recognition and Machine Learning

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

STA 4273H: Statistical Machine Learning

Hidden Markov Models (HMMs)

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

STA 414/2104: Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models

Lecture 4: Probabilistic Learning

Lecture 9. Intro to Hidden Markov Models (finish up)

Graphical Models for Collaborative Filtering

Learning Methods for Linear Detectors

Hidden Markov Models and Gaussian Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS838-1 Advanced NLP: Hidden Markov Models

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Decision and Bayesian Learning

Pattern Classification

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Notes on Machine Learning for and

CSE555: Introduction to Pattern Recognition Midterm Exam Solution (100 points, Closed book/notes)

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

PATTERN CLASSIFICATION

Expect Values and Probability Density Functions

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

L11: Pattern recognition principles

Lecture 11: Hidden Markov Models

Probability Models for Bayesian Recognition

Design and Implementation of Speech Recognition Systems

Pattern Recognition. Parameter Estimation of Probability Density Functions

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

A brief introduction to Conditional Random Fields

A Note on the Expectation-Maximization (EM) Algorithm

Nonparametric Methods Lecture 5

Bayesian Decision Theory Lecture 2

Support Vector Machines

Hidden Markov Models

Expectation Maximization (EM)

Density Estimation. Seungjin Choi

Note Set 5: Hidden Markov Models

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Mathematical Formulation of Our Example

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Clustering on Unobserved Data using Mixture of Gaussians

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

STA 4273H: Statistical Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

( ).666 Information Extraction from Speech and Text

Data-Intensive Computing with MapReduce

Linear Classifiers as Pattern Detectors

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

MODULE -4 BAYEIAN LEARNING

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Introduction to Probabilistic Graphical Models: Exercises

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models. Terminology and Basic Algorithms

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Machine Learning, Midterm Exam

Sequential Supervised Learning

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Lecture 3: Pattern Classification

Statistical learning. Chapter 20, Sections 1 4 1

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Hidden Markov Models and Gaussian Mixture Models

The Expectation-Maximization Algorithm

Lecture 3: Machine learning, classification, and generative models

Sequence modelling. Marco Saerens (UCL) Slides references

Graphical Models Seminar

Transcription:

p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to be drawn from a Gaussian of a particular variance, but unknown mean. Four of the infinite number of candidate source distributions are shown in dashed lines. The middle figure shows the likelihood p(d θ) as a function of the mean. If we had a very large number of training points, this likelihood would be very narrow. The value that maximizes the likelihood is marked ˆθ; it also maximizes the logarithm of the likelihood that is, the log-likelihood l(θ), shown at the bottom. Note that even though they look similar, the likelihood p(d θ) is shown as a function of θ whereas the conditional density p(x θ) is shown as a function of x. Furthermore, as a function of θ, the likelihood p(d θ) is not a probability density function and its area has no significance. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

p(µ x,x 2,...,x n ) p(µ x,x 2,...,x n ) 30 20 3 2 50 0 5-4 -2 0 2 4 µ 0-2 - 0 2 5-4 24 2-2 -3-0 FIGURE 3.2. Bayesian learning of the mean of normal distributions in one and two dimensions. The posterior distribution estimates are labeled by the number of training samples used in the estimation. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

x 3 x 2 x FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, and thus in three dimensions the Bayes error vanishes. When projected to a subspace here, the two-dimensional x x 2 subspace or a one-dimensional x subspace there can be greater overlap of the projected distributions, and hence greater Bayes error. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

f(x) 0 5 0-5 2 4 6 8 x -0 FIGURE 3.4. The training data (black dots) were selected from a quadratic function plus Gaussian noise, i.e., f (x) = ax 2 +bx+c+ɛ where p(ɛ) N(0,σ 2 ). The 0th-degree polynomial shown fits the data perfectly, but we desire instead the second-order function f (x), because it would lead to better predictions for new samples. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

2 x 2 2 x 2.5.5 0.5 w 0.5 0.5.5 x -0.5 0.5.5 w x FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions marked w. The figure on the right shows greater separation between the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

W W 2 FIGURE 3.6. Three three-dimensional distributions are projected onto two-dimensional subspaces, described by a normal vectors W and W 2. Informally, multiple discriminant methods seek the optimum such subspace, that is, the one with the greatest separation of the projected distributions for a given total within-scatter matrix, here as associated with W. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

Q(θ i+ ; θ i ) θ 2 θ 4 θ 0 θ θ 3 θ FIGURE 3.7. The search for the best model via the EM algorithm starts with some initial value of the model parameters, θ 0. Then, via the Mstep, the optimal θ is found. Next, θ is held constant and the value θ 2 is found that optimizes Q( ; ). This process iterates until no value of θ can be found that will increase Q( ; ). Note in particular that this is different from a gradient search. For example here θ is the global optimum (given fixed θ 0 ), and would not necessarily have been found via gradient search. (In this illustration, Q( ; ) is shown symmetric in its arguments; this need not be the case in general, however.) From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

a 22 a a 2 a 2 a 23 a 32 a 3 a 3 a 33 FIGURE 3.8. The discrete states, ω i, in a basic Markov model are represented by nodes, and the transition probabilities, a ij, are represented by links. In a first-order discrete-time Markov model, at any step t the full system is in a particular state ω(t). Thestateatstep t + is a random function that depends solely on the state at step t and the transition probabilities. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

v v 2 v 3 v4 a b b 22 b 23 2 b a 24 22 a 2 a 23 a 32 a 2 a 3 a 33 b b 2 b 3 b 4 a 3 b 3 b 32 b b 34 33 v v 2 v 3 v 4 v v2 v 3 v 4 FIGURE 3.9. Three hidden units in an HMM and the transitions between them are shown in black while the visible states and the emission probabilities of visible states are shown in red. This model shows all transitions as being possible; in other HMMs, some such candidate transitions are not allowed. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

v k α (2) b 2k a 2 α 2 (2) a 2 α 3 (2) a 32 a c2 α c (2) t = 2 3 T- T FIGURE 3.0. The computation of probabilities by the Forward algorithm can be visualized by means of a trellis a sort of unfolding of the HMM through time. Suppose we seek the probability that the HMM was in state at t = 3 and generated the observed visible symbol up through that step (including the observed visible symbol v k ). The probability the HMM was in state ω j (t = 2) and generated the observed sequence through t = 2isα j (2) for j =, 2,...,c. To find α 2 (3) we must sum these and multiply the probability that state emitted the observed symbol v k. Formally, for this particular illustration we have α 2 (3) = b 2k c j= α j(2)a j2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.

ω 4 ω 5 ω 6 ω 7 /v/ /i/ /t/ /e/ /r/ /b/ /i/ FIGURE 3.. A left-to-right HMM commonly used in speech recognition. For instance, such a model could describe the utterance viterbi, where represents the phoneme /v/, represents /i/,..., and a final silent state. Such a left-to-right model is more restrictive than the general HMM in Fig. 3.9 because it precludes transitions back in time. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc. /-/

α max (T) α max () α max (3) α max (T-) α max (2) t = 2 3 4 T- T FIGURE 3.2. The decoding algorithm finds at each time step t the state that has the highest probability of having come from the previous step and generated the observed visible state v k. The full path is the sequence of such states. Because this is a local optimization (dependent only upon the single previous time step, not the full sequence), the algorithm does not guarantee that the path is indeed allowable. For instance, it might be possible that the maximum at t = 5is and at t = 6is, and thus these would appear in the path. This can even occur if a 2 = P( (t + ) (t)) = 0, precluding that transition. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 200 by John Wiley & Sons, Inc.