Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Similar documents
Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Examples of Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

HIDDEN MARKOV MODELS

Hidden Markov Models

Hidden Markov Models for biological sequence analysis I

L23: hidden Markov models

Hidden Markov Models for biological sequence analysis

Hidden Markov Models 1

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Temporal Modeling and Basic Speech Recognition

Sequences and Information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Hidden Markov Models

An Introduction to Hidden

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS 188: Artificial Intelligence Fall 2011

A Revealing Introduction to Hidden Markov Models

HMMs and biological sequence analysis

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models (HMMs)

Hidden Markov Models and some applications

Math 350: An exploration of HMMs through doodles.

Hidden Markov Models and some applications

Lecture 15. Probabilistic Models on Graph

Brief Introduction of Machine Learning Techniques for Content Analysis

HMM: Parameter Estimation

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

8: Hidden Markov Models

Stephen Scott.

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CS 188: Artificial Intelligence Spring Announcements

Chapter 4: Hidden Markov Models

Hidden Markov Models. Ron Shamir, CG 08

Data Mining in Bioinformatics HMM

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Notation and Problems of Hidden Markov Models

Text Mining. March 3, March 3, / 49

Final Examination CS 540-2: Introduction to Artificial Intelligence

Computational Genomics and Molecular Biology, Fall

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

1. Markov models. 1.1 Markov-chain

Hidden Markov Models. Hosein Mohimani GHC7717

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Conditional Random Fields for Sequential Supervised Learning

Parametric Models Part III: Hidden Markov Models

CSCE 471/871 Lecture 3: Markov Chains and

1 Normal Distribution.

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Probabilistic Language Modeling

STA 4273H: Statistical Machine Learning

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models (I)

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

10/17/04. Today s Main Points

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

Bringing machine learning & compositional semantics together: central concepts

Hidden Markov Models

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Language Modeling. Michael Collins, Columbia University

Hidden Markov Models. Terminology and Basic Algorithms

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Today s Lecture: HMMs

The Noisy Channel Model and Markov Models

STA 414/2104: Machine Learning

11.3 Decoding Algorithm

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

N-gram Language Modeling

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Lecture #5. Dependencies along the genome

Statistical Machine Learning from Data

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

1 Probability Theory. 1.1 Introduction

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models in Language Processing

Hidden Markov Models. Three classic HMM problems

Machine Learning Algorithm. Heejun Kim

Lecture 9. Intro to Hidden Markov Models (finish up)

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Info 2950, Lecture 25

COMP90051 Statistical Machine Learning

When working with probabilities we often perform more than one event in a sequence - this is called a compound probability.

Bayesian Inference. Introduction

Bayesian Nonparametrics for Speech and Signal Processing

Bioinformatics 2 - Lecture 4

Hidden Markov Models. George Konidaris

Transcription:

, I. Toy Markov, I. February 17, 2017 1 / 39

Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39

, I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding of any concept, and when I want to learn something new, I make it my first job to build one. Paul Halmos 3 / 39

The Variable Factory 1, I. Toy Markov Production process in a factory If in good state 0 then, independent of the past, with probability 0.9 it will be in state 0 during the next period, with probability 0.1 it will be in bad state 1, once in state 1 it remains in that state forever. Classic Markov chain: A = ( 0 1 ) 0 0.9 0.1. 1 0 1 4 / 39

The Variable Factory 2, I. Toy Each item produced in state 0 is acceptable quality with probability 0.99. Each item produced in state 1 is acceptable quality with probability 0.96. Markov P [u 0] = 0.01 P [a 0] = 0.99 P [u 1] = 0.04 P [a 1] = 0.96 5 / 39

The Variable Factory 3, I. Toy Markov Probability of starting in state 0 is 0.8. A sequence of three observed articles are (a, u, a). Then given the possible state sequence, the probability of the observed sequence 000 (0.8)(0.99)(0.9)(0.01)(0.9)(0.99) 0.006351048 001 (0.8)(0.99)(0.9)(0.01)(0.1)(0.96) 0.000684288 010 (0.8)(0.99)(0.1)(0.04)(0.0)(0.99) 0 011 (0.8)(0.99)(0.1)(0.04)(1.0)(0.96) 0.00304128 100 (0.2)(0.96)(0.0)(0.01)(0.9)(0.99) 0 101 (0.2)(0.96)(0.0)(0.01)(0.1)(0.96) 0 110 (0.2)(0.96)(1.0)(0.96)(0.0)(0.99) 0 111 (0.2)(0.96)(1.0)(0.01)(1.0)(0.96) 0.0018432 6 / 39

The Variable Factory 4, I. Although we can t observe the state of the factory (expensive), we can observe the product quality. Toy Markov P [X 3 = 0 (a, u, a)] = 0.006351 + 0 + 0 + 0 P [seq] state seq = 0.3276. Notice that even without the 0 table entries there are calculations which can combine to make the calculations more efficient. 7 / 39

A Paleontological Temperature Model 1, I. Toy Determine the average annual temperature at a particular location over a sequence of years in the distant past. Only 2 annual average temperatures, hot and cold. Use correlation between the size of tree growth rings and temperature. Markov 8 / 39

Temperature Model 2, I. Toy Markov Probability of a hot year followed by hot year is 0.7, probability of a cold year followed by cold year is 0.6, independent of the temperature in prior years. Classic Markov Chain: ( H C ) H 0.7 0.3. C 0.4 0.6 9 / 39

Temperature Model 3, I. Toy Markov 3 tree ring sizes, small S, medium M, and large L, the observable signal of the average annual temperature. the probabilistic relationship between annual temperature and tree ring sizes is ( S M L ) H 0.1 0.4 0.5. C 0.7 0.2 0.1 10 / 39

Temperature Model 4, I. Toy Markov The annual average temperatures are hidden states, since we can t directly observe the temperature in the past. Although we can t observe the state or temperature in the past, we can observe the sequence of tree ring sizes. From this evidence, we would like to determine the most likely temperature states in past years. 11 / 39

Ball and Urn Model 1, I. Toy Markov N urns, each with a number of colored balls; M possible colors for the balls; initially choose one urn, from an initial probability distribution; select a ball, record color, replace the ball; choose a new urn according to a transition probability distribution; the signal or observation is the color of the selected ball; the hidden states are the urns. 12 / 39

Ball and Urn Model 2, I. Toy Markov 13 / 39

Coin Flip Model, I. Toy Markov Coin flipping experiment: You are in a room with a curtain through which you cannot see what is happening. On the other side of the curtain is a person flipping a coin, or maybe one of several coins. The other person will not tell us exactly what is happening, only the result of each coin flip. How do we build a Model to best explain the observed sequence of heads and tails? 14 / 39

Simplest one coin model, I. Toy Markov Special case: proportion of heads in the observation sequence is 0.5 Two states, each state solely associated with heads or tails. This model is not hidden because the observations directly define the state. This is a degenerate example (standard Bernoulli trials) of a hidden Markov model 15 / 39

Simplest one coin model, I. 1 a Toy a 0 1 1 a Markov a P [H] = 1 P [H] = 0 P [T ] = 0 P [T ] = 1 16 / 39

Two-fair-coins Model, I. Toy Markov 17 / 39 There are again two states in the model. special case: each state has a fair coin, so the probability of a head in each state is 0.5. The probabilities associated with remaining in or leaving each of the two states form a probability transition matrix whose entries are unimportant because the observable sequences from the 2-fair coins model are identical in each of the states. That means this special case of the 2-fair-coin model is indistinguishable from the 1-fair-coin model in a statistical sense so this is another degenerate example of a Hidden Markov Model

2-biased-coins Model, I. Toy Markov 18 / 39 The person behind the curtain has two biased coins and a third fair coin with the biased coins associated to the faces of the fair coin respectively. The person behind the barrier flips the fair coin to decide which biased coin to use and then flips the chosen biased coin to generate the observed outcome. If p 0 = 1 p 1 (e.g. p 0 = 0.6 and p 1 = 0.4) the long term averages of heads would be statistically indistinguishable from either the 1-fair-coin model or the 2-fair-coin model. Other higher-order statistics of the 2-biased-coins model, such as the probability of runs of heads, should be distinguishable from the 1-fair-coin or the 2-fair coin model.

2-biased-coins Model, I. 1 a 00 Toy Markov a 00 0 1 a 11 1 a 11 P [H 0] = p 0 P [T 0] = 1 p 0 P [H 1] = p 1 P [T 1] = 1 p 1 19 / 39

2-biased-coins Model, I. Toy Markov 20 / 39 Now suppose observed sequence has a very long sequence of heads, then followed by another long sequence of tails of some random length, interrupted by a head, followed again by yet another long random sequence of tails. 2-coin model with two biased coins, with biased switching between the states of the coins would be a possible model for the observed sequence. Such a sequence of many heads followed by many tails could conceivably come from one fair coin. The choice between a 1-fair-coin or 2-biased-coins model would be a choice justified by the likelihoods of the observations under the models, or possibly by other external modeling considerations.

3-biased coin Model, I. P [H 0] = p 0, P [T 0] = 1 p 0 P [H 1] = p 1, P [T 1] = 1 p 1 a 01 a 00 0 1 a 11 a 10 Toy Markov a 20 a 21 a 02 a 12 2 21 / 39 a 22 P [H 2] = p 2, P [T 2] = 1 p 2

3-biased coin Model, I. Toy Markov Another Model could be the 3-biased-coins model. First state, the coin is slightly biased to heads, second state, the coin is slightly biased toward tails, third state the coin is some other distribution, maybe fair. A Markov chain for transition probabilities among the three states. 22 / 39 The simple statistics and higher-order statistics of the observations would be correspondingly influenced and would suggest the appropriateness of this choice of model.

Modeling Considerations 1, I. Toy Markov No reason to stop at 1, 2, or even 3 coin models. Part of the modeling process is to decide on the number of states N for the model. Without some a priori information, this choice is often difficult to make and may involve trial and error before settling on the appropriate model size. 23 / 39

Modeling Considerations 2, I. Toy Markov 24 / 39 One goal is optimal estimation of the model parameters from the observations, that is, the probabilities of heads and tails in each state and the transition probabilities between states. The choice of what optimal means is a mathematical modeling choice: maximize the number of expected states that are correctly predicted (forward-backward algorithm) Choose the sequence of states whose conditional probability, given the observations, is maximal (Viterbi algorithm) After choice of optimal, parameters estimation is a statistical problem.

Modeling Considerations 3, I. Toy Markov The length of the observation sequence. With a too short observation sequence, we may not be able to reliably estimate the model parameters or even the number of states. With insufficient data, some may not be statistically different. 25 / 39

Modeling Considerations 4, I. Toy Markov Finally, this emphasizes the title of the subject as Hidden Markov. If the Model is completely specified, then one might as well make a larger-state ordinary Markov process from it. The completely specified 2-coin model biased, could be a four-state Markov process. The three-coin model could be a 6-state Markov process the classic results of Markov processes would apply 26 / 39

Modeling Considerations 5, I. Toy Markov Here we have only the observations or signals, not all the necessary information, even the number of states. From that, we wish to best determine the underlying states and probabilities. The word best indicates choices of measures of optimality. So this is a modeling and statistical problem. That accounts for calling these, not considering them from the viewpoint of Markov processes. 27 / 39

CpG Islands 1, I. Toy Markov In the human genome the dinucleotide CpG (written CpG to distinguish it from the CG base-pair across two strands) is rarer than would be expected from the independent probabilities of C and G, for reasons of chemistry. 28 / 39

CpG Islands 2, I. Toy Markov For biologically important reasons, the chemistry is suppressed in short regions of the genome, such as around the promoters or start regions of many genes. In these regions, we see many more CpG dinucleotides than elsewhere. Such regions are called CpG islands. They are typically a few hundred to a few thousand bases long. 29 / 39

CpG Islands 3, I. Toy Markov Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Second, given a long piece of sequence, how would we find the CpG islands in it, if there are any? Note the similarity to the coin toss example. 30 / 39

Language Analysis and Translation, I. Toy Markov 31 / 39 Language translation is a classic application of Hidden Markov, You obtain a large body of English text, such as the Brown Corpus with 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961. With knowledge of, but no knowledge of English, you would like to determine some basic properties of this mysterious writing system. Can you partition the characters into sets so that characters in each set are different in some statistically significant way?

Language Analysis 2 (Experiment), I. Toy Markov Remove punctuation, numbers; convert all letters to lower case. Leaves 26 distinct letters and the space, total of 27 symbols. Test the hypothesis that English text has an underlying Markov chain with two states. For each of these two hidden states, assume that the 27 symbols appear according to a fixed probability distribution (not same). 32 / 39 This sets up a Model with N = 2 and M = 27 where the state transition probabilities and the observation probabilities from each state are unknown.

Language Analysis 3, I. Toy Markov 33 / 39 State 0 State 1 a 0.13845 0.00075 b 0.00000 0.02311 c 0.00062 0.05614 d 0.00000 0.06937 e 0.21404 0.00000 f 0.00000 0.03559 g 0.00081 0.02724 h 0.00066 0.07278 i 0.12275 0.00000 j 0.00000 0.00365 k 0.00182 0.00703 l 0.00049 0.07231 m 0.00000 0.03889 n 0.00000 0.11461 o 0.13156 0.00000 p 0.00040 0.03674 q 0.00000 0.00153 r 0.00000 0.10225 s 0.00000 0.11042 t 0.01102 0.14392 u 0.04508 0.00000 v 0.00000 0.01621 w 0.00000 0.02303 x 0.00000 0.00447 y 0.00019 0.02587 z 0.00000 0.00110 space 0.33211 0.01298

Language Analysis 4, I. Toy Markov Results of a case study using the first 50,000 observations of letters from the Brown Corpus. Without having any assumption about the nature of the two states, the probabilities tell us that the one hidden state contains the vowels while the other hidden state contains the consonants. Space is more like a vowel and y is a consonant. The Model deduces the statistically significant difference between vowels and consonants without knowing anything about the English language. 34 / 39

Speech Recognition 1, I. Toy Markov were developed in the 1960s and 1970s for satellite communication (Viterbi, 1967). They were later adapted for language analysis and translation and speech recognition in the 1970s and 1980s (Bell Labs, IBM). Interest in HMMs for speech recognition seems to have peaked in the late 1980s. It is not clear if current (2017) speech recognition relies on HMMs at any level or in any way. 35 / 39

Speech Recognition 2, I. Toy Markov 1 Feature analysis spectral or temporal analysis of speech signal 2 Unit matching the speech signal parts are matched to words or phonemes with an HMM 3 Lexical analysis if units are phonemes, combine into recognized words. 4 Syntactic analysis with a grammar, group words into proper sequences (if single word like "yes" or "no", or digit sequences, this is minimal) 5 Semantic analysis interpret for the task model. 36 / 39

Speech Recognition 3, I. Toy Markov Example: Isolated word recognition: With Viterbi Algorithm, a vocabulary of V = 100 words with an N = 5 state model, and T = 40 observations, it takes about 10 5 computations (addition/multiplication) for a single word recognition. 37 / 39

Speech Recognition 4, I. Toy Markov It is hard to determine what current (2017) speech recognition is based on. Explanations are clouded with buzzwords and hype, with no theory. There does not seem to be a connection to HMMs. 38 / 39

Speech Recognition 4, I. Toy Markov Artificial Intelligence Machine Learning Neural Networks Deep Learning In a nutshell, neural networks take a high dimensional problem and turn it into a low-dimensional problem: 39 / 39