Hidden Markov Models. x 1 x 2 x 3 x N

Similar documents
Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

Advanced Data Science

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 9. Intro to Hidden Markov Models (finish up)

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Stephen Scott.

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Soundex distance metric

Statistical Methods for NLP

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models for biological sequence analysis

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models

STA 414/2104: Machine Learning

HIDDEN MARKOV MODELS

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

STA 4273H: Statistical Machine Learning

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 15. Probabilistic Models on Graph

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models (I)

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models and Gaussian Mixture Models

CS711008Z Algorithm Design and Analysis

Hidden Markov Models

2 : Directed GMs: Bayesian Networks

Expectation Maximization (EM)

Hidden Markov Models. Hosein Mohimani GHC7717

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models in Language Processing

2 : Directed GMs: Bayesian Networks

6 Markov Chains and Hidden Markov Models

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models

Hidden Markov Models for biological sequence analysis I

O 3 O 4 O 5. q 3. q 4. Transition

Statistical Methods for NLP

8: Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

order is number of previous outputs

8: Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 12: Algorithms for HMMs

Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical Pattern Recognition

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Hidden Markov Models

Machine Learning for natural language processing

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

HMM: Parameter Estimation

11.3 Decoding Algorithm

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Approximate Inference

Hidden Markov models

Lecture 12: Algorithms for HMMs

Basic math for biology

1. Markov models. 1.1 Markov-chain

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Data Mining in Bioinformatics HMM

Data-Intensive Computing with MapReduce

Hidden Markov Models and Gaussian Mixture Models

HMM part 1. Dr Philip Jackson

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Hidden Markov Models. Three classic HMM problems

Math 350: An exploration of HMMs through doodles.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

COMP90051 Statistical Machine Learning

Today s Lecture: HMMs

Hidden Markov Models 1

Dynamic Programming: Hidden Markov Models

CS 188: Artificial Intelligence Spring 2009

Directed Probabilistic Graphical Models CMSC 678 UMBC

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Temporal Modeling and Basic Speech Recognition

Pair Hidden Markov Models

Transcription:

Hidden Markov Models 1 1 1 1 K K K K x 1 x x 3 x N

Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5) = 1/10 P(6) = 1/ Casino player switches between fair and loaded die with probability 1/0 at each turn Game: 1. You bet $1. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $

Question # 1 Evaluation GIVEN A sequence of rolls by the casino player 14556461461461361366616646616366163661636165156151151461356344 QUESTION Prob = 1.3 x 10-35 How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs

Question # Decoding GIVEN A sequence of rolls by the casino player 14556461461461361366616646616366163661636165156151151461356344 FAIR LOADED FAIR QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

Question # 3 Learning GIVEN A sequence of rolls by the casino player 14556461461461361366616646616366163661636165156151151461356344 QUESTION Prob(6) = 64% How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

The dishonest casino model 0.95 0.05 0.95 P(1 F) = 1/6 P( F) = 1/6 P(3 F) = 1/6 P(4 F) = 1/6 P(5 F) = 1/6 P(6 F) = 1/6 FAIR 0.05 LOADED A Hidden Markov Model: we never observe the state, only observe output dependent (probabilistically) on state P(1 L) = 1/10 P( L) = 1/10 P(3 L) = 1/10 P(4 L) = 1/10 P(5 L) = 1/10 P(6 L) = 1/

A parse of a sequence Observation sequence x = x 1 x N, State sequence = 1,, N 1 K 1 K 1 K 1 K x 1 x x 3 x N 1 K

An HMM is memoryless At each time step t, the only thing that affects future states is the current state t 1 P( t+1 = k whatever happened so far ) = P( t+1 = k 1,,, t, x 1, x,, x t ) = P( t+1 = k t ) K

An HMM is memoryless At each time step t, the only thing that affects x t is the current state t P(x t = b whatever happened so far ) = P(x t = b 1,,, t, x 1, x,, x t-1 ) = P(x t = b t ) 1 K

Definition of a hidden Markov model Definition: A hidden Markov model (HMM) Alphabet = { b 1, b,, b M } = Val(x t ) Set of states Q = { 1,..., K }.= Val( t ) Transition probabilities between any two states a ij = transition prob from state i to state j a i1 + + a ik = 1, for all states i = 1K 1 Start probabilities a 0i a 01 + + a 0K = 1 Emission probabilities within each state e k (b) = P( x t = b t = k) e k (b 1 ) + + e k (b M ) = 1, for all states k = 1K K

Hidden Markov Models as Bayes Nets i i+1 i+ i+3 States unobserved x i x i+1 x i+ x i+3 Observations 6 5 4 Transition probabilities between any two states a jk = transition prob from state j to state k = P( i+1 = k i = j) Emission probabilities within each state e k (b) = P( x i = b i = k)

How special of a case are HMM BNs? Template-based representation All hidden nodes have same CPTs, as do all output nodes Limited connectivity Junction tree has clique nodes of size <= Special-purpose algorithms are simple and efficient

HMMs are good for Speech Recognition Gene Sequence Matching Text Processing Part of speech tagging Information extraction Handwriting recognition

Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1. Start at state 1 according to prob a 0 1. Emit letter x 1 according to prob e 1 (x 1 ) 3. Go to state according to prob a 1 4. until emitting x n 1 K 1 K 1 What did we call this earlier, for general Bayes Nets? 0 a 0 e (x 1 ) K 1 x 1 x x 3 x n K

A parse of a sequence Given a sequence x = x 1 x N, A parse of x is a sequence of states = 1,, N 1 K 1 K 1 K 1 K x 1 x x 3 x N 1 K

Likelihood of a parse 1 1 1 1 Given a sequence x = x 1 x N and a parse = 1,, N, K K K K To find how likely this scenario is: (given our HMM) x 1 x x 3 x K P(x, ) = P(x 1,, x N, 1,, N ) = P(x N N ) P( N N-1 ) P(x ) P( 1 ) P(x 1 1 ) P( 1 ) = a 0 1 a 1 a N-1 N e 1 (x 1 )e N (x N )

Example: the dishonest casino Let the sequence of rolls be: x = 1,, 1, 5, 6,, 1, 5,, 4 Then, what is the likelihood of = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a 0Fair = ½, a 0Loaded = ½) ½ P(1 Fair) P(Fair Fair) P( Fair) P(Fair Fair) P(4 Fair) = ½ (1/6) 10 (0.95) 9 =.000000005115864711 ~= 0.5 10-9

Example: the dishonest casino So, the likelihood the die is fair in this run is just 0.51 10-9 What is the likelihood of = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P(1 Loaded) P(Loaded, Loaded) P(4 Loaded) = ½ (1/10) 9 (1/) 1 (0.95) 9 =.000000000157563543 ~= 0.16 10-9 Therefore, it s somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die

Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6,, 6, 6, 3, 6 Now, what is the likelihood = F, F,, F? ½ (1/6) 10 (0.95) 9 ~= 0.5 10-9, same as before What is the likelihood = L, L,, L? ½ (1/10) 4 (1/) 6 (0.95) 9 =.000000493835134735 ~= 0.5 10-7 So, it is 100 times more likely the die is loaded

The three main questions for HMMs 1. Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x M ]. Decoding GIVEN a HMM M, and a sequence x, FIND the sequence of states that maximizes P[ x, M ] 3. Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters = (e i (.), a ij ) that maximize P[ x ]

Problem 1: Evaluation Compute the likelihood that a sequence is generated by the model

Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1. Start at state 1 according to prob a 0 1. Emit letter x 1 according to prob e 1 (x 1 ) 3. Go to state according to prob a 1 4. until emitting x n 1 1 1 1 0 a 0 K K K K e (x 1 ) x 1 x x 3 x n

Evaluation We will develop algorithms that allow us to compute: P(x) Probability of x given the model P(x i x j ) Probability of a substring of x given the model P( i = k x) Posterior probability that the i th state is k, given x

The Forward Algorithm We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) = P(x, ) = P(x ) P( ) To avoid summing over exponentially many paths, use variable elimination. In HMMs, VE has same form at each i. Define: f k (i) = P(x 1 x i, i = k) (the forward probability) generate i first observations and end up in state k

The Forward Algorithm derivation Define the forward probability: f k (i) = P(x 1 x i, i = k) = 1 i-1 P(x 1 x i-1, 1,, i-1, i = k) e k (x i ) = j 1 i- P(x 1 x i-1, 1,, i-, i-1 = j) a jk e k (x i ) = j P(x 1 x i-1, i-1 = j) a jk e k (x i ) = e k (x i ) j f j (i 1) a jk

The Forward Algorithm We can compute f k (i) for all k, i, using dynamic programming! Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f k (i) = e k (x i ) j f j (i 1) a jk Termination: P(x) = k f k (N)

Backward Algorithm Forward algorithm can compute P(x). But we d also like: P( i = k x), the probability distribution on the i th position, given x Again, we ll use variable elimination. We start by computing: P( i = k, x) = P(x 1 x i, i = k, x i+1 x N ) = P(x 1 x i, i = k) P(x i+1 x N x 1 x i, i = k) = P(x 1 x i, i = k) P(x i+1 x N i = k) Forward, f k (i) Backward, b k (i) Then, P( i = k x) = P( i = k, x) / P(x)

The Backward Algorithm derivation Define the backward probability: b k (i) = P(x i+1 x N i = k) starting from i th state = k, generate rest of x = i+1 N P(x i+1,x i+,, x N, i+1,, N i = k) = j i+ N P(x i+1,x i+,, x N, i+1 = j, i+,, N i = k) = j e j (x i+1 ) a kj i+ N P(x i+,, x N, i+,, N i+1 = j) = j e l (x i+1 ) a kj b j (i+1)

The Backward Algorithm We can compute b k (i) for all k, i, using dynamic programming Initialization: b k (N) = 1, for all k Iteration: b k (i) = l e l (x i+1 ) a kl b l (i+1) Termination: P(x) = l a 0l e l (x 1 ) b l (1)

Computational Complexity What is the running time, and space required, for Forward and Backward? Time: O(K N) Space: O(KN) Useful implementation technique to avoid underflows: rescaling at each few positions by multiplying by a constant Assuming we want to save all forward, backward messages. Otherwise space can be O(K)

Posterior Decoding We can now calculate f k (i) b k (i) P( i = k x) = P(x) Then, we can ask P( i = k x) = P( i = k, x)/p(x) = P(x 1,, x i, i = k, x i+1, x n ) / P(x) = P(x 1,, x i, i = k) P(x i+1, x n i = k) / P(x) = f k (i) b k (i) / P(x) What is the most likely state at position i of sequence x: Define ^ by Posterior Decoding: ^i = argmax k P( i = k x)

Posterior Decoding State 1 x 1 x x 3 x N l P( i =j x) k

Problem : Decoding Find the most likely parse of a sequence

Most likely sequences vs. ind. states Given a sequence x, Most likely sequence of states 1.49*10 very -9 different from sequence of most likely states Example: the dishonest casino P(box: FFFFFFFFFFF) = (1/6) 11 * 0.95 1 =.76 * 10-9 * 0.54 = P(box: LLLLLLLLLLL) = [ (1/) 6 * (1/10) 5 ] * 0.95 10 * 0.05 = 1.56*10-7 * 1.5 * 10-3 = 0.3 * 10-9 Say x = 134131661636461634111341 F F Most likely path: = FFF (too unlikely to transition F L F) However: marked letters more likely to be L than unmarked letters

Decoding 1 1 1 1 GIVEN x = x 1 x x N Find = 1,, N, to maximize P[ x, ] K K K K * = argmax P[ x, ] x 1 x x 3 x N Maximizes a 0 1 e 1 (x 1 ) a 1 a N-1 N e N (x N ) Like forward alg, with max instead of sum V k (i) = max { 1 i-1} P[x 1 x i-1, 1,, i-1, x i, i = k] = Prob. of most likely sequence of states ending at state i = k

Decoding main idea Induction: Given that for all states k, and for a fixed position i, V k (i) = max { 1 i-1} P[x 1 x i-1, 1,, i-1, x i, i = k] What is V j (i+1)? From definition, V l (i+1) = max { 1 i} P[ x 1 x i, 1,, i, x i+1, i+1 = j ] = max { 1 i} P(x i+1, i+1 = j x 1 x i, 1,, i ) P[x 1 x i, 1,, i ] = max { 1 i} P(x i+1, i+1 = j i ) P[x 1 x i-1, 1,, i-1, x i, i ] = max k [P(x i+1, i+1 = j i =k) max { 1 i-1} P[x 1 x i-1, 1,, i- 1,x i, i =k]] = max k [ P(x i+1 i+1 = j ) P( i+1 = j i =k) V k (i) ] = e j (x i+1 ) max k a kj V k (i)

The Viterbi Algorithm Input: x = x 1 x N Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 (0 is the imaginary first position) Iteration: V j (i) = e j (x i ) max k a kj V k (i 1) Ptr j (i) = argmax k a kj V k (i 1) Termination: P(x, *) = max k V k (N) Traceback: N * = argmax k V k (N) i-1 * = Ptr i (i)

The Viterbi Algorithm x 1 x x 3..x N State 1 V j (i) K Time: Space: O(K N) O(KN)

Viterbi Algorithm a practical detail Underflows are a significant problem (like with forward, backward) P[ x 1,., x i, 1,, i ] = a 0 1 a 1 a i e 1 (x 1 )e i (x i ) These numbers become extremely small underflow Solution: Take the logs of all values V l (i) = log e k (x i ) + max k [ V k (i-1) + log a kl ]

Example Let x be a long sequence with a portion of ~ 1/6 6 s, followed by a portion of ~ ½ 6 s x = 13456134561345 666364656166364656 Then, it is not hard to show that optimal parse is: FFF...F LLL...L 6 characters 13456 parsed as F, contribute.95 6 (1/6) 6 = 1.6 10-5 parsed as L, contribute.95 6 (1/) 1 (1/10) 5 = 0.4 10-5 16636 parsed as F, contribute.95 6 (1/6) 6 = 1.6 10-5 parsed as L, contribute.95 6 (1/) 3 (1/10) 3 = 9.0 10-5

Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Initialization: b k (N) = 1, for all k Iteration: Iteration: Iteration: V l (i) = e l (x i ) max k V k (i-1) a kl f l (i) = e l (x i ) k f k (i-1) a kl b l (i) = k e l (x i +1) a kl b k (i+1) Termination: Termination: Termination: P(x, *) = max k V k (N) P(x) = k f k (N) P(x) = k a 0k e k (x 1 ) b k (1)

Problem 3: Learning Find the parameters that maximize the likelihood of the observed sequence

Estimating HMM parameters Easy if we know the sequence of hidden states Count # times each transition occurs Count #times each observation occurs in each state Given an HMM and observed sequence, we can compute the distribution over paths, and therefore the expected counts Chicken and egg problem

Solution: Use the EM algorithm Guess initial HMM parameters E step: Compute distribution over paths M step: Compute max likelihood parameters But how do we do this efficiently?

The forward-backward algorithm Also known as the Baum-Welch algorithm Compute probability of each state at each position using forward and backward probabilities (Expected) observation counts Compute probability of each pair of states at each pair of consecutive positions i and i+1 using forward(i) and backward(i+1) (Expected) transition counts Count(j k) i f j (i) a jk b k (i+1)

Application: HMMs for Information Extraction (IE) IE: Text machine-understandable data Paris, the capital of France, (Paris, France) CapitalOf, p=0.85 Applied to Web: better search engines, semantic Web, step toward human-level AI

IE Automatically? Intractable to get human labels for every concept expressed on the Web Idea: extract from semantically tractable sentences Edison invented the light bulb (Edison, light bulb) Invented x V y => (x, y) V Bloomberg, mayor of New York City (Bloomberg, New York City) Mayor x, C of y => (x, y) C

But Extraction patterns make errors: Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and Empirical fact: Extractions you see over and over tend to be correct The problem is the long tail 48

Challenge: the long tail Number of times extraction appears in pattern 500 50 0 e.g., (Bloomberg, New York City) 0 50000 100000 Frequency rank of extraction Tend to be correct e.g., (Dave Shaver, Pickerington) (Ronald McDonald, McDonaldland) 49 A mixture of correct and incorrect

Mayor McCheese 50

Assessing Sparse Extractions Strategy 1) Model how common extractions occur in text ) Rank sparse extractions by fit to model 51

The Distributional Hypothesis Terms in the same class tend to appear in similar contexts. Hits with Hits with Context Chicago Twisp cities including 4,000 1 and other cities 37,900 0 hotels,000,000 1,670 mayor of 657,000 8 5

HMM Language Models Precomputed scalable Handle sparsity 53

Baseline: context vectors cities such as Chicago, Boston, But Chicago isn t the best cities such as Chicago, Boston, Los Angeles and Chicago. 1 1 Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 005] 54

Hidden Markov Model (HMM) t i t i+1 t i+ t i+3 States unobserved w i w i+1 w i+ w i+3 Words observed cities such as Seattle Hidden States t i {1,, N} (N fairly small) Train on unlabeled data P(t i w i = w) is N-dim. distributional summary of w Compare extractions using KL divergence 55

HMM Compresses Context Vectors Twisp: <... 0 0 0 1... > P(t Twisp): t=1 N 0.14 0.01 0.06 Distributional Summary P(t w) Compact (efficient 10-50x less data retrieved) Dense (accurate 3-46% error reduction) 56

Example Is Pickerington of the same type as Chicago? Chicago, Illinois Pickerington, Ohio Chicago: Pickerington: 91 0 0 1 => Context vectors say no, dot product is 0! 57

Example HMM Generalizes: Chicago, Illinois Pickerington, Ohio 58

Experimental Results Task: Ranking sparse TextRunner extractions. Metric: Area under precision-recall curve. Headquartered Merged Average Frequency 0.710 0.784 0.713 PL 0.651 0.851 0.785 LM 0.810 0.908 0.851 Language models reduce missing area by 39% over nearest competitor. 59

Example word distributions (1 of ) P(word state 3) unk0 0.044415 new 0.035757 more 0.013496 unk1 0.0119841 few 0.01144 small 0.00858043 good 0.0080634 large 0.0073657 great 0.0078838 important 0.00710116 other 0.0067399 major 0.006844 little 0.00545736 P(word state 4), 0.49014. 0.433618 ; 0.0079789 -- 0.00365591-0.0030614! 0.003575 : 0.001859

Example word distributions ( of ) P(word state 1) unk1 0.11654 United+States 0.01609 world 0.0091 U.S 0.007950 University 0.00743 Internet 0.00715 time 0.005167 end 0.00498 unk0 0.004818 war 0.00460 country 0.003774 way 0.00358 city 0.00349 US 0.00369 Sun 0.0098 Earth 0.0068 P(word state 3) the 0.863846 a 0.0131049 an 0.00960474 its 0.008541 our 0.00650477 this 0.00366675 unk1 0.00313899 your 0.0065876

Correlation between LM and IE accuracy Below: correlation coefficients As LM error decreases, IE accuracy increases

Correlation between LM and IE accuracy

Correlation between LM and IE accuracy

What this suggests Better HMM language models => better information extraction Better HMM language models => => human-level AI? Consider: a good enough LM could do question answering, pass the Turing Test, etc. There are lots of paths to human-level AI, but LMs have: Well-defined progress Ridiculous amounts of training data

Also: active learning Today, people train language models by taking what comes Larger corpora => better language models But corpus size limited by # of humans typing What if we asked for the most informative sentences? (active learning)

What have we learned? In HMMs, general Bayes Net algorithms have simple & efficient form 1. Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x M ] Forward Algorithm and Backward Algorithm (Variable Elimination). Decoding GIVEN a HMM M, and a sequence x, FIND the sequence of states that maximizes P[ x, M ] Viterbi Algorithm (MAP query) 3. Learning GIVEN A sequence x, FIND HMM parameters = (e i (.), a ij ) that maximize P[ x ] Baum-Welch/Forward-Backward algorithm (EM)

What have we learned? Unsupervised Learning of HMMs can power more scalable, accurate unsupervised IE Today, unsupervised learning of neural network language models is much more popular Deep networks to be discussed in future weeks