Web Search and Text Mining. Lecture 23: Hidden Markov Models

Size: px

Start display at page:

Download "Web Search and Text Mining. Lecture 23: Hidden Markov Models"

Ethan Williamson
5 years ago
Views:

1 Web Search and Text Mining Lecture 23: Hidden Markov Models

2 Markov Models Observable states: 1, 2,..., N (discrete states) Observed sequence: X 1, X 2,..., X T (discrete time) Markov assumption, P (X t = i X t 1 = j, X t 2 = k,...) = P (X t = i X t 1 = j) dependence on preceding time not any before that. Stationarity, probability law is time-invariant (homogeneous). P (X t = i X t 1 = j) = P (X t+l = i X t+l 1 = j)

3 Probability law can be characterized by a state transition matrix, A = [a ij ] N i,j=1 with a ij = P (X t = j X t 1 = i) Constraints on a ij, a ij 0 N j=1 a ij = 1

4 Example Modeling weather with three states: 1. Rainy (R) 2. Cloudy (C) 3. Sunny (S) State transition matrix,

5 Weather predictor example of a Markov model State Transition Diagram State 1: State 2: State 3: rain cloud sun State-transition probabilities, Sequence of observations, X 1, X 2,... A = { a ij } = (12)

6 Compute the probability of observing the sequence, given that today is S. SSRRSCS Product rule P (AB) = P (A B)P (A) The Markov chain rule P (X 1, X 2,..., X T ) = P (X T X 1, X 2,..., X T 1 )P (X 1, X 2,..., X T 1 ) = P (X T X T 1 )P (X 1, X 2,..., X T 1 ) = P (X T X T 1 )P (X T1 X T 2 ) P (X 2 X 1 )P (X 1 )

7 The observation sequence O = [S, S, R, R, S, C, S] Using the chain rule we have, P (O model) = P (S)P (S S) 2 P (R S)P (R R)P (S R)P (C S)P (S C) = π 3 a 2 33 a 31a 11 a 13 a 32 a 23 = We used the prior probability (state distribution at time 1) π i = P (X 1 = i)

8 A Warm-Up Example Suppose we want to compute P (X T = i), (brute-force) P (X T = i) = sum over all sequences ending with state i Complexity, O(T N T ). But notice P (O) O ends with i P (X t+1 = i) = N i=j P (X t+1 = i, X t = j) = N i=j P (X t = j)p (X t+1 = i X t = j) = N i=j P (X t = j)a ji Complexity, O(T N 2 ). (same idea applied to many other calculations)

9 )? Clever answer ime t Computation is simple. Just fill in this table in this order: ition art state ise t 0 1 : p t (1) 0 p t (2) 1 p t (N) 0 t final ) $ p t (i) = P (X t = i) (Andrew Moore) N

10 Hidden Markov Models States are not observable Observations are probabilistic functions of state State transitions are still probabilistic

11 Occasionally Dishonest Casinos Use a fair die most of the time but occasionally switch to a loaded die. Loaded die: 6/.5, 1-5/.1. Transition matrix (between fair die/loaded die) [ 0.95 ] The observations is a sequence of rolls (1-6), and which die is used is hidden.

12 Hidden Markov Models Hidden states: 1, 2,..., N (discrete states) Observable symbols: 1, 2,..., M (discrete observations) Hidden state sequence: X 1, X 2,..., X T Observed sequence: Y 1, Y 2,..., Y T State transition matrix A = [a ij ] N i,j=1, a ij = P (X t = j X t 1 = i)

13 Observation probability distribution (time-invariant) B j (k) = P (Y t = k X t = j), k = 1,..., M Initial state distribution π i = P (X 1 = i), i = 1,..., N

14 Joint Probability P (X, Y ) = P (X 1,..., X T, Y 1,..., Y T ) = P (X)P (Y X) = = P (X T X T 1 ) P (X 2 X 1 )P (X 1 )P (Y 1 X 1 )... P (Y T X T )

15 Three Fundamental Problems in HMMs 1. Evaluate P (O) for a given observation sequence O 2. Find the most probable state sequence for a given observation sequence O 3. Estimate the HMM model parameters given observation sequence(s).

16 Problem I Let the observation be O = y 1, y 2,..., y T. Again using brute force, P (O) = Summing over N T terms. x 1,...,x T P (x 1,..., x T, y 1, y 2,..., y T ) Forward-Backward Procedure Consider forward variable, α t (i) = P (y 1, y 2,..., y t, X t = i), t = 1,..., T, i = 1,..., N

17 1. Initialization α 1 (i) = π i B i (y 1 ), i = 1,..., N 2. Induction α t+1 (j) = N i=1 α t (i)a ij B j (y t+1 ) 3. Termination, P (O) = N i=1 α T (i)

18 Similarly, define the backward variables, β t (i) = P (y t+1,..., y T X t = i) 1. Initialization β T (i) = 1 2. Induction β t (i) = N j=1 a ij B j (y t+1 )β t+1 (j)

19 Most Probable State Path: Viterbi Algorithm Given an observation sequence, choose the most likely state sequence. {x 1,..., x T } = argmax P ({x 1,..., x T }, {y 1,..., y T }) One solution, optimal individual state, Then choose γ t (i) = P (X t = i O) = α t(i)β t (i) Ni=1 α t (i)β t (i) x t = argmax 1 i N γ t (i) Why not a good idea?

20 Solution by DP. Define v t (i) = max P (x x 1,...,x 1,..., x t 1, x t = i, y 1,..., y t ) t 1 Highest probability path ending at state i at time t. Notice that Then P (x 1,..., x t+1, y 1,..., y t+1 ) = P (x 1,..., x t, y 1,..., y t )P (x t+1 x t )P (y t+1 x t+1 ) v t+1 (j) = max x 1,...,x t P (x 1,..., x t, x t+1 = j, y 1,..., y t+1 ) = max i (v t (i)a ij )B j (y t+1 )

21 Viterbi Algorithm Initialization v 1 (i) = π i B i (y 1 ), i = 1,..., N ψ 1 (i) = 0, i = 1,..., N Recursion Termination P = max i v T (i), v t+1 (j) = max(v t (i)a ij )B j (y t+1 ) j ψ t (j) = argmax i (v t (i)a ij ) x T = argmax i v T (i) x t = ψ t+1(x t+1 )

22 Word Segmentation Icanreadwordswithoutspaces I can read words without spaces For position i in the string, 1) best[i] the probability of the most probable segmentation from start to i. 2) words[i], the word ending at i with the best probability

23 Viterbi Word Segmentation For a string text with length n + 1 for i = 0 to n do for j = 0 to i-1 do word = text[j:i] w = length(word) if P(word)*best[i-w] > best[i] then best[i] = P(word)*best[i-w] words[i] = word Last word is words[n], and let k = length(words[n]), and next to the last is words[n-k], etc.

24 Bigram PoS tagger Outline Recall: HMM PoS tagging Viterbi decoding Trigram PoS tagging Summary ˆt N 1 = arg max P(t N t1 n 1 w1 N ) N P(w i t i )P(t i t i 1 ) i=1 P(t i t i 1 ) P(t i+1 t i ) t i 1 t i+1 t i P(w i 1 t i 1 ) P(w i t i ) P(w i+1 t i+1 ) w i 1 w i w i+1 Steve Renals s.renals@ed.ac.uk Part-of-speech tagging (3) (S. Renals)

25 Outline Recall: HMM PoS tagging Viterbi decoding Trigram PoS tagging Summary Transition and observation probabilities Transition probabilities: P(t i t i 1 ) VB TO NN PPSS start VB TO NN PPSS Observation likelihoods: P(w i t i ) I want to race VB TO NN PPSS Steve Renals s.renals@ed.ac.uk Part-of-speech tagging (3)

26 Named Entity Extraction Mr. Jones eats. Mr. <ENAMEX TYPE=PERSON>Jones</ENAMEX> eats. Probability computation, Pr(NOT-A-NAME START-OF-SENTENCE, +end+) * Pr(Mr. NOT-A-NAME, START-OF-SENTENCE) * Pr(+end+ Mr., NOT-A-NAME) * Pr(PERSON NOT-A-NAME, Mr.) * Pr(Jones PERSON, NOT-A-NAME) * Pr(+end+ Jones, PERSON) * Pr(NOT-A-NAME PERSON, Jones) * Pr(eats NOT-A-NAME, PERSON) * Pr(. eats, NOT-A-NAME) * Pr(+end+., NOT-A-NAME) * Pr(END-OF-SENTENCE NOT-A-NAME,.)

27 Algorithms for NER Decision trees Hidden Markov models Maximum entropy models SVM Boosting Conditional random fields

28 Generative Model Generation of words and name-classes: 1. Select a name-class NC, conditioning on the previous nameclass and the previous word. 2. Generate the first word inside that name-class, conditioning on the current and previous name-classes. 3. Generate all subsequent words inside the current name-class, where each subsequent word is conditioned on its immediate predecessor.

29 Top-level Model 1. The probability for generating the first word of a name-class, P (NC NC 1, w 1 )P ([w, f] first NC, NC 1 ) 2. Generating all but the first word in a name-class, P ([w, f] [w, f] 1, NC) 3. Distinguished +end+ word, the final word of its name-class, P ([ + end+, other] [w, f] fina, NC)

30 Dealing with low-frequency words Split words into two sets: 1) frequent words and rare words; Map rare words to a small, finite set

31 features in Table 3.1.! semantic classes can be defined by lists of words having a semantic feature.! special character sets such as the ones used for transliterating names in Chinese or in Japanese can be identified. Throughout most of the model, we consider words to be ordered pairs (or twoelement vectors), composed of word and word-feature, denoted w, f. The word feature is a simple, deterministic computation performed on each word as it is added to or looked up in the vocabulary. It produces one of the fourteen values in Table 3.1. Table 3.1 Word features, examples and intuition behind them. 2 Word Feature Example Text Intuition twodigitnum 90 Two-digit year fourdigitnum 1990 Four digit year containsdigitandalpha A Product code containsdigitanddash Date containsdigitandslash 11/9/89 Date containsdigitandcomma 23, Monetary amount containsdigitandperiod 1.00 Monetary amount, percentage othernum Other number allcaps BBN Organization capperiod M. Person name initial firstword first word of No useful capitalization sentence information initcap Sally Capitalized word lowercase can Uncapitalized word other, Punctuation marks, all other words These values are computed in the order listed, so that in the case of non-disjoint feature-classes, such as containsdigitandalpha and containsdigitanddash, the former will take precedence. The first eight features arise from the need to distinguish and annotate monetary amounts, percentages, times and dates. The rest of the features distinguish types of capitalization and all other words (such as punctuation marks, which are separate tokens). In particular, the firstword feature arises from the fact that if a word is capitalized and is the first word of the sentence, we have no good information as to why it is

32 Dealing with Low-Frequency Words: An Example Profi ts/na soared/na at/na Boeing/SC Co./CC,/NA easily/na topping/na forecasts/na on/na Wall/SL Street/CL,/NA as/na their/na CEO/NA Alan/SP Mulally/CP announced/na fi rst/na quarter/na results/na./na firstword/na soared/na at/na initcap/sc Co./CC,/NA easily/na lowercase/na forecasts/na on/na initcap/sl Street/CL,/NA as/na their/na CEO/NA Alan/SP initcap/cp announced/na fi rst/na quarter/na results/na./na NA SC CC SL CL... = No entity = Start Company = Continue Company = Start Location = Continue Location (Regina Barzilay) 20

Natural Language Processing Winter 2013

Natural Language Processing Winter 2013 Hidden Markov Models Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Hidden Markov Models Learning Supervised: