What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What s an HMM? Set of states Initial probabilities Transition probabilities Hidden Markov Models (HMMs) Finite state machine Hidden state sequence Generates o o o 3 o 4 o 5 o 6 o 7 o 8 Observation sequence Set of potential observations Emission probabilities o o o 3 o 4 o 5 HMM generates observation sequence Adapted from Cohen & McCallum Hidden Markov Models (HMMs) HMM Finite state machine Hidden state sequence Finite state machine Hidden state sequence Generates o o o 3 o 4 o 5 o 6 o 7 o 8 Generates o o o 3 o 4 o 5 o 6 o 7 o 8 Observation sequence Observation sequence Graphical Model Hidden states... y t- y t- y t Random y t... takes values from sss{s, s, s 3, s 4 } Graphical Model Hidden states... y t- y t- y t Random y t... takes values from sss{s, s, s 3, s 4 } Observations... x t- x t- x t Random x t... takes values from s{o, o, o 3, o 4, o 5, } Observations... x t- x t- x t Random x t... takes values from s{o, o, o 3, o 4, o 5, } Adapted from Cohen & McCallum

HMM Graphical Model Hidden states... y t- y t- y t Random y t... takes values from sss{s, s, s 3, s 4 } Random x t Observations... x t- x t- x... t takes values from s{o, o, o 3, o 4, o 5, } Need Parameters: Start state probabilities: P(y =s k ) Transition probabilities: P(y t =s i y t- =s k ) Observation probabilities: P(x t =o j y t =s k ) Usually multinomial over atomic, fixed alphabet Training: Maximize probability of training obervations Example: The Dishonest Casino A casino has two dice: Fair die P() = P() = P(3) = P(5) = P(6) = /6 Loaded die P() = P() = P(3) = P(5) = /0 P(6) = / Dealer switches back-&-forth between fair and loaded die about once every 0 turns Game:. You bet $. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $ Slides from Serafim Batzoglou The dishonest casino HMM Question # Evaluation 0.95 0.05 0.95 GIVEN A sequence of rolls by the casino player 45564646463636666646663666366636 P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 FAIR 0.05 LOADED P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Slides from Serafim Batzoglou Slides from Serafim Batzoglou Question # Decoding GIVEN A sequence of rolls by the casino player 4556464646363666664666366636663 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs Slides from Serafim Batzoglou Question # 3 Learning GIVEN A sequence of rolls by the casino player 4556464646363666664666366636663665 QUESTION How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Slides from Serafim Batzoglou

What s this have to do with Info Extraction? What s this have to do with Info Extraction? 0.95 0.05 0.95 0.95 0.05 0.95 FAIR LOADED TEXT NAME P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 0.05 P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / P(the T) = 0.003 P(from T) = 0.00.. 0.05 P(Dan N) = 0.005 P(Sue N) = 0.003 IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: Find the most likely state sequence: (Viterbi) person name location name background arg max Yesterday Pedro Domingos spoke this example sentence. v s v v P( s, o) IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Any words said to be generated by the designated person name state extract as a person name: Slide by Cohen & McCallum Person name: Pedro Domingos Slide by Okan Basegmez 6 Or Combined HMM Example Research Paper Headers HMM Example: Nymble Task: Named Entity Extraction [Bikel, et al 998], [BBN IdentiFinder ] Person Org (Five other name classes) start-ofsentence end-ofsentence Train on ~500k words of news wire text. Other Slide by Okan Basegmez 7 Results: Slide adapted from Cohen & McCallum Case Language F. Mixed English 93% Upper English 9% Mixed Spanish 90%

Person Finite State Model GIVEN Question # Evaluation Org (Five other name classes) Other start-ofsentence end-ofsentence vs. Path A sequence of observations x x x 3 x 4 x N A trained HMM θ=(,, ) QUESTION y y y 3 y 4 y 5 y 6 How likely is this sequence, given our HMM? P(x,θ) x x x 3 x 4 x 5 x 6 Why do we care? Need it for learning to choose among competing models! A parse of a sequence Given a sequence x = x x N, A parse of o is a sequence of states y = y,, y N person GIVEN Question # - Decoding A sequence of observations x x x 3 x 4 x N A trained HMM θ=(,, ) other QUESTION location How dow we choose the corresponding parse (state sequence) y y y 3 y 4 y N, which best explains x x x 3 x 4 x N? Slide by Serafim Batzoglou x x x 3 x There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, Question #3 - Learning GIVEN A sequence of observations x x x 3 x 4 x N QUESTION How do we learn the model parameters θ =(,, ) which maximize P(x, θ )? Evaluation Forward algorithm Decoding Viterbialgorithm Three Questions Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization)

Naive Solution to #: Evaluation Given observations x=x x N and HMM θ, what is p(x)? Many Calculations Repeated Use Dynamic Programming Enumerate every possible state sequence y=y y N Probability of x and given particular y Probability of particular y T multiplications per sequence Summing over all possible state sequences we get For small HMMs T=0, N=0 there are 0 N T state sequences! billion sequences! Cache and reuse inner sums forward s Solution to #: Evaluation Base Case: Forward Variable α t (i) Use Dynamic Programming: Define forward prob - that the state at t has value S i and - the partial obs sequence x=x x t has been seen probability that at time t -the state is S i - the partial observation sequence x=x x t has been emitted person other Base Case: t= α (i) = p(y =S i ) p(x =o ) location x Inductive Case: Forward Variable α t (i) The Forward Algorithm prob - that the state at t has value S i and - the partial obs sequence x=x x t has been seen person α t- () α t- () other α t- (3) α t (3) S i location α t- () S S x x x 3 x t y t- y t

The Forward Algorithm The Backward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O( N) Space: O(N) = S N #states length of sequence The Backward Algorithm INITIALIZATION INDUCTION TERMINATION Three Questions Evaluation Forward algorithm (also Backward algorithm) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) Time: O( N) Space: O(N) # Decoding Problem Given x=x x N and HMM θ, what is best parse y y T? Several possible meanings. States which are individually most likely: most likely state y * t at time t is then # Decoding Problem Given x=x x N and HMM θ, what is best parse y y T? Several possible meanings of solution. States which are individually most likely. Single best state sequence We want sequence y y T, such that P(x,y) is maximized y * = argmax y P( x, y ) Again, we can use dynamic??? programming!????? o o o 3 o T

δ t (i) Like α t (i) = prob that the state, y, at time t has value S i and the partial obs sequence x=x x t has been seen Define δ t (i) = probability of most likely state sequence ending with state S i, given observations x,, x t P(y,,y t-, y t =S i o,, o t, Θ) δ t (i) = probability of most likely state sequence ending with state S i, given observations x,, x t Base Case: t= P(y,,y t-, y t =S i o,, o t, Θ) Max i P(y =S i ) P(x =o y = S i ) δ t- () Inductive Step P(y,,y t-, y t =S i o,, o t, Θ) The Viterbi Algorithm DEFINE o,, o t, Θ INITIALIZATION δ t- () δ t- (3) Take Max S 3 δ t (3) INDUCTION TERMINATION δ t- () Backtracking to get state sequence y* The Viterbi Algorithm x x x j- x j..x T State Max i δ j- (i) * P trans * P obs i δ j (i) State i Terminating Viterbi x x..x T δ δ δ δ δ Choose Max Remember: δ t (i) = probability of most likely state seq ending with y t = state S t Slides from Serafim Batzoglou

Terminating Viterbi x x..x T The Viterbi Algorithm State i δ * Max How did we compute δ*? Max i δ T- (i) * P trans * P obs Now Backchain to Find Final Sequence Time: O( T) Space: O(T) Linear in length of sequence Pedro Domingos 44 Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) Solution to #3 - Learning If we have labeled training data! Input: person name location name background Output: Initial state & transition probabilities: p(y ), p(y t y t- ) Emission probabilities: p(x t y t ) } Yesterday Pedro Domingos spoke this example sentence. states & edges, but no probabilities Many labeled sentences Input: Supervised Learning person name location name background } states & edges, but no probabilities Input: Supervised Learning person name location name background } states & edges, but no probabilities Output: Yesterday Pedro Domingos spoke this example sentence. Daniel Weld gave his talk in Mueller 53. Sieg 8 is a nasty lecture hall, don t you think? The next distinguished lecture is by Oren Etzioni on Thursday. Initial state probabilities: p(y ) P(y =name) = /4 P(y =location) = /4 P(y =background) = /4 Output: Yesterday Pedro Domingos spoke this example sentence. Daniel Weld gave his talk in Mueller 53. Sieg 8 is a nasty lecture hall, don t you think? The next distinguished lecture is by Oren Etzioni on Thursday. State transition probabilities: p(y t y t- ) P(y t =name y t- =name) = P(y t =name y t- =background) = Etc 3/6 /

Supervised Learning Supervised Learning Input: person name location name background } states & edges, but no probabilities Input: person name location name background } states & edges, but no probabilities Yesterday Pedro Domingos spoke this example sentence. Many labeled sentences Yesterday Pedro Domingos spoke this example sentence. Many labeled sentences Output: Initial state probabilities: p(y ), p(y t y t- ) Emission probabilities: p(x t y t ) Output: Initial state probabilities: p(y ), p(y t y t- ) Emission probabilities: p(x t y t ) Solution to #3 - Learning Given x x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ such that P(o θ ) P(o θ) Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn transition and emission probabilities But we can t observe states, so we don t! If we knew transition & emission probabilities Then it d be easy to estimate the sequence of states (Viterbi) But we don t know them! 5 Simplest Version Mixture of two distributions Input Looks Like now: form of distribution & variance, % =5 Just need mean of each distribution 53 54

We Want to Predict Chicken & Egg Note that coloring instances would be easy if we knew Gausians.? 55 56 Chicken & Egg And finding the Gausians would be easy If we knew the coloring Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly: set θ =?; θ =? 57 58 Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]compute probability of instance Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]compute probability of instance 59 60

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values ML Mean of Single Gaussian U ml = argmin u Σ i (x i u) 6 6 Expectation Maximization (EM) Expectation Maximization (EM) [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values [E step]compute probability of instance 63 64 Expectation Maximization (EM) Expectation Maximization (EM) [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values 65 66

EM for HMMs [E step] Compute probability of instance Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having every value compute the new parameter values - Re-estimate the model parameters - Simple counting Summary - Learning Use hill-climbing Called the Baum/Welch algorithm Also forward-backward algorithm Idea Use an initial parameter instantiation Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters Until estimates don t change much 67 The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski is in a list of city names is under node X in WordNet part of is in bold font noun phrase is indented is in hyperlink anchor last person name was female next two words are and Associates y t- ends in -ski x t - y t x t y t+ x t+ Problems with the Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t- S t S t+ Ignore the dependencies. This causes over-counting of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t- S t S t+ Slide by Cohen & McCallum O Slide by Cohen & tmccallum - O t O t + O t - O t O t+ Discriminative vs. Generative Models So far: all models generative Generative Models model P(y, x) Discriminative Models model P(y x) Discriminative Models often better Eventually, what we care about is p(y x)! Bayes Net describes a family of joint distributions of, whose conditionals take certain form But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x. P(y x) does not include a model of P(x), so it does not need to model the dependencies between features!

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y x) instead of P(y,x): Can examine features, but not responsible for generating them. Don t have to explicitly model their dependencies. Don t waste modeling effort trying to generate what we are given at test time anyway. Naïve Bayes Logistic Regression Finite State Models Sequence HMMs Linear-chain CRFs General Graphs Generative directed models Conditional Conditional Conditional General CRFs Sequence General Graphs Slide by Cohen & McCallum Fix following slides Linear-Chain Conditional Random Fields From HMMs to CRFs can also be written as (set, ) We let new parameters vary freely, so we need normalization constant Z. Linear-Chain Conditional Random Fields Introduce feature functions One feature per transition One feature per state-observation pair (, ) Then the conditional distribution is This is a linear-chain CRF, but includes only current word s identity as a feature Linear-Chain Conditional Random Fields Conditional p(y x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!

Linear-Chain Conditional Random Fields Definition: A linear-chain CRF is a distribution that takes the form parameters feature functions where Z(x) is a normalization function Linear-Chain Conditional Random Fields HMM-like linear-chain CRF x y Linear-chain CRF, in which transition score depends on the current observation y x