What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Similar documents
Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

Advanced Data Science

Hidden Markov Models. x 1 x 2 x 3 x N

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Intelligent Systems (AI-2)

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical Methods for NLP

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CSCE 471/871 Lecture 3: Markov Chains and

CSE 473: Artificial Intelligence Autumn Topics

Intelligent Systems (AI-2)

Stephen Scott.

Hidden Markov Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Lecture 9. Intro to Hidden Markov Models (finish up)

A brief introduction to Conditional Random Fields

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

STA 4273H: Statistical Machine Learning

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

STA 414/2104: Machine Learning

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Lecture 13: Structured Prediction

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models

Pair Hidden Markov Models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Brief Introduction of Machine Learning Techniques for Content Analysis

HIDDEN MARKOV MODELS

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS711008Z Algorithm Design and Analysis

COMP90051 Statistical Machine Learning

Hidden Markov Models Part 2: Algorithms

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

order is number of previous outputs

Statistical NLP: Hidden Markov Models. Updated 12/15

Conditional Random Field

Statistical Methods for NLP

Hidden Markov Models (I)

Hidden Markov Models

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Hidden Markov Models. Three classic HMM problems

Log-Linear Models, MEMMs, and CRFs

Hidden Markov Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Sequential Supervised Learning

Probabilistic Models for Sequence Labeling

Conditional Random Fields

HMM: Parameter Estimation

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

2 : Directed GMs: Bayesian Networks

2 : Directed GMs: Bayesian Networks

CS 7180: Behavioral Modeling and Decision- making in AI

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Naïve Bayes classification

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Expectation Maximization (EM)

Hidden Markov Models. x 1 x 2 x 3 x K

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models. Hosein Mohimani GHC7717

Expectation Maximization Algorithm

Lecture 12: Algorithms for HMMs

8: Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models in Language Processing

Sequence Labeling: HMMs & Structured Perceptron

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

8: Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models Hamid R. Rabiee

Conditional Random Fields: An Introduction

Data-Intensive Computing with MapReduce

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

11.3 Decoding Algorithm

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Graphical models for part of speech tagging

Transcription:

Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What s an HMM? Set of states Initial probabilities Transition probabilities Hidden Markov Models (HMMs) Finite state machine Hidden state sequence Generates o o o 3 o 4 o 5 o 6 o 7 o 8 Observation sequence Set of potential observations Emission probabilities o o o 3 o 4 o 5 HMM generates observation sequence Adapted from Cohen & McCallum Hidden Markov Models (HMMs) HMM Finite state machine Hidden state sequence Finite state machine Hidden state sequence Generates o o o 3 o 4 o 5 o 6 o 7 o 8 Generates o o o 3 o 4 o 5 o 6 o 7 o 8 Observation sequence Observation sequence Graphical Model Hidden states... y t- y t- y t Random y t... takes values from sss{s, s, s 3, s 4 } Graphical Model Hidden states... y t- y t- y t Random y t... takes values from sss{s, s, s 3, s 4 } Observations... x t- x t- x t Random x t... takes values from s{o, o, o 3, o 4, o 5, } Observations... x t- x t- x t Random x t... takes values from s{o, o, o 3, o 4, o 5, } Adapted from Cohen & McCallum

HMM Graphical Model Hidden states... y t- y t- y t Random y t... takes values from sss{s, s, s 3, s 4 } Random x t Observations... x t- x t- x... t takes values from s{o, o, o 3, o 4, o 5, } Need Parameters: Start state probabilities: P(y =s k ) Transition probabilities: P(y t =s i y t- =s k ) Observation probabilities: P(x t =o j y t =s k ) Usually multinomial over atomic, fixed alphabet Training: Maximize probability of training obervations Example: The Dishonest Casino A casino has two dice: Fair die P() = P() = P(3) = P(5) = P(6) = /6 Loaded die P() = P() = P(3) = P(5) = /0 P(6) = / Dealer switches back-&-forth between fair and loaded die about once every 0 turns Game:. You bet $. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $ Slides from Serafim Batzoglou The dishonest casino HMM Question # Evaluation 0.95 0.05 0.95 GIVEN A sequence of rolls by the casino player 45564646463636666646663666366636 P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 FAIR 0.05 LOADED P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Slides from Serafim Batzoglou Slides from Serafim Batzoglou Question # Decoding GIVEN A sequence of rolls by the casino player 4556464646363666664666366636663 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs Slides from Serafim Batzoglou Question # 3 Learning GIVEN A sequence of rolls by the casino player 4556464646363666664666366636663665 QUESTION How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Slides from Serafim Batzoglou

What s this have to do with Info Extraction? What s this have to do with Info Extraction? 0.95 0.05 0.95 0.95 0.05 0.95 FAIR LOADED TEXT NAME P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 0.05 P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / P(the T) = 0.003 P(from T) = 0.00.. 0.05 P(Dan N) = 0.005 P(Sue N) = 0.003 IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: Find the most likely state sequence: (Viterbi) person name location name background arg max Yesterday Pedro Domingos spoke this example sentence. v s v v P( s, o) IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Any words said to be generated by the designated person name state extract as a person name: Slide by Cohen & McCallum Person name: Pedro Domingos Slide by Okan Basegmez 6 Or Combined HMM Example Research Paper Headers HMM Example: Nymble Task: Named Entity Extraction [Bikel, et al 998], [BBN IdentiFinder ] Person Org (Five other name classes) start-ofsentence end-ofsentence Train on ~500k words of news wire text. Other Slide by Okan Basegmez 7 Results: Slide adapted from Cohen & McCallum Case Language F. Mixed English 93% Upper English 9% Mixed Spanish 90%

Person Finite State Model GIVEN Question # Evaluation Org (Five other name classes) Other start-ofsentence end-ofsentence vs. Path A sequence of observations x x x 3 x 4 x N A trained HMM θ=(,, ) QUESTION y y y 3 y 4 y 5 y 6 How likely is this sequence, given our HMM? P(x,θ) x x x 3 x 4 x 5 x 6 Why do we care? Need it for learning to choose among competing models! A parse of a sequence Given a sequence x = x x N, A parse of o is a sequence of states y = y,, y N person GIVEN Question # - Decoding A sequence of observations x x x 3 x 4 x N A trained HMM θ=(,, ) other QUESTION location How dow we choose the corresponding parse (state sequence) y y y 3 y 4 y N, which best explains x x x 3 x 4 x N? Slide by Serafim Batzoglou x x x 3 x There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, Question #3 - Learning GIVEN A sequence of observations x x x 3 x 4 x N QUESTION How do we learn the model parameters θ =(,, ) which maximize P(x, θ )? Evaluation Forward algorithm Decoding Viterbialgorithm Three Questions Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization)

Naive Solution to #: Evaluation Given observations x=x x N and HMM θ, what is p(x)? Many Calculations Repeated Use Dynamic Programming Enumerate every possible state sequence y=y y N Probability of x and given particular y Probability of particular y T multiplications per sequence Summing over all possible state sequences we get For small HMMs T=0, N=0 there are 0 N T state sequences! billion sequences! Cache and reuse inner sums forward s Solution to #: Evaluation Base Case: Forward Variable α t (i) Use Dynamic Programming: Define forward prob - that the state at t has value S i and - the partial obs sequence x=x x t has been seen probability that at time t -the state is S i - the partial observation sequence x=x x t has been emitted person other Base Case: t= α (i) = p(y =S i ) p(x =o ) location x Inductive Case: Forward Variable α t (i) The Forward Algorithm prob - that the state at t has value S i and - the partial obs sequence x=x x t has been seen person α t- () α t- () other α t- (3) α t (3) S i location α t- () S S x x x 3 x t y t- y t

The Forward Algorithm The Backward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O( N) Space: O(N) = S N #states length of sequence The Backward Algorithm INITIALIZATION INDUCTION TERMINATION Three Questions Evaluation Forward algorithm (also Backward algorithm) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) Time: O( N) Space: O(N) # Decoding Problem Given x=x x N and HMM θ, what is best parse y y T? Several possible meanings. States which are individually most likely: most likely state y * t at time t is then # Decoding Problem Given x=x x N and HMM θ, what is best parse y y T? Several possible meanings of solution. States which are individually most likely. Single best state sequence We want sequence y y T, such that P(x,y) is maximized y * = argmax y P( x, y ) Again, we can use dynamic??? programming!????? o o o 3 o T

δ t (i) Like α t (i) = prob that the state, y, at time t has value S i and the partial obs sequence x=x x t has been seen Define δ t (i) = probability of most likely state sequence ending with state S i, given observations x,, x t P(y,,y t-, y t =S i o,, o t, Θ) δ t (i) = probability of most likely state sequence ending with state S i, given observations x,, x t Base Case: t= P(y,,y t-, y t =S i o,, o t, Θ) Max i P(y =S i ) P(x =o y = S i ) δ t- () Inductive Step P(y,,y t-, y t =S i o,, o t, Θ) The Viterbi Algorithm DEFINE o,, o t, Θ INITIALIZATION δ t- () δ t- (3) Take Max S 3 δ t (3) INDUCTION TERMINATION δ t- () Backtracking to get state sequence y* The Viterbi Algorithm x x x j- x j..x T State Max i δ j- (i) * P trans * P obs i δ j (i) State i Terminating Viterbi x x..x T δ δ δ δ δ Choose Max Remember: δ t (i) = probability of most likely state seq ending with y t = state S t Slides from Serafim Batzoglou

Terminating Viterbi x x..x T The Viterbi Algorithm State i δ * Max How did we compute δ*? Max i δ T- (i) * P trans * P obs Now Backchain to Find Final Sequence Time: O( T) Space: O(T) Linear in length of sequence Pedro Domingos 44 Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) Solution to #3 - Learning If we have labeled training data! Input: person name location name background Output: Initial state & transition probabilities: p(y ), p(y t y t- ) Emission probabilities: p(x t y t ) } Yesterday Pedro Domingos spoke this example sentence. states & edges, but no probabilities Many labeled sentences Input: Supervised Learning person name location name background } states & edges, but no probabilities Input: Supervised Learning person name location name background } states & edges, but no probabilities Output: Yesterday Pedro Domingos spoke this example sentence. Daniel Weld gave his talk in Mueller 53. Sieg 8 is a nasty lecture hall, don t you think? The next distinguished lecture is by Oren Etzioni on Thursday. Initial state probabilities: p(y ) P(y =name) = /4 P(y =location) = /4 P(y =background) = /4 Output: Yesterday Pedro Domingos spoke this example sentence. Daniel Weld gave his talk in Mueller 53. Sieg 8 is a nasty lecture hall, don t you think? The next distinguished lecture is by Oren Etzioni on Thursday. State transition probabilities: p(y t y t- ) P(y t =name y t- =name) = P(y t =name y t- =background) = Etc 3/6 /

Supervised Learning Supervised Learning Input: person name location name background } states & edges, but no probabilities Input: person name location name background } states & edges, but no probabilities Yesterday Pedro Domingos spoke this example sentence. Many labeled sentences Yesterday Pedro Domingos spoke this example sentence. Many labeled sentences Output: Initial state probabilities: p(y ), p(y t y t- ) Emission probabilities: p(x t y t ) Output: Initial state probabilities: p(y ), p(y t y t- ) Emission probabilities: p(x t y t ) Solution to #3 - Learning Given x x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ such that P(o θ ) P(o θ) Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn transition and emission probabilities But we can t observe states, so we don t! If we knew transition & emission probabilities Then it d be easy to estimate the sequence of states (Viterbi) But we don t know them! 5 Simplest Version Mixture of two distributions Input Looks Like now: form of distribution & variance, % =5 Just need mean of each distribution 53 54

We Want to Predict Chicken & Egg Note that coloring instances would be easy if we knew Gausians.? 55 56 Chicken & Egg And finding the Gausians would be easy If we knew the coloring Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly: set θ =?; θ =? 57 58 Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]compute probability of instance Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]compute probability of instance 59 60

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values ML Mean of Single Gaussian U ml = argmin u Σ i (x i u) 6 6 Expectation Maximization (EM) Expectation Maximization (EM) [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values [E step]compute probability of instance 63 64 Expectation Maximization (EM) Expectation Maximization (EM) [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values [E step]compute probability of instance [M step] Treating each instance as fractionally having both values compute the new parameter values 65 66

EM for HMMs [E step] Compute probability of instance Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having every value compute the new parameter values - Re-estimate the model parameters - Simple counting Summary - Learning Use hill-climbing Called the Baum/Welch algorithm Also forward-backward algorithm Idea Use an initial parameter instantiation Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters Until estimates don t change much 67 The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski is in a list of city names is under node X in WordNet part of is in bold font noun phrase is indented is in hyperlink anchor last person name was female next two words are and Associates y t- ends in -ski x t - y t x t y t+ x t+ Problems with the Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t- S t S t+ Ignore the dependencies. This causes over-counting of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t- S t S t+ Slide by Cohen & McCallum O Slide by Cohen & tmccallum - O t O t + O t - O t O t+ Discriminative vs. Generative Models So far: all models generative Generative Models model P(y, x) Discriminative Models model P(y x) Discriminative Models often better Eventually, what we care about is p(y x)! Bayes Net describes a family of joint distributions of, whose conditionals take certain form But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x. P(y x) does not include a model of P(x), so it does not need to model the dependencies between features!

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y x) instead of P(y,x): Can examine features, but not responsible for generating them. Don t have to explicitly model their dependencies. Don t waste modeling effort trying to generate what we are given at test time anyway. Naïve Bayes Logistic Regression Finite State Models Sequence HMMs Linear-chain CRFs General Graphs Generative directed models Conditional Conditional Conditional General CRFs Sequence General Graphs Slide by Cohen & McCallum Fix following slides Linear-Chain Conditional Random Fields From HMMs to CRFs can also be written as (set, ) We let new parameters vary freely, so we need normalization constant Z. Linear-Chain Conditional Random Fields Introduce feature functions One feature per transition One feature per state-observation pair (, ) Then the conditional distribution is This is a linear-chain CRF, but includes only current word s identity as a feature Linear-Chain Conditional Random Fields Conditional p(y x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!

Linear-Chain Conditional Random Fields Definition: A linear-chain CRF is a distribution that takes the form parameters feature functions where Z(x) is a normalization function Linear-Chain Conditional Random Fields HMM-like linear-chain CRF x y Linear-chain CRF, in which transition score depends on the current observation y x