Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Similar documents
What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Intelligent Systems (AI-2)

Introduction to Machine Learning CMU-10701

Intelligent Systems (AI-2)

Advanced Data Science

Hidden Markov Models. x 1 x 2 x 3 x N

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CSCE 471/871 Lecture 3: Markov Chains and

Lecture 13: Structured Prediction

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Stephen Scott.

CSE 473: Artificial Intelligence Autumn Topics

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Intelligent Systems (AI-2)

A brief introduction to Conditional Random Fields

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Undirected Graphical Models

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Conditional Random Fields

Probabilistic Models for Sequence Labeling

Statistical Methods for NLP

Hidden Markov Models

Conditional Random Field

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Sequential Supervised Learning

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models

Probabilistic Graphical Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

CS 7180: Behavioral Modeling and Decision- making in AI

STA 4273H: Statistical Machine Learning

Naïve Bayes classification

COMP90051 Statistical Machine Learning

Hidden Markov Models Part 2: Algorithms

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Statistical Methods for NLP

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

2 : Directed GMs: Bayesian Networks

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Hidden Markov Models. x 1 x 2 x 3 x K

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

STA 414/2104: Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Hidden Markov Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Pair Hidden Markov Models

Conditional Random Fields: An Introduction

2 : Directed GMs: Bayesian Networks

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

HIDDEN MARKOV MODELS

Lecture 9. Intro to Hidden Markov Models (finish up)

Directed Probabilistic Graphical Models CMSC 678 UMBC

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

CS711008Z Algorithm Design and Analysis

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

order is number of previous outputs

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Hidden Markov Models

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models,99,100! Markov, here I come!

Statistical Sequence Recognition and Training: An Introduction to HMMs

Linear Dynamical Systems

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models (I)

Introduction to Machine Learning Midterm, Tues April 8

Graphical models for part of speech tagging

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Log-Linear Models, MEMMs, and CRFs

Structure Learning in Sequential Data

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 540: Machine Learning

Log-Linear Models with Structured Outputs

Hidden Markov Models

CS Lecture 4. Markov Random Fields

Hidden Markov Models. x 1 x 2 x 3 x K

Sequence Labeling: HMMs & Structured Perceptron

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning for Structured Prediction

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

8: Hidden Markov Models

Expectation Maximization (EM)

8: Hidden Markov Models

Transcription:

Administrivia Hidden Markov Models (HMMs) for Information Extraction Group meetings next week Feel free to rev proposals thru weekend Daniel S. Weld CSE 454 What is Information Extraction Landscape of IE Techniques: Models As a task: Filling slots in a database from sub-segments of text. Lexicons Classify Pre-segmented Candidates Sliding Window Abraham Lincoln was born in entucky. Abraham Lincoln was born in entucky. Abraham Lincoln was born in entucky. October 4, 00, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. member? Alabama Alaska Wisconsin Wyoming Classifier which class? Classifier which class? Try alternate window sizes: Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. Boundary Models Abraham Lincoln was born in entucky. BEGIN Classifier which class? BEGIN END BEGIN END Finite State Machines Abraham Lincoln was born in entucky. Most likely state sequence? Context Free Grammars Abraham Lincoln was born in entucky. NNP NNP V V P NP NP S VP PP VP Most likely parse? Richard Stallman, founder of the Free Software Foundation, countered saying Slides from Cohen & McCallum Each model can capture words, formatting, or both 4/6/009 :39 PM Slides from Cohen & McCallum4 Naïve Bayes Finite State Models Sequence HMMs General Graphs Generative directed models Conditional Conditional Conditional Graphical Models Family of probability Node is independent of its nondescendants a given its parents distributions that factorize in certain way Directed (Bayes Nets) x0 x4 Node is independent all other nodes given its neighbors Undirected (Markov Random Field) x x x3 x0 x4 Logistic Regression Linear-chain CRFs General CRFs Factor Graphs x x x3 x5 Sequence General Graphs x x0 x4 x3 x5 x

Warning Graphical models add another set of arcs between nodes where the arcs mean something completely different confusing. Skip for 454 New sldies are too abstract Recap: Naïve Bayes Classifier Hidden state Random s (Boolean) Observable Spam? y x x x 3 Nigeria? Widow? CSE 454? Causal dependency (probabilistic) P(x i y=spam) P(x i y spam) Recap: Naïve Bayes Assumption: features independent given label Generative Classifier Model joint distribution p(x,y) Inference Learning: counting Can we use for IE directly? The article appeared in the Seattle Times. city? capitalization length suffix... Labels of neighboring words dependent! Need to consider sequence! other person Hidden Markov Models Finite state model location state sequence Generative Sequence Model assumptions make joint distribution tractable observation x x x 3 x 4 x 5 x 6 x 7 x 8 person sequence Yesterday Pedro. Each state depends only on its immediate predecessor.. Each observation depends only on current state. other person y y y 3 y 4 y 5 y 6 y 7 y 8 Graphical Model transitions observations other person Hidden Markov Models Finite state model location state sequence other person y y y 3 y 4 y 5 y 6 y 7 y 8 observation x x x 3 x 4 x 5 x 6 x 7 x 8 person sequence Yesterday Pedro Generative Sequence Model Model Parameters Start state probabilities Transition probabilities Observation probabilities Graphical Model transitions observations HMM Formally Set of states: {y i } Set of possible observations {x i } Probability of initial state Transition Probabilities Emission Probabilities

Example: Dishonest Casino Casino has two dice: Fair die P() = P() = P(3) = P(5) = P(6) = /6 Loaded die P() = P() = P(3) = P(5) = /0 P(6) = / Casino player switches dice Approx once every 0 turns Game:. You bet $. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $ Slides from Serafim Batzoglou The dishonest casino model 0.95 P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 FAIR 0.05 0.05 LOADED 0.95 P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / Slides from Serafim Batzoglou IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: Find the most likely state sequence: (Viterbi) person name location name background arg max Yesterday Pedro Domingos spoke this example sentence. v s v v P( s, o) IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Any words said to be generated by the designated person name state extract as a person name: Slide by Cohen & McCallum Person name: Pedro Domingos Slide by Okan Basegmez 6 Or Combined HMM Example Research Paper Headers HMM Example: Nymble Task: Named Entity Extraction [Bikel, et al 998], [BBN IdentiFinder ] Person Org start-ofsentence end-ofsentence Transition probabilities Observation probabilities or (Five other name classes) Back-off to: Back-off to: Other Train on ~500k words of news wire text. Results: Case Language F. Mixed English 93% Upper English 9% Mixed Spanish 90% Slide by Okan Basegmez 7 Slide by Cohen & McCallum Other examples of shrinkage for HMMs in IE: [Freitag and McCallum 99]

HMM Example: Nymble Task: Named Entity Extraction Person Org (Five other name classes) Other [Bikel, et al 998], [BBN IdentiFinder ] Train on ~500k words of news wire text. Finite State Model start-ofsentence end-ofsentence start-ofsentence Person end-ofsentence Org (Five other name classes) Other y vs. Path y y 3 y 4 y 5 y 6 Results: Slide adapted from Cohen & McCallum Case Language F. Mixed English 93% Upper English 9% Mixed Spanish 90% x x x 3 x 4 x 5 x 6 GIVEN Question # Evaluation A sequence of observations x x x 3 x 4 x N A trained HMM θ=(,, ) QUESTION How likely is this sequence, given our HMM? P(x,θ) Why do we care? Need it for learning to choose among competing models! GIVEN Question # - Decoding A sequence of observations x x x 3 x 4 x N A trained HMM θ=(,, ) QUESTION How dow we choose the corresponding parse (state sequence) y y y 3 y 4 y N, which best explains x x x 3 x 4 x N? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, A parse of a sequence Given a sequence x = x x N, A parse of o is a sequence of states y = y,, y N GIVEN Question #3 - Learning A sequence of observations x x x 3 x 4 x N person QUESTION other How do we learn the model parameters location θ =(,, ) which maximize P(x, θ )? Slide by Serafim Batzoglou x x x 3 x

Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) A Solution to #: Evaluation Given observations x=x x N and HMM θ, what is p(x)? Enumerate every possible state sequence y=y y N Probability of x and given particular y Probability of particular y T multiplications per sequence Summing over all possible state sequences we get For small HMMs T=0, N=0 there are 0 N T state sequences! billion sequences! Solution to #: Evaluation Forward Variable α t (i) Use Dynamic Programming: Define forward Prob - that the state at time t vas value S i and - the partial obs sequence x=x x t has been seen probability that at time t -the state is S i - the partial observation sequence x=x x t has been emitted person other location x x x 3 x t Forward Variable α t (i) Solution to #: Evaluation Use Dynamic Programming prob - that the state at t vas value S i and - the partial obs sequence x=x x t has been seen person other S i Cache and reuse inner sums Define forward s location x x x 3 x t probability that at time t - the state is y t =S i - the partial observation sequence x=x x t has been omitted

The Forward Algorithm The Forward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O( N) Space: O(N) = S N #states length of sequence The Backward Algorithm The Backward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O( N) Space: O(N) Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) Solution to # - Decoding Given x=x x N and HMM θ, what is best parse y y N? Several optimal solutions. States which are individually most likely: most likely state y * t at time t is then But some transitions may have 0 probability!

Need new slide! Looks like this but deltas Solution to # - Decoding Given x=x x N and HMM θ, what is best parse y y N? Several optimal solutions. States which are individually most likely. Single best state sequence We want to find sequence y y N, such that P(x,y) is maximized y * = argmax y P( x, y ) o o o 3 o Again, we can use dynamic programming! The Viterbi Algorithm The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION State i x x x j- x j..x T Max δ j (i) TERMINATION Backtracking to get state sequence y* Time: O( T) Space: O(T) Linear in length of sequence Remember: δ k (i) = probability of most likely state seq ending with state S k Slides from Serafim Batzoglou The Viterbi Algorithm Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbialgorithm Learning Baum-Welch Algorithm (aka forward-backward ) A kind of EM (expectation maximization) Pedro Domingos 4

Solution to #3 - Learning Given x x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ such that P(o θ ) P(o θ) Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn transition and emission probabilities But we can t observe states, so we don t! If we knew transition & emission probabilities Then it d be easy to estimate the sequence of states (Viterbi) But we don t know them! 44 Simplest Version Mixture of two distributions Input Looks Like.0.03.05.07.09 now: form of distribution & variance, % =5 Just need mean of each distribution 45.0.03.05.07.09 46 We Want to Predict Chicken & Egg Note that coloring instances would be easy if we knew Gausians.?.0.03.05.07.09.0.03.05.07.09 47 48

Chicken & Egg And finding the Gausians would be easy If we knew the coloring Pretend we do know the parameters Initialize randomly: set θ =?; θ =?.0.03.05.07.09.0.03.05.07.09 49 50 Pretend we do know the parameters Initialize randomly Pretend we do know the parameters Initialize randomly.0.03.05.07.09.0.03.05.07.09 5 5 Pretend we do know the parameters Initialize randomly [M step] Treating each instance as fractionally having both values compute the new parameter values ML Mean of Single Gaussian U ml = argmin u Σ i (x i u).0.03.05.07.09 53.0.03.05.07.09 54

[M step] Treating each instance as fractionally having both values compute the new parameter values.0.03.05.07.09.0.03.05.07.09 55 56 [M step] Treating each instance as fractionally having both values compute the new parameter values [M step] Treating each instance as fractionally having both values compute the new parameter values.0.03.05.07.09.0.03.05.07.09 57 58 EM for HMMs Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having both values compute the new parameter values - Re-estimate the model parameters - Simple Counting Summary - Learning Use hill-climbing Called the forward-backward (or Baum/Welch) algorithm Idea Use an initial parameter instantiation Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters Until estimates don t change much 59

The Problem with HMMs Following slides are unclear We want more than an Atomic View of Words We want many arbitrary, overlapping features of words identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski is in a list of city names is under node X in WordNet part of is in bold font noun phrase is indented is in hyperlink anchor last person name was female next two words are and Associates y t- ends in -ski x t - y t x t y t+ x t+ Slide by Cohen & McCallum Naïve Bayes Logistic Regression Finite State Models HMMs Sequence Linear-chain CRFs General Graphs Generative directed models Conditional Conditional Conditional? General CRFs Problems with Richer Representation and a Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t- S t S t+ Ignore the dependencies. This causes over-counting of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t- S t S t+ Sequence General Graphs O Slide by Cohen & tmccallum - O t O t + O t - O t O t+ Discriminative and Generative Models So far: all models generative Generative Models model P(x,y) Discriminative Models model P(x y) Discriminative Models often better Eventually, what we care about is p(y x)! Bayes Net describes a family of joint distributions of, whose conditionals take certain form But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x. P(x y) does not include a model of P(x), so it does not need to model the dependencies between features!

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y x) instead of P(y,x): Can examine features, but not responsible for generating them. Don t have to explicitly model their dependencies. Don t waste modeling effort trying to generate what we are given at test time anyway. Naïve Bayes Logistic Regression Finite State Models Sequence HMMs Linear-chain CRFs General Graphs Generative directed models Conditional Conditional Conditional General CRFs Sequence General Graphs Slide by Cohen & McCallum Linear-Chain Conditional Random Fields From HMMs to CRFs can also be written as (set, ) We let new parameters vary freely, so we need normalization constant Z. Linear-Chain Conditional Random Fields Introduce feature functions One feature per transition One feature per state-observation pair (, ) Then the conditional distribution is This is a linear-chain CRF, but includes only current word s identity as a feature Linear-Chain Conditional Random Fields Conditional p(y x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions! Linear-Chain Conditional Random Fields Definition: A linear-chain CRF is a distribution that takes the form parameters feature functions where Z(x) is a normalization function

Linear-Chain Conditional Random Fields HMM-like linear-chain CRF y x Linear-chain CRF, in which transition score depends on the current observation y x