Probability and Structure in Natural Language Processing

Similar documents
Probability and Structure in Natural Language Processing

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Conditional Random Field

Variable Elimination (VE) Barak Sternberg

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS Lecture 4. Markov Random Fields

Graphical models for part of speech tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequential Supervised Learning

Intelligent Systems (AI-2)

4 : Exact Inference: Variable Elimination

Intelligent Systems (AI-2)

STA 4273H: Statistical Machine Learning

Log-Linear Models, MEMMs, and CRFs

Probabilistic Models for Sequence Labeling

Dynamic Approaches: The Hidden Markov Model

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Basic math for biology

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

STA 414/2104: Machine Learning

CSE 473: Ar+ficial Intelligence

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

Lecture 13: Structured Prediction

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Bayesian networks Lecture 18. David Sontag New York University

A brief introduction to Conditional Random Fields

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Sta$s$cal sequence recogni$on

Probabilistic Graphical Models

Introduction to Artificial Intelligence (AI)

Machine Learning for Structured Prediction

CRF for human beings

Introduc)on to Ar)ficial Intelligence

Bayesian Machine Learning - Lecture 7

CS Homework 3. October 15, 2009

Undirected Graphical Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

CS 6140: Machine Learning Spring 2017

Hidden Markov Models: All the Glorious Gory Details

6.867 Machine learning, lecture 23 (Jaakkola)

Computational Genomics and Molecular Biology, Fall

CS 188: Artificial Intelligence Fall 2011

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models

Parameter learning in CRF s

Unit 3: Ra.onal and Radical Expressions. 3.1 Product Rule M1 5.8, M , M , 6.5,8. Objec.ve. Vocabulary o Base. o Scien.fic Nota.

Pair Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Methods for NLP

Lecture 6: Graphical Models

Design and Implementation of Speech Recognition Systems

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Probabilistic Graphical Models

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Lecture 12: Algorithms for HMMs

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Today s Lecture: HMMs

9 Forward-backward algorithm, sum-product on factor graphs

order is number of previous outputs

The Noisy Channel Model and Markov Models

Probabilistic Graphical Models

Hidden Markov Models

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

COMP90051 Statistical Machine Learning

Alternative Parameterizations of Markov Networks. Sargur Srihari

Undirected Graphical Models: Markov Random Fields

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models Hamid R. Rabiee

Lecture 12: Algorithms for HMMs

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Natural Language Processing

Bayesian Networks Representation

Dynamic models 1 Kalman filters, linearization,

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

Hidden Markov Models in Language Processing

Inference and Representation

Genome 373: Hidden Markov Models II. Doug Fowler

Lecture 3: ASR: HMMs, Forward, Viterbi

Machine Learning for natural language processing

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Alternative Parameterizations of Markov Networks. Sargur Srihari

Hidden Markov Models

A gentle introduction to Hidden Markov Models

10/17/04. Today s Main Points

11 The Max-Product Algorithm

Lecture 3: Markov chains.

Hidden Markov Models

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Transcription:

Probability and Structure in Natural Language Processing Noah Smith, Carnegie Mellon University 2012 Interna@onal Summer School in Language and Speech Technologies

Quick Recap Yesterday: Bayesian networks and some formal proper@es Markov networks and some formal proper@es Exact marginal inference using variable elimina@on Sum- product version Beginnings of the max- product version

Variable Elimina@on for Condi@onal Probabili@es Input: Graphical model on V, set of query variables Y, evidence X = x Output: factor ϕ and scalar α 1. Φ = factors in the model 2. Reduce factors in Φ by X = x 3. Choose variable ordering on Z = V \ Y \ X 4. ϕ = Variable- Elimina@on(Φ, Z) 5. α = z Val(Z) ϕ(z) 6. Return ϕ, α

From Marginals to MAP Replace factor marginaliza@on steps with maximiza&on. Add bookkeeping to keep track of the maximizing values. Add a traceback at the end to recover the solu@on. This is analogous to the connec@on between the forward algorithm and the Viterbi algorithm. Ordering challenge is the same.

Factor Maximiza@on Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via maximiza@on: (X) = max Y (X,Y) We can refer to this new factor by max Y ϕ.

Factor Maximiza@on Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via maximiza@on: (X) = max A B C ϕ (A, B, C) 0 0 0 0.9 0 0 1 0.3 0 1 0 1.1 0 1 1 1.7 1 0 0 0.4 1 0 1 0.7 1 1 0 1.1 1 1 1 0.2 Y maximizing out B (X,Y) A C ψ(a, C) 0 0 1.1 B=1 0 1 1.7 B=1 1 0 1.1 B=1 1 1 0.7 B=0

Distribu@ve Property A useful property we exploited in variable elimina@on: X Scope( 1 ) Under the same condi@ons, factor mul@plica@on distributes over max, too: X ( 1 2) = 1 X 2 max ( 1 2) = 1 max X X 2

Traceback Input: Sequence of factors with associated variables: (ψ Z 1,, ψ Z k) Output: z * Each ψ Z is a factor with scope including Z and variables eliminated a)er Z. Work backwards from i = k to 1: Let z i = arg max z ψ Z i(z, z i+1, z i+2,, z k ) Return z

About the Traceback No extra (asympto@c) expense. Linear traversal over the intermediate factors. The factor opera@ons for both sum- product VE and max- product VE can be generalized. Example: get the K most likely assignments

Elimina@ng One Variable (Max- Product Version with Bookkeeping) Input: Set of factors Φ, variable Z to eliminate Output: new set of factors Ψ 1. Let Φ = {ϕ Φ Z Scope(ϕ)} 2. Let Ψ = {ϕ Φ Z Scope(ϕ)} 3. Let τ be max Z ϕ Φ ϕ Let ψ be ϕ Φ ϕ (bookkeeping) 4. Return Ψ {τ}, ψ

Variable Elimina@on (Max- Product Version with Decoding) Input: Set of factors Φ, ordered list of variables Z to eliminate Output: new factor 1. For each Z i Z (in order): Let (Φ, ψ Z i) = Eliminate- One(Φ, Z i ) 2. Return ϕ Φ ϕ, Traceback({ψ Z i})

Variable Elimina@on Tips Any ordering will be correct. Most orderings will be too expensive. There are heuris@cs for choosing an ordering (you are welcome to find them and test them out).

(Rocket Science: True MAP) Evidence: X = x Query: Y Other variables: Z = V \ X \ Y y = arg max P (Y = y X = x) y Val(Y ) = arg max y Val(Y ) z Val(Z) P (Y = y, Z = z X = x) First, marginalize out Z, then do MAP inference over Y given X = x This is not usually apempted in NLP, with some excep@ons.

Par@ng Shots You will probably never implement the general variable elimina@on algorithm. You will rarely use exact inference. There is value in understanding the problem that approxima@on methods are trying to solve, and what an exact (if intractable) solu@on would look like!

Lecture 3: Structures and Decoding

Two Meanings of Structure Yesterday: structure of a graph for modeling a collec@on of random variables together. Today: linguis@c structure. Sequence labelings (POS, IOB chunkings, ) Parse trees (phrase- structure, dependency, ) Alignments (word, phrase, tree, ) Predicate- argument structures Text- to- text (transla@on, paraphrase, answers, )

A Useful Abstrac@on? I think so. Brings out commonali@es: Modeling formalisms (e.g., linear models with features) Learning algorithms (lectures 4-6) Generic inference algorithms Permits sharing across a wider space of problems. Disadvantage: hides engineering details.

Familiar Example: Hidden Markov Models

Hidden Markov Model X and Y are both sequences of symbols X is a sequence from the vocabulary Σ Y is a sequence from the state space Λ p(x = x, Y = y) = Parameters: Transi@ons p(y y) including p(stop y), p(y start) Emissions p(x y) n p(x i y i )p(y i y i 1 ) p(stop y n ) i=1

Hidden Markov Model The joint model s independence assump@ons are easy to capture with a Bayesian network. p(x = x, Y = y) = n p(x i y i )p(y i y i 1 ) p(stop y n ) i=1 Y 0 Y 1 Y 2 Y 3 Y n stop X 1 X 2 X 3 X n

Hidden Markov Model The joint model instan@ates dynamic Bayesian networks. p(x = x, Y = y) = n p(x i y i )p(y i y i 1 ) p(stop y n ) i=1 Y 0 Y i- 1 Y i template that gets copied as many @mes as needed X i

Hidden Markov Model Given X s value as evidence, the dynamic part becomes unnecessary, since we know n. p(x = x, Y = y) = n p(x i y i )p(y i y i 1 ) p(stop y n ) i=1 Y 0 Y 1 Y 2 Y 3 Y n stop X 1 = x 1 X 2 = x 2 X 3 = x 3 X n = x n

Hidden Markov Model The usual inference problem is to find the most probable value of Y given X = x. Y 0 Y 1 Y 2 Y 3 Y n stop X 1 = x 1 X 2 = x 2 X 3 = x 3 X n = x n

Hidden Markov Model The usual inference problem is to find the most probable value of Y given X = x. Factor graph: Y 0 Y 1 Y 2 Y 3 Y n stop X 1 = x 1 X 2 = x 2 X 3 = x 3 X n = x n

Hidden Markov Model The usual inference problem is to find the most probable value of Y given X = x. Factor graph axer reducing factors to respect evidence: Y 1 Y 2 Y 3 Y n

Hidden Markov Model The usual inference problem is to find the most probable value of Y given X = x. Clever ordering should be apparent! Y 1 Y 2 Y 3 Y n

Hidden Markov Model When we eliminate Y 1, we take a product of three relevant factors. p(y 1 start) η(y 1 ) = reduced p(x 1 Y 1 ) p(y 2 Y 1 ) Y 1 Y 2 Y 3 Y n

Hidden Markov Model When we eliminate Y 1, we first take a product of two factors that only involve Y 1. y 1 p(y 1 start) y 2 y Λ y 1 Y 1 Y 2 Y 3 Y n y 2 y Λ η(y 1 ) = reduced p(x 1 Y 1 )

Hidden Markov Model When we eliminate Y 1, we first take a product of two factors that only involve Y 1. This is the Viterbi probability vector for Y 1. y 1 Y 1 Y 2 Y 3 Y n y 2 y Λ φ 1 (Y 1 )

Hidden Markov Model When we eliminate Y 1, we first take a product of two factors that only involve Y 1. This is the Viterbi probability vector for Y 1. Elimina@ng Y 1 equates to solving the Viterbi probabili@es for Y 2. y 1 Y 1 Y 2 Y 3 Y n y 2 y Λ φ 1 (Y 1 ) y 1 y 2 y Λ p(y 2 Y 1 )

Hidden Markov Model Product of all factors involving Y 1, then reduce. φ 2 (Y 2 ) = max y Val(Y1 ) (φ 1 (y) p(y 2 y)) This factor holds Viterbi probabiliiesy for Y 2. Y 2 Y 3 Y n

Hidden Markov Model When we eliminate Y 2, we take a product of the analogous two relevant factors. Then reduce. φ 3 (Y 3 ) = max y Val(Y2 ) (φ 2 (y) p(y 3 y)) Y 3 Y n Y 2

Hidden Markov Model At the end, we have one final factor with one row, φ n+1. This is the score of the best sequence. Use backtrace to recover values. Y n

Why Think This Way? Easy to see how to generalize HMMs. More evidence More factors More hidden structure More dependencies Probabilis@c interpreta@on of factors is not central to finding the best Y Many factors are not condi@onal probability tables.

Generaliza@on Example 1 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Each word also depends on previous state.

Generaliza@on Example 2 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Trigram HMM

Generaliza@on Example 3 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Aggregate bigram model (Saul and Pereira, 1997)

General Decoding Problem Two structured random variables, X and Y. Some@mes described as collec&ons of random variables. Decode observed value X = x into some value of Y. Usually, we seek to maximize some score. E.g., MAP inference from yesterday.

Linear Models Define a feature vector func@on g that maps (x, y) pairs into d- dimensional real space. Score is linear in g(x, y). score(x, y) = w g(x, y) y = arg max y Y x w g(x, y) Results: decoding seeks y to maximize the score. learning seeks w to do something we ll talk about later. Extremely general!

Generic Noisy Channel as Linear Model ŷ = arg max y = arg max y = arg max y = arg max y log (p(y) p(x y)) log p(y) + log p(x y) w y + w x y w g(x, y) Of course, the two probability terms are typically composed of smaller factors; each can be understood as an exponen@ated weight.

Max Ent Models as Linear Models ŷ = arg max y = arg max y = arg max y = arg max y log p(y x) log exp w g(x, y) z(x) w g(x, y) log z(x) w g(x, y)

HMMs as Linear Models ŷ = arg max y = arg max y = arg max y = arg max y log p(x, y) n log p(x i y i ) + log p(y i y i 1 ) + log p(stop y n ) i=1 n w yi x i + w yi 1 y i + w yn stop i=1 w y x freq(y x; y, x)+ w y y freq(y y ; y) y,y y,x = arg max y w g(x, y)

Running Example IOB sequence labeling, here applied to NER Oxen solved with HMMs, CRFs, M 3 Ns

(What is Not A Linear Model?) Models with hidden variables arg max y p(y x) = arg max y z p(y, z x) Models based on non- linear kernels arg max y w g(x, y) = arg max y N i=1 ik ( x i, y i, x, y )

Decoding For HMMs, the decoding algorithm we usually think of first is the Viterbi algorithm. This is just one example. We will view decoding in five different ways. Sequence models as a running example. These views are not just for HMMs. Some@mes they will lead us back to Viterbi!

Five Views of Decoding

1. Probabilis@c Graphical Models View the linguis@c structure as a collec@on of random variables that are interdependent. Represent interdependencies as a directed or undirected graphical model. Condi@onal probability tables (BNs) or factors (MNs) encode the probability distribu@on.

Inference in Graphical Models General algorithm for exact MAP inference: variable elimina@on. Itera@vely solve for the best values of each variable condi@oned on values of preceding neighbors. Then trace back. The Viterbi algorithm is an instance of max- product variable elimina@on!

MAP is Linear Decoding Bayesian network: Markov network: i + C log p(x i parents(x i )) log p(y j parents(y j )) j log C ({x i } i C, {y j } j C ) This only works if every variable is in X or Y.

Inference in Graphical Models Remember: more edges make inference more expensive. Fewer edges means stronger independence. Really pleasant:

Inference in Graphical Models Remember: more edges make inference more expensive. Fewer edges means stronger independence. Really unpleasant:

2. Polytopes

Parts Assume that feature func@on g breaks down into local parts. g(x, y) = #parts(x) f( i (x, y)) i=1 Each part has an alphabet of possible values. Decoding is choosing values for all parts, with consistency constraints. (In the graphical models view, a part is a clique.)

Example One part per word, each is in {B, I, O} No features look at mul@ple parts Fast inference Not very expressive

Example One part per bigram, each is in {BB, BI, BO, IB, II, IO, OB, OO} Features and constraints can look at pairs Slower inference A bit more expressive

Geometric View Let z i,π be 1 if part i takes value π and 0 otherwise. z is a vector in {0, 1} N N = total number of localized part values Each z is a vertex of the unit cube

Score is Linear in z #parts(x) arg max y w g(x, y) = arg max y w f( i (x, y)) i=1 not really equal; need to transform back to get y = arg max y w #parts(x) i=1 #parts(x) = arg max w z Z x i=1 = arg max w F x z z Z x = arg max z Z x w F x z Values( i ) Values( i ) f( )1{ i (x, y) = } f( )z i,

Polyhedra Not all ver@ces of the N- dimensional unit cube sa@sfy the constraints. E.g., can t have z 1,BI = 1 and z 2,BI = 1 Some@mes we can write down a small (polynomial number) of linear constraints on z. Result: linear objec@ve, linear constraints, integer constraints

Integer Linear Programming Very easy to add new constraints and non- local features. Many decoding problems have been mapped to ILP (sequence labeling, parsing, ), but it s not always trivial. NP- hard in general. But there are packages that oxen work well in prac@ce (e.g., CPLEX) Specialized algorithms in some cases LP relaxa@on for approximate solu@ons

Remark Graphical models assumed a probabilis@c interpreta@on Though they are not always learned using a probabilis@c interpreta@on! The polytope view is agnos@c about how you interpret the weights. It only says that the decoding problem is an ILP.