So# Inference and Posterior Marginals. September 19, 2013

Similar documents
Soft Inference and Posterior Marginals. September 19, 2013

Chapter 14 (Partially) Unsupervised Parsing

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Advanced Natural Language Processing Syntactic Parsing

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CSE 21 Math for Algorithms and Systems Analysis. Lecture 11 Bayes Rule and Random Variables

Natural Language Processing

Lab 12: Structured Prediction

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Lecture 13: Structured Prediction

Topics in Natural Language Processing

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Decoding and Inference with Syntactic Translation Models

COMP90051 Statistical Machine Learning

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Probabilistic Context-free Grammars

Lecture 12: Algorithms for HMMs

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Lecture 12: Algorithms for HMMs

A* Search. 1 Dijkstra Shortest Path

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Context- Free Parsing with CKY. October 16, 2014

CS838-1 Advanced NLP: Hidden Markov Models

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CRF for human beings

CS : Speech, NLP and the Web/Topics in AI

Bayesian Machine Learning - Lecture 7

Genome 373: Hidden Markov Models II. Doug Fowler

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

A brief introduction to Conditional Random Fields

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Conditional Random Field

CS460/626 : Natural Language

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Sequential Supervised Learning

Intelligent Systems (AI-2)

STA 414/2104: Machine Learning

Data-Intensive Computing with MapReduce

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Aspects of Tree-Based Statistical Machine Translation

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

LECTURER: BURCU CAN Spring

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Linear Dynamical Systems

Basic math for biology

Conditional Random Fields for Sequential Supervised Learning

Statistical Methods for NLP

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Midterm sample questions

Sequence Labeling: HMMs & Structured Perceptron

9 Forward-backward algorithm, sum-product on factor graphs

Logic-based probabilistic modeling language. Syntax: Prolog + msw/2 (random choice) Pragmatics:(very) high level modeling language

Language and Statistics II

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Intelligent Systems (AI-2)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA 4273H: Statistical Machine Learning

Hidden Markov Models. Terminology and Basic Algorithms

Probability and Structure in Natural Language Processing

Statistical Methods for NLP

Lecture 12: EM Algorithm

Expectation Maximization (EM)

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Log-Linear Models with Structured Outputs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models

Spectral Learning for Non-Deterministic Dependency Parsing

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Design and Implementation of Speech Recognition Systems

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Final Exam, Machine Learning, Spring 2009

Lecture 8: PGM Inference

Parsing with Context-Free Grammars

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Log-Linear Models, MEMMs, and CRFs

Cours 7 12th November 2014

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models. Terminology, Representation and Basic Problems

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Lecture 3: ASR: HMMs, Forward, Viterbi

Alignment Algorithms. Alignment Algorithms

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 11: Viterbi and Forward Algorithms

Probabilistic and Bayesian Machine Learning

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Conditional Random Fields: An Introduction

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Parsing with Context-Free Grammars

Directed Probabilistic Graphical Models CMSC 678 UMBC

with Local Dependencies

Introduction to Machine Learning CMU-10701

Transcription:

So# Inference and Posterior Marginals September 19, 2013

So# vs. Hard Inference Hard inference Give me a single solucon Viterbi algorithm Maximum spanning tree (Chu- Liu- Edmonds alg.) So# inference Task 1: Compute a distribucon over outputs Task 2: Compute funccons on distribucon marginal probabili,es, expected values, entropies, divergences

Why So# Inference? Useful applicacons of posterior distribucons Entropy: how confused is the model? Entropy: how confused is the model of its prediccon at Cme i? Expecta,ons What is the expected number of words in a translacon of this sentence? What is the expected number of Cmes a word ending in ed was tagged as something other than a verb? Posterior marginals: given some input, how likely is it that some (latent) event of interest happened?

String Marginals Inference quescon for HMMs What is the probability of a string w? Answer: generate all possible tag sequences and explicitly marginalize O( w ) Cme

Initial Probabilities: DET ADJ NN V 0.5 0.1 0.3 0.1 Transition Probabilities: ADJ DET ADJ NN V DET 0.0 0.0 0.0 0.5 ADJ 0.3 0.2 0.1 0.1 NN 0.7 0.7 0.3 0.2 V 0.0 0.1 0.4 0.1 0.0 0.0 0.2 0.1 DET V NN Emission Probabilities: DET ADJ NN V the 0.7 a 0.3 Examples: green 0.1 big 0.4 old 0.4 might 0.1 book 0.3 plants 0.2 people 0.2 person 0.1 John 0.1 watch 0.1 might 0.2 watch 0.3 watches 0.2 loves 0.1 reads 0.19 books 0.01 John might watch NN V V the old person loves big books DET ADJ NN V ADJ NN

John might watch Pr(x, y) John might watch Pr(x, y) John might watch Pr(x, y) John might watch Pr(x, y) DET DET DET 0.0 ADJ DET DET 0.0 NN DET DET 0.0 V DET DET 0.0 DET DET ADJ 0.0 ADJ DET ADJ 0.0 NN DET ADJ 0.0 V DET ADJ 0.0 DET DET NN 0.0 ADJ DET NN 0.0 NN DET NN 0.0 V DET NN 0.0 DET DET V 0.0 ADJ DET V 0.0 NN DET V 0.0 V DET V 0.0 DET ADJ DET 0.0 ADJ ADJ DET 0.0 NN ADJ DET 0.0 V ADJ DET 0.0 DET ADJ ADJ 0.0 ADJ ADJ ADJ 0.0 NN ADJ ADJ 0.0 V ADJ ADJ 0.0 DET ADJ NN 0.0 ADJ ADJ NN 0.0 NN ADJ NN 0.0000042 V ADJ NN 0.0 DET ADJ V 0.0 ADJ ADJ V 0.0 NN ADJ V 0.0000009 V ADJ V 0.0 DET NN DET 0.0 ADJ NN DET 0.0 NN NN DET 0.0 V NN DET 0.0 DET NN ADJ 0.0 ADJ NN ADJ 0.0 NN NN ADJ 0.0 V NN ADJ 0.0 DET NN NN 0.0 ADJ NN NN 0.0 NN NN NN 0.0 V NN NN 0.0 DET NN V 0.0 ADJ NN V 0.0 NN NN V 0.0 V NN V 0.0 DET V DET 0.0 ADJ V DET 0.0 NN V DET 0.0 V V DET 0.0 DET V ADJ 0.0 ADJ V ADJ 0.0 NN V ADJ 0.0 V V ADJ 0.0 DET V NN 0.0 ADJ V NN 0.0 NN V NN 0.0000096 V V NN 0.0 DET V V 0.0 ADJ V V 0.0 NN V V 0.0000072 V V V 0.0

John might watch Pr(x, y) John might watch Pr(x, y) John might watch Pr(x, y) John might watch Pr(x, y) DET DET DET 0.0 ADJ DET DET 0.0 NN DET DET 0.0 V DET DET 0.0 DET DET ADJ 0.0 ADJ DET ADJ 0.0 NN DET ADJ 0.0 V DET ADJ 0.0 DET DET NN 0.0 ADJ DET NN 0.0 NN DET NN 0.0 V DET NN 0.0 DET DET V 0.0 ADJ DET V 0.0 NN DET V 0.0 V DET V 0.0 DET ADJ DET 0.0 ADJ ADJ DET 0.0 NN ADJ DET 0.0 V ADJ DET 0.0 DET ADJ ADJ 0.0 ADJ ADJ ADJ 0.0 NN ADJ ADJ 0.0 V ADJ ADJ 0.0 DET ADJ NN 0.0 ADJ ADJ NN 0.0 NN ADJ NN 0.0000042 V ADJ NN 0.0 DET ADJ V 0.0 ADJ ADJ V 0.0 NN ADJ V 0.0000009 V ADJ V 0.0 DET NN DET 0.0 ADJ NN DET 0.0 NN NN DET 0.0 V NN DET 0.0 DET NN ADJ 0.0 ADJ NN ADJ 0.0 NN NN ADJ 0.0 V NN ADJ 0.0 DET NN NN 0.0 ADJ NN NN 0.0 NN NN NN 0.0 V NN NN 0.0 DET NN V 0.0 ADJ NN V 0.0 NN NN V 0.0 V NN V 0.0 DET V DET 0.0 ADJ V DET 0.0 NN V DET 0.0 V V DET 0.0 DET V ADJ 0.0 ADJ V ADJ 0.0 NN V ADJ 0.0 V V ADJ 0.0 DET V NN 0.0 ADJ V NN 0.0 NN V NN 0.0000096 V V NN 0.0 DET V V 0.0 ADJ V V 0.0 NN V V 0.0000072 V V V 0.0 p =0.0000219

Weighted Logic Programming Slightly different notacon than the textbook, but you will see it in the literature WLP is useful here because it lets us build hypergraphs I 1 : w 1 I 2 : w 2 I k : w k I : w

Weighted Logic Programming Slightly different notacon than the textbook, but you will see it in the literature WLP is useful here because it lets us build hypergraphs I 1 : w 1 I 2 : w 2 I k : w k I 1 I : w I 2. I I k

Hypergraphs A B C I A B I C

Hypergraphs A B C I A B C X I I Y X Y

Hypergraphs A B C X Y U I I I A B I U C X Y

Item form [q, i] Viterbi Algorithm

Viterbi Algorithm Item form [q, i] Axioms [start, 0] : 1

Viterbi Algorithm Item form [q, i] Axioms [start, 0] : 1 Goals [stop, x + 1]

Viterbi Algorithm Item form [q, i] Axioms [start, 0] : 1 Inference rules Goals [q, i] :w [stop, x + 1] [r, i + 1] : w (q! r) (r # x i+1 )

Viterbi Algorithm Item form [q, i] Axioms [start, 0] : 1 Inference rules Goals [q, i] :w [stop, x + 1] [r, i + 1] : w (q! r) (r # x i+1 ) [q, x ] :w [stop, x + 1] : w (q! stop)

Viterbi Algorithm w=(john, might, watch) Goal: [stop, 4]

String Marginals Inference quescon for HMMs What is the probability of a string w? Answer: generate all possible tag sequences and explicitly marginalize O( w ) Cme Answer: use the forward algorithm O( 2 w ) Cme O( ) space

Forward Algorithm Instead of compucng a max of inputs at each node, use addi,on Same run- Cme, same space requirements Viterbi cell interpretacon What is the score of the best path through the la]ce ending in state q at Cme i? What does a forward node weight correspond to?

Forward Algorithm Recurrence 0 (start) =1 t (y) = X q2 (q! y) (y # x i ) t 1 (q)

Forward Chart a i=1 t (q) =p(start,x 1,...,x t,y t = q)

Forward Chart a i=1 t (q) =p(start,x 1,...,x t,y t = q)

Forward Chart a i=1 t (q) =p(start,x 1,...,x t,y t = q)

Forward Chart a b 2 (s 2 ) i=1 i=2 t (q) =p(start,x 1,...,x t,y t = q)

John might watch DET 0.0 0.0 0.0 ADJ 0.0 0.0003 0.0 NN 0.03 0.0 0.000069 V 0.0 0.0024 0.000081 1 2 3 0.0000219 p =0.0000219

Posterior Marginals Marginal inference quescon for HMMs Given x, what is the probability of being in a state q at Cme i? p(x 1,...,x i,y i = q y 0 = start) p(x i+1,...,x x y i = q) Given x, what is the probability of transiconing from state q to r at Cme i? p(x 1,...,x i,y i = q y 0 = start) (q! r) (r # x i+1 ) p(x i+2,...,x x y i+1 = r)

Posterior Marginals Marginal inference quescon for HMMs Given x, what is the probability of being in a state q at Cme i? p(x 1,...,x i,y i = q y 0 = start) p(x i+1,...,x x y i = q) Given x, what is the probability of transiconing from state q to r at Cme i? p(x 1,...,x i,y i = q y 0 = start) (q! r) (r # x i+1 ) p(x i+2,...,x x y i+1 = r)

Posterior Marginals Marginal inference quescon for HMMs Given x, what is the probability of being in a state q at Cme i? p(x 1,...,x i,y i = q y 0 = start) p(x i+1,...,x x y i = q) Given x, what is the probability of transiconing from state q to r at Cme i? p(x 1,...,x i,y i = q y 0 = start) (q! r) (r # x i+1 ) p(x i+2,...,x x y i+1 = r)

Backward Algorithm Start at the goal node(s) and work backwards through the hypergraph What is the probability in the goal node cell? What if there is more than one cell? What is the value of the axiom cell?

Backward Recurrence x +1(stop) =1 i(q) = X r2 i+1(r) (r # x i+1 ) (q! r)

Backward Chart

Backward Chart i=5

Backward Chart i=5

Backward Chart i=5

Backward Chart b i=5

Backward Chart b i=5

Backward Chart c b 3(s 2 ) i=3 i=4 i=5 t(q) =p(x t+1,...,x x y t = q)

Forward- Backward Compute forward chart t (q) =p(start,x 1,...,x t,y t = q) Compute backward chart t(q) =p(x t+1,...,x x, stop y t = q) What is t (q) t (q)?

Forward- Backward Compute forward chart t (q) =p(start,x 1,...,x t,y t = q) Compute backward chart t(q) =p(x t+1,...,x x, stop y t = q) What is t (q) t (q)? p(x,y t = q) = t (q) t (q)

Edge Marginals What is the probability that x was generated and q - > r happened at Cme t? p(x 1,...,x i,y i = q y 0 = start) (q! r) (r # x i+1 ) p(x i+2,...,x x y i+1 = r)

Edge Marginals What is the probability that x was generated and q - > r happened at Cme t? p(x 1,...,x i,y i = q y 0 = start) (q! r) (r # x i+1 ) p(x i+2,...,x x y i+1 = r) t (q) (q! r) (r # x t+1 ) t+1(r)

Forward- Backward a b b c b 2 (s 2 ) 3(s 2 ) i=1 i=2 i=3 i=4 i=5

Generic Inference Semirings are useful structures in abstract algebra Set of values Addi/on, with addicve idencty 0: (a + 0 = a) Mul/plica/on, with mult idencty 1: (a * 1 = a) Also: a * 0 = 0 Distribu/vity: a * (b + c) = a * b + a * c Not required: commutacvity, inverses

So What? You can unify Forward and Viterbi by changing the semiring forward(g) = M 2G O w[e] e2

Semiring Inside Probability semiring marginal probability of output Coun,ng semiring number of paths ( taggings ) Viterbi semiring best scoring derivacon Log semiring w[e] = w T f(e) log(z) = log parccon funccon

Semiring Edge- Marginals Probability semiring posterior marginal probability of each edge Coun,ng semiring number of paths going through each edge Viterbi semiring score of best path going through each edge Log semiring log (sum of all exp path weights of all paths with e) = log(posterior marginal probability) + log(z)

Max- Marginal Pruning

Generalizing Forward- Backward Forward/Backward algorithms are a special case of Inside/Outside algorithms It s helpful to think of I/O as algorithms on PCFG parse forests, but it s more general Recall the 5 views of decoding: decoding is parsing More specifically, decoding is a weighted proof forest

Item form [X, i, j] CKY Algorithm

CKY Algorithm Item form [X, i, j] Goals [S, 1, x + 1]

CKY Algorithm Item form [X, i, j] Goals [S, 1, x + 1] Axioms [N,i,i + 1] : w (N! w x i) 2 G

CKY Algorithm Item form [X, i, j] Goals [S, 1, x + 1] Axioms [N,i,i + 1] : w (N! w x i) 2 G Inference rules [X, i, k] :u [Y,k,j] :v [Z, i, j] :u v w w (Z! XY) 2 G

Posterior Marginals Marginal inference quescon for PCFGs Given w, what is the probability of having a consctuent of type Z from i to j? Given w, what is the probability of having a consctuent of any type from i to j? Given w, what is the probability of using rule Z - > XY to derive the span from i to j?

Inside Algorithm [i,j] (N) =p(x i,x i+1,...,x j N; G) S N x 1... x i 1 x i x i+1... x j x j+1... x x

Inside Algorithm [i,j] (N) =p(x i,x i+1,...,x j N; G) S N x 1... x i 1 x i x i+1... x j x j+1... x x

Inside Algorithm [i,j] (N) =p(x i,x i+1,...,x j N; G) S x 1... x i 1 x i x i+1... x j x j+1... x x

CKY Inside Algorithm Base case(s) [i,i+1] (Z) =p(z! x i ) Recurrence [i,j] (Z) = Xj 1 k=i+1 X (Z!XY )2G [i,k] (X) [k,j] (Y ) p(z! XY )

Generic Inside A B C e 1 e 2 X I e 3 Y B(I) ={e 1,e 2,e 3 } t(e 1 )=ha, B, Ci U

QuesCons for Generic Inside Probability semiring Marginal probability of input CounCng semiring Number of paths (parses, labels, etc) Viterbi semiring Viterbi probability (max joint probability) Log semiring log Z(input)

Outside probabilities: decomposing the problem The shaded area represents the outside probability α j ( p, q) which we need to calculate. How can this be decomposed? f N pe j N pq g N ( q +1) e w 1... w p-1 w p... w q w q+1... w e w e+1... w m

Outside probabilities: decomposing the problem N pe N pq f j Step 1: We assume that is the parent of. Its outside probability, α f ( p, e), (represented by the yellow shading) is available recursively. How do we calculate the cross-hatched probability? f N pe j N pq g N ( q +1) e w 1... w p-1 w p... w q w q+1... w e w e+1... w m

Outside probabilities: decomposing the problem Step 2: The red shaded area is the inside probability g of, which is available as β g ( q + 1, e. N ( q + 1) ) e f N pe j N pq g N ( q +1) e w 1... w p-1 w p... w q w q+1... w e w e+1... w m

Outside probabilities: decomposing the problem Step 3: The blue shaded part corresponds to the production N f N j N g, which because of the contextfreeness of the grammar, is not dependent on the positions of the words. It s probability is simply P(N f N j N g N f, G) and is available from the PCFG without calculation. f N pe j N pq g N ( q +1) e w 1... w p-1 w p... w q w q+1... w e w e+1... w m

Generic Outside

Generic Inside- Outside

Inside- Outside Inside probabilices are required to compute Outside probabilices Inside- Outside works where Forward- Backward does, but not vice- versa ImplementaCon consideracons Building a hypergraph explicitly simplifies code, but it can be expensive in terms of memory