Maschinelle Sprachverarbeitung

Similar documents
Maschinelle Sprachverarbeitung

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Probabilistic Context-Free Grammar

Advanced Natural Language Processing Syntactic Parsing

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Probabilistic Context-free Grammars

CS460/626 : Natural Language

Processing/Speech, NLP and the Web

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

LECTURER: BURCU CAN Spring

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Statistical Methods for NLP

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Natural Language Processing

Parsing with Context-Free Grammars

Introduction to Probablistic Natural Language Processing

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Probabilistic Context Free Grammars. Many slides from Michael Collins

DT2118 Speech and Speaker Recognition

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

A Context-Free Grammar

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

10/17/04. Today s Main Points

Chapter 14 (Partially) Unsupervised Parsing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 12: Algorithms for HMMs

Parsing with Context-Free Grammars

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Properties of Context-Free Languages

Introduction to Computational Linguistics

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Lecture 12: Algorithms for HMMs

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Context Free Grammars

CS460/626 : Natural Language

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

CYK Algorithm for Parsing General Context-Free Grammars

CSCI Compiler Construction

A* Search. 1 Dijkstra Shortest Path

Statistical Machine Translation

Statistical methods in NLP, lecture 7 Tagging and parsing

Midterm sample questions

Probabilistic Linguistics

Multiword Expression Identification with Tree Substitution Grammars

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Lecture 5: UDOP, Dependency Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Features of Statistical Parsers

Even More on Dynamic Programming

TnT Part of Speech Tagger

Statistical Methods for NLP

Constituency Parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Remembering subresults (Part I): Well-formed substring tables

Algorithms for Syntax-Aware Statistical Machine Translation

Text Mining. March 3, March 3, / 49

Latent Variable Models in NLP

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Stochastic Parsing. Roberto Basili

Sharpening the empirical claims of generative syntax through formalization

Sequences and Information

Decoding and Inference with Syntactic Translation Models

CS 6120/CS4120: Natural Language Processing

The effect of non-tightness on Bayesian estimation of PCFGs

RNA Secondary Structure Prediction

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Bringing machine learning & compositional semantics together: central concepts

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Lecture 13: Structured Prediction

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Computational Models - Lecture 4

SYNTHER A NEW M-GRAM POS TAGGER

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Lab 12: Structured Prediction

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Lecture 9: Hidden Markov Model

Computational Models - Lecture 4 1

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

Natural Language Processing

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

CISC4090: Theory of Computation

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

The Noisy Channel Model and Markov Models

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

CFLs and Regular Languages. CFLs and Regular Languages. CFLs and Regular Languages. Will show that all Regular Languages are CFLs. Union.

Transcription:

Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung 2

Parsing Sentences POS tagging studies the plain sequence of words in a sentence But sentences have more and non-consecutive structures Plenty of linguistic theories exist about the nature and representation of these structures / units / phrases / Here: Phrase structure grammars S The astronomer saw the star with a telescope V P PP The astronomer saw the star with a telescope Ulf Leser: Maschinelle Sprachverarbeitung 3

Parsing Sentences POS tagging studies the plain sequence of words in a sentence But sentences have more and non-consecutive structures Plenty of linguistic theories exist about the nature and representation of these structures / units / phrases / Here: Phrase structure grammars S The astronomer saw the star with a telescope V P PP astronomer saw star with telescope Ulf Leser: Maschinelle Sprachverarbeitung 4

Phrase Structure Grammar Builds on assumptions Sentences consist of nested structures There is a fixed set of different structures (phrase types) Nesting can be described by a context-free grammar 1: S 7: PP 2: PP P 8: astronomer 3: V 9: telescope 4: PP 10: star 5: P with 6: V saw 8 3 1 S 4 2 PP 6 V 10 5 P 9 astronomer saw star with telescope Ulf Leser: Maschinelle Sprachverarbeitung 5

Ambiguity? S PP V P astronomer saw star with ring S PP V P astronomer saw star with ring Ulf Leser: Maschinelle Sprachverarbeitung 6

Problem 1: Ambiguity! S PP V P astronomer saw man with telescope S PP V P astronomer saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung 7

Problem 2: Syntax versus Semantics Phrase structure grammars only capture syntax 1: S 7: PP 2: PP P 8: astronomer 3: V 9: telescope 4: PP 10: star 5: P with 6: V saw V ate moon cat cream V moon ate cat with cream S P PP telescope ate moon with cat telescope saw cream with star Ulf Leser: Maschinelle Sprachverarbeitung 8

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung 9

Probabilistic Context-Free Grammars (PCFG) Also called Stochastic Context Free Grammars Idea: Context free grammars with transition probabilities Every rule gets a non-zero probability of firing Grammar still recognizes the same language But different parses usually have different probability Usages Find parse with highest probability (most probable meaning) Detect ambiguous sentences (>1 parses with similar probability) What is the overall probability of a sentence given a grammar? Sum of the probabilities of all derivations producing the sentence Language models: Predict most probable next token in an incomplete sentence which is allowed by the grammar Ulf Leser: Maschinelle Sprachverarbeitung 10

POS Tagging versus Parsing The velocity of the seismic waves rises to Difficult for a POS tagger: waves/plural rises/singular Simple for a PCFG Ulf Leser: Maschinelle Sprachverarbeitung 11

More Formal Definition A PCFG is a 5-tuple (W, N, S, R, p) with W is a set of terminals (words) w 1, w 2, N is a set of non-terminals (phrase types) N 1,N 2, S is a designated start symbol R is a set of rules <N i ϕ> where ϕ is a sequence of terminals and/or non-terminals p is a function assigning a non-zero probability to every rule such that i j ( ϕ ) 1 : p = N i j Ulf Leser: Maschinelle Sprachverarbeitung 12

Example S Rules 1: S 1,00 2: PP P 1,00 3: V 0,30 4: PP 0,70 5: P with 1,00 6: V saw 1,00 7: PP 0,80 8: astronomer 0,10 9: telescope 0,05 10: man 0,05 p PP V P astronomer saw man with telescope S PP V P astronomer saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung 13

Example 1 S 1: S 1,00 2: PP P 1,00 3: V 0,30 4: PP 0,70 5: P with 1,00 6: V saw 1,00 7: PP 0,80 8: astronomer 0,10 9: telescope 0,05 10: man 0,05 3 8 7 2 PP 6 V 10 5 P 9 astronomer saw man with telescope p(t 1 ) = 1 *0,1*0,3*1*0,8*0,05*1*1*0,05 = 0,0006 Ulf Leser: Maschinelle Sprachverarbeitung 14

Example 1: S 1,00 2: PP P 1,00 3: V 0,30 4: PP 0,70 5: P with 1,00 6: V saw 1,00 7: PP 0,80 8: astronomer 0,10 9: telescope 0,05 10: man 0,05 p(t 2 ) = 1*0,1*0,7*0,3*1*0,05*1*1*0,05 = 0,000525 S PP V P astronomer saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung 15

Implicit Assumptions Context-free: Probability of a derivation of a subtree under non-terminal N is independent of anything else in the tree Above N, left of N, right of N Place-invariant: Probability of a given rule r is the same anywhere in the tree Probability of a subtree is independent of its position in the sentence Semantic-unaware: Probability of terminals do not depend on the co-occurring terminals in the sentence Semantic validity is not considered Ulf Leser: Maschinelle Sprachverarbeitung 16

Usefulness (of a good PCFG) Tri-gram models are the better language models Work at word level conditional probabilities of word sequences PCFG are a step towards resolving ambiguity, but not a complete solution due to lack of semantics PCFG can produce robust parsers When learned on a corpus with a few, rare errors, these are cast into rules with low probability Have some implicit bias (work-arounds known) E.g. small trees get higher probabilities State-of-the-art parser combine PCFG with additional formalized (semantic) knowledge Ulf Leser: Maschinelle Sprachverarbeitung 17

Three Issues Given a PCFG G and a sentence s L(G) Issue 1: Decoding (or parsing): Which chain of rules (derivation) from G produced s with the highest probability? Issue 2: Evaluation: What is the overall probability of s given G? Given a context free grammar G and a set of sentences S with their derivation in G Issue 3: Learning: Which PCFG G with the same rule set as G produces S with the highest probability? We make our life simple: (1) G is given, (2) sentences are parsed Removing assumption (2) leads to an EM algorithm, removing (1) is hard (structure learning) Obvious relationships to corresponding problems in HMMs Ulf Leser: Maschinelle Sprachverarbeitung 18

Chomsky Normal Form We only consider PCFG with rules of the following form (Chomsky Normal Form, CNF) N w Non-terminal to terminal N N N Non-terminal to two non terminals Note: For any CFG G, there exists a CFG G in Chomsky Normal Form such that G and G are weakly equivalent, i.e., accept the same language (but with different derivations) Accordingly, a PCFG in CNF has N 3 + N * W parameter Ulf Leser: Maschinelle Sprachverarbeitung 19

Issue 3: Learning Given a context free grammar G and a set of sentences S with their derivations in G : Which PCFG G with the same rule set as G produces S with the highest probability? A simple Maximum Likelihood approach will do i : p ( N ϕ ) i j = N i N i ϕ j *. Number of occurrence of a rule in the set of derivations * Any rule consequence Ulf Leser: Maschinelle Sprachverarbeitung 20

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung 21

Issue 2: Evaluation Given a PCFG G and a sentence s L(G): What is the overall probability of s given G? We did not discuss this problem for HMM, but for PCFG it is simpler to derive parsing from evaluation Naïve: Find all derivations of s, sum-up their probabilities Problem: There can be exponentially many derivations We give a Dynamic Programming based algorithm Ulf Leser: Maschinelle Sprachverarbeitung 22

Idea Recall that a PCFG builds on a CFG in CNF Definition The inside probability of a sub-sentence w p w q to be produced by a non-terminal N i is defined as β i (p,q) = p(w pq N i,pq,g) N i,pq w pq : Sub-sentence of s starting at token w p at pos. p until token w q at pos. q N i,pq : Non-terminal N i producing w pq From now on, we omit the G We search β S (1,n) for a sentence with n token w p w q Ulf Leser: Maschinelle Sprachverarbeitung 23

Induction We compute β S (1,n) by induction over the length of all sub-sentences Start: Assume p=q (sub-sent of length 1). Since we have a CNF, the rule producing w pp must have the form N i,pp w pp β i (p,p) = p(w pp N i,pp ) = p(n i,pp w pp ) This is parameter of G and can be lookup up for all (i,p) N i,pq Induction: Assume p<q. Since we are in CNF, the derivation must look like this for some d with p d q And we know all β i (a,b) with (a-b)<(q-p) N r,pd w p w d N s,(d+1)q w d+1 w q Ulf Leser: Maschinelle Sprachverarbeitung 24

Derivation β i (p,q) = p(w pq N i,pq,g) = N r,pd N i,pq N s,(d+1)q = = = = r, s d = p.. q 1 r, s d = p.. q 1 r, s d = p.. q 1 p p * p p r, s d = p.. q 1 ( w, N, w, N N ) pd w p w d ( N, N N )* p( w N, N, N ) r, pd s,( d + 1) q w d+1 p( w ) ( d + 1) q Nr, pd, N s,( d + 1) q, Ni, pq ( N, N N )* p( w N )* p( w N ) r, pd s,( d + 1) q ( N N N ) i r, pd r s ( d + 1) q i, pq i, pq r s,( d + 1) q pd pd s i, pq r, pd r, pd * β ( p, d)* β ( d + 1, q) s,( d + 1) q ( d + 1) q i, pq * s,( d + 1) q w q Ulf Leser: Maschinelle Sprachverarbeitung 25

Derivation All possible combinations of N r /N s producing their subtrees Subtrees can have any length between 1 and q- p, sum is q-p Chain rule of conditional probabilities N r,pd N i,pq N s,(d+1)q = = = = r, s d = p.. q 1 r, s d = p.. q 1 r, s d = p.. q 1 p p * p p r, s d = p.. q 1 ( w, N, w, N N ) pd w p w d ( N, N N )* p( w N, N, N ) r, pd s,( d + 1) q Ulf Leser: Maschinelle Sprachverarbeitung 26 w d+1 p( w ) ( d + 1) q Nr, pd, N s,( d + 1) q, Ni, pq ( N, N N )* p( w N )* p( w N ) r, pd s,( d + 1) q ( N N N ) i r, pd r s ( d + 1) q i, pq i, pq r s,( d + 1) q pd pd s i, pq r, pd r, pd * β ( p, d)* β ( d + 1, q) s,( d + 1) q ( d + 1) q i, pq * s,( d + 1) q Independence assumptions in CFG w q

Example astronomer saw man with telescope 1: S 1,00 2: PP P 1,00 3: V 0,70 4: PP 0,30 5: P with 1,00 6: V saw 1,00 7: PP 0,40 8: astronomer 0,10 9: telescope 0,18 10: man 0,18 11: saw 0,04 12: ears 0,10 1 β (1,1)=0,1 1 2 3 4 5 2 β V (2,2)=1 β (2,2)=0,04 3 β (3,3)=0,18 4 β P (4,4)=1 5 β (5,5)=0,18 Ulf Leser: Maschinelle Sprachverarbeitung 27

Example astronomer saw man with telescope 1: S 1,00 2: PP P 1,00 3: V 0,70 4: PP 0,30 5: P with 1,00 6: V saw 1,00 7: PP 0,40 8: astronomer 0,10 9: telescope 0,18 10: man 0,18 11: saw 0,04 12: ears 0,10 1 2 3 4 5 1 β =0,1-2 β V =1 β =0,04 β =0,7*1*0,18= 0,126 3 β =0,18-4 β P =1 β PP =1*1*0,18= 0,18 5 β =0,18 No rule X V or X Must be V with p=0.7 Ulf Leser: Maschinelle Sprachverarbeitung 28

Example astronomer saw man with telescope 1: S 1,00 2: PP P 1,00 3: V 0,70 4: PP 0,30 5: P with 1,00 6: V saw 1,00 7: PP 0,40 8: astronomer 0,10 9: telescope 0,18 10: man 0,18 11: saw 0,04 12: ears 0,10 1 β =0,10-1 2 3 4 5 2 β V =1,00 β =0,04 β S =1*0,1*0,126= 0,0126 β =0,126-3 β =0,18 - β =0,4*0,18*0,18= 0,01296 4 β P =1,00 β PP =0,18 5 β =0,18 Ulf Leser: Maschinelle Sprachverarbeitung 29

Example astronomer saw man with telescope 1 2 3 4 5 1 β =0,1 - β S =0,0126 - β S = 2 β V =1 β =0,04 β =0,126 - β 1 +β 2 = 3 β =0,18 - β =0,01296 4 β P =1 β PP =0,18 5 β =0,18 PP PP V P V P saw man with telescope saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung 30

Note This is the Cocke Younger Kasami (CYK) algorithm for parsing with context free grammars, enriched with aggregations / multiplications for computing probabilities Same complexity: O(n 3 * G ) n: Sentence length G : Number of rules in the grammar G Ulf Leser: Maschinelle Sprachverarbeitung 31

Issue 1: Decoding / Parsing Once evaluation is solved, parsing is simple Instead of summing over all derivations, we only chose the most probable deviation of a sub-sentence for each possible root Let δ i (p,q) = p(w pq N i,pq ) be the most probable derivation of sub-sentence p..q from a non-terminal root N i This gives δ ( ( ) i ( p, q) = arg max arg max p wpd, Nr, pd, w( d 1) q, N s,( d 1) q Ni, pq + + r, s d = p... q 1 = arg max p( N N N )* δ ( p, d)* δ ( d + 1, q) d = p... q 1, r, s ( ) We omit induction start and backtracing i r s r s Ulf Leser: Maschinelle Sprachverarbeitung 32

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung 33

Treebanks A treebank is a set of sentences (corpus) whose phrase structures are annotated Training corpus for PCFG Not many exist; very costly, manual task Most prominent: Penn Treebank Marcus, Marcinkiewicz, Santorini. "Building a large annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330. ~5500 citations (!) 2,499 stories from a 3-years Wall Street Journal (WSJ) collection Roughly 1 Million tokens, freely available Deutsche Baumbanken Deutsche Diachrone Baumbank, 3 historical periods, small Tübinger Baumbank, 38.000 Sätze, 345.000 Token Ulf Leser: Maschinelle Sprachverarbeitung 34

Using Derivation History Phrase structure grammars as described here are kind-of simplistic One idea for improvement: Incorporate dependencies between non-terminals Probability of rules is not identical across all positions in a sentence Trick: Annotate derivation of a non-terminal in its name and learn different probabilities for different derivations Read: generated from a 1: S 2: PP P 3: V 7: PP 7a: PP PP Source: MS99; from Penn Treebank Ulf Leser: Maschinelle Sprachverarbeitung 35

Lexicalization Second idea: Incorporate word semantics (lexicalization) Clearly, different verbs take different arguments leading to different structures (similar for other word types) Trick: Learn a model for each head word of a non-terminal walk, take, eat, Requires much larger training corpus and sophisticated smoothing Source: MS99; from Penn Treebank Ulf Leser: Maschinelle Sprachverarbeitung 36

Dependency Grammars Phrase structure grammars are not the only way to represent structural information within sentences Popular alternative: Dependency trees Every word forms exactly one node Edges describe the syntactic relationship between words: object-of, subject-of, modifier-of, preposition-of, Different tag sets exist subject-of aux-of det-of obj-of mod-of Source: Wikipedia Ulf Leser: Maschinelle Sprachverarbeitung 37

Self-Assessment Which assumptions are behind PCFG for parsing? What is the complexity of the parsing problem in PCFG? Assume the following rule set Derive all derivations for the sentence together with their probabilities. Mark the most probable derivation. Derive the complexity of the decoding algorithm for PCFG What is the head word of a phrase in a phrase structure grammar? When are two grammars weakly equivalent? Ulf Leser: Maschinelle Sprachverarbeitung 38