Maschinelle Sprachverarbeitung

Similar documents
Maschinelle Sprachverarbeitung

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Probabilistic Context-Free Grammar

Advanced Natural Language Processing Syntactic Parsing

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Probabilistic Context-free Grammars

CS460/626 : Natural Language

Processing/Speech, NLP and the Web

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

LECTURER: BURCU CAN Spring

Statistical Methods for NLP

Probabilistic Context Free Grammars

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Parsing with Context-Free Grammars

Introduction to Probablistic Natural Language Processing

Natural Language Processing

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Probabilistic Context Free Grammars. Many slides from Michael Collins

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

DT2118 Speech and Speaker Recognition

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

A Context-Free Grammar

10/17/04. Today s Main Points

Chapter 14 (Partially) Unsupervised Parsing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 12: Algorithms for HMMs

Properties of Context-Free Languages

Parsing with Context-Free Grammars

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Introduction to Computational Linguistics

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Lecture 12: Algorithms for HMMs

CYK Algorithm for Parsing General Context-Free Grammars

CSCI Compiler Construction

CS460/626 : Natural Language

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Linguistics

Statistical Machine Translation

A* Search. 1 Dijkstra Shortest Path

Context Free Grammars

Even More on Dynamic Programming

Statistical methods in NLP, lecture 7 Tagging and parsing

Midterm sample questions

Multiword Expression Identification with Tree Substitution Grammars

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Remembering subresults (Part I): Well-formed substring tables

Lecture 5: UDOP, Dependency Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Features of Statistical Parsers

Constituency Parsing

Statistical Methods for NLP

TnT Part of Speech Tagger

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Text Mining. March 3, March 3, / 49

Algorithms for Syntax-Aware Statistical Machine Translation

Latent Variable Models in NLP

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

RNA Secondary Structure Prediction

Sharpening the empirical claims of generative syntax through formalization

CS 6120/CS4120: Natural Language Processing

Stochastic Parsing. Roberto Basili

Sequences and Information

Decoding and Inference with Syntactic Translation Models

The effect of non-tightness on Bayesian estimation of PCFGs

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Lecture 13: Structured Prediction

Bringing machine learning & compositional semantics together: central concepts

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Computational Models - Lecture 4

SYNTHER A NEW M-GRAM POS TAGGER

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Lab 12: Structured Prediction

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Lecture 9: Hidden Markov Model

Computational Models - Lecture 4 1

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Foundations of Informatics: a Bridging Course

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Natural Language Processing

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

CISC4090: Theory of Computation

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

The Noisy Channel Model and Markov Models

The Rise of Statistical Parsing

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Transcription:

Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 2

Parsing Sentences POS tagging studies the plain sequence of words in a sentence But sentences have more and non-consecutive structures Plenty of linguistic theories exist about the nature and representation of these structures / units / phrases / Here: Phrase structure grammars S The astronomer saw the star with a telescope V P PP The astronomer saw the star with a telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 3

Parsing Sentences POS tagging studies the plain sequence of words in a sentence But sentences have more and non-consecutive structures Plenty of linguistic theories exist about the nature and representation of these structures / units / phrases / Here: Phrase structure grammars S The astronomer saw the star with a telescope V P PP astronomer saw star with telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 4

Phrase Structure Grammar Builds on assumptions Sentences consist of nested structures There is a fixed set of different structures (phrase types) Nesting can be described by a context-free grammar 1: S 7: PP 2: PP P 8: astronomer 3: V 9: telescope 4: PP 10: star 5: P with 6: V saw 8 3 1 S 4 2 PP 6 V 10 5 P 9 astronomer saw star with telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 5

Ambiguity? S PP V P astronomer saw star with ring S PP V P astronomer saw star with ring Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 6

Problem 1: Ambiguity! S PP V P astronomer saw man with telescope S PP V P astronomer saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 7

Problem 2: Syntax versus Semantics Phrase structure grammars only capture syntax 1: S 7: PP 2: PP P 8: astronomer 3: V 9: telescope 4: PP 10: star 5: P with 6: V saw V ate moon cat cream V moon ate cat with cream S P PP telescope ate moon with cat telescope saw cream with star Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 8

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 9

Probabilistic Context-Free Grammars (PCFG) Also called Stochastic Context Free Grammars Idea: Context free grammars by transition probabilities Every rule gets a non-zero probability of firing Grammar still recognizes the same language But every parse can be assigned a probability Usages Find parse with highest probability ( true meaning) Detect ambiguous sentences (>1 parses with similar probability) What is the overall probability of a sentence given a grammar Sum of the probabilities of all derivations producing the sentence Language models: Predict most probable next token in an incomplete sentence which is allowed by the grammar rules Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 10

POS Tagging versus Parsing The velocity of the seismic waves rises to Difficult for a POS tagger: waves/plural rises/singular Simple for a PCFG Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 11

More Formal Definition A PCFG if a 5-tuple (W, N, S, R, p) with W is a set of terminals (words) w 1, w 2, N is a set of non-terminals (phrase types) N 1,N 2, S is a designated start symbol R is a set of rules <N i ϕ> where ϕ is a sequence of terminals and or non terminals p is a function assigning a non-zero probability to every rule such that i j ( ϕ ) 1 : p = N i j Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 12

Example S Rules 1: S 1,00 2: PP P 1,00 3: V 0,30 4: PP 0,70 5: P with 1,00 6: V saw 1,00 7: PP 0,80 8: astronomer 0,10 9: telescope 0,05 10: man 0,05 p PP V P astronomer saw man with telescope S PP V P astronomer saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 13

Example 1 S 1: S 1,00 2: PP P 1,00 3: V 0,30 4: PP 0,70 5: P with 1,00 6: V saw 1,00 7: PP 0,80 8: astronomer 0,10 9: telescope 0,05 10: man 0,05 3 8 7 2 PP 6 V 10 5 P 9 astronomer saw man with telescope p(t 1 ) = 1 *0,1*0,3*1*0,8*0,05*1*1*0,05 = 0,0006 Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 14

Example 1: S 1,00 2: PP P 1,00 3: V 0,30 4: PP 0,70 5: P with 1,00 6: V saw 1,00 7: PP 0,80 8: astronomer 0,10 9: telescope 0,05 10: man 0,05 p(t 2 ) = 1*0,1*0,7*0,3*1*0,05*1*1*0,05 = 0,000525 S PP V P astronomer saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 15

Implicit Assumptions Context-free: Probability of a derivation of a subtree under non-terminal N is independent of anything else in the tree Above N, left of N, right of N Place-invariant: Probability of a given rule r is the same anywhere in the tree Probability of a subtree is independent of its position in the sentence Semantic-unaware: Probability of terminals do not depend on the co-occurring terminals in the sentence Semantic validity is not considered Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 16

Usefulness (of a good PCFG) Tri-gram models are the better language models Work at word level conditional probabilities of word sequences PCFG are a step towards resolving ambiguity, but not a solution due to lack of semantics PCFG can produce robust parsers When learned on a corpus with a few, rare errors, these are cast into rules with low probability Have some implicit bias (work-arounds known) E.g. small trees get higher probabilities State-of-the-art parser combine PCFG with additional formalized knowledge Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 17

Three Issues Given a PCFG G and a sentence s L(G) Issue 1: Decoding (or parsing): Which chain of rules (derivation) from G produced s with the highest probability? Issue 2: Evaluation: What is the overall probability of s given G? Given a context free grammar G and a set of sentences S with their derivation in G Issue 3: Learning: Which PCFG G with the same rule set as G produces S with the highest probability? We make our life simple: (1) G is given, (2) sentences are parsed Removing assumption (2) leads to an EM algorithm, removing (1) is hard (structure learning) Very close relationship to the same problems in HMMs Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 18

Chomsky Normal Form We only consider PCFG with rules of the following form (Chomsky Normal Form, CNF) N w Non-terminal to terminal N N N Non-terminal to two non terminals Note: For any CFG G, there exists a CFG G in Chomsky Normal Form such that G and G are weakly equivalent, i.e., accept the same language (but with different derivations) Accordingly, a PCFG in CNF has N 3 + N * W parameter Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 19

Issue 3: Learning Given a context free grammar G and a set of sentences S with their derivations in G : Which PCFG G with the same rule set as G produces S with the highest probability? A simple Maximum Likelihood approach will do i : p ( N ϕ ) i j = N i N i ϕ j *. Number of occurrence of a rule in the set of derivations * Any rule consequence Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 20

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 21

Issue 2: Evaluation Given a PCFG G and a sentence s L(G): What is the overall probability of s given G? We did not discuss this problem for HMM, but for PCFG it is simpler to derive parsing from evaluation Naïve: Find all derivations of s, sum-up their probabilities Problem: There can be exponentially many derivations We give a Dynamic Programming based algorithm Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 22

Idea Recall that a PCFG build on a context-free grammar in CNF Definition The inside probability of a sub-sentence w p w q to be produced by a non-terminal N i is defined as β i (p,q) = p(w pq N i,pq,g) N i,pq w pq : Sub-sentence of s starting at token w p at pos. p until token w q at pos. q N i,pq : Non-terminal N i producing w pq From now on, we omit the G and s We search β S (1,n) for a sentence with n token w p w q Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 23

Induction We compute β S (1,n) by induction over the length of all sub-sentences Start: Assume p=q. Since we have a CNF, the rule producing w pp must have the form N i,pp w pp. β i (p,p) = p(w pp N i,pp ) = p(n i,pp w pp ) This is parameter of G and can be lookup up for all (i,p) N i,pq Induction: Assume p<q. Since we are in CNF, the derivation must look like this for some d with p d q N r,pd w p w d N s,(d+1)q w d+1 w q Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 24

Derivation β i (p,q) = p(w pq N i,pq,g) = N r,pd N i,pq N s,(d+1)q = = = = r, s d = p.. q 1 r, s d = p.. q 1 r, s d = p.. q 1 p p * p p r, s d = p.. q 1 ( w, N, w, N N ) pd w p w d ( N, N N )* p( w N, N, N ) r, pd s,( d + 1) q w d+1 p( w ) ( d + 1) q Nr, pd, N s,( d + 1) q, Ni, pq ( N, N N )* p( w N )* p( w N ) r, pd s,( d + 1) q ( N N N ) i r, pd r s ( d + 1) q i, pq i, pq r s,( d + 1) q pd pd s i, pq r, pd r, pd * β ( p, d)* β ( d + 1, q) s,( d + 1) q ( d + 1) q i, pq * s,( d + 1) q w q Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 25

Derivation All possible combinations of N r /N s producing their subtrees Subtrees can have any length between 1 and q- p, sum is q-p Chain rule of conditional probabilities N r,pd N i,pq N s,(d+1)q = = = = r, s d = p.. q 1 r, s d = p.. q 1 r, s d = p.. q 1 p p * p p r, s d = p.. q 1 ( w, N, w, N N ) pd w p w d ( N, N N )* p( w N, N, N ) r, pd s,( d + 1) q Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 26 w d+1 p( w ) ( d + 1) q Nr, pd, N s,( d + 1) q, Ni, pq ( N, N N )* p( w N )* p( w N ) r, pd s,( d + 1) q ( N N N ) i r, pd r s ( d + 1) q i, pq i, pq r s,( d + 1) q pd pd s i, pq r, pd r, pd * β ( p, d)* β ( d + 1, q) s,( d + 1) q ( d + 1) q i, pq * s,( d + 1) q Independence assumptions in CFG w q

Example astronomer saw man with telescope 1: S 1,00 2: PP P 1,00 3: V 0,70 4: PP 0,30 5: P with 1,00 6: V saw 1,00 7: PP 0,40 8: astronomer 0,10 9: telescope 0,18 10: man 0,18 11: saw 0,04 12: ears 0,10 1 β (1,1)=0,1 1 2 3 4 5 2 β V (2,2)=1 β (2,2)=0,04 3 β (3,3)=0,18 4 β P (4,4)=1 5 β (5,5)=0,18 Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 27

Example astronomer saw man with telescope 1: S 1,00 2: PP P 1,00 3: V 0,70 4: PP 0,30 5: P with 1,00 6: V saw 1,00 7: PP 0,40 8: astronomer 0,10 9: telescope 0,18 10: man 0,18 11: saw 0,04 12: ears 0,10 1 2 3 4 5 1 β =0,1-2 β V =1 β =0,04 β =0,7*1*0,18= 0,126 3 β =0,18-4 β P =1 β PP =1*1*0,18= 0,18 5 β =0,18 No rule X V or X Must be V with p=0.7 Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 28

Example astronomer saw man with telescope 1: S 1,00 2: PP P 1,00 3: V 0,70 4: PP 0,30 5: P with 1,00 6: V saw 1,00 7: PP 0,40 8: astronomer 0,10 9: telescope 0,18 10: man 0,18 11: saw 0,04 12: ears 0,10 1 β =0,1-1 2 3 4 5 2 β V =1 β =0,04 β S =1*0,1*0,126= 0,0126 β =0,126-3 β =0,18 - β =0,4*0,18*0,18= 0,1296 4 β P =1 β PP =0,18 5 β =0,18 Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 29

Example astronomer saw man with telescope 1 2 3 4 5 1 β =0,1 - β S =0,0126 - β =0,0015 2 β V =1 β =0,04 β =0,126 - β 1 +β 2 =0,015 3 β =0,18 - β =0,1296 4 β P =1 β PP =0,18 5 β =0,18 PP PP V P V P saw man with telescope saw man with telescope Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 30

Note This is the Cocke Younger Kasami (CYK) algorithm for parsing with context free grammars, enriched with aggregations / multiplications for computing probabilities Same complexity: O(n 3 * G ) n: Sentence length G : Number of rules in the grammar G Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 31

Note This is the Cocke Younger Kasami (CYK) algorithm for parsing with context free grammars, enriched with aggregations / multiplications for computing probabilities Same complexity: O(n 3 * G ) n: Sentence length G : Number of rules in the grammar G Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 32

Issue 1: Decoding / Parsing Once evaluation is solved, parsing is simple Instead of summing over all derivations, we only chose the most probable deviation of a sub-sentence for each possible root Let δ i (p,q) = p(w pq N i,pq ) be the most probable derivation of sub-sentence p..q from a non-terminal root N i This gives δ ( ( ) i ( p, q) = arg max arg max p wpd, Nr, pd, w( d 1) q, N s,( d 1) q Ni, pq + + r, s d = p... q 1 = arg max p( N N N )* δ ( p, d)* δ ( d + 1, q) d = p... q 1, r, s ( ) We omit induction start and backtracing i r s r s Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 33

Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other Issues in Parsing Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 34

Treebanks A treebank is a set of sentences (corpus) whose phrase structures are annotated Training corpus for PCFG Not many exist; very costly, manual task Most prominent: Penn Treebank Marcus, Marcinkiewicz, Santorini. "Building a large annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330. ~5500 citations (!) 2,499 stories from a 3-years Wall Street Journal (WSJ) collection Roughly 1 Million tokens, freely available Deutsche Baumbanken Deutsche Diachrone Baumbank, 3 historical periods, small Tübinger Baumbank, 38.000 Sätze, 345.000 Token Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 35

Using Derivation History Phrase structure grammars as described here are kind-of simplistic One idea for improvement: Incorporate dependencies between non-terminals Probability of rules is not identical across all positions in a sentence Trick: Annotate derivation of a non-terminal in its name and learn different probabilities for different derivations Read: generated from a 1: S 2: PP P 3: V 7: PP 7a: PP PP Source: MS99; from Penn Treebank Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 36

Lexicalization Second idea: Incorporate word semantics (lexicalization) Clearly, different verbs take different arguments leading to different structures (similar for other word types) Trick: Learn a model for each head word of a non-terminal walk, take, eat, Requires much larger training corpus and sophisticated smoothing Source: MS99; from Penn Treebank Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 37

Dependency Grammars Phrase structure grammars are not the only way to represent structural information within sentences Popular alternative: Dependency trees Every word forms exactly one node Edges describe the syntactic relationship between words: object-of, subject-of, modifier-of, preposition-of, Different tag sets exist subject-of aux-of det-of obj-of mod-of Source: Wikipedia Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 38

Self-Assessment Which assumptions are behind PCFG for parsing? What is the complexity of the parsing problem in PCFG? Assume the following rule set Derive all derivations for the sentence together with their probabilities. Mark the most probable derivation. Derive the complexity of the decoding algorithm for PCFG What is the head word of a phrase in a phrase structure grammar? When are two grammars weakly equivalent? Ulf Leser: Maschinelle Sprachverarbeitung, Winter Semester 2015/2016 39