Introduction to Probablistic Natural Language Processing

Similar documents
Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Advanced Natural Language Processing Syntactic Parsing

Probabilistic Context-Free Grammar

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

LECTURER: BURCU CAN Spring

CS460/626 : Natural Language

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Processing/Speech, NLP and the Web

Probabilistic Linguistics

DT2118 Speech and Speaker Recognition

Probabilistic Context Free Grammars. Many slides from Michael Collins

Parsing with Context-Free Grammars

Probabilistic Context Free Grammars

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Natural Language Processing SoSe Words and Language Model

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

10/17/04. Today s Main Points

The Noisy Channel Model and Markov Models

Parsing with Context-Free Grammars

Statistical Methods for NLP

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

A Context-Free Grammar

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Probabilistic Context-free Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Statistical Methods for NLP

Latent Variable Models in NLP

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

TnT Part of Speech Tagger

Introduction to Computational Linguistics

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Multiword Expression Identification with Tree Substitution Grammars

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Chapter 14 (Partially) Unsupervised Parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Decoding and Inference with Syntactic Translation Models

CS460/626 : Natural Language

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Midterm sample questions

Probabilistic Context-Free Grammars and beyond

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Language Processing with Perl and Prolog

Text Mining. March 3, March 3, / 49

Lecture 12: Algorithms for HMMs

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Natural Language Processing. Statistical Inference: n-grams

Features of Statistical Parsers

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Language Modeling. Michael Collins, Columbia University

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Lecture 12: Algorithms for HMMs

Bringing machine learning & compositional semantics together: central concepts

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Machine Learning for natural language processing

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Chapter 3: Basics of Language Modelling

Sequences and Information

Harvard CS 121 and CSCI E-207 Lecture 9: Regular Languages Wrap-Up, Context-Free Grammars

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

A Support Vector Method for Multivariate Performance Measures

Stochastic Parsing. Roberto Basili

Statistical Machine Translation

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

CS 6120/CS4120: Natural Language Processing

Remembering subresults (Part I): Well-formed substring tables

SYNTHER A NEW M-GRAM POS TAGGER

Lecture 5: UDOP, Dependency Grammars

Computational Models - Lecture 4

Statistical methods in NLP, lecture 7 Tagging and parsing

COMP-330 Theory of Computation. Fall Prof. Claude Crépeau. Lec. 10 : Context-Free Grammars

Maxent Models and Discriminative Estimation

Lecture 9: Hidden Markov Model

N-gram Language Modeling

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Transcription:

Introduction to Probablistic Natural Language Processing Alexis Nasr Laboratoire d Informatique Fondamentale de Marseille

Natural Language Processing Use computers to process human languages Machine Translation Automatic Speech Recognition Optical Character Recognition Question Answering Information Extraction...

What do we need to Process Natural Language Represent linguistic knowledge : Phonetics Morphology Lexicon Syntax Semantics Automatically learn linguistic models : Machine Learning Corpora Efficient algorithms : Dynamic Programming Linear Programming

Probabilistic Methods for NLP Using statistics in order to choose between several alternatives. Ambiguity can be real : buying books for children. buying books for children. or may be due to a lack of knowledge : buying books for money.

Ambiguity Everywhere in NLP : Speech Recognition Machine Translation Optical Character Recognition Part of Speech Tagging Syntactic Parsing...

Speech Recognition Homophones Word boundaries The tail of the dog The tale of the dog I Ice scream cream

Part of Speech Tagging time flies like an arrow N N V Det N V V Prep V

Syntactic parsing Part of Speech Ambiguity Attachement Ambiguity John saw a man with a telescope. John saw a man with a telescope.

Machine Translation The river banks The banks closed les berges de la rivière les banques ont fermé

How to resolve ambiguity? Add knowledge : phonetics / phonology lexical syntactic semantic pragmatic world knowledge Compute a score for every possibility : S(the tail of the dog) > S(the tale of the dog) Such scores can be probabilities, they are computed with a probabilistic or stochastic model.

Example : computing the probability of a sequence of words Needed in many applications : speech recognition optical character recognition machine translation A model that associates a probability to any sequence of words, also called language model What do we expect from the language model? give a good probability to correct sequences : les traits très tirés. les traits très tirées. les très traient tiraient. les très très t iraient. give a good probability to likely sequences : les traits très tirés. lettrés très tirés.

Unigram model We use the occurrence probability of the words of the sequence : P(w 1... w n ) = n i=1 P(w i ) P(les traits très tirés) = P(les) P(traits) P(très) P(tirés) the model is based on completely wrong hypotheses! séquence log P les très très tirés 14.265 les trait très tiré 15.3102 les traits très tiré 15.4028 les trait très tirés 15.8386 les traits très tirés 15.9312

Unigram model : probability estimation How to estimate P(w i ) with 1 i V? Take a very lage corpus : o i... o n (o is a word occurrence) Compute the number of occurrences of word w : C(w) = δ oi,w = 1 if o i = w and 0 otherwise Then relative frequency of w : P(w) = V probabilities to estimate. n δ oi,w i=1 C(w) V i=1 C(w i)

Training Data Newspaper Le Monde 1986-2002 16 479 270 sentences 370 005 285 occurrences

Number of different unigrams 800000 700000 600000 500000 nb unigrams 400000 300000 200000 100000 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 nb phrases

Rank, frequency and frequency of frequencies rank freq f2f unigram 1 13 286 304 1 de 2 6 964 863 1 la 3 5 900 839 1 le 4 5 599 010 1 l 5 5 017 018 1 et 6 4 762 293 1 les 7 4 208 264 1 des 8 3 856 293 1 d 9 3 695 434 1 un 10 3 425 787 1 en

Rank, frequency and frequency of frequencies rank freq f2f unigram 8532 10 7 992 abaisseront, Abott, antihypertenseur 8533 9 9 369 8534 8 11 140 8535 7 13 671 8536 6 17 351 8537 5 22 684 8538 4 31 980 8539 3 50 311 8540 2 99 581 8541 1 356 780 absolumen, absorbable, Ambrozic

Frequency of frequencies, lin-lin 350000 300000 250000 frequence de frequence 200000 150000 100000 50000 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 frequence

Frequency of frequencies, lin-lin 350000 300000 250000 frequence de frequence 200000 150000 100000 50000 0 5 10 15 20 25 30 35 40 45 50 frequence

Frequency of frequencies, log-lin 350000 300000 250000 frequence de frequence 200000 150000 100000 50000 0 1 10 100 1000 10000 100000 1e+06 1e+07 frequence

Frequency of frequencies, log-log 100000 10000 frequence de frequence 1000 100 10 1 1 10 100 1000 10000 100000 1e+06 1e+07 frequence

Zipf Law In may human phenomena, there is a linear relation between frequency ( f ) and the inverse of rank (r) : f(r) = k 1 r Zipf interpretation : a way to minimize the effort of the speaker and the hearer : small number of frequent words minimize speaker effort large number of rare words (low ambiguity) minimize hearer effort Roughly : few very frequent words and many rare words

Bigram Model We use the probability that word a follows word b : P(w i+1 = a w i = b) P(w 1... w n ) = P(w 1 ) n i=2 P(w i w i 1 ) P(les traits très tirés) = P(les) P(traits les) P(très traits) P(tirés très) les très très tirés 13.4767 les traits très tiré 13.8494 les traits très tirés 14.1414 les trait très tirés 18.2462 les trait très tiré 18.5381

Bigram Models : estimation Compute the number of occurrences of sequence ab : C(a, b) = Then relative frequency : N 1 δ oi,a δ oi+1,b i=1 P(b a) = V 2 probabilities to estimate. C(a, b) C(a)

Number of different bigrams 2.5e+07 nb bigrammes nb unigrammes 2e+07 1.5e+07 1e+07 5e+06 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 nb phrases

Rank, frequency and frequency of frequencies rank freq f2f bigram 1 2 091 936 1 de la 2 1 563 496 1 de l 3 1 156 551 1 neuf cent 4 1 139 331 1 mille neuf 5 921 010 1 cent quatre vingt 6 744 220 1 d un 7 578 140 1 d une 8 571 205 1 c est 9 556 919 1 et de 10 541 528 1 en mille

Rank, frequency and frequency of frequencies rank freq f2f bigram 8817 10 121 362 8818 9 146 471 8819 8 181 721 8820 7 231 738 8821 6 307 461 8822 5 429 606 8823 4 655 244 8824 3 1 148 479 8825 2 2 680 674 8826 1 13 808 569

Frequency of frequencies, lin-lin 1.2e+07 1e+07 frequence de frequences 8e+06 6e+06 4e+06 2e+06 0 5 10 15 20 25 30 35 40 45 50 frequence

Frequency of frequencies, log-lin 1.2e+07 1e+07 frequence de frequences 8e+06 6e+06 4e+06 2e+06 0 1 10 100 1000 10000 100000 1e+06 frequence

Frequency of frequencies, log-log 1e+07 1e+06 100000 frequence de frequences 10000 1000 100 10 1 1 10 100 1000 10000 100000 1e+06 frequence

Trigram Model We use probability that c follows sequence ab : P(w i+2 = c w i = a, w i+1 = b) P(w 1... w n ) = P(w 1 ) P(w 2 w 1 ) n i=3 P(w i w i 2 w i 1 ) P(les traits très tirés) = P(les) P(traits les) P(très les, traits) P(tirés traits, très) les très très tirés 24.0764 les traits très tirés 28.134 les traits très tiré 28.2369 les trait très tirés 31.0759 les trait très tiré 31.1788

Trigram Model : estimation Compute the number of occurrences of word sequence abc : C(a, b, c) = Then relative frequencies : N 2 δ oi,a δ oi+1,b δ oi+2,c i=1 P(c a, b) = V 3 probabilities to estimate C(a, b, c) C(a, b)

Number of different trigrams 9e+07 8e+07 nb trigrammes nb bigrammes nb unigrammes 7e+07 6e+07 5e+07 4e+07 3e+07 2e+07 1e+07 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 nb phrases

Rank, frequency and frequency of frequencies rang fréq f2f trigramme 1 1 1 135 994 mille neuf cent 2 1 792 656 en mille neuf 3 1 379 902 cent quatre vingt dix 4 1 191 324 n est pas 5 1 184 626 neuf cent soixante 6 1 167 909 il y a 7 1 160 077 de mille neuf 8 1 145 121 n a pas 9 1 95 683 et de la 10 1 95 201 cent soixante dix

Rank, frequency and frequency of frequencies rang fréq f2f 5193 10 242 472 5194 9 299 876 5195 8 382 270 5196 7 502 171 5197 6 691 755 5198 5 1 013 295 5199 4 1 641 586 5200 3 3 124 196 5201 2 8 523 984 5202 1 64 425 232

Frequency of frequencies, lin-lin 6e+07 5e+07 frequence de frequences 4e+07 3e+07 2e+07 1e+07 0 5 10 15 20 25 30 35 40 45 50 frequence

Frequency of frequencies, log-lin 6e+07 5e+07 frequence de frequences 4e+07 3e+07 2e+07 1e+07 0 1 10 100 1000 10000 100000 1e+06 frequence

Frequency of frequencies, log-log 1e+07 1e+06 frequence de frequences 100000 10000 1000 100 10 1 1 10 100 1000 10000 100000 1e+06 frequence

Limits of n-gram models They fail to model arbitrary long distance phenomena the speed reaches... the speed of the waves reaches... the speed of the sismic waves reaches... the speed of the large sismic waves reaches...

Syntactic structure of the sentence can help S NPs VP3s D Ns V3s the speed reaches

Syntactic structure of the sentence can help S NPs VP3s V3s NPs PP D Ns reaches Prep NPp the speed of D Np the waves

Syntactic structure of the sentence can help S NPs VP3s V3s NPs PP D Ns reaches Prep NPp the speed of D NPp the AP Np sismic waves

Use grammars to compute the probability of a sentence Agreement between the subject and the verb is modeled by the same rule in the differents sentences. The rule is independent of the Noun Phrase length The grammar can easily associate a better probability to the correct sentence P(S NPs VPs) > P(S NPs VPp) The grammar can be used as a language model. For any sentence S L(G) it allows to compute P(S)

Context Free Grammars A CFG is made of : A non terminal alphabet N = {N 1... N n } A terminal Alphabet T = {t 1... t m } A start symbol N 1 A set of rewrite rules N i α with α (N T ) Without loss of generality we can consider that the rules are in Chomsky Normal Form : rules are of the form : N i N j N k or N i t with N i, N j, N k N, t T

Language and syntactic structure A CFG G defines a language L(G) T associates to every string w L(G) one syntactic structure or more (when G is ambiguous)

Example A BC C DE A CB C ED B DE D a B ED E a A BC DEC aec aac aaed aaad aaaa L(G) = {aaaa}

Ambiguity A A A A B C B C B C B C D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a A A A A C B C B C B C B D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a

Probabilistic Context Free Grammar A PCFG is made of : A non terminal alphabet N = {N 1... N n } A terminal Alphabet T = {t 1... t m } A start symbol N 1 A set of rewrite rules N i α with α (N T ) Without loss of generality we can consider that the rules are in Chomsky Normal Form : rules are of the form : N i N j N k or N i t with N i, N j, N k N, t T A probability distribution for every N i : P(N i α j ) = 1 j

Probabilities The probability of a rule is the probability to choose this rule to rewrite its left hand side symbol Probability of a tree T : P(N i α j ) = P(N i α j N i ) P(T) = P(r(N)) N T where N N and r(n) is the rule used to rewrite N in T Probability of a sentence S : P(S) = P(T) T T (S)

Example A BC 0.4 C DE 0.3 A CB 0.6 C ED 0.7 B DE 0.2 D a 1.0 B ED 0.8 E a 1.0

Parse probabilities 0.024 0.056 A A 0.096 0.224 A A B C B C B C B C D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a 0.036 0.144 A A 0.084 0.336 A A C B C B C B C B D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a

Probabilities we want to compute we would like to compute efficiently P(S) P(S) = P(T) T T (S) where T (S) is the set of all the syntactic structures that the grammar G associates to sentence S we would also like to find the most likely analysis of the sentence ˆT = arg max T T (S) P(T) ˆT is useful for many other tasks such as : text understanding machine translation question answering...

But grammars of Natural Languages are ambiguous 1e+08 max moy 1e+07 Nombre d analyses 1e+06 100000 10000 1000 100 10 1 0 5 10 15 20 25 30 35 40 45 50 55 Longueur de la phrase

Efficiently building the set T (S) with CYK Algorithm input : a CFG G in Chomsky Normal Form a sentence w 1... w n output : a table t such that N t i,j N w i... w j algorithm for i = 1 to n do { INITIALISATION } t i,i = {A A a i } for j = 1 to n do for i = j 1 to 1 do for k = i to j 1 do t i,j = t i,j {A A BC} with B t i,k and C t k+1,j

CYK a a a a A BC C DE A CB C ED B DE D a B ED E a

CYK a D,E a D,E a D,E a D,E A BC C DE A CB C ED B DE D a B ED E a

CYK a D,E B,C a D,E B,C a D,E B,C a D,E A BC C DE A CB C ED B DE D a B ED E a

CYK a D,E B,C A a D,E B,C a D,E B,C a D,E A BC C DE A CB C ED B DE D a B ED E a

Packed parse forest Instantiated symbol A 1..5 Instantiated production A 1..5 B 1..3 C 3..5 A 1..5 C 1..3 B 3..5 B 1..3 C 1..3 C 3..5 B 3..5 E 1..2 D 1..2 E 2..3 D 2..3 E 3..4 D 3..4 E 4..5 D 4..5 a 1..2 a 2..3 a 3..4 a 4..5

Three problems to solve Compute the probability of sentence S : P(S) = P(T) T T (S) Build the most probable syntactic tree for S : ˆT = arg max T T (S) P(T) Estimate the probabilities of G using data D : Ĝ = arg max P(D G) G

Notations w 1... w n w p,q m i N j sentence to parse segment w 1... w n of the sentence symbol of the terminal alphabet symbol of non terminal alphabet N j p,q symbol N j allows to derive segment w p,q (N j w p,q )

Inside Probabilities β j (p, q) de f = P(w p,q N j p,q) N j m p m q

Probability of a sentence P(w 1,n ) = P(N 1 w1,n ) = P(w 1,n N 1 ) = β 1 (1, n)

Recursive computation of inside probabilities N j N r N s m p m d m d+1 m q Bottom up We compute β j (p, q) after β r (p, d) and β s (d+1, q) have been computed

Recursive computation of inside probabilities Recursive formula β j (p, q) = P(w p,q N j p,q) q 1 = P(w p,d, Np,d r, w d+1,q, Nd+1,q s Nj p,q) r,s d=p q 1 = P(Np,d r, Ns d+1,q Nj p,q)p(w p,d Np,d r, Ns d+1,q, Nj p,q) r,s d=p P(w d+1,q N r p,d, Ns d+1,q, Nj p,q) q 1 = P(N j N r, N s )β r (p, d)β s (d+1, q) r,s d=p

Recursive computation of inside probabilities Final case β j (k, k) = P(w k N j k,k ) = P(N j w k )

Relation with CYK algorithm N j p,q corresponds to the presence of symbol N j in cell t p,q β(p, q) can be computed while filling matrix t

Computing P(S) with CYK for q = 1 to n do { INITIALISATION } for p = q to 1 do if (p == q) β j (p, p) = P(N j w p ) otherwise β j (p, q) = 0 for q = 1 to n do for p = q 1 to 1 do for d = p to q 1 do β j (p, q) = β j (p, q)+p(n j N r, N s )β r (p, d)β s (d+1, q) with N r t p,d and N s t d+1,q P(S) = β 1 (1, n)

Computing P( ˆT) δ i (p, q) = probability of the most probable subtree for N j p,q. 1 Initialisation 2 Recursive step δ i (p, p) = P(N i w p ) 3 End δ i (p, q) = max 1 j,k n, p d<q P(Ni N j N k )δ j (p, d)δ k (d+1, q) P( ˆT) = δ 1 (1, n)

Computing P( ˆT) with CYK for q = 1 to n do { INITIALISATION } for p = q to 1 do if (p == q) δ j (p, p) = P(N j w p ) otherwise δ j (p, q) = 0 for q = 1 to n do for p = q 1 to 1 do for d = p to q 1 do δ j (p, q) = max(δ j (p, q), P(N i N j N k )δ j (p, d)δ k (d+1, q)) with N r t p,d and N s t d+1,q P( ˆT) = β 1 (1, n)

Building ˆT ψ i (p, q) = j, k, d where j, k, d indicates that the rule désignent l application de la règle ayant réalisé le maximum δ i (p, q) ψ i (p, q) = arg max j,k,d P(Ni N j N k )δ j (p, d)δ k (d+1, q) racine( ˆT) = N 1 1,n if ψ i (p, q) = j, k, d alors fils gauche(n i p,q) = N j p,d fils droit(n i p,q) = N k d+1,q

Estimating grammar probabilities Two situations : Observable syntactic data : we have a set of sentences with the correct syntactic tree for every sentence (tree bank) Hidden syntactic structure : we just have the sentences without the syntactic structure.

Treebank examples name Penn Treebank corpus Paris 7 word 1 000 000 400 000 sentences 45 000 15 000 synt. cat. 26 13 part of speech tags 36 14 rules 9 657

Building the grammar S GN GV Det N V le chat dort S GN GV Det N V GN GV Det N V le chat dort

Estimating rule probabilities Count the number of occurrences of non terminal symbol A in the treebank : C(A). Count the number of occurrences of the rule A α : C(A α) Estimate P(A α) with relative frequencies : P(A α) = C(A α) C(A)

PCFG Limits Lexical independence Rewriting pre terminal symbol X is independent of the context of X problem : the preposition that introduces the complement of a verb depends on the lexical nature of the verb Exemple : The farmer gave an apple to John this dependency is not modeled : VP V NP PP V gave PP P NP P to

PCFG Limits Structural independence The choice of a rule for rewriting symbol X is independent of the contex of X problem : subjects are realized as pronouns much more often than objects But there is a single rule of the form NP Pro and hence a single probability

evaluation measures : (Black et al, 1991) given a sentence S, a candidate tree C and the correct tree R a syntagm in C is correct if it exists a syntagm in R with the same span and labeled with the same category labeled recall (LR) = labeled precision (LP) = # correct syntagms in C # syntagms in R # correct syntagms in C # syntagms in C F1 : harmonic mean of LR and LP (F1 = 2 LR LP LR+LR )

Some results on the Penn Tree Bank LP LR F1 Baseline 72.6 Magerman 1995 84.9 84.6 84.7 Collins 1996 86.3 85.8 85.9 Charniak 1997 87.4 87.5 87.4 Collins 1999 88.7 88.6 88.6 Petrov 2010 91.8