Introduction to Probablistic Natural Language Processing

Introduction to Probablistic Natural Language Processing Alexis Nasr Laboratoire d Informatique Fondamentale de Marseille

Natural Language Processing Use computers to process human languages Machine Translation Automatic Speech Recognition Optical Character Recognition Question Answering Information Extraction...

What do we need to Process Natural Language Represent linguistic knowledge : Phonetics Morphology Lexicon Syntax Semantics Automatically learn linguistic models : Machine Learning Corpora Efficient algorithms : Dynamic Programming Linear Programming

Probabilistic Methods for NLP Using statistics in order to choose between several alternatives. Ambiguity can be real : buying books for children. buying books for children. or may be due to a lack of knowledge : buying books for money.

Ambiguity Everywhere in NLP : Speech Recognition Machine Translation Optical Character Recognition Part of Speech Tagging Syntactic Parsing...

Speech Recognition Homophones Word boundaries The tail of the dog The tale of the dog I Ice scream cream

Part of Speech Tagging time flies like an arrow N N V Det N V V Prep V

Syntactic parsing Part of Speech Ambiguity Attachement Ambiguity John saw a man with a telescope. John saw a man with a telescope.

Machine Translation The river banks The banks closed les berges de la rivière les banques ont fermé

How to resolve ambiguity? Add knowledge : phonetics / phonology lexical syntactic semantic pragmatic world knowledge Compute a score for every possibility : S(the tail of the dog) > S(the tale of the dog) Such scores can be probabilities, they are computed with a probabilistic or stochastic model.

Example : computing the probability of a sequence of words Needed in many applications : speech recognition optical character recognition machine translation A model that associates a probability to any sequence of words, also called language model What do we expect from the language model? give a good probability to correct sequences : les traits très tirés. les traits très tirées. les très traient tiraient. les très très t iraient. give a good probability to likely sequences : les traits très tirés. lettrés très tirés.

Unigram model We use the occurrence probability of the words of the sequence : P(w 1... w n ) = n i=1 P(w i ) P(les traits très tirés) = P(les) P(traits) P(très) P(tirés) the model is based on completely wrong hypotheses! séquence log P les très très tirés 14.265 les trait très tiré 15.3102 les traits très tiré 15.4028 les trait très tirés 15.8386 les traits très tirés 15.9312

Unigram model : probability estimation How to estimate P(w i ) with 1 i V? Take a very lage corpus : o i... o n (o is a word occurrence) Compute the number of occurrences of word w : C(w) = δ oi,w = 1 if o i = w and 0 otherwise Then relative frequency of w : P(w) = V probabilities to estimate. n δ oi,w i=1 C(w) V i=1 C(w i)

Training Data Newspaper Le Monde 1986-2002 16 479 270 sentences 370 005 285 occurrences

Number of different unigrams 800000 700000 600000 500000 nb unigrams 400000 300000 200000 100000 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 nb phrases

Rank, frequency and frequency of frequencies rank freq f2f unigram 1 13 286 304 1 de 2 6 964 863 1 la 3 5 900 839 1 le 4 5 599 010 1 l 5 5 017 018 1 et 6 4 762 293 1 les 7 4 208 264 1 des 8 3 856 293 1 d 9 3 695 434 1 un 10 3 425 787 1 en

Rank, frequency and frequency of frequencies rank freq f2f unigram 8532 10 7 992 abaisseront, Abott, antihypertenseur 8533 9 9 369 8534 8 11 140 8535 7 13 671 8536 6 17 351 8537 5 22 684 8538 4 31 980 8539 3 50 311 8540 2 99 581 8541 1 356 780 absolumen, absorbable, Ambrozic

Frequency of frequencies, lin-lin 350000 300000 250000 frequence de frequence 200000 150000 100000 50000 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 frequence

Frequency of frequencies, lin-lin 350000 300000 250000 frequence de frequence 200000 150000 100000 50000 0 5 10 15 20 25 30 35 40 45 50 frequence

Frequency of frequencies, log-lin 350000 300000 250000 frequence de frequence 200000 150000 100000 50000 0 1 10 100 1000 10000 100000 1e+06 1e+07 frequence

Frequency of frequencies, log-log 100000 10000 frequence de frequence 1000 100 10 1 1 10 100 1000 10000 100000 1e+06 1e+07 frequence

Zipf Law In may human phenomena, there is a linear relation between frequency ( f ) and the inverse of rank (r) : f(r) = k 1 r Zipf interpretation : a way to minimize the effort of the speaker and the hearer : small number of frequent words minimize speaker effort large number of rare words (low ambiguity) minimize hearer effort Roughly : few very frequent words and many rare words

Bigram Model We use the probability that word a follows word b : P(w i+1 = a w i = b) P(w 1... w n ) = P(w 1 ) n i=2 P(w i w i 1 ) P(les traits très tirés) = P(les) P(traits les) P(très traits) P(tirés très) les très très tirés 13.4767 les traits très tiré 13.8494 les traits très tirés 14.1414 les trait très tirés 18.2462 les trait très tiré 18.5381

Bigram Models : estimation Compute the number of occurrences of sequence ab : C(a, b) = Then relative frequency : N 1 δ oi,a δ oi+1,b i=1 P(b a) = V 2 probabilities to estimate. C(a, b) C(a)

Number of different bigrams 2.5e+07 nb bigrammes nb unigrammes 2e+07 1.5e+07 1e+07 5e+06 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 nb phrases

Rank, frequency and frequency of frequencies rank freq f2f bigram 1 2 091 936 1 de la 2 1 563 496 1 de l 3 1 156 551 1 neuf cent 4 1 139 331 1 mille neuf 5 921 010 1 cent quatre vingt 6 744 220 1 d un 7 578 140 1 d une 8 571 205 1 c est 9 556 919 1 et de 10 541 528 1 en mille

Rank, frequency and frequency of frequencies rank freq f2f bigram 8817 10 121 362 8818 9 146 471 8819 8 181 721 8820 7 231 738 8821 6 307 461 8822 5 429 606 8823 4 655 244 8824 3 1 148 479 8825 2 2 680 674 8826 1 13 808 569

Frequency of frequencies, lin-lin 1.2e+07 1e+07 frequence de frequences 8e+06 6e+06 4e+06 2e+06 0 5 10 15 20 25 30 35 40 45 50 frequence

Frequency of frequencies, log-lin 1.2e+07 1e+07 frequence de frequences 8e+06 6e+06 4e+06 2e+06 0 1 10 100 1000 10000 100000 1e+06 frequence

Frequency of frequencies, log-log 1e+07 1e+06 100000 frequence de frequences 10000 1000 100 10 1 1 10 100 1000 10000 100000 1e+06 frequence

Trigram Model We use probability that c follows sequence ab : P(w i+2 = c w i = a, w i+1 = b) P(w 1... w n ) = P(w 1 ) P(w 2 w 1 ) n i=3 P(w i w i 2 w i 1 ) P(les traits très tirés) = P(les) P(traits les) P(très les, traits) P(tirés traits, très) les très très tirés 24.0764 les traits très tirés 28.134 les traits très tiré 28.2369 les trait très tirés 31.0759 les trait très tiré 31.1788

Trigram Model : estimation Compute the number of occurrences of word sequence abc : C(a, b, c) = Then relative frequencies : N 2 δ oi,a δ oi+1,b δ oi+2,c i=1 P(c a, b) = V 3 probabilities to estimate C(a, b, c) C(a, b)

Number of different trigrams 9e+07 8e+07 nb trigrammes nb bigrammes nb unigrammes 7e+07 6e+07 5e+07 4e+07 3e+07 2e+07 1e+07 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 nb phrases

Rank, frequency and frequency of frequencies rang fréq f2f trigramme 1 1 1 135 994 mille neuf cent 2 1 792 656 en mille neuf 3 1 379 902 cent quatre vingt dix 4 1 191 324 n est pas 5 1 184 626 neuf cent soixante 6 1 167 909 il y a 7 1 160 077 de mille neuf 8 1 145 121 n a pas 9 1 95 683 et de la 10 1 95 201 cent soixante dix

Rank, frequency and frequency of frequencies rang fréq f2f 5193 10 242 472 5194 9 299 876 5195 8 382 270 5196 7 502 171 5197 6 691 755 5198 5 1 013 295 5199 4 1 641 586 5200 3 3 124 196 5201 2 8 523 984 5202 1 64 425 232

Frequency of frequencies, lin-lin 6e+07 5e+07 frequence de frequences 4e+07 3e+07 2e+07 1e+07 0 5 10 15 20 25 30 35 40 45 50 frequence

Frequency of frequencies, log-lin 6e+07 5e+07 frequence de frequences 4e+07 3e+07 2e+07 1e+07 0 1 10 100 1000 10000 100000 1e+06 frequence

Frequency of frequencies, log-log 1e+07 1e+06 frequence de frequences 100000 10000 1000 100 10 1 1 10 100 1000 10000 100000 1e+06 frequence

Limits of n-gram models They fail to model arbitrary long distance phenomena the speed reaches... the speed of the waves reaches... the speed of the sismic waves reaches... the speed of the large sismic waves reaches...

Syntactic structure of the sentence can help S NPs VP3s D Ns V3s the speed reaches

Syntactic structure of the sentence can help S NPs VP3s V3s NPs PP D Ns reaches Prep NPp the speed of D Np the waves

Syntactic structure of the sentence can help S NPs VP3s V3s NPs PP D Ns reaches Prep NPp the speed of D NPp the AP Np sismic waves

Use grammars to compute the probability of a sentence Agreement between the subject and the verb is modeled by the same rule in the differents sentences. The rule is independent of the Noun Phrase length The grammar can easily associate a better probability to the correct sentence P(S NPs VPs) > P(S NPs VPp) The grammar can be used as a language model. For any sentence S L(G) it allows to compute P(S)

Context Free Grammars A CFG is made of : A non terminal alphabet N = {N 1... N n } A terminal Alphabet T = {t 1... t m } A start symbol N 1 A set of rewrite rules N i α with α (N T ) Without loss of generality we can consider that the rules are in Chomsky Normal Form : rules are of the form : N i N j N k or N i t with N i, N j, N k N, t T

Language and syntactic structure A CFG G defines a language L(G) T associates to every string w L(G) one syntactic structure or more (when G is ambiguous)

Example A BC C DE A CB C ED B DE D a B ED E a A BC DEC aec aac aaed aaad aaaa L(G) = {aaaa}

Ambiguity A A A A B C B C B C B C D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a A A A A C B C B C B C B D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a

Probabilistic Context Free Grammar A PCFG is made of : A non terminal alphabet N = {N 1... N n } A terminal Alphabet T = {t 1... t m } A start symbol N 1 A set of rewrite rules N i α with α (N T ) Without loss of generality we can consider that the rules are in Chomsky Normal Form : rules are of the form : N i N j N k or N i t with N i, N j, N k N, t T A probability distribution for every N i : P(N i α j ) = 1 j

Probabilities The probability of a rule is the probability to choose this rule to rewrite its left hand side symbol Probability of a tree T : P(N i α j ) = P(N i α j N i ) P(T) = P(r(N)) N T where N N and r(n) is the rule used to rewrite N in T Probability of a sentence S : P(S) = P(T) T T (S)

Example A BC 0.4 C DE 0.3 A CB 0.6 C ED 0.7 B DE 0.2 D a 1.0 B ED 0.8 E a 1.0

Parse probabilities 0.024 0.056 A A 0.096 0.224 A A B C B C B C B C D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a 0.036 0.144 A A 0.084 0.336 A A C B C B C B C B D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a

Probabilities we want to compute we would like to compute efficiently P(S) P(S) = P(T) T T (S) where T (S) is the set of all the syntactic structures that the grammar G associates to sentence S we would also like to find the most likely analysis of the sentence ˆT = arg max T T (S) P(T) ˆT is useful for many other tasks such as : text understanding machine translation question answering...

But grammars of Natural Languages are ambiguous 1e+08 max moy 1e+07 Nombre d analyses 1e+06 100000 10000 1000 100 10 1 0 5 10 15 20 25 30 35 40 45 50 55 Longueur de la phrase

Efficiently building the set T (S) with CYK Algorithm input : a CFG G in Chomsky Normal Form a sentence w 1... w n output : a table t such that N t i,j N w i... w j algorithm for i = 1 to n do { INITIALISATION } t i,i = {A A a i } for j = 1 to n do for i = j 1 to 1 do for k = i to j 1 do t i,j = t i,j {A A BC} with B t i,k and C t k+1,j

CYK a a a a A BC C DE A CB C ED B DE D a B ED E a

CYK a D,E a D,E a D,E a D,E A BC C DE A CB C ED B DE D a B ED E a

CYK a D,E B,C a D,E B,C a D,E B,C a D,E A BC C DE A CB C ED B DE D a B ED E a

CYK a D,E B,C A a D,E B,C a D,E B,C a D,E A BC C DE A CB C ED B DE D a B ED E a

Packed parse forest Instantiated symbol A 1..5 Instantiated production A 1..5 B 1..3 C 3..5 A 1..5 C 1..3 B 3..5 B 1..3 C 1..3 C 3..5 B 3..5 E 1..2 D 1..2 E 2..3 D 2..3 E 3..4 D 3..4 E 4..5 D 4..5 a 1..2 a 2..3 a 3..4 a 4..5

Three problems to solve Compute the probability of sentence S : P(S) = P(T) T T (S) Build the most probable syntactic tree for S : ˆT = arg max T T (S) P(T) Estimate the probabilities of G using data D : Ĝ = arg max P(D G) G

Notations w 1... w n w p,q m i N j sentence to parse segment w 1... w n of the sentence symbol of the terminal alphabet symbol of non terminal alphabet N j p,q symbol N j allows to derive segment w p,q (N j w p,q )

Inside Probabilities β j (p, q) de f = P(w p,q N j p,q) N j m p m q

Probability of a sentence P(w 1,n ) = P(N 1 w1,n ) = P(w 1,n N 1 ) = β 1 (1, n)

Recursive computation of inside probabilities N j N r N s m p m d m d+1 m q Bottom up We compute β j (p, q) after β r (p, d) and β s (d+1, q) have been computed

Recursive computation of inside probabilities Recursive formula β j (p, q) = P(w p,q N j p,q) q 1 = P(w p,d, Np,d r, w d+1,q, Nd+1,q s Nj p,q) r,s d=p q 1 = P(Np,d r, Ns d+1,q Nj p,q)p(w p,d Np,d r, Ns d+1,q, Nj p,q) r,s d=p P(w d+1,q N r p,d, Ns d+1,q, Nj p,q) q 1 = P(N j N r, N s )β r (p, d)β s (d+1, q) r,s d=p

Recursive computation of inside probabilities Final case β j (k, k) = P(w k N j k,k ) = P(N j w k )

Relation with CYK algorithm N j p,q corresponds to the presence of symbol N j in cell t p,q β(p, q) can be computed while filling matrix t

Computing P(S) with CYK for q = 1 to n do { INITIALISATION } for p = q to 1 do if (p == q) β j (p, p) = P(N j w p ) otherwise β j (p, q) = 0 for q = 1 to n do for p = q 1 to 1 do for d = p to q 1 do β j (p, q) = β j (p, q)+p(n j N r, N s )β r (p, d)β s (d+1, q) with N r t p,d and N s t d+1,q P(S) = β 1 (1, n)

Computing P( ˆT) δ i (p, q) = probability of the most probable subtree for N j p,q. 1 Initialisation 2 Recursive step δ i (p, p) = P(N i w p ) 3 End δ i (p, q) = max 1 j,k n, p d<q P(Ni N j N k )δ j (p, d)δ k (d+1, q) P( ˆT) = δ 1 (1, n)

Computing P( ˆT) with CYK for q = 1 to n do { INITIALISATION } for p = q to 1 do if (p == q) δ j (p, p) = P(N j w p ) otherwise δ j (p, q) = 0 for q = 1 to n do for p = q 1 to 1 do for d = p to q 1 do δ j (p, q) = max(δ j (p, q), P(N i N j N k )δ j (p, d)δ k (d+1, q)) with N r t p,d and N s t d+1,q P( ˆT) = β 1 (1, n)

Building ˆT ψ i (p, q) = j, k, d where j, k, d indicates that the rule désignent l application de la règle ayant réalisé le maximum δ i (p, q) ψ i (p, q) = arg max j,k,d P(Ni N j N k )δ j (p, d)δ k (d+1, q) racine( ˆT) = N 1 1,n if ψ i (p, q) = j, k, d alors fils gauche(n i p,q) = N j p,d fils droit(n i p,q) = N k d+1,q

Estimating grammar probabilities Two situations : Observable syntactic data : we have a set of sentences with the correct syntactic tree for every sentence (tree bank) Hidden syntactic structure : we just have the sentences without the syntactic structure.

Treebank examples name Penn Treebank corpus Paris 7 word 1 000 000 400 000 sentences 45 000 15 000 synt. cat. 26 13 part of speech tags 36 14 rules 9 657

Building the grammar S GN GV Det N V le chat dort S GN GV Det N V GN GV Det N V le chat dort

Estimating rule probabilities Count the number of occurrences of non terminal symbol A in the treebank : C(A). Count the number of occurrences of the rule A α : C(A α) Estimate P(A α) with relative frequencies : P(A α) = C(A α) C(A)

PCFG Limits Lexical independence Rewriting pre terminal symbol X is independent of the context of X problem : the preposition that introduces the complement of a verb depends on the lexical nature of the verb Exemple : The farmer gave an apple to John this dependency is not modeled : VP V NP PP V gave PP P NP P to

PCFG Limits Structural independence The choice of a rule for rewriting symbol X is independent of the contex of X problem : subjects are realized as pronouns much more often than objects But there is a single rule of the form NP Pro and hence a single probability

evaluation measures : (Black et al, 1991) given a sentence S, a candidate tree C and the correct tree R a syntagm in C is correct if it exists a syntagm in R with the same span and labeled with the same category labeled recall (LR) = labeled precision (LP) = # correct syntagms in C # syntagms in R # correct syntagms in C # syntagms in C F1 : harmonic mean of LR and LP (F1 = 2 LR LP LR+LR )

Some results on the Penn Tree Bank LP LR F1 Baseline 72.6 Magerman 1995 84.9 84.6 84.7 Collins 1996 86.3 85.8 85.9 Charniak 1997 87.4 87.5 87.4 Collins 1999 88.7 88.6 88.6 Petrov 2010 91.8