Introduction to Probablistic Natural Language Processing

Size: px

Start display at page:

Download "Introduction to Probablistic Natural Language Processing"

Delphia Bradford
6 years ago
Views:

1 Introduction to Probablistic Natural Language Processing Alexis Nasr Laboratoire d Informatique Fondamentale de Marseille

2 Natural Language Processing Use computers to process human languages Machine Translation Automatic Speech Recognition Optical Character Recognition Question Answering Information Extraction...

3 What do we need to Process Natural Language Represent linguistic knowledge : Phonetics Morphology Lexicon Syntax Semantics Automatically learn linguistic models : Machine Learning Corpora Efficient algorithms : Dynamic Programming Linear Programming

4 Probabilistic Methods for NLP Using statistics in order to choose between several alternatives. Ambiguity can be real : buying books for children. buying books for children. or may be due to a lack of knowledge : buying books for money.

5 Ambiguity Everywhere in NLP : Speech Recognition Machine Translation Optical Character Recognition Part of Speech Tagging Syntactic Parsing...

6 Speech Recognition Homophones Word boundaries The tail of the dog The tale of the dog I Ice scream cream

7 Part of Speech Tagging time flies like an arrow N N V Det N V V Prep V

8 Syntactic parsing Part of Speech Ambiguity Attachement Ambiguity John saw a man with a telescope. John saw a man with a telescope.

9 Machine Translation The river banks The banks closed les berges de la rivière les banques ont fermé

10 How to resolve ambiguity? Add knowledge : phonetics / phonology lexical syntactic semantic pragmatic world knowledge Compute a score for every possibility : S(the tail of the dog) > S(the tale of the dog) Such scores can be probabilities, they are computed with a probabilistic or stochastic model.

11 Example : computing the probability of a sequence of words Needed in many applications : speech recognition optical character recognition machine translation A model that associates a probability to any sequence of words, also called language model What do we expect from the language model? give a good probability to correct sequences : les traits très tirés. les traits très tirées. les très traient tiraient. les très très t iraient. give a good probability to likely sequences : les traits très tirés. lettrés très tirés.

12 Unigram model We use the occurrence probability of the words of the sequence : P(w 1... w n ) = n i=1 P(w i ) P(les traits très tirés) = P(les) P(traits) P(très) P(tirés) the model is based on completely wrong hypotheses! séquence log P les très très tirés les trait très tiré les traits très tiré les trait très tirés les traits très tirés

13 Unigram model : probability estimation How to estimate P(w i ) with 1 i V? Take a very lage corpus : o i... o n (o is a word occurrence) Compute the number of occurrences of word w : C(w) = δ oi,w = 1 if o i = w and 0 otherwise Then relative frequency of w : P(w) = V probabilities to estimate. n δ oi,w i=1 C(w) V i=1 C(w i)

14 Training Data Newspaper Le Monde sentences occurrences

15 Number of different unigrams nb unigrams e+06 4e+06 6e+06 8e+06 1e e+07 nb phrases

16 Rank, frequency and frequency of frequencies rank freq f2f unigram de la le l et les des d un en

17 Rank, frequency and frequency of frequencies rank freq f2f unigram abaisseront, Abott, antihypertenseur absolumen, absorbable, Ambrozic

18 Frequency of frequencies, lin-lin frequence de frequence e+06 4e+06 6e+06 8e+06 1e e+07 frequence

19 Frequency of frequencies, lin-lin frequence de frequence frequence

20 Frequency of frequencies, log-lin frequence de frequence e+06 1e+07 frequence

21 Frequency of frequencies, log-log frequence de frequence e+06 1e+07 frequence

22 Zipf Law In may human phenomena, there is a linear relation between frequency ( f ) and the inverse of rank (r) : f(r) = k 1 r Zipf interpretation : a way to minimize the effort of the speaker and the hearer : small number of frequent words minimize speaker effort large number of rare words (low ambiguity) minimize hearer effort Roughly : few very frequent words and many rare words

23 Bigram Model We use the probability that word a follows word b : P(w i+1 = a w i = b) P(w 1... w n ) = P(w 1 ) n i=2 P(w i w i 1 ) P(les traits très tirés) = P(les) P(traits les) P(très traits) P(tirés très) les très très tirés les traits très tiré les traits très tirés les trait très tirés les trait très tiré

24 Bigram Models : estimation Compute the number of occurrences of sequence ab : C(a, b) = Then relative frequency : N 1 δ oi,a δ oi+1,b i=1 P(b a) = V 2 probabilities to estimate. C(a, b) C(a)

25 Number of different bigrams 2.5e+07 nb bigrammes nb unigrammes 2e e+07 1e+07 5e e+06 4e+06 6e+06 8e+06 1e e+07 nb phrases

26 Rank, frequency and frequency of frequencies rank freq f2f bigram de la de l neuf cent mille neuf cent quatre vingt d un d une c est et de en mille

27 Rank, frequency and frequency of frequencies rank freq f2f bigram

28 Frequency of frequencies, lin-lin 1.2e+07 1e+07 frequence de frequences 8e+06 6e+06 4e+06 2e frequence

29 Frequency of frequencies, log-lin 1.2e+07 1e+07 frequence de frequences 8e+06 6e+06 4e+06 2e e+06 frequence

30 Frequency of frequencies, log-log 1e+07 1e frequence de frequences e+06 frequence

31 Trigram Model We use probability that c follows sequence ab : P(w i+2 = c w i = a, w i+1 = b) P(w 1... w n ) = P(w 1 ) P(w 2 w 1 ) n i=3 P(w i w i 2 w i 1 ) P(les traits très tirés) = P(les) P(traits les) P(très les, traits) P(tirés traits, très) les très très tirés les traits très tirés les traits très tiré les trait très tirés les trait très tiré

32 Trigram Model : estimation Compute the number of occurrences of word sequence abc : C(a, b, c) = Then relative frequencies : N 2 δ oi,a δ oi+1,b δ oi+2,c i=1 P(c a, b) = V 3 probabilities to estimate C(a, b, c) C(a, b)

33 Number of different trigrams 9e+07 8e+07 nb trigrammes nb bigrammes nb unigrammes 7e+07 6e+07 5e+07 4e+07 3e+07 2e+07 1e e+06 4e+06 6e+06 8e+06 1e e+07 nb phrases

34 Rank, frequency and frequency of frequencies rang fréq f2f trigramme mille neuf cent en mille neuf cent quatre vingt dix n est pas neuf cent soixante il y a de mille neuf n a pas et de la cent soixante dix

35 Rank, frequency and frequency of frequencies rang fréq f2f

36 Frequency of frequencies, lin-lin 6e+07 5e+07 frequence de frequences 4e+07 3e+07 2e+07 1e frequence

37 Frequency of frequencies, log-lin 6e+07 5e+07 frequence de frequences 4e+07 3e+07 2e+07 1e e+06 frequence

38 Frequency of frequencies, log-log 1e+07 1e+06 frequence de frequences e+06 frequence

39 Limits of n-gram models They fail to model arbitrary long distance phenomena the speed reaches... the speed of the waves reaches... the speed of the sismic waves reaches... the speed of the large sismic waves reaches...

40 Syntactic structure of the sentence can help S NPs VP3s D Ns V3s the speed reaches

41 Syntactic structure of the sentence can help S NPs VP3s V3s NPs PP D Ns reaches Prep NPp the speed of D Np the waves

42 Syntactic structure of the sentence can help S NPs VP3s V3s NPs PP D Ns reaches Prep NPp the speed of D NPp the AP Np sismic waves

43 Use grammars to compute the probability of a sentence Agreement between the subject and the verb is modeled by the same rule in the differents sentences. The rule is independent of the Noun Phrase length The grammar can easily associate a better probability to the correct sentence P(S NPs VPs) > P(S NPs VPp) The grammar can be used as a language model. For any sentence S L(G) it allows to compute P(S)

44 Context Free Grammars A CFG is made of : A non terminal alphabet N = {N 1... N n } A terminal Alphabet T = {t 1... t m } A start symbol N 1 A set of rewrite rules N i α with α (N T ) Without loss of generality we can consider that the rules are in Chomsky Normal Form : rules are of the form : N i N j N k or N i t with N i, N j, N k N, t T

45 Language and syntactic structure A CFG G defines a language L(G) T associates to every string w L(G) one syntactic structure or more (when G is ambiguous)

46 Example A BC C DE A CB C ED B DE D a B ED E a A BC DEC aec aac aaed aaad aaaa L(G) = {aaaa}

47 Ambiguity A A A A B C B C B C B C D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a A A A A C B C B C B C B D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a

48 Probabilistic Context Free Grammar A PCFG is made of : A non terminal alphabet N = {N 1... N n } A terminal Alphabet T = {t 1... t m } A start symbol N 1 A set of rewrite rules N i α with α (N T ) Without loss of generality we can consider that the rules are in Chomsky Normal Form : rules are of the form : N i N j N k or N i t with N i, N j, N k N, t T A probability distribution for every N i : P(N i α j ) = 1 j

49 Probabilities The probability of a rule is the probability to choose this rule to rewrite its left hand side symbol Probability of a tree T : P(N i α j ) = P(N i α j N i ) P(T) = P(r(N)) N T where N N and r(n) is the rule used to rewrite N in T Probability of a sentence S : P(S) = P(T) T T (S)

50 Example A BC 0.4 C DE 0.3 A CB 0.6 C ED 0.7 B DE 0.2 D a 1.0 B ED 0.8 E a 1.0

51 Parse probabilities A A A A B C B C B C B C D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a A A A A C B C B C B C B D E D E D E E D E D D E E D E D a a a a a a a a a a a a a a a a

52 Probabilities we want to compute we would like to compute efficiently P(S) P(S) = P(T) T T (S) where T (S) is the set of all the syntactic structures that the grammar G associates to sentence S we would also like to find the most likely analysis of the sentence ˆT = arg max T T (S) P(T) ˆT is useful for many other tasks such as : text understanding machine translation question answering...

53 But grammars of Natural Languages are ambiguous 1e+08 max moy 1e+07 Nombre d analyses 1e Longueur de la phrase

54 Efficiently building the set T (S) with CYK Algorithm input : a CFG G in Chomsky Normal Form a sentence w 1... w n output : a table t such that N t i,j N w i... w j algorithm for i = 1 to n do { INITIALISATION } t i,i = {A A a i } for j = 1 to n do for i = j 1 to 1 do for k = i to j 1 do t i,j = t i,j {A A BC} with B t i,k and C t k+1,j

55 CYK a a a a A BC C DE A CB C ED B DE D a B ED E a

56 CYK a D,E a D,E a D,E a D,E A BC C DE A CB C ED B DE D a B ED E a

57 CYK a D,E B,C a D,E B,C a D,E B,C a D,E A BC C DE A CB C ED B DE D a B ED E a

58 CYK a D,E B,C A a D,E B,C a D,E B,C a D,E A BC C DE A CB C ED B DE D a B ED E a

59 Packed parse forest Instantiated symbol A 1..5 Instantiated production A 1..5 B 1..3 C 3..5 A 1..5 C 1..3 B 3..5 B 1..3 C 1..3 C 3..5 B 3..5 E 1..2 D 1..2 E 2..3 D 2..3 E 3..4 D 3..4 E 4..5 D 4..5 a 1..2 a 2..3 a 3..4 a 4..5

60 Three problems to solve Compute the probability of sentence S : P(S) = P(T) T T (S) Build the most probable syntactic tree for S : ˆT = arg max T T (S) P(T) Estimate the probabilities of G using data D : Ĝ = arg max P(D G) G

61 Notations w 1... w n w p,q m i N j sentence to parse segment w 1... w n of the sentence symbol of the terminal alphabet symbol of non terminal alphabet N j p,q symbol N j allows to derive segment w p,q (N j w p,q )

62 Inside Probabilities β j (p, q) de f = P(w p,q N j p,q) N j m p m q

63 Probability of a sentence P(w 1,n ) = P(N 1 w1,n ) = P(w 1,n N 1 ) = β 1 (1, n)

64 Recursive computation of inside probabilities N j N r N s m p m d m d+1 m q Bottom up We compute β j (p, q) after β r (p, d) and β s (d+1, q) have been computed

65 Recursive computation of inside probabilities Recursive formula β j (p, q) = P(w p,q N j p,q) q 1 = P(w p,d, Np,d r, w d+1,q, Nd+1,q s Nj p,q) r,s d=p q 1 = P(Np,d r, Ns d+1,q Nj p,q)p(w p,d Np,d r, Ns d+1,q, Nj p,q) r,s d=p P(w d+1,q N r p,d, Ns d+1,q, Nj p,q) q 1 = P(N j N r, N s )β r (p, d)β s (d+1, q) r,s d=p

66 Recursive computation of inside probabilities Final case β j (k, k) = P(w k N j k,k ) = P(N j w k )

67 Relation with CYK algorithm N j p,q corresponds to the presence of symbol N j in cell t p,q β(p, q) can be computed while filling matrix t

68 Computing P(S) with CYK for q = 1 to n do { INITIALISATION } for p = q to 1 do if (p == q) β j (p, p) = P(N j w p ) otherwise β j (p, q) = 0 for q = 1 to n do for p = q 1 to 1 do for d = p to q 1 do β j (p, q) = β j (p, q)+p(n j N r, N s )β r (p, d)β s (d+1, q) with N r t p,d and N s t d+1,q P(S) = β 1 (1, n)

69 Computing P( ˆT) δ i (p, q) = probability of the most probable subtree for N j p,q. 1 Initialisation 2 Recursive step δ i (p, p) = P(N i w p ) 3 End δ i (p, q) = max 1 j,k n, p d<q P(Ni N j N k )δ j (p, d)δ k (d+1, q) P( ˆT) = δ 1 (1, n)

70 Computing P( ˆT) with CYK for q = 1 to n do { INITIALISATION } for p = q to 1 do if (p == q) δ j (p, p) = P(N j w p ) otherwise δ j (p, q) = 0 for q = 1 to n do for p = q 1 to 1 do for d = p to q 1 do δ j (p, q) = max(δ j (p, q), P(N i N j N k )δ j (p, d)δ k (d+1, q)) with N r t p,d and N s t d+1,q P( ˆT) = β 1 (1, n)

71 Building ˆT ψ i (p, q) = j, k, d where j, k, d indicates that the rule désignent l application de la règle ayant réalisé le maximum δ i (p, q) ψ i (p, q) = arg max j,k,d P(Ni N j N k )δ j (p, d)δ k (d+1, q) racine( ˆT) = N 1 1,n if ψ i (p, q) = j, k, d alors fils gauche(n i p,q) = N j p,d fils droit(n i p,q) = N k d+1,q

72 Estimating grammar probabilities Two situations : Observable syntactic data : we have a set of sentences with the correct syntactic tree for every sentence (tree bank) Hidden syntactic structure : we just have the sentences without the syntactic structure.

73 Treebank examples name Penn Treebank corpus Paris 7 word sentences synt. cat part of speech tags rules 9 657

74 Building the grammar S GN GV Det N V le chat dort S GN GV Det N V GN GV Det N V le chat dort

75 Estimating rule probabilities Count the number of occurrences of non terminal symbol A in the treebank : C(A). Count the number of occurrences of the rule A α : C(A α) Estimate P(A α) with relative frequencies : P(A α) = C(A α) C(A)

76 PCFG Limits Lexical independence Rewriting pre terminal symbol X is independent of the context of X problem : the preposition that introduces the complement of a verb depends on the lexical nature of the verb Exemple : The farmer gave an apple to John this dependency is not modeled : VP V NP PP V gave PP P NP P to

77 PCFG Limits Structural independence The choice of a rule for rewriting symbol X is independent of the contex of X problem : subjects are realized as pronouns much more often than objects But there is a single rule of the form NP Pro and hence a single probability

78 evaluation measures : (Black et al, 1991) given a sentence S, a candidate tree C and the correct tree R a syntagm in C is correct if it exists a syntagm in R with the same span and labeled with the same category labeled recall (LR) = labeled precision (LP) = # correct syntagms in C # syntagms in R # correct syntagms in C # syntagms in C F1 : harmonic mean of LR and LP (F1 = 2 LR LP LR+LR )

79 Some results on the Penn Tree Bank LP LR F1 Baseline 72.6 Magerman Collins Charniak Collins Petrov

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other