Probabilistic Linguistics

Size: px

Start display at page:

Download "Probabilistic Linguistics"

Abel Shields
5 years ago
Views:

1 Matilde Marcolli MAT1509HS: Mathematical and Computational Linguistics University of Toronto, Winter 2019, T 4-6 and W 4, BA6180

2 Bernoulli measures finite set A alphabet, strings of arbitrary (finite) length A = m A m Alphabet: letters, phonemes, lexical list of words,... Λ A = infinite strings in alphabet A; cylinder sets Λ A (w) = {α Λ A α i = w i, i = 1,..., m} with w A m (also called the ω-language A ω ) Λ A Cantor set with the topology generated by cylinder sets Bernoulli measure: P = (p a ) a A probability measure p a 0 for all a A and a A p a = 1 Gives measure µ P on Λ A with µ P (Λ A (w)) = p w1 p wm meaning: in a word w = w 1 w m each letter w i A is an independent random variable drawn with probabilities P = (p a ) a A

3 Markov measures same as above: A alphabet, A finite strings, Λ A infinite strings π = (π a ) a A probability distribution π a 0 and a π a = 1 P = (p ab ) a,b A stochastic matrix p ab 0 and a p ab = 1 Perron Frobenius eigenvector πp = π Markov measure µ π,p on Λ A with µ π,p (Λ A (w)) = π w1 p w1 w 2 p wm 1 w m meaning: in a word w 1 w m letters follow one another according to a Markov chain model, with probability p ab of having a and b as consecutive letters

4 Example of Markov Chain

5 support of Markov measure µ π,p subset Λ A,A Λ A with A = (A ab ) entries A ab = 0 if p ab = 0 and A ab = 1 if p ab 0 Λ A,A = {α Λ A A αi α i+1 = 1, i} both Bernoulli and Markov measures are shift invariant σ : Λ A Λ A, σ(a 1 a 2 a m ) = a 2 a 3 a m+1 the shift map is a good model for many properties in the theory of dynamical systems: widely studied examples Markov s original aim was modeling natural languages A. A. Markov (1913), Ein Beispiel statistischer Forschung am Text Eugen Onegin zur Verbindung von Proben in Ketten

6 Shannon Entropy Claude E. Shannon, Warren Weaver, The Mathematical Theory of Communication, University of Illinois Press, (reprinted 1998) Entropy measures the uncertainty associated to a prediction of the result of the experiment (equivalently the amount of information one gains from performing the experiment) Shannon entropy of a Bernoulli measure S(µ P ) = a A p a log(p a ) Entropy of a Markov measure S(µ π,p ) = π a p ab log(p ab ) (Kolmogorov Sinai entropy) a,b A

7 Relative entropy and cross-entropy Relative entropy (Kullback Leibler divergence) P = (p a ), Q = (q a ) KL(P Q) = p a log( p a ) q a a A cross entropy of probabilities P = (p a ) and Q = (q a ) S(P, Q) = S(P) + KL(P Q) = a A p a log(q a ) expected message-length per datum when a wrong distribution Q is assumed while data follow distribution P

8 Entropy and Cross-entropy for a message W (thought of as a random variable) with V(W ) set of possible values S(W ) = P(w) log P(w) w V(W ) = Average w V(W ) (#Bits Required for(w)) P M = probabilities estimated using a model S(W, P M ) = P(w) log P M (w) w V(W ) cross-entropy Cross-entropy gives a model evaluator

9 Entropy and Cross-entropy of languages asymptotic limit of per-word entropy as length grows 1 S(L, P) = lim n n w W n (L) for language L A with W n (L) = L A n P(w) log P(w) cross-entropy same with respect to a model P M 1 S(L, P, P M ) = lim n n w W n (L) P(w) log P M (w)

10 Other notions of Entropy of languages language structure functions s L (m) = #W m (L) (number of strings of length m in the language) Generating function for the language structure functions G C (t) = m s L (m)t m ρ = radius of convergence of the series G C (t) Entropy: S(L) = log ρ(g C (t)) Example: for L = A with #A = N, and P uniform distribution both S(L, P) = S(L) = log N

11 Trigram model can consider further dependencies between letters beyond consecutive ones Example trigram: how next letter in a word depends on the previous two P(w) = P(w 0 )P(w 1 w 0 )P(w 2 w 0 w 1 ) P(w m w m 2 w m 1 ) build the model probabilities by counting frequencies of sequences w j 2 w j 1 w j of specific choices of three words over corpora of texts... problem: the model suppresses grammatical but unlikely combinations of words, which occur infrequently problem of sparse data possible solution: smoothing out the probabilities P(w m w m 1 w m 2 ) = λ 1 P f (w m )+λ 2 P f (w m w m 1 )+λ 3 P f (w m w m 2 w m 1 ) respectively frequencies P f of single word, pair, and triple

12 Hidden Markov Models these more general types of dependencies (like trigram) extend Markov Chains to Hidden Markov Models first step construct Markov Chain for P(w) = m P(w j w j 2 w j 1 ) j=0 by taking states consisting of pairs of consecutive letters A A and an edge for each a A with probability of transition P(c ab) from ab to bc

13 Example:

14 second step: introduce the combination (with i λ i = 1) λ 1 P(w m ) + λ 2 P(w m w m 1 ) + λ 3 P(w m w m 2 w m 1 ) in the diagram by replacing edges out of every node of the Markov chain with a diagram with additional states marked by λ i and additional edges corresponding to all the probabilities contributing: P(a), P(a b) and P(c ab) with edges into λ i state marked by empty ɛ output symbol... the presence of ɛ-transitions with no output symbol implies this is a Hidden Markov Model with hidden states and visible states

15 Example: replacing the part of the diagram connecting node ab to nodes ba and bb

16 Shannon s series approximations to English Random texts composed according to probability distributions that better approximate English texts Step 0: random text from English alphabet (plus blank symbol): letters drawn with uniform Bernoulli distribution Step 1: random text with Bernoulli distribution based on frequency of letters in English

17 Step 2: random text with Markov distribution over two consecutive letters from English frequencies (diagram model) Step 3: trigram model

18 Can also do the same on an alphabet of words instead of letters Step 1: random text with Bernoulli distribution based on frequency of words in sample texts of English Step 2: Markov distribution for consecutive words (digram)

19 Stone house poet Mr.Shih who ate ten lions Text presented at the tenth Macy Conference on Cybernetics (1953) by linguist Yuen Ren Chao Shannon information as a written text very different from Shannon information as a spoken text

20 for an interesting example in English, consider the word buffalo - buffalo! = bully! VP - Buffalo buffalo = bison bully NP VP - Buffalo buffalo buffalo = those bison that are from the city of Buffalo bully - Buffalo buffalo buffalo buffalo = those bison that are from the city of Buffalo bully other bison in fact arbitrarily long sentences consisting solely of the word buffalo are grammatical (Example by Thomas Tymoczko, logician and philosopher of mathematics)

21 Bison from Buffalo, that bison from Buffalo bully, themselves bully bison from Buffalo Where is the information content? as string (in an alphabet of words) it has Shannon information zero... the information is in the parse trees

22 Probabilistic Context Free Grammars G = (V N, V T, P, S, P) V N and V T disjoint finite sets: non-terminal and terminal symbols S V N start symbol P finite rewriting system on V N V T P = production rules A α with A V N and α (V N V T ) Probabilities P(A α) P(A α) = 1 α ways to expand same non-terminal A add up to probability one

23 Probabilities of parse trees T G = {T } family of parse trees T for a context-free grammar G if G probabilistic, can assign probabilities to all the possible parse trees T (w) for a given string w in L G P(w) = P(T ) P(w T ) = P(T ) T =T (w) P(w, T ) = T T =T (w) last because tree includes the terminals (labels of leaves) so P(w T (w)) = 1 Probabilities account for syntactic ambiguities of parse trees in context-free languages

24 Subtree independence assumption a vertex v in an oriented rooted planar tree T spans a subset Ω(v) of the set of leaves of T if Ω(v) is the set of leaves reached by an oriented path in T starting at v denote by A kl a non-terminal labeling a vertex in a parse tree T that spans the subset w k... w l of the string w = w 1... w n parsed by T = T (w) 1 P(A kl w k... w l anything outside of k j l} = P(A kl w k... w l ) 2 P(A kl w k... w l anything above A kl in the tree} = P(A kl w k... w l )

25 Example A B C w 1 w 2 w 3 w 4 w 5 P(T ) = P(A, B, C, w 1, w 2, w 3, w 4, w 4 A) = P(B, C A) P(w 1, w 2, w 3 A, B, C) P(w 4 w 5 A, B, C, w 1, w 2, w 3 ) = P(B, C A) P(w 1, w 2, w 3 B) P(w 4 w 5 C) = P(A BC) P(B w 1, w 2, w 3 ) P(C w 4 w 5 )

26 Sentence probabilities in PCFGs Fact: context-free grammars can always be put in Chomsky normal form where all the production rules are of the form N w, N N 1 N 2 where N, N 1, N 2 are non-terminal, w terminal Parse trees for a CFG in Chomsky normal form have either an internal node marked with non-terminal N and one output to a leaf with terminal w or a node with nonterminal N and two outputs with non-terminals N 1 and N 2

28 assume CFG in Chomsky normal form inside probabilities β j (k, l) := P(w k,l N j k,l ) probability of the string of terminals inside (outputs of) the oriented tree with vertex (root) N j k,l outside probabilities α j (k, l) := P(w 1,k 1, N j k,l, w l+1,n) probability of everything that s outside the tree with root N j k,l

30 Recursive formula for inside probabilities β j (k, l) = P(w k,l N j k,l ) = P(w k,m, N p k,m w m+1,ln q m+1,l Nj k,l ) p,q,m = P(N p k,m, Nq m+1,l Nj k,l ) P(w k,m N j k,l, Np k,m, Nq m+1,l ) p,q,m P(w m+1,l w k,m, N j k,l Np k,m, Nq m+1,l ) = P(N p k,m, Nq m+1,l Nj k,l ) P(w k,m N p k,m ) P(w m+1,l N q m+1,l ) p,q,m = P(N j N p N q ) β p (k, m)β q (m + 1, l) p,q,m

31 Training Probabilistic Context-Free Grammars simpler case of a Markov chain: consider a transition s i w k s j from state s i to state s j labeled by w k given a large training corpus: count number of times the given transition occurs: counting function C(s i w k s j ) model probabilities on the frequencies obtained from these counting functions: P M (s i w k s j ) = C(s i w k s j ) l,m C(si w m s l ) a similar procedure exists for Hidden Markov Models

32 in the case of Probabilistic Context Free Grammars: use training corpus to estimate probabilities of production rules P M (N i w j ) = C(Ni w j ) k C(Ni w k ) At the internal (hidden) nodes counting function related to probabilites by C(N j N p N q ) := k,l,m P(N j k,l, Np k,m, Nq m+1,l w 1,n) = = 1 P(w 1,n ) 1 P(w 1,n ) P(N j k,l, Np k,m, Nq m+1,l, w 1,n) k,l,m α j (k, l)p(n j N p N q )β p (k, m)β q (m + 1, l) k,l,m

33 Training corpora and the syntactic-semantic interface the existence of different parse trees for the same sentence is a sign of semantic ambiguity training a Probabilistic Context Free Grammar over a large corpus can (sometime) resolve ambiguities by assigning different probabilities Example: two parsings of sentence: They are flying planes They (are flying) planes or They are (flying planes). This type of ambiguity might not be resolved by training over a corpus Example: the three following sentences all have different parsings, but the most likely parsing can be picked up by training over a corpus - Time flies like an arrow - Fruit flies like a banana - Time flies like a stopwatch

34 Probabilistic Tree Adjoining Grammars I = set of initial trees of the TAG; A = set of the auxiliary trees of the TAG each tree has subset of leaf nodes marked as nodes where substitution can occur adjunction can occur at any node marked by nonterminal (other than those marked for substitution) s(τ) set of substitution nodes of tree τ; α(τ) set of adjunction nodes of τ S(τ, τ, η) = substitution of tree τ into τ at node η; A(τ, τ, η) = adjunction of tree τ into tree τ at node η; A(τ,, η) = no adjunction performed at node η Ω set of all substitutions and adjunction events

35 Probabilistic TAG (PTAG) (I, A, P I, P S, P A ) P I : I R, P S : Ω R, P A : Ω R P I (τ) = 1 τ I P S (S(τ, τ, η)) = 1, τ I τ A P A (A(τ, τ, η)) = 1, η s(τ), τ I A η α(τ), τ I A P I (τ) = probability that a derivation begins with the tree τ P S (S(τ, τ, η) = probability of substituting τ into τ at η P A (A(τ, τ, η)) = probability of adjoining τ to τ at η

36 Probability of a derivation in a PTAG: P(τ) = P I (τ 0 ) N P opi (op i (τ i 1, τ i, η i )) i=1 if τ obtained from initial tree τ 0 through a sequence of N substitutions and adjunctions op i similar to probabilistic context-free grammars

37 Some References: 1 Eugene Charniak, Statistical Language Learning, MIT Press, Rens Bod, Jennifer Hay, Stefanie Jannedy (Eds.), Probabilistic Linguistics, MIT Press, Philip Resnik, Probabilistic tree-adjoining grammar as a framework for statistical natural language processing, Proc. of COLING-92 (1992)

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612