Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Size: px

Start display at page:

Download "Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations"

Allan Parsons
5 years ago
Views:

1 Language Technology CUNY Graduate Center Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required hard optional Liang Huang

2 Probabilities experiment (e.g., toss a coin 3 times ) basic outcomes Ω (e.g., Ω={HHH, HHT, HTH,..., TTT} ) event: some subset A of Ω (e.g., A = heads twice ) probability distribution a function p from Ω to [0, 1] e Ω p(e) = 1 probability of events (marginals) p (A) = e A p(e) 2

3 Joint and Conditional Probs 3

4 Multiplication Rule 4

5 Independence P(A, B) = P(A) P(B) or P(A) = P(A B) disjoint events are always dependent! P(A,B) = 0 unless one of them is impossible : P(A)=0 conditional independence: P(A, B C) = P(A C) P(B C) P(A C) = P (A B, C) 5

6 Marginalization compute marginal probs from joint/conditional probs 6

7 Bayes Rules alternative bayes rule by partition 7

8 Most Likely Event 8

9 Most Likely Given... 9

10 Estimating Probabilities how to get probabilities for basic outcomes? do experiments count stuff e.g. how often do people start a sentence with the? P (A) = (# of sentences like the... in the sample) / (# of all sentences in the sample) P (A B) = (count of A, B) / (count of B) we will show that this is Maximum Likelihood Estimation 10

11 Probabilistic Finite-State Machines adding probabilities into finite-state acceptors (FSAs) FSA: a set of strings; WFSA: a distribution of strings 11

12 WFSA normalization: transitions leaving each state sum up to 1 defines a distribution over strings? or a distribution over paths? => also induces a distribution over strings 12

13 WFSTs FST: a relation over strings (a set of string pairs) WFST: a probabilistic relation over strings (a set of <s, t, p>: strings pair <s, t> with probability p) what is p representing? 13

14 Edit Distance as WFST this is simplified edit distance real edit distance as an example of WFST, but not PFST WFST: real edit distance... a:a/0 costs: replacement: 1 insertion: 2 deletion: 2 b:a/1 b:b/0 0 *e*:a/2 a:b/1 a:*e*/2 14

15 Normalization if transitions leaving each state and each input symbol sum up to 1, then... WFST defines conditional prob p(y x) for x => y what if we want to define a joint prob p(x, y) for x=>y? what if we want p(x y)? 15

16 Questions of WFSTs given x, y, what is p(y x)? for a given x, what s the y that maximizes p(y x)? for a given y, what s the x that maximizes p(y x)? for a given x, supply all output y w/ respective p(y x) for a given y, supply all input x w/ respective p(y x) 16

17 Answer: Composition p (z x) = p (y x) p (z y)??? = sumy p (y x) p (z y) have to sum up y given y, z & x are independent in this cascade - Why? how to build a composed WFST C out of WFSTs A, B? again, like intersection sum up the products (+, x) semiring 17

18 Example 18

19 Example from M. Mohri and J. Eisner they use (min, +), we use (+, *) 19

20 Example they use (min, +), we use (+, *) 20

21 Example they use (min, +), we use (+, *) 21

22 Given x, supply all output y no longer normalized! 22

23 Given x, y, what s p(y x) 23

24 Given x, what s max p(y x) 24

25 Part-of-Speech Tagging Again 25

26 Part-of-Speech Tagging Again 26

27 Adding a Tag Bigram Model (again) FST C: POS bigram LM p(w...w) p(t...t w...w) p(t...t) p(???) wait, is that right (mathematically)? 27

28 Noisy-Channel Model 28

29 Noisy-Channel Model p(t...t) 29

30 Applications of Noisy-Channel 30

31 Example: Edit Distance O(k) deletion arcs from J. Eisner a:ε" b:ε" a:b ε:a b:a O(k) insertion arcs ε:b b:b a:a O(k) identity arcs 31

32 Example: Edit Distance clara.o. Best path (by Dijkstra s algorithm) c:ε" l:ε" a:ε" r:ε" a:ε" c:c l:c a:c r:c a:c a:ε" b:ε" a:b ε:c ε:c ε:c ε:c ε:c ε:c c:ε" l:ε" a:ε" r:ε" a:ε" ε:a b:a = ε:a c:a l:a a:a r:a a:a ε:a ε:a c:ε" l:ε" a:ε" ε:a r:ε" ε:a a:ε" ε:a ε:b.o. caca b:b a:a ε:c ε:a c:c l:c a:c r:c a:c ε:c ε:c c:ε" ε:c ε:c ε:c l:ε" a:ε" r:ε" a:ε" c:a l:a a:a r:a a:a ε:a ε:a ε:a ε:a ε:a c:ε" l:ε" a:ε" r:ε" a:ε" 32

Max / Sum Probs in a WFSA, which string x has the greatest p(x)? graph search (shortest path) problem Dijkstra; or Viterbi if the FSA is acyclic does it work for NFA?

33 Max / Sum Probs in a WFSA, which string x has the greatest p(x)? graph search (shortest path) problem Dijkstra; or Viterbi if the FSA is acyclic does it work for NFA? best path much easier than best string you can determinize it (with exponential cost!) popular work-around: n-best list crunching Edsger Dijkstra ( ) GOTO considered harmful (b. 1932) Viterbi Alg. (1967) CMDA, Qualcomm 33

34 Dijkstra 1959 vs. Viterbi 1967 Edsger Dijkstra ( ) GOTO considered harmful that s min. spanning tree! Jarnik (1930) - Prim (1957) - Dijkstra (1959) 34

35 Dijkstra 1959 vs. Viterbi 1967 that s shortest-path Moore (1957) - Dijkstra (1959) 35

36 Dijkstra 1959 vs. Viterbi 1967 special case of dynamic programming (Bellman, 1957) 36

37 Sum Probs what is P(x) for some particular x? for DFA, just follow x for NFA, get a subgraph (by composition), then sum?? acyclic => Viterbi cyclic => compute strongly connected components SCC-DAG cluster graph (cyclic locally, acyclic globally) do infinite sum (matrix inversion) locally, Viterbi globally refer to extra readings on course website 37

Natural Language Processing

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment