Statistical Processing of Natural Language

Statistical Processing of Natural Language and DMKM - Universitat Politècnica de Catalunya

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Graphical and Generative models: Bayes rule independence assumptions. Able to generate data. Conditional models: No independence assumptions. Unable to generate data. Most algorithms of both kinds make assumptions about the nature of the data-generating process, predefining a fixed model structure and only acquiring from data the distributional information.

Usual Statistical in NLP and Generative models: Graphical: (Rabiner 1990), IO (Bengio 1996). Automata-learning algorithms: No assumptions about model structure. VLMM (Rissanen 1983), Suffix Trees (Galil & Giancarlo 1988), CSSR (Shalizi & Shalizi 2004). Non-graphical: Stochastic Grammars (Lary & Young 1990) Conditional models: Graphical: discriminative MM (Bottou 1991), MEMM (McCallum et al. 2000), CRF (Lafferty et al. 2001). Non-graphical: Maximum Entropy (Berger et al 1996).

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

[Visible] and X = (X 1,..., X T ) sequence of random variables taking values in S = {s 1,..., s N } Properties Limited Horizon: P(X t+1 = s k X 1,..., X t ) = P(X t+1 = s k X t ) Time Invariant (Stationary): P(X t+1 = s k X t ) = P(X 2 = s k X 1 ) Transition matrix: a ij = P(X t+1 = s j X t = s i ); Initial probabilities (or extra state s 0 ): π i = P(X 1 = s i ); N i=1 π i = 1 a ij 0, i, j; N j=1 a ij = 1, i

MM Example 0.6 1.0 h a 0.4 p and 1.0 0.4 0.3 0.6 0.3 e 1.0 t i 0.4 Sequence probability: P(X 1,.., X T ) = = P(X 1 )P(X 2 X 1 )P(X 3 X 1 X 2 )... P(X T X 1..X T 1 ) = P(X 1 )P(X 2 X 1 )P(X 3 X 2 )... P(X T X T 1 ) = π X1 T 1 t=1 a X tx t+1

() and States and Observations Emission Probability: b ik = P(O t = k X t = s i ) Used when underlying events probabilistically generate surface events: PoS tagging (hidden states: PoS tags, observations: words) ASR (hidden states: phonemes, observations: sound)... Trainable with unannotated data. Expectation Maximization (EM) algorithm. arc-emission vs state-emission

Example: PoS Tagging and Emission probabilities. the this cat kid eats runs fish fresh little big <FF> 1.0 Dt 0.6 0.4 N 0.6 0.1 0.3 V 0.7 0.3 Adj 0.3 0.3 0.4

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

and 1 Observation probability (decoding): Given a model µ = (A, B, π), how we do efficiently compute how likely is a certain observation? That is, P µ (O) 2 Classification: Given an observed sequence O and a model µ, how do we choose the state sequence (X 1,..., X T ) that best explains the observations? 3 Parameter estimation: Given an observed sequence O and a space of possible models, each with different parameters (A, B, π), how do we find the model that best explains the observed data?

and 1. Observation Probability 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Question 1. Observation probability and 1. Observation Probability Let O = (o 1,..., o T ) observation sequence. For any state sequence X = (X 1,..., X T ), we have: T P µ (O X ) = P µ (o t X t ) t=1 = b X1 o 1 b X2 o 2... b XT o T P µ (X ) = π X1 a X1 X 2 a X2 X 3... a XT 1 X T P µ (O) = P µ (O, X ) = P µ (O X )P µ (X ) X X = T π X1 b X1 o 1 a Xt 1 X t b Xtot X 1...X T t=2 Complexity: O(TN T ) Dynammic Programming: Trellis/lattice. O(TN 2 )

Trellis S 1 and 1. Observation Probability State S 2 S 3 S N 1 2 3 T+1 Fully connected where one can move from any state to any other at each step. A node {s i, t} of the trellis stores information about state sequences which include X t = i. Time t

Forward & Backward computation and 1. Observation Probability Forward procedure O(TN 2 ) We store α i (t) at each trellis node {s i, t}. α i (t) = P µ (o 1... o t, X t = i) Probability of emmiting o 1... o t and reach state s i at time t. 1 Inicialization: α i (1) = π i b io1 ; i = 1... N 2 Induction: t : 1 t < T N α j (t + 1) = α i (t)a ij b jot+1 ; 3 Total: P µ (O) = i=1 N α i (T ) i=1 j = 1... N

Forward computation and 1. Observation Probability Closeup of the computation of forward probabilities at one node. The forward probability α j (t + 1) is calculated by summing the product of the probabilities on each incoming arc with the forward probability of the originating node.

Forward & Backward computation and 1. Observation Probability Backward procedure O(TN 2 ) We store β i (t) at each trellis node {s i, t}. β i (t) = P µ (o t+1... o T X t = i) 1 Inicialization: β i (T ) = 1 i = 1... N 2 Induction: t : 1 t < T N β i (t) = a ij b jot+1 β j (t + 1) j=1 3 Total: P µ (O) = N π i b io1 β i (1) i=1 Probability of emmiting o t+1... o T given we are in state s i at time t. i = 1... N

Forward & Backward computation and Combination P µ (O, X t = i) = P µ (o 1... o t 1, X t = i, o t... o T ) = α i (t)β i (t) 1. Observation Probability P µ (O) = N α i (t)β i (t) i=1 t : 1 t T Forward and Backward procedures are particular cases of this equation when t = 1 and t = T respectively.

and 2. Best State Sequence 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Question 2. Best state sequence and 2. Best State Sequence Most likely path for a given observation O: P µ (X, O) argmax P µ (X O) = argmax X X P µ (O) = argmax P µ (X, O) X (since O is fixed) Compute the best sequence with the same recursive approach than in FB: Viterbi algorithm, O(TN 2 ). δ j (t) = max P µ (X 1... X t 1 s j, o 1... o t ) X 1...X t 1 Highest probability of any sequence reaching state s j at time t after emmitting o 1... o t ψ j (t) = last(argmax X 1...X t 1 P µ (X 1... X t 1 s j, o 1... o t )) Last state (X t 1) in highest probability sequence reaching state s j at time t after emmitting o 1... o t

Viterbi algorithm and 2. Best State Sequence 1 Initialization: j = 1... N δ j (1) = π j b jo1 ψ j (1) = 0 2 Induction: t : 1 t < T δ j (t + 1) = max δ i(t)a ij b jot+1 j = 1... N 1 i N ψ j (t + 1) = argmax δ i (t)a ij j = 1... N 1 i N 3 Termination: backwards path readout. ˆX T = argmax δ i (T ) 1 i N ˆX t = ψ ˆX t+1 (t + 1) P( ˆX ) = max 1 i N δ i(t )

and 3. Parameter Estimation 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Question 3. Parameter Estimation and 3. Parameter Estimation Obtain model parameters (A, B, π) for the model µ that maximizes the probability of given observation O: (A, B, π) = argmax P µ (O) µ

Baum-Welch algorithm and 3. Parameter Estimation Baum-Welch algorithm (aka Forward-Backward): 1 Start with an initial model µ 0 (uniform, random, MLE...) 2 Compute observation probability (F&B computation) using current model µ. 3 Use obtained probabilities as data to reestimate the model, computing ˆµ 4 Let µ = ˆµ and repeat until no significant improvement. Iterative hill-climbing: Local maxima. Particular application of Expectation Maximization (EM) algorithm. EM Property: Pˆµ (O) P µ (O)

Definitions and 3. Parameter Estimation γ i (t) = P µ (X t = i O) = P µ(x t = i, O) α i (t)β i (t) = P µ (O) N k=1 α k(t)β k (t) Probability of being at state s i at time t given observation O. ϕ t (i, j) = P µ (X t = i, X t+1 = j O) = P µ(x t = i, X t+1 = j, O) P µ (O) = α i(t)a ij b jot+1 β j (t + 1) N k=1 α k(t)β k (t) probability of moving from state s i at time t to state s j at time t + 1, given observation sequence O. Note that γ i (t) = N j=1 ϕt(i, j) T 1 γ i (t) t=1 Expected number of transitions from state s i in O. T 1 ϕ t (i, j) t=1 Expected number of transitions from state s i to s j in O.

Arc probability and 3. Parameter Estimation Given an observation O, the model µ Probability ϕ t(i, j) of moving from state s i at time t to state s j at time t + 1 given observation O.

Reestimation and 3. Parameter Estimation Iterative reestimation ˆπ i = â ij = ˆb jk = Expected frequency in state s i at time (t = 1) = γ i(1) Expected number of transitions from s i to s j Expected number of = transitions from s i Expected number of emissions of k from s j Expected number = of visits to s j T 1 t=1 T 1 ϕ t (i, j) γ i (t) t=1 {t: 1 t T, o t=k} γ t (j) T γ t (j) t=1

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

and C. Manning & H. Schütze, Foundations of Statistical Natural Language Processing. The MIT Press. Cambridge, MA. May 1999. S.L. Lauritzen, Graphical. Oxford University Press, 1996 L.R. Rabiner, A tutorial on hidden models and selected applications in speech recognition. Proceedings of the IEEE, Vol. 77, num. 2, pg 257-286, 1989.