Stochastic Parsing. Roberto Basili

Similar documents
Stochastic models for complex machine learning

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Statistical Methods for NLP

Advanced Natural Language Processing Syntactic Parsing

10/17/04. Today s Main Points

Statistical NLP: Hidden Markov Models. Updated 12/15

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

LECTURER: BURCU CAN Spring

DT2118 Speech and Speaker Recognition

Features of Statistical Parsers

Statistical Processing of Natural Language

Dept. of Linguistics, Indiana University Fall 2009

Lecture 12: Algorithms for HMMs

CS460/626 : Natural Language

CS460/626 : Natural Language

Statistical Methods for NLP

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Lecture 12: Algorithms for HMMs

Natural Language Processing

Hidden Markov Models

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Text Mining. March 3, March 3, / 49

A gentle introduction to Hidden Markov Models

A Context-Free Grammar

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

POS-Tagging. Fabian M. Suchanek

Processing/Speech, NLP and the Web

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 13: Structured Prediction

Hidden Markov Models

Probabilistic Context Free Grammars. Many slides from Michael Collins

Structured Output Prediction: Generative Models

Probabilistic Context-free Grammars

Sequences and Information

Hidden Markov Models

Lecture 15. Probabilistic Models on Graph

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning for natural language processing

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Probabilistic Context-Free Grammar

Lecture 11: Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models

Hidden Markov Models (HMMs)

Lecture 9: Hidden Markov Model

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Expectation Maximization (EM)

CS : Speech, NLP and the Web/Topics in AI

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Sequence Labeling: HMMs & Structured Perceptron

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Statistical Methods for NLP

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

CS838-1 Advanced NLP: Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

8: Hidden Markov Models

Parsing with Context-Free Grammars

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

8: Hidden Markov Models

1. Markov models. 1.1 Markov-chain

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Hidden Markov Modelling

Introduction to Probablistic Natural Language Processing

Language and Statistics II

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Probabilistic Context-Free Grammars and beyond

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Expectation Maximization (EM)

Hidden Markov Models

COMP90051 Statistical Machine Learning

Maxent Models and Discriminative Estimation

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Annealing Techniques for Unsupervised Statistical Language Learning

Hidden Markov Model and Speech Recognition

Graphical models for part of speech tagging

Log-Linear Models, MEMMs, and CRFs

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Hidden Markov Models in Language Processing

Intelligent Systems (AI-2)

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

A brief introduction to Conditional Random Fields

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Lecture 3: ASR: HMMs, Forward, Viterbi

Brief Introduction of Machine Learning Techniques for Content Analysis

The Noisy Channel Model and Markov Models

Transcription:

Stochastic Parsing Roberto Basili Department of Computer Science, System and Production University of Roma, Tor Vergata Via Della Ricerca Scientifica s.n.c., 00133, Roma, ITALY e-mail: basili@info.uniroma2.it

Outline: Stochastic Processes and Methods POS tagging as a stochastic process Probabilistic Approaches to Syntactic Analysis CF grammar-based approaches (e.g. PCFG) Other Approaches (lexicalized or dependency based) Further Relevant Issues

The role of Quantitative Approaches Weighted grammars are models of the degree of grammaticality able to deal with disambiguation: 1. S -> NP V 2. S -> NP 3. NP -> PN 4. NP -> N 5. NP -> Adj N 6. N -> imposta 7. V -> imposta 8. Adj -> Pesante 9. PN -> Pesante...

The role of Quantitative Approaches Pesante imposta S S NP V NP PN Adj N Pesante imposta Pesante imposta

The role of Quantitative Approaches Weighted grammars are models of the degree of grammaticality able to deal with disambiguation: 1. S -> NP V.7 2. S -> NP.3 3. NP -> PN.1 4. NP -> N.6 5. NP -> Adj N.4 6. N -> imposta.6 7. V -> imposta.4 8. Adj -> Pesante.8 9. PN -> Pesante.2...

The role of Quantitative Approaches Pesante imposta S S (.7) NP V (.3) NP (.1) PN (.3) Adj N (.2) Pesante (.4) imposta (.8) Pesante (.6) imposta

The role of Quantitative Approaches Weighted grammars are models of the degree of grammaticality able to deal with disambiguation: 1. S -> NP V.7 2. S -> NP.3 3. NP -> PN.1 4. NP -> N.6 5. NP -> Adj N.3 6. N -> imposta.6 7. V -> imposta.4 8. Adj -> Pesante.8 9. PN -> Pesante.2... prob(((pesante) P N (imposta) V ) S )= (.7 *.1 *.2 *.4) = 0.0084 prob(((pesante) Adj (imposta) N ) S )= (.3 *.3 *.8 *.6) = 0.0432

Structural Disambiguation portare borsa in pelle S S NP V N VP NP PP NP V VP NP N VP PP portare borsa in pelle portare borsa in pelle Derivation Trees for a structurally ambiguous sentence.

Structural Disambiguation (cont d) portare borsa in mano S S NP V N VP NP PP NP V VP NP N VP PP portare borsa in mano portare borsa in mano Derivation Trees for a second structurally ambiguous sentence.

Structural Disambiguation (cont d) portare borsa in pelle S portare borsa in mano S NP VP NP VP V N NP PP V VP NP N PP portare borsa in pelle portare borsa in mano p(portare,in,pelle) << p(borsa,in,pelle) p(borsa,in,mano) << p(portare,in,mano) Disambiguation of structural ambiguity.

Error tolerance vendita di articoli da regalo NP vendita articoli regalo NP N PP NP?? vendita P N NP PP N?? N di articoli da regalo vendita articoli regalo An example of ungrammatical but meaningful sentence

Error tolerance (cont d) vendita di articoli da regalo NP vendita articoli regalo Γ NP N PP NP?? vendita di P NP N PP articoli da regalo vendita p(γ) > 0?? N b articoli N regalo p( ) > 0 Modeling of ungrammatical phenomena

Probability and Language Modeling Aims to extend existing models with predictive and disambiguation capabilities to offer theoretically well founded inductive methods to develop (not-so) quantitative models of linguistic phenomena Methods and Resources: Methematical theories (e.g. Markov models) Systematic testing/evaluation frameworks Extended repositories of language use instances Traditional linguistic resources (e.g. models like dictionaries)

Probability and the Empiricist Renaissance (2) Differences amount of knowledge available a priori target: competence vs. performance methods: deduction vs. induction The role of probability in NLP is also related to: difficulties in categorial statements in language study (e.g. grammaticality or syntactic categorization) the cognitive nature of language understanding the role of uncertainty

Probability and Language Modeling Signals are abstracted via symbols not known in advance Emitted signals belong to an alphabet A Time is discrete: each time point corresponds to an emitted signal Sequences of symbols (w1,..., wn) correspond to sequences of time points (1,..., n), 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

Probability and Language Modeling A random variable X can be introduced so that It assumes values w i in the alfabet A Probability is used to describe the uncertainty on the emitted signal p(x = w i ) w i A

Probability and Language Modeling A random variable X can be introduced so that X assumes values in A at each step i, i.e. X i = w j probability is p(x i = w j ) Constraints: Total probability is for each step: j p(x i = w j ) = 1 i, 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

Probability and Language Modeling Notice that time points can be represented as states of the emitting source An output w i can be considered as emitted in a given state X i by the source, and given a certain history, 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

Probability and Language Modeling Formally: P (X i = w i, X i 1 = w i 1,... X 1 = w 1 ) = = P (X i = w i X i 1 = w i 1, X i 2 = w i 2,..., X 1 = w 1 ), 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

Probability and Language Modeling What s in a state preceding words n-gram models, 3, 2, 1, dog, black, the p(the, black, dog) = p(dog the, black)...

Probability and Language Modeling What s in a state preceding words n-gram models, 3, 2, 1, dog, black, the p(the, black, dog) = p(dog the, black)p(black the)p(the)

Probability and Language Modeling What s in a state preceding POS tags stochastic taggers POS2 = Adj, POS1 = Det, dog, black, the p(the DT, black ADJ, dog N ) = p(dog N the DT, black ADJ )...

Probability and Language Modeling What s in a state preceding parses stochastic grammars NP NP Det, 3, N 2, ADJ 1, dog, black, the p((the Det, (black ADJ, dog N ) NP ) NP ) = p(dog N ((the Det ), (black ADJ, )))...

Probability and Language Modeling (2) Expressivity The predictivity of a statistical device can be very good explanatory model of the source information Simpler and Systematic Induction Simpler and Augmented Description (e.g. grammatical preference) Optimized Coverage (better on more important phenomena) Integrating Linguistic Description Start with poor assumptions and approximate as much as possible what is known (evaluate performance only) Bias the statistical model since the beginning and check the results on a linguistic ground

Probability and Language Modeling (3) Performances Faster Processing Faster Design Linguistic Adequacy Acceptance Psychological Plausibility Explanatory power Tools for further analysis of Linguistic Data

Markov Models Suppose X 1, X 2,..., X T form a sequence of random variables taking values in a countable set W = p 1, p 2,..., p N (State space). Limited Horizon Property: P (X t+1 = p k X 1,..., X t ) = P (X t+1 = k X t ) Time invariant: P (X t+1 = p k X t = p l ) = P (X 2 = p k X 1 = p l ) t(> 1) It follows that the sequence of X 1, X 2,..., X T is a Markov chain.

Representation of a Markov Chain Matrix Representation: A (transition) matrix A: a ij = P (X t+1 = p j X t = p i ) Note that i, j a ij 0 and i j a ij = 1 Initial State description (i.e. probabilities of initial states): Note that n j=1 π ij = 1. π i = P (X 1 = p i )

Representation of a Markov Chain Graphical Representation (i.e. Automata) States as nodes with names Transitions from states i-th and j-th as arcs labelled by conditional probabilities P (X t+1 = p j X t = p i ) Note that 0 probability arcs are omitted from the graph.

Representation of a Markov Chain Graphical Representation P (X 1 = p 1 ) = 1 StartState P (X k = p 3 X k 1 = p 2 ) = 0.7 P (X k = p 4 X k 1 = p 1 ) = 0 k k 0.2 p 3 1.0 p 1 0.8 0.7 p 2 0.3 p 4 Start State

A Simple Example of Hidden Markov Model Crazy Coffee Machine Two states: Tea Preferring (T P ), Coffee Preferring (CP ) Switch from one state to another randomly Simple (or visible) Markov model: Iff the machine output T ea in T P AND Coffee in CP What we need is a description of the random event of switching from one state to another. More formally we need for each time step n and couple of states p i and p j to determine following conditional probabilities: P (X n+1 = p j X n = p i ) where p t is one of the two states TP, CP.

A Simple Example of Hidden Markov Model Crazy Coffee Machine Assume, for example, the following state transition model: T P CP T P 0.70 0.30 CP 0.50 0.50 and let CP be the starting state (i.e. π CP = 1, π T P = 0). Potential Use: Which is the probability at time step 3 to be in state T P Which is the probability at time step n to be in state T P Which is the probability of the following sequence in output (Coffee, T ea, Coffee)

Crazy Coffee Machine Graphical Representation 0.5 0.7 TP CP 0.5 0.3 Start State

Crazy Coffee Machine Solutions: P (X 3 = T P ) = (given by (CP, CP, T P ) and (CP, T P, T P )) = P (X 1 = CP ) P (X 2 = CP X 1 = CP ) P (X 3 = T P X 1 = CP, X 2 = CP )+ + P (X 1 = CP ) P (X 2 = T P X 1 = CP ) P (X 3 = T P X 1 = CP, X 2 = T P ) = = P (CP )P (CP CP )P (T P CP, CP ) + P (CP )P (T P CP )P (T P CP, T P ) = = P (CP )P (CP CP )P (T P CP ) + P (CP )P (T P CP )P (T P T P ) = = 1 0.50 0.50 + 1 0.50 0.70 = 0.25 + 0.35 = 0.60 In the general case, P (X n = T P ) = CP,p 2,p 3,...,T P P (X 1 = CP )P (X 2 = p 2 X 1 = CP )P (X 3 = p 3 X 1 = CP, X 2 = p 2 )... P (X n = T P X 1 = CP, X 2 = p 2,..., X n 1 = p n 1 ) = = CP,p 2,p 3,...,T P P (CP )P (p 2 CP )P (p 3 p 2 )... P (T P p n 1 ) = = CP,p 2,p 3,...,T P P (CP ) n 1 t=1 P (p t+1 p t ) = = p 1,...,p n P (p 1 ) n 1 t=1 P (p t+1 p t ) P (Cof, T ea, Cof) = = P (Cof) P (T ea Cof) P (Cof T ea) = 1 0.5 0.3 = 0.15

A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Hidden Markov model: If the machine output Tea, Coffee or Capuccino independently from CP and T P. What we need is a description of the random event of output(ting) a drink.

Crazy Coffee Machine A description of the random event of output(ting) a drink. More formally we need (for each time step n and for each kind of output O = {T ea, Cof, Cap}), the following conditional probabilities: P (O n = k X n = p i, X n+1 = p j ) where k is one of the values Tea, Coffee or Capuccino. This matrix is called the output matrix of the machine (or of its Hidden markov Model).

A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Given the following output probability for the machine Tea Coffee Capuccino TP 0.8 0.2 0.0 CP 0.15 0.65 0.2 and let CP be the starting state (i.e. π CP = 1, π T P = 0). Find the following probabilities of output from the machine 1. (Cappuccino, Coffee) given that the state sequence is (CP, T P, T P ) 2. (T ea, Coffee) for any state sequence 3. a generic output O = (o 1,..., o n ) for any state sequence

A Simple Example of Hidden Markov Model (2) Solution for the problem 1 For the given state sequence X = (CP, T P, T P ) P (O 1 = Cap, O 2 = Cof, X 1 = CP, X 2 = T P, X 3 = T P ) = P (O 1 = Cap, O 2 = Cof X 1 = CP, X 2 = T P, X 3 = T P )P (X 1 = CP, X 2 = T P, X 3 = T P )) = P (Cap, Cof CP, T P, T P )P (CP, T P, T P )) Now: P (Cap, Cof CP, T P, T P ) is the probability of output Cap, Cof during transitions from CP to T P and T P to T P and P (CP, T P, T P ) is the probability of the transition chain. Therefore, = P (Cap CP, T P )P (Cof T P, T P ) =(in our simplified model) = P (Cap CP )P (Cof T P ) = 0.2 0.2 = 0.04

A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1, X 2, X 3 ) P (T ea, Cof X 1, X 2, X 3 ) = P (T ea, Cof) = (as sequences are a partition for the sample space) = X 1,X 2,X 3 P (T ea, Cof X 1, X 2, X 3 )P (X 1, X 2, X 3 ) where P (T ea, Cof X 1, X 2, X 3 ) = P (T ea X 1, X 2 )P (Cof X 2, X 3 ) = (for the simplified model of the coffee machine ) = P (T ea X 1 )P (Cof X 2 ) and (for the Markov constraint) P (X 1, X 2, X 3 ) = P (X 1 )P (X 2 X 1 )P (X 3 X 2 ) The simplified model is concerned with only the following transition chains (CP, CP, CP ) (CP, T P, CP ) (CP, CP, T P ) (CP, T P, T P ) so that the following probability is given

P (T ea, Cof) = P (T ea CP )P (Cof CP )P (CP )P (CP CP )P (CP CP )+ P (T ea CP )P (Cof T P )P (CP )P (T P CP )P (CP T P )+ P (T ea CP )P (Cof CP )P (CP )P (CP CP )P (T P CP )+ P (T ea CP )P (Cof T P )P (CP )P (T P CP )P (T P T P ) = st.: (CP,CP,CP)) st.: (CP,TP,CP)) st.: (CP,CP,TP)) st.: (CP,TP,TP)) = 0.15 0.65 1 0.5 0.5+ + 0.15 0.2 1 0.5 0.3+ + 0.15 0.65 1 0.5 0.5+ + 0.15 0.2 1.0 0.5 0.7 = = 0.024375 + 0.0045 + 0.024375 + 0.0105 = = 0.06375

A Simple Example of Hidden Markov Model (2) Solution to the problem 3 (Likelihood) In the general case, a sequence of n symbols O = (o 1,..., o n ) out from any sequence of n + 1 transitions X = (p 1,..., p n+1 ) can be predicted by the following probability: P (O) = p 1,...,p n+1 P (O X)P (X) = = p 1,...,p n+1 P (CP ) n t=1 P (O t p t, p t+1 )P (p t+1 p t )

Modeling linguistic tasks of Stochastic Processes Outline Powerful mathematical framework Questions for Stochastic models Modeling POS tagging via HMM

Mathematical Methods for HMM The complexity of training and decoding can be limited by the use of optimization techniques Parameter estimation via entropy minimization (EM) Language Modeling via dynamic programming (Forward algorithms) (O(n)) Tagging/Decoding via dynamic programming (O(n)) (Viterbi) A relevant issue is the availablity of source data: cannot be applied always supervised training

HMM and POS tagging Given a sequence of morphemes w 1,..., w n with ambiguous syntactic descriptions (i.e.part-of-speech) tag, derive the sequence of n POS tags t 1,..., t n that maximizes the following probability: P (w 1,..., w n, t 1,..., t n ) that is (t 1,..., t n ) = argmax pos1,...,pos n P (w 1,..., w n, pos 1,..., pos n ) Note that this is equivalent to the following: (t 1,..., t n ) = argmax pos1,...,pos n P (pos 1,..., pos n w 1,..., w n ) as: P (w 1,...,w n,pos 1,...,pos n ) P (w 1,...,w n ) = P (pos 1,..., pos n w 1,..., w n ) and P (w 1,..., w n ) is the same for all the sequencies (pos 1,..., pos n ).

HMM and POS tagging How to map a POS tagging problem into a HMM. The following problem (t 1,..., t n ) = argmax pos1,...,pos n P (pos 1,..., pos n w 1,..., w n ) can be also written (Bayes law) as: (t 1,..., t n ) = argmax pos1,...,pos n P (w 1,..., w n pos 1,..., pos n )P (pos 1,..., pos n )

HMM and POS tagging The HMM Model of POS tagging: HMM States are mapped into POS tags (t i ), so that P (t 1,..., t n ) = P (t 1 )P (t 2 t 1 )...P (t n t n 1 ) HMM Output symbols are words, so that P (w 1,..., w n t 1,..., t n ) = n i=1 P (w i t i ) Transitions represent moves from one word to another Note that the Markov assumption is used to model probability of a tag in position i (i.e. t i ) only by means of the preceeding part-of-speech (i.e. t i 1 ) to model probabilities of words (i.e. w i ) based only on the tag (t i ) appearing in that position (i).

HMM and POS tagging The final equation is thus: (t 1,..., t n ) = argmax t1,...,t n P (t 1,..., t n w 1,..., w n ) = argmax t1,...,t n ni=1 P (w i t i )P (t i t i 1 )

Fundamental Questions for HMM in POS tagging 1. Given a sample of the output sequences and a space of possible models how to find out the best model, that is the model that best explains the data: how to estimate parameters? 2. Given a model and an observable output sequence O (i.e. words), how to determine the sequence of states (t 1,..., t n ) such that it is the best explanation of the observation O: Decoding Problem 3. Given a model what is the probability of an output sequence, O: Computing Likelihood.

Fundamental Questions for HMM in POS tagging 1. Baum-Welch (or Forward-Backward algorithm), that is a special case of Expectation Maximization estimation. Weakly supervised or even unsupervised. Problems: Local minima can be reached when initial data are poor. 2. Viterbi Algorithm for evaluating P (W O). Linear in the sequence length. 3. Not relevant for POS tagging, where (w 1,..., w n ) are always known. Trellis and dynamic programming technique.

HMM and POS tagging Advantages for adopting HMM in POS tagging An elegant and well founded theory Training algorithms: Estimation via EM (Baum-Welch) Unsupervised (or possibly weakly supervised) Fast Inference algorithms: Viterbi algorithm Linear wrt the sequence length (O(n)) Sound methods for comparing different models and estimations (e.g. cross-entropy)

Forward algorithm In computing the likelihood P (O) of an observation we need to sum up the probability of all paths in a Markov model. Brute forse computation is not applicable in most cases. The forward algorithm is an application of dynamic programming.

HMM and POS tagging: Forward Algorithm

Viterbi algorithm In decoding we need to find the most likely state sequence given an observation O. Brute forse computation does not apply. The Viterbi algorithm follows the same approach (dynamic programming) of the Forward. Viterbi scores are attached to each possible state in the sequence.

HMM and POS tagging: the Viterbi Algorithm

HMM and POS tagging: Parameter Estimation Supervised methods in tagged data sets: Output probs: P (w i p j ) = C(w i,p j ) C(p j ) Transition probs: P (p i p j ) = C(pi follows p j ) C(p j ) Smoothing: P (w i p j ) = C(w i,p j )+1 C(p j )+K i (see Manning& Schutze, Chapter 6)

HMM and POS tagging: Parameter Estimation Unsupervised (few tagged data available): With a dictionary: P (w i p j ) are early estimated from D, while P (p i p j ) are randomly assigned With equivalence classes u L, (Kupiec92): P (w i p L ) = 1 L C(uL ) u L C(u L ) L For example, if L ={noun, verb} then u L = {cross, drive,...}

Other Approaches to POS tagging Church (1988): 3i=n P (w i t i )P (t i 2 t i 1, t i ) (backward) Estimation from tagged corpus (Brown) No HMM training Perfromances: > 95% De Rose (1988): ni=1 P (w i t i )P (t i 1 t i ) (forward) Estimation from tagged corpus (Brown) No HMM training Performance: 95% Merialdo et al.,(1992), ML estimation vs. Viterbi training Propose an incremental approach: small tagging and then Viterbi training n i=1 P (w i t i )P (t i+1 t i, w i )???

POS tagging: References F. Jelinek, Statistical methods for speech recognition, Cambridge, Mass.: Press, 1997. MIT Manning & Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Chapter 6. Church (1988), A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text, http://acl.ldc.upenn.edu/a/a88/a88-1019.pdf Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2), 260-269. Parameter Estimation (slides): http://jan.stanford.edu/fsnlp/statest/henke-ch6.ppt

Statistical Parsing Outline: Parsing and Statistical modeling Parsing Models Probabistic Context-Free Grammars Dependency-based assumptions Statistical Disambiguation

Parsing and Statistical modeling Main Issues in statistical modeling of grammatical recognition: Modelling the structure vs. the language How context can be taken into account Modelling the structure vs. the derivation Which Parsing given a Model

Probabilistic Context-Free Grammars, PCFG Basic Steps: Start from a CF grammar Attach probability to the rules Tune (i.e. adjust them) against training data (e.g. a treebank)

Probabilistic Context-Free Grammars, PCFG Syntactic sugar N rl Γ P (N rl Γ) P (s) = t P (t), where t are parse trees for s N rl w 1, w 2,, w r-1, w r,.., w l, w l+1,, w n

Probabilistic Context-Free Grammars, PCFG Main assumptions Total probability of each continuation, i.e. i j P (N i Γ j ) = 1 Context Independence Place Invariance, k P (N j k(k+const) Γ) is the same Context-freeness, r, l P (N j rl Γ) is independent from w 1,.., w r 1 and w l+1,..., w n Derivation (or Anchestor ) Independence, i.e. k P (N j rl Γ) is independent from any anchestor node outside N j rl

PCFG Main Questions: How to develop the best model from observable data (Training) Which is the best parse for a sentence Which training data?

Probabilistic Context-Free Grammars, PCFG Probabilities of trees P (t) = P ( 1 S 13 1 NP 3 12 V P 33, 1 NP 12 the man, 3 V P 33 snores) = = P ( 1 α)p ( 1 δ 1 α)p ( 3 ϕ 1 α, 1 δ) = = P ( 1 α)p ( 1 δ)p ( 3 ϕ) = = P (S NP V P )P (NP the man)p (V P snores) 1 S 1 α 1 NP 3 VP 1 δ 3 ϕ the man snores

PCFG Training Supervised case: Treebanks are used. Simple ML estimation by counting subtrees Basic problems are low counts for some structures and local maxima When the target grammatical model is different (e.g. based), rewriting is required before counting dependency

PCFG Training Unsupervised: Given a grammar G, and a set of sentences (parsed but not validated) The best model is the one maximizing the probability of useful structures in observable data set Parameters are adjusted via EM algorithms Ranking data according to small training data (Pereira and Schabes 1992)

Limitations of PCFGs Poor modeling of structural dependencies, due to the context-freeness modeled through the independence assumptions Example: NPs in Subject vs. Object position NP DT NN NP P RP have always the same probability that does not distinguishes between S NP V P and V P V NP Lack of lexical constraints, as lexical preferences are not taken into account in the grammar

Lexical Constraints and PP-attachment compare with: fishermen caught tons of herring

A form of lexicalized PCFG is intro- Statistical Lexicalized Parsing duced in (Collins, 99). Each grammar rule is lexicalized by denoting in the RHS the grammatical head V P (dumped) V (dumped) NP (sacks) P P (into) Probablities are not estimated directly by counting occurrences in TBs Further independence assumptions are made to estimate effectively complex RHS statistics: each rule is computed by scanning the RHS left-to-right, and then treating such process as a generative (markovian) model.

Collin s PCFGs Lexicalized Tree in the Collins (1999) model

Statistical Lexicalized Parsing In (Collins 96) a dependency-based statistical parser is proposed: In the system also chunks (Abney,1991) for complex NPs are used Probabilities are assigned to head-modifier dependencies, (R, < hw j, ht j >, < w k, t k >) P (...) = P H (H(hw, ht) R, hw, ht) P L (L i (lw i, lt i ) R, H, hw, ht) i i P R (R i (rw i, rt i ) R, H, hw, ht) Dependency relationships are considered independent Supervised training is much more effective from treebanks

Evaluation of Parsing Systems

Results Performances of different Probabilistic Parsers (Sentences with 40 words) %LR %LP CB % 0 CB Charniak (1996) 80.4 78.8 Collins (1996) 85.8 86.3 1.14 59.9 Charniak (1997) 87.5 87.4 1.00 62.1 Collins (1997) 88.1 88.6 0.91 66.5

Further Issues in statistical grammatical recognition Syntactic Disambiguation PP-attachment Subcategorization frames Induction and Use Semantic Disambiguation Word Clustering for better estimation (e.g. data sparseness)

Disambiguation of Prepositional Phrase attachment Purely probabilstic model (Hindle and Rooths, 1991, 1993) V P N P P P sequences (i.e. verb and noun attachment) bi-gram probabilties P (prep noun) and P (prep verb) are compared (via χ 2 test) Smoothing low counts is applied for sparse data problems

Disambiguation of Prepositional Phrase attachment An hybrid model (Basili et al, 1993, 1995): Multiple attachment sites in a dependency framework Semantic classes for words are available semantic tri-gram probabilities, i.e. P (prep, Γ noun) and P (prep, Γ verb), are compared Smoothing is not required as Γ are classes and low counts are less effective

Statistical Parsing: References Abney, S. P. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23(4), 597-618. Basili, R., M.T. Pazienza, P. Velardi, Semiautomatic extraction of linguistic information for syntactic disambiguation, Applied Artificial Intelligence, vol 7, pp. 339-364, 1993. Bikel, D., Miller, S., Schwartz, R., and Weischedel, R. (1997). Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97, pp. 194-201. Black, E., Abney, S. P., Flickinger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M. P., Roukos, S., Santorini, B., and Strzalkowski, T. (1991). A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings DARPA Speech and Natural Language Workshop, Pacific Grove, CA, pp. 306-311. Morgan Kaufmann. Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL 00), Seattle, Washington, pp. 132-139. Chelba, C. and Jelinek, F. (2000). Structured language modeling. Computer Speech and Language, 14, 283-332. Collins, M. J. (1999). Head-driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia. Hindle, D. and Rooth, M. (1991). Structural ambiguity and lexical relations. In Proceedings of the 29th ACL, Berkeley, CA, pp. 229-236. ACL.