Statistical Methods for NLP

Similar documents
Statistical Methods for NLP

Probabilistic Context-free Grammars

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

LECTURER: BURCU CAN Spring

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Advanced Natural Language Processing Syntactic Parsing

Processing/Speech, NLP and the Web

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Lecture 13: Structured Prediction

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Chapter 14 (Partially) Unsupervised Parsing

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Parsing with Context-Free Grammars

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

The Infinite PCFG using Hierarchical Dirichlet Processes

10/17/04. Today s Main Points

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

CS : Speech, NLP and the Web/Topics in AI

Probabilistic Context Free Grammars. Many slides from Michael Collins

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Multiword Expression Identification with Tree Substitution Grammars

Sequence Labeling: HMMs & Structured Perceptron

CS460/626 : Natural Language

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Latent Variable Models in NLP

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Multilevel Coarse-to-Fine PCFG Parsing

Probabilistic Context-Free Grammar

Natural Language Processing

Lecture 9: Hidden Markov Model

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Computational Linguistics

Lecture 12: Algorithms for HMMs

CS 545 Lecture XVI: Parsing

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Lecture 15. Probabilistic Models on Graph

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 12: Algorithms for HMMs

Expectation Maximization (EM)

CS838-1 Advanced NLP: Hidden Markov Models

Features of Statistical Parsers

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Intelligent Systems (AI-2)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Parsing with Context-Free Grammars

Lecture 5: UDOP, Dependency Grammars

NLP Programming Tutorial 11 - The Structured Perceptron

COMP90051 Statistical Machine Learning

CS 6120/CS4120: Natural Language Processing

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lab 12: Structured Prediction

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Transition-Based Parsing

Lecture 11: PCFGs: getting luckier all the time

Spectral Unsupervised Parsing with Additive Tree Metrics

Decoding and Inference with Syntactic Translation Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Intelligent Systems (AI-2)

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

A* Search. 1 Dijkstra Shortest Path

The Infinite PCFG using Hierarchical Dirichlet Processes

Statistical NLP: Hidden Markov Models. Updated 12/15

Lecture 7: Sequence Labeling

CS460/626 : Natural Language

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Personal Project: Shift-Reduce Dependency Parsing

A gentle introduction to Hidden Markov Models

Spectral Learning for Non-Deterministic Dependency Parsing

Probabilistic Linguistics

Statistical Methods for NLP

A Context-Free Grammar

Hidden Markov Models

Constituency Parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Syntax-Based Decoding

Data-Intensive Computing with MapReduce

The Inside-Outside Algorithm

AN ABSTRACT OF THE DISSERTATION OF

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Natural Language Processing

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

A brief introduction to Conditional Random Fields

Machine Learning for natural language processing

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Hidden Markov Models

Statistical methods in NLP, lecture 7 Tagging and parsing

Natural Language Processing

Transcription:

Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22)

Structured Classification In many NLP tasks, the output (and input) is structured: Part-of-speech tagging Input: Sequence X 1,..., X n of words Output: Sequence Y 1,..., Y n of tags Syntactic parsing Input: Sequence X 1,..., X n of words Output: Parse tree Y consisting of nodes, edges and labels Models for structured classification: Sequence models lecture 5 Stochastic grammars today Statistical Methods for NLP 2(22)

Syntactic Parsing Given a word sequence w 1,..., w n, determine the corresponding syntactic analysis Y. Probabilistic view of the problem: f (w 1,..., w n ) = argmax Y P(Y w 1,..., w n ) = argmax Y P(Y, w 1,..., w n ) P(w 1,..., w n ) = argmax Y P(Y, w 1,..., w n ) = argmax Y P(Y )P(w 1,..., w n Y ) We will assume that Y is a context-free parse tree, but the same reasoning applies to any choice of syntactic analysis. Statistical Methods for NLP 3(22)

Context-Free Parse Trees Since a context-free parse tree T for the string w 1,..., w n includes the string itself, it follows that: { 1 if yield(t ) = w1,... w P(w 1,..., w n T ) = n 0 otherwise Hence, if we restrict attention to trees with the right yield, we can simply search for the most probable tree T : argmax P(T ) T Statistical Methods for NLP 4(22)

Probabilistic Context-Free Grammar A PCFG G is a 5-tuple G = (Σ, N, S, R, D): N is a finite (non-terminal) alphabet. Σ is a finite (terminal) alphabet. S N is the start symbol. R is a finite set of rules A α (A N, α (Σ N) ). D is function from R to the real numbers in [0, 1] such that: [ ] A N : D(A α) = 1 α:a α R Statistical Methods for NLP 5(22)

Example: Grammar S NP VP PU 1.00 VP VP PP 0.33 VP VBD NP 0.67 NP NP PP 0.14 NP JJ NN 0.57 NP JJ NNS 0.29 PP IN NP 1.00 PU. 1.00 JJ Economic 0.33 JJ little 0.33 JJ financial 0.33 NN news 0.50 NN effect 0.50 NNS markets 1.00 VBD had 1.00 IN on 1.00 S VP NP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets S VP VP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets.. Statistical Methods for NLP 6(22)

Independence Assumptions The probability of a rule A α represents the probability of using the rule to expand a node labeled A: A D(A α) = P( A) α The probability of using the rule in a derivation S w 1,..., w n is independent of anything before or after: P(A α S βaγ, βαγ w 1,..., w n ) = D(A α) Statistical Methods for NLP 7(22)

Probabilities for Parse Trees and Strings The probability of a parse tree is the product of the probabilities of all its independent subtrees: P(T ) = D(A α) t(a,α) T where t(a, α) T signifies that T contains a local tree with root labeled by A and children labeled α. The probability of a string is the sum of the probabilities of all its parse trees: P(w 1,..., w n ) = P(T ) T :yield(t )=w 1,...w n Statistical Methods for NLP 8(22)

Example S NP VP PU 1.00 VP VP PP 0.33 VP VBD NP 0.67 NP NP PP 0.14 NP JJ NN 0.57 NP JJ NNS 0.29 PP IN NP 1.00 PU. 1.00 JJ Economic 0.33 JJ little 0.33 JJ financial 0.33 NN news 0.50 NN effect 0.50 NNS markets 1.00 VBD had 1.00 IN on 1.00 Economic news had little effect on financial markets. 0.0002665 S VP NP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets S VP VP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets.. 0.0000794 0.0001871 Statistical Methods for NLP 9(22)

Inference (Parsing) Inference problem: Finding the most probable tree T for a string w1,..., w n Difficulty: Maximizing over all possible trees The number of trees grows exponentially Key observation: Solution of size n contains solutions of smaller sizes For binarized grammar: max P(T, w 1,n ) = max i ˆmaxT P(T, w 1,i ) max T P(T, w i+1,n ) D(r(T ) r(t )r(t )) Sounds familiar? Dynamic programming algorithms are applicable Probabilistic versions of algorithms like CKY and Earley Statistical Methods for NLP 10(22)

Probabilistic CKY PARSE(G, w 1,..., w n ) for j from 1 to n do for all A : A a R G and a = w j C[j 1, j, A] := D G (A a) for i from j 2 downto 0 do for k from i + 1 to j 1 do for all A : A BC R G and C[i, k, B] > 0 and C[k, j, C] > 0 if (C[i, j, A] < D G (A BC) C[i, k, B] C[k, j, C]) then C[i, j, A] := D G (A BC) C[i, k, B] C[k, j, C] B[i, j, A] := {k, B, C} return BUILD-TREE(B[1, n, S]), C[1, n, S] Statistical Methods for NLP 11(22)

Learning Two parts: Learn CFG G = (Σ, N, S, R) Learn rule probability distributions D Supervised learning of G and D: Treebank grammars (more in a minute) Unsupervised learning of D given G: Expectation-Maximization (EM) Guess initial distribution D 0 Iteratively improve distribution D i until convergence: E-step: Compute expected rule frequencies E i given D i M-step: Compute D i+1 to maximize probability given E i The Inside-Outside algorithm for computing expectations is similar to the Forward-Backward algorithm for HMMs Statistical Methods for NLP 12(22)

Treebank Grammar Training set: Treebank T = {T1,..., T m } Extract grammar G = (Σ, N, S, R): Σ = the set of all terminals occurring in some Ti T N = the set of all nonterminals occurring in some Ti T S = the nonterminal at the root of every Ti T R = the set of all rules needed to derive some T i T Estimate D using relative frequencies (MLE): Introduction D(A α) = m m C(A α, T i ) i=1 i=1 β:a β R C(A β, T i ) Statistical Methods for NLP 13(22)

Example: Treebank Grammar S NP VP PU 1.00 VP VP PP 0.33 VP VBD NP 0.67 NP NP PP 0.14 NP JJ NN 0.57 NP JJ NNS 0.29 PP IN NP 1.00 PU. 1.00 JJ Economic 0.33 JJ little 0.33 JJ financial 0.33 NN news 0.50 NN effect 0.50 NNS markets 1.00 VBD had 1.00 IN on 1.00 S VP NP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets S VP VP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets.. Statistical Methods for NLP 14(22)

Pros and Cons of Treebank Grammars Pros: Guaranteed to produce a consistent probability model Learning simple, efficient and well understood (MLE) Inference simple and (relatively) efficient Cons: Not guaranteed to be robust parsing new sentences may require rules not seen in the treebank Not optimal for disambigation treebank annotation may not fit independence assumptions enforced by PCFG model Statistical Methods for NLP 15(22)

Example: NP Expansions in Penn Treebank Introduction Tree Context NP PP DT NN PRP Anywhere 11% 9% 6% NP under S 9% 9% 21% NP under VP 23% 7% 4% Pronouns (PRP) more frequent under S (subject) Prepositional modifiers more frequent under VP (object) Statistical Methods for NLP 16(22)

PCFG Transformations Early research on statistical parsing abandoned PCFGs in favor of richer history-based models More recent research has shown that the same effect can be achieved by transforming PCFGs (or treebanks) Three common techniques: Markovization Parent annotation Lexicalization Statistical Methods for NLP 17(22)

Markovization Idea: Replace an n-ary rule by a set of unary and binary rules Encode a Markov process in new nonterminals Example: VP VB NP PP VP VP:NP_PP VP:NP_PP VP:VB_NP PP VP:VB_NP VP:VB NP VP:VB VB Benefits: Reduces the number of unique rules Improves robustness Statistical Methods for NLP 18(22)

Parent Annotation Idea: Replace nonterminal A with AˆB when A is child of B Example: SˆROOT VPˆS NPˆVP PPˆNP NPˆS NPˆNP NPˆPP. JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets Benefit: Differentiates structural contexts (SˆNP VPˆNP). Statistical Methods for NLP 19(22)

Horizontal and Vertical Markovization Markovization is often called horizontal markovization Conditioning history for siblings Standard PCFG: infinite-order (any number of siblings) Example above: second-order (at most two siblings) Parent annotation can be seen as vertical markovization Conditioning history for descendants Standard PCFG: first-order (only parent) Example above: second-order (grand-parent as well) Many different combinations are possible Statistical Methods for NLP 20(22)

Lexicalization Idea: Index nonterminals by lexical heads (terminals) Example: VP VBD NP VP(had) VBD(had) NP(effect) Consequences: Increases sensitivity to lexical properties (good) Increases the size of the grammar drastically (bad) Statistical Methods for NLP 21(22)

The State of the Art PCFG models: Markovization Limited horizontal markovization Extended vertical markovization Fine-grained nonterminals: Completely or partially lexicalized models Latent variable models that learn splits using EM Alternative models: Discriminative (log-linear) models for (re)ranking Dependency parsing (graph-based, transition-based) Statistical Methods for NLP 22(22)