Processing/Speech, NLP and the Web

Similar documents
CS : Speech, NLP and the Web/Topics in AI

CS460/626 : Natural Language

CS460/626 : Natural Language

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Probabilistic Context-free Grammars

Probabilistic Context-Free Grammar

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Advanced Natural Language Processing Syntactic Parsing

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Statistical Methods for NLP

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Probabilistic Context Free Grammars

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

LECTURER: BURCU CAN Spring

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Context Free Grammars. Many slides from Michael Collins

Lecture 13: Structured Prediction

Statistical Methods for NLP

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

CS 6120/CS4120: Natural Language Processing

Probabilistic Linguistics

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

A Context-Free Grammar

Natural Language Processing

Parsing with Context-Free Grammars

The Noisy Channel Model and Markov Models

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Multiword Expression Identification with Tree Substitution Grammars

Chapter 14 (Partially) Unsupervised Parsing

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

CS460/626 : Natural Language Processing/Speech, NLP and the Web

Multilevel Coarse-to-Fine PCFG Parsing

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Introduction to Probablistic Natural Language Processing

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

10/17/04. Today s Main Points

Spectral Unsupervised Parsing with Additive Tree Metrics

CS 545 Lecture XVI: Parsing

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 3: ASR: HMMs, Forward, Viterbi

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Lecture 12: Algorithms for HMMs

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Lecture 12: Algorithms for HMMs

Sequence Labeling: HMMs & Structured Perceptron

Artificial Intelligence

The Infinite PCFG using Hierarchical Dirichlet Processes

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

DT2118 Speech and Speaker Recognition

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Graphical models for part of speech tagging

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

CS838-1 Advanced NLP: Hidden Markov Models

Introduction to Computational Linguistics

Lecture 9: Hidden Markov Model

Expectation Maximization (EM)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

CSE 490 U Natural Language Processing Spring 2016

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models

CS344: Introduction to Artificial Intelligence

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Statistical methods in NLP, lecture 7 Tagging and parsing

Features of Statistical Parsers

Parsing with Context-Free Grammars

Constituency Parsing

Lecture 5: UDOP, Dependency Grammars

Even More on Dynamic Programming

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Administrivia. Test I during class on 10 March. Bottom-Up Parsing. Lecture An Introductory Example

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Soft Inference and Posterior Marginals. September 19, 2013

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Lab 12: Structured Prediction

Intelligent Systems (AI-2)

Probabilistic Context-Free Grammars and beyond

Aspects of Tree-Based Statistical Machine Translation

Stochastic Parsing. Roberto Basili

Text Mining. March 3, March 3, / 49

Harvard CS 121 and CSCI E-207 Lecture 9: Regular Languages Wrap-Up, Context-Free Grammars

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Intelligent Systems (AI-2)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical NLP: Hidden Markov Models. Updated 12/15

Transcription:

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011

Bracketed Structure: Treebank Corpus [ S1[ S[ S[ VP[ VBCome][ NP[ NNPJuly]]]] [,,] [ CC and] [ S [ NP [ DT the] [ JJ IIT] [ NN campus]] [ VP [ AUX is] [ ADJP [ JJ abuzz] [ PP [ IN with] [ NP [ ADJP [ JJ new] [ CC and] [ VBG returning]] [ NNS students]]]]]] [..]]]

Noisy Channel Modeling Source sentence Noisy Channel Target parse T*= argmax [P(T S)] T = argmax [P(T).P(S T)] ( )] T = argmax [P(T)], since given the parse the T sentence is completely determined and P(S T)=1

Corpus A collection of text called corpus, is used for collecting various language data With annotation: more information, but manual labor intensive Practice: label automatically; correct manually The famous Brown Corpus contains 1 million tagged words. Switchboard: very famous corpora 2400 conversations, 543 speakers, many US dialects, annotated with orthography and phonetics

Discriminative vs. Generative Model W * = argmax (P(W SS)) W Discriminativei i i Model Generative Model Compute directly from P(W SS) Compute from P(W).P(SS W)

Language Models N-grams: sequence of n consecutive words/characters Probabilistic / Stochastic Context Free Grammars: Simple probabilistic models capable of handling recursion A CFG with probabilities attached to rules Rule probabilities how likely is it that a particular rewrite rule is used?

PCFGs Why PCFGs? Intuitive probabilistic models for tree-structured languages Algorithms are extensions of HMM algorithms Better than the n-gram model for language modeling.

Formal Definition of PCFG A PCFG consists of A set of terminals {w k }, k = 1,.,V {w k } = { child, teddy, bear, played } A set of non-terminals {N i }, i = 1,,n {N i } = { NP, VP, DT } A designated start symbol N 1 A set of rules {N i ζ j }, where ζ j is a sequence of terminals & non-terminals NP DT NN A corresponding set of rule probabilities bili i

Rule Probabilities Rule probabilities are such that i j i P(N ζ ) = 1 i Eg E.g., P( NP DT NN) = 0.2 P( NP NN) = 0.5 P( NP NP PP) = 03 0.3 P( NP DT NN) = 0.2 Means 20 % of the training data parses Means 20 % of the training data parses use the rule NP DT NN

Probabilistic Context Free Grammars S NP VP 10 1.0 DT the 10 1.0 NP DT NN 0.5 NN gunman 0.5 NP NNS 03 0.3 NN building 0.5 NP NP PP 0.2 VBD sprayed 1.0 PP PNP 1.0 NNS bullets 1.0 VP VP PP 0.6 VP VBD NP 0.4

Example Parse t 1` The gunman sprayed the building with bullets. S 1.0 NP 0.5 VP 0.6 P (t 1 ) = 1.0 * 0.5 * 1.0 * 0.5 * 0.6 * 0.4 * 1.0 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = DT 1.0 NN 0.00225 0.5 VP 0.4 PP 1.0 The gunman VBD 1.0 NP 0.5 P 1.0 NP 0.3 sprayed DT 1.0 NN 0.5 with NNS 1.0 the building bullets

Another Parse t 2 The gunman sprayed the building with bullets. S 1.0 P (t 2 ) = 1.0 * 0.5 * 1.0 * 0.5 * NP 0.5 VP 0.4 DT 1.0 NN 0.5 VBD 1.0 NP 0.2 0.4 * 1.0 * 0.2 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = 0.0015 The gunman sprayed NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 th building with NNS 1.0 e bullet s

Probability of a sentence Notation : w ab subsequence w a.w b N j NP N j dominates w a.w b or yield(n j ) = w a.w b w a..w b the..sweet..teddy..be Probability of a sentence = P(w 1m) ) Pw ( ) = Pw (, t) 1m 1m t = PtPw ( ) ( t) t 1m Where t is a parse tree of the sentence = Pt () Q Pw ( 1m t ) = 1 t: yield ( t) = w1 m If t is a parse tree for the sentence w 1m, this will be 1!!

Assumptions of the PCFG model Place invariance : P(NP DT NN) is same in locations 1 and 2 Context-free : P(NP DT NN anything outside The child ) = P(NP DT NN) Ancestor free : At 2, P(NP DT NN its ancestor is VP) =P(NP DT NN) 1 NP The child S VP 2 NP The toy

Probability of a parse tree Domination :We say N j dominates from k to l, symbolized as, if W k,l is derived from N j P (tree sentence) = P (tree S 1l 1,l ) where S 1,l means that the start symbol S dominates the word sequence W 1,l P (t s) approximately equals joint probability of constituent non-terminals dominating the sentence fragments (next slide)

Probability of a parse tree (cont.) S 1,l NP 1,2 VP 3,l P ( t s ) = P (t S 1,l ) DT 1 N 2 V 3,3 PP 4,l = P ( NP 1,2, DT 1,1, w 1, w w P NP 1 2 w 3 4,4 5,l N 2,2, w 2, VP 3,l, V 3,3, w 3, PP 4,l, P 4,4, w 4, NP 5,l, w 5 l S 1,l ) w 4 w 5 w l = P ( NP 1,2, VP 3,l S 1,l ) * P ( DT 1,1, N 2,2 NP 1,2 ) * D(w 1 DT 1,1 ) * P( (w 2 N 2,2 )*P(V 3,3, PP 4,l VP 3,l )*P( P(w 3 V 3,3 )*P(P P 4,4, NP 5,l PP 4,l ) * P(w 4 P 4,4 ) * P (w 5 l NP 5,l ) (Using Chain Rule, Context Freeness and Ancestor Freeness )

HMM PCFG O observed sequence w 1m sentence X state sequence t parse tree μ model G grammar Three fundamental questions

HMM PCFG How likely is a certain observation given the model? How likely l is a sentence given the grammar? PO ( μ ) Pw ( G ) How to choose a state sequence which best explains the observations? How to choose a parse which best supports the sentence? 1m arg max PX ( Oμ, ) arg max P( t w1 m, G) X t

HMM PCFG How to choose the model parameters that best explain the observed data? How to choose rule probabilities which maximize the probabilities of the observed sentences? arg max PO ( ) μ μ P w1 m G arg max ( G)

Interesting Probabilities N 1 What is the probability of having a NP at this position such that it will derive the building? - NP (4,5) NP Inside Probabilities The gunman sprayed the building with bullets 1 2 3 4 5 6 7 Outside Probabilities biliti What is the probability of starting from N 1 and deriving The gunman sprayed, a NP and with bullets? - α NP (4,5)

Interesting Probabilities Random variables to be considered The non-terminal being expanded. E.g., NP The word-span covered by the non-terminal. E.g., (4,5) refers to words the building While calculating gp probabilities, consider: The rule to be used for expansion : E.g., NP DT NN The probabilities associated with the RHS nonterminals : E.g., DT subtree s inside/outside probabilities & NN subtree s inside/outside probabilities

Outside Probability α j(p,q) : The probability of beginning with N 1 & generating the non-terminal N j pq and all words outside w p..w q ( p, q) P( w, N, w G) j α j = 1( p 1) pq ( q+ 1) m N 1 N j w 1 w p-1 w p w q w q+1 w m

Inside Probabilities j (p,q) : The probability of generating the words w p..w q starting with the non-terminal N j pq. ( pq, ) = Pw ( N j, G) j pq pq N 1 α N j w 1 w p-1 w p w q w q+1 w m

Outside & Inside Probabilities: example α = NP (4,5) for "the building" P (The gunman sprayed, NP,with bullets G) (4,5) for "the building" = (the building, ) NP P NP4,5 G N 1 4,5 NP The gunman sprayed the building with bullets 1 2 3 4 5 6 7

Calculating Inside probabilities j (p,q) Base case: j j ( kk, ) = Pw ( N, G ) = PN ( w G ) j k kk kk k Base case is used for rules which derive the words or terminals directly Eg E.g., Suppose N j = NN is being considered & NN building is one of the rules with probability 0.5 (5,5) = (, ) NN Pbuilding NN5,5 G = PNN ( bildi building G ) = 0.5 5,5

Induction Step: Assuming Grammar in Chomsky Normal Form Induction step : ( pq, ) = Pw ( N, G) N j j j pq pq q 1 j r s N r N s, = PN ( NN)* rs d= p r ( pd, )* w p w d w d+1 w q s ( d + 1, q) Consider different splits of the words - indicated by d E.g., the huge building Split here for d=2 d=3 Consider different non-terminals to be used in the rule: NP DT NN, NP DT NNS are available options Consider summation over all these.

The Bottom-Up Approach The idea of induction Consider the gunman NP 05 0.5 Base cases : Apply unary rules DT the Prob = 1.0 NN gunman Prob = 0.5 DT 1.0 NN 0.5 The gunman Induction : Prob that a NP covers these 2 words = P (NP DT NN) * P (DT deriving the word the ) * P (NN deriving the word gunman ) = 0.5 * 1.0 * 0.5 = 0.25

Parse Triangle A parse triangle is constructed cted for calculating lating j (p,q) Probability of a sentence using j (p,q): P( w G) = P( N w G) = P( w N, G) = (1, m) 1 1 1m 1m 1m 1m 1

Parse Triangle The gunman sprayed the building with bullets (1) (2) (3) (4) (5) (6) (7) 1 DT = 1.0 2 3 4 5 6 7 = 05 0.5 NN VBD = 1.0 = 1.0 DT NN = 0.5 P = 1.0 NNS = 10 1.0 Fill diagonals with j ( kk, )

Parse Triangle 1 2 3 The gunman sprayed the building with bullets (1) (2) (3) (4) (5) (6) (7) = 1.0 DT NP = 0.25 NN = 0.5 = 10 1.0 VBD 4 DT = 1.0 5 = 0.5 6 7 Calculate l using induction formula NP = P NP1,2 G (1,2) (the gunman, ) = PNP ( DTNN)* DT (1,1) 1) * NN (2, 2) = 0.5*1.0*0.5 = 0.25 NN P = 1.0 NNS = 1.0

Example Parse t 1 The gunman sprayed the building with bullets. S 1.0 NP 0.5 VP 0.6 DT 1.0 NN 0.5 VP 0.4 PP 1.0 Rule used here is VP VP PP The gunman VBD 1.0 NP 0.5 P 1.0 NP 0.3 DT 1.0 NN 0.5 with NNS sprayed 1.0 th e building bullet s

Another Parse t 2 The gunman sprayed the building with bullets. S 1.0 Rule used here is NP 0.5 VP 0.4 VP VBD NP DT 1.0 NN 0.5 VBD 1.0 NP 0.2 The gunmansprayed NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 th building with NNS 10 1.0 e bullet s

Parse Triangle The (1) gunman (2) sprayed (3) the (4) building (5) with (6) bullets (7) 1 DT = 10 1.0 NP = 025 0.25 S = 0.04650465 2 NN = 0.5 3 VBD = 1.0 VP = 1.0 VP = 0.186 4 DT = 1.0 NP = 0.25 NP = 0.015 5 NN = 0.5 6 P = 10 1.0 PP = 0.3 7 NNS = 1.0 (3,7) = (sprayed the building with bullets, ) VP P VP37 3,7 G = PVP ( VPPP)* VP (3,5)* PP (6,7) + P ( VP VBD NP)* VBD (3,3)* NP (4,7) = 0.6*1.0*0.3+ 0.4*1.0*0.015 = 0.186

Different Parses Consider Different splitting points : E.g., 5th and 3 rd position Using different rules for VP expansion : E.g., VP VP PP, VP VBD NP Different parses for the VP sprayed the building with bullets can be constructed this way.

Outside Probabilities α j (p,q) Base case: α (1, m) = 1 for start symbol 1 α j (1, m) = 0 for j 1 Inductive step for calculating α j ( pq, : ) N 1 N f pe α f ( pe, ) PN ( f NN j g ) N j pq N g (q+1) e ( q + g 1, e ) w 1 w p-1 w p w q w q+1 w e w e+1 w m Summation over f, g & e

Probability of a Sentence Pw (, N G) = Pw ( N, G) = α ( pq, ) ( pq, ) j 1m pq 1m pq j j j j Joint probability of a sentence w 1m and that there is a constituent spanning words w p to w q is given as: P (The gunman...bullets, N G) N 1 NP 45 4,5 = P(The gunman...bullets N, G) j = α NP j 4,5 (4,5) NP (4,5) + α (4,5) (4,5) +... VP VP The gunman sprayed the building with bullets 1 2 3 4 5 6 7