CS460/626 : Natural Language

Similar documents
CS : Speech, NLP and the Web/Topics in AI

Processing/Speech, NLP and the Web

CS460/626 : Natural Language

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Statistical Methods for NLP

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

LECTURER: BURCU CAN Spring

Probabilistic Context-Free Grammar

Hidden Markov Models

Text Mining. March 3, March 3, / 49

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Lecture 12: Algorithms for HMMs

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lecture 12: Algorithms for HMMs

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Advanced Natural Language Processing Syntactic Parsing

Lecture 3: ASR: HMMs, Forward, Viterbi

Sequence Labeling: HMMs & Structured Perceptron

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

10/17/04. Today s Main Points

Statistical NLP: Hidden Markov Models. Updated 12/15

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

1. Markov models. 1.1 Markov-chain

Statistical Sequence Recognition and Training: An Introduction to HMMs

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Probabilistic Context Free Grammars

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Statistical Methods for NLP

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Introduction to Machine Learning CMU-10701

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 15. Probabilistic Models on Graph

Lecture 13: Structured Prediction

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Probabilistic Context-free Grammars

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Soft Inference and Posterior Marginals. September 19, 2013

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Statistical methods in NLP, lecture 7 Tagging and parsing

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Spectral Unsupervised Parsing with Additive Tree Metrics

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Sequences and Information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Statistical Methods for NLP

Hidden Markov Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Artificial Intelligence

Hidden Markov Model and Speech Recognition

COMP90051 Statistical Machine Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Stochastic Parsing. Roberto Basili

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

A gentle introduction to Hidden Markov Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

The Noisy Channel Model and Markov Models

Expectation Maximization (EM)

Lecture 9: Hidden Markov Model

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Intelligent Systems (AI-2)

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Dept. of Linguistics, Indiana University Fall 2009

CS460/626 : Natural Language Processing/Speech, NLP and the Web

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Natural Language Processing

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Lecture 7 Sequence analysis. Hidden Markov Models

STA 414/2104: Machine Learning

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Machine Learning for natural language processing

Computers playing Jeopardy!

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

order is number of previous outputs

Artificial Intelligence Markov Chains

Transcription:

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27 SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th March, 2011

CMU Pronunciation Dictionary Assignment

Data The Carnegie Mellon University Pronouncing Dictionary machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions. The current phoneme set contains 39 phonemes

Parallel Corpus Phoneme Example Translation ------- ------- ----------- AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY

Parallel Corpus cntd Phoneme Example Translation ------- ------- ----------- CH cheese CH IY Z D DH ER EY F G HH IH IY JH dee DIY thee DH IY EH Ed EH D hurt HH ER T ate EY T fee F IY green G R IY N he HH IY it IH T eat IY T gee JH IY

The tasks First obtain the Carnegie Mellon University's it Pronouncing Dictionary Create the Phrase Table using GIZA++ For language modeling use SRILM For decoding use Moses Calculate precision, recall and F-score

Probabilistic Parsing

Bridging Classical and Probabilistic Parsing The bridge between probabilistic parsing and classical parsing is the concept of domination Frequency: P( NP -> DT NN) = 0.5 means in the corpus 50% of noun phrase is composed of determiner and noun Phenomenon: P(NP -> DT NN) is actually P(DT NN NP) i.e. join probability of domination by DT and NN to give rise to domination of NP The concept of domination is the bridge between Frequency (probabilistic parsing) and Phenomenon (classical parsing).

Calculating Probability of a Sentence = 1 m = t: yield ( t ) = w P ( s w ) P ( t ) We can either calculate P(s=w 1 m ) using naive N-gram based approach or by calculating P() t t : yield ( t ) = w1 m Which approach to choose?? 1 m The velocity of waves rises near the shore. Consecutive plural noun and singular verb is unlikely in the corpus. So low probability value for the sentence as given by n- gram.

Parse Tree S NP VP NP PP V PP DT NN P NNS rises P near P NP N The velocity of waves the shore No other Parse Tree is possible for the sentence

Various ways to calculate probability of sentence Naïve n-gram based Syntactic level (Parse tree) Semantic Level Pragmatics Discourse

Probabilistic Context Free Grammars S NP VP 10 1.0 DT the 10 1.0 NP DT NN 0.5 NN gunman 0.5 NP NNS 03 0.3 NN building 0.5 NP NP PP 0.2 VBD sprayed 1.0 PP PNP 1.0 NNS bullets 1.0 VP VP PP 0.6 VP VBD NP 0.4

Example Parse t 1` The gunman sprayed the building with bullets. S 1.0 NP 0.5 VP 0.6 P (t 1 ) = 1.0 * 0.5 * 1.0 * 0.5 * 0.6 * 0.4 * 1.0 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = DT 1.0 NN 0.00225 0.5 VP 0.4 PP 1.0 The gunman VBD 1.0 NP 0.5 P 1.0 NP 0.3 sprayed DT 1.0 NN 0.5 with NNS 1.0 the building bullets

Another Parse t 2 The gunman sprayed the building with bullets. S 1.0 P (t 2 ) = 1.0 * 0.5 * 1.0 * 0.5 * NP 0.5 VP 0.4 DT 1.0 NN 0.5 VBD 1.0 NP 0.2 0.4 * 1.0 * 0.2 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = 0.0015 The gunman sprayed NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 th building with NNS 1.0 e bullet s

Probability of a parse tree (cont.) S 1,l NP 1,2 VP 3,l P ( t s ) = P (t S 1,l ) DT 1 N 2 V 3,3 PP 4,l = P ( NP 1,2, DT 1,1, w 1, w w P NP 1 2 w 3 4,4 5,l N 2,2, w 2, VP 3,l, V 3,3, w 3, PP 4,l, P 4,4, w 4, NP 5,l, w 5 l S 1,l ) w 4 w 5 w l = P ( NP 1,2, VP 3,l S 1,l ) * P ( DT 1,1, N 2,2 NP 1,2 ) * D(w 1 DT 1,1 ) * P( (w 2 N 2,2 )*P(V 3,3, PP 4,l VP 3,l )*P( P(w 3 V 3,3 )*P(P P 4,4, NP 5,l PP 4,l ) * P(w 4 P 4,4 ) * P (w 5 l NP 5,l ) (Using Chain Rule, Context Freeness and Ancestor Freeness )

HMM PCFG O observed sequence w 1m sentence X state sequence t parse tree μ model G grammar Three fundamental questions

HMM PCFG How likely is a certain observation given the model? How likely l is a sentence given the grammar? PO ( μ ) Pw ( G ) How to choose a state sequence which best explains the observations? How to choose a parse which best supports the sentence? 1m arg max PX ( Oμ, ) arg max P( t w1 m, G) X t

HMM PCFG How to choose the model parameters that best explain the observed data? How to choose rule probabilities which maximize the probabilities of the observed sentences? arg max PO ( ) μ μ P w1 m G arg max ( G)

Recap of HMM

HMM Definition Set of states: S where S =N Start state S 0 /*P(S 0 )=1*/ Output Alphabet: O where O =M Transition Probabilities: A= {a ij } /*state i to state j*/ Emission Probabilities : B= {b j (o k )} /*prob. of emitting or absorbing o k from state j*/ Initial State Probabilities: Π={p 1,p 2,p 3, p N } Each p i =P(o 0 =ε,s i S 0 )

Markov Processes Properties Limited Horizon: Given previous t states, a state i, is independent of preceding 0 to t- k+1 states. P(X t =i X t-1, X t-2, X 0 ) = P(X t =i X t-1, X t-2 X t-k ) Order k Markov process Time invariance: (shown for k=1) P(X t =i X t-1 =j) = P(X 1 =i X 0 =j) = P(X n =i X n-1 =j)

Three basic problems (contd.) Problem 1: Likelihood of a sequence Forward Procedure Backward Procedure Problem 2: Best state sequence Viterbi Algorithm Problem 3: Re-estimation Baum-Welch ( Forward-Backward Algorithm )

Probabilistic Inference O: Observation Sequence S: State Sequence Given O find S * * where S = arg max p( S / O) called Probabilistic Inference S Infer Hidden from Observed How is this inference different from logical inference based on propositional or predicate calculus?

Essentials of Hidden Markov Model 1. Markov + Naive Bayes 2. Uses both transition and observation probability O p ( S k S ) = p ( O / S ) p ( S / S ) k k + 1 k k k + 1 k 3. Effectively makes Hidden Markov Model a Finite State Machine (FSM) with probability

Probability of Observation Sequence po ( ) = pos (, ) S = ps ( ) po ( / S) S Without any restriction, Search space size= S O

Continuing with the Urn example Colored Ball choosing Urn 1 # of Red = 30 # of Green = 50 # of Blue = 20 Urn 2 # of Red = 10 # of Green = 40 # of Blue = 50 Urn 3 # of Red =60 # of Green =10 # of Blue = 30

Example (contd.) Given : Transition Probability U 1 U 2 U 3 U 1 0.1 0.4 0.5 U 2 0.6 0.2 0.2 U 3 03 0.3 04 0.4 03 0.3 Observation/output Probability R G B U 1 03 0.3 05 0.5 02 0.2 and U 2 0.1 0.4 0.5 U 3 0.6 0.1 0.3 Observation : RRGGBRGR What is the corresponding state sequence?

Diagrammatic representation (1/2) R, 0.3 G, 0.5 B, 0.2 0.1 03 0.3 03 0.3 U 1 0.5 U 3 R, 0.6 0.4 0.6 0.2 0.4 B, 0.3 G, 0.1 R, 0.1 G, 0.4 B, 0.5 U 2 0.2

Diagrammatic representation (2/2) R,0.03 G,0.05 B,0.02 R,0.18 G,0.03 B,0.09 R,0.15 G,0.25 B,0.10 U 1 U 3 R, 0.08 G, 0.20 B, 0.12 R,0.06 G,0.24 B,0.30 U 2 R002 R,0.02 G,0.08 B,0.10 R,0.24 G004 G,0.04 B,0.12 R,0.18 G,0.03 B,0.09 R,0.02, G,0.08 B,0.10

Probabilistic FSM (a 1 :0.3) (a 1 :0.1) (a 2 :0.4) (a 1 :0.3) S 1 S 2 (a 2 :0.2) (a 1 :0.2) (a 2 :0.2) (a 2 :0.3) The question here is: what is the most likely state sequence given the output sequence seen

Developing the tree Start 10 1.0 00 0.0 S1 S2 0.1 0.3 0.2 0.3.. 1*0.1=0.1 S1 0.3 S2 0.0 S1 S2 0.0 a 1 02 0.2 0.4 0.3 02 0.2.. S1 S2 S1 S2 a 2 0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06 Choose the winning sequence per state per iteration

Tree structure contd 0.09 0.06 S1 S2 0.1 0.3 0.2 0.3.. 0.09*0.1=0.009 S1 0.027 S2 0.012 S1 S2 0.018 a 1 0.3 0.2 0.2 0.4 a 2 S1. S2 S1 S2 0.0081 0.0054 0.0024 000 0.0048 The problem bi being addressed d by this tree is S * = arg max P ( S a 1 a 2 a 1 a 2, μ ) a1-a2-a1-a2 is the output sequence and μ the model or the machine s

Viterbi Algorithm for the Urn problem (first two symbols) S 0 ε 0.5 03 0.3 0.2 U 1 U 2 U 3 R 0.03 008 0.08 015 0.15 0.06 0.02 0.18 0.02 0.24 0.18 U 1 U 2 U 3 U 1 U 2 U 3 U 1 U 2 U 3 0.015 0.04 0.075* 0.018 0.006 0.006 0.048* 0.036 *: winner sequences

Markov process of order>1 (say 2) O 0 O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 Obs: ε R R G G B R G R State: S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 Same theory works P(S).P(O S) P(O S) = P(O 0 S 0 ).P(S 1 S 0 ). [P(O 1 S 1 ). P(S 2 S 1 S 0 )]. [P(O 2 S 2 ). P(S 3 S 2 S 1 )]. [P(O 3 S 3 ).P(S 4 S 3 S 2 )]. [P(O 4 S 4 ).P(S 5 S 4 S 3 )]. [P(O 5 S 5 ).P(S 6 S 5 S 4 )]. [P(O 6 S 6 ).P(S 7 S 6 S 5 )]. [P(O 7 S 7 )P(S S ).P(S 8 7 S 6 )]. [P(O 8 S 8 ).P(S 9 S 8 S 7 )]. We introduce the states S 0 and ds 9 as initial i i and final states respectively. After S 8 the next state is S 9 with probability 1, i.e., P(S 9 S 8 S 7 )=1 O 0 is ε-transition

Adjustments Transition probability table will have tuples on rows and states on columns Output probability table will remain the same In the Viterbi tree, the Markov process will take effect from the 3 rd input symbol (εrr) There will be 27 leaves, out of which only 9 will remain Sequences ending in same tuples will be compared Instead of U1, U2 and U3 U 1 U 1, U 1 U 2, U 1 U 3, U 2 U 1, U 2 U 2,U 2 U 3, U 3 U 1,U 3 U 2,U 3 U 3

Forward and Backward Probability Calculation

Forward probability F(k,i) Define F(k,i)= Probability of being in t t S h i state S i having seen o 0 o 1 o 2 o k F(k,i)=P(o 0 o 1 o 2 o k, S i ) With m as the length of the observed sequence P(observed sequence)=p(o 0 o 1 o 2..o m ) =Σ p=0,n P(o 0 o 1 o 2..o m, S p ) =Σ p=0,n F(m, p)

Forward probability (contd.) F(k, q) = P(o 0 o 1 o 2..o k, S q ) = P(o 0 o 1 o 2..o k, S q ) = P(o 0 o 1 o 2..o k-1, o k,s q ) = Σ p=0,n P(o 0 o 1 o 2..o k-1, S p, o k,s q ) = Σ p=0,n P(o 0 o 1 o 2..o k-1, S p ). P(o m,s q o 0 o 1 o 2..o k-1, S p ) = Σ p=0,n F(k-1,p). P(o k,s q S p ) o k = Σ p=0,n F(k-1,p). P(S p S q ) O 0 O 1 O 2 O 3 O k O k+1 O m-1 O m 0 1 2 3 k k+1 m-1 m S 0 S 1 S 2 S 3 S p S q S m S final

Backward probability B(k,i) Define B(k,i)= Probability of seeing o k o k+1 o k+2 o m given that t the state t was S i B(k,i)=P(o k o k+1 o k+2 o m \ S i ) With m as the length of the observed sequence P(observed sequence)=p(o 0 o 1 o 2..o m ) = P(o 0 o 1 o 2..o m S 0 ) =B(0,0) 0)

B(k, p) Backward probability (contd.) = P(o k o k+1 o k+2 o m \S p ) =P(o k+1 o k+2 o m, o k S p ) = Σ q=0,n P(o k+1 o k+2 o m, o k, S q S p ) = Σ q=0,n P(o k,s q S p ) P(o k+1 o k+2 o m o k,s q,s p ) = Σ q=0,n P(o k+1 o k+2 o m S q ). P(o k, S q S p ) o k = Σ q=0,n B(k+1,q). P(S p S q ) O 0 O 1 O 2 O 3 O k O k+1 O m-1 O m S 0 S 1 S 2 S 3 S p S q S m S final

Back to PCFG

Interesting Probabilities N 1 What is the probability of having a NP at this position such that it will derive the building? - β NP (4,5) NP Inside Probabilities The gunman sprayed the building with bullets 1 2 3 4 5 6 7 Outside Probabilities biliti What is the probability of starting from N 1 and deriving The gunman sprayed, a NP and with bullets? - α NP (4,5)

Interesting Probabilities Random variables to be considered The non-terminal being expanded. E.g., NP The word-span covered by the non-terminal. E.g., (4,5) refers to words the building While calculating gp probabilities, consider: The rule to be used for expansion : E.g., NP DT NN The probabilities associated with the RHS nonterminals : E.g., DT subtree s inside/outside probabilities & NN subtree s inside/outside probabilities

Outside Probability α j(p,q) : The probability of beginning with N 1 & generating the non-terminal N j pq and all words outside w p..w q ( p, q) P( w, N, w G) j α j = 1( p 1) pq ( q+ 1) m N 1 N j w 1 w p-1 w p w q w q+1 w m

Inside Probabilities β j (p,q) : The probability of generating the words w p..w q starting with the non-terminal N j pq. β ( pq, ) = Pw ( N j, G) j pq pq N 1 α N j β w 1 w p-1 w p w q w q+1 w m

Outside & Inside Probabilities: example α = NP (4,5) for "the building" P (The gunman sprayed, NP,with bullets G) β (4,5) for "the building" = (the building, ) NP P NP4,5 G N 1 4,5 NP The gunman sprayed the building with bullets 1 2 3 4 5 6 7

Calculating Inside probabilities β j (p,q) Base case: j j β ( kk, ) = Pw ( N, G ) = PN ( w G ) j k kk kk k Base case is used for rules which derive the words or terminals directly Eg E.g., Suppose N j = NN is being considered & NN building is one of the rules with probability 0.5 β (5,5) = (, ) NN Pbuilding NN5,5 G = PNN ( bildi building G ) = 0.5 5,5

Induction Step: Assuming Grammar in Chomsky Normal Form Induction step : β ( pq, ) = Pw ( N, G) N j j j pq pq q 1 j r s N r N s, = PN ( NN)* rs d= p βr ( pd, )* w p w d w d+1 w q β s ( d + 1, q) Consider different splits of the words - indicated by d E.g., the huge building Split here for d=2 d=3 Consider different non-terminals to be used in the rule: NP DT NN, NP DT NNS are available options Consider summation over all these.

The Bottom-Up Approach The idea of induction Consider the gunman NP 05 0.5 Base cases : Apply unary rules DT the Prob = 1.0 NN gunman Prob = 0.5 DT 1.0 NN 0.5 The gunman Induction : Prob that a NP covers these 2 words = P (NP DT NN) * P (DT deriving the word the ) * P (NN deriving the word gunman ) = 0.5 * 1.0 * 0.5 = 0.25

Parse Triangle A parse triangle is constructed cted for calculating lating β j (p,q) Probability of a sentence using β j (p,q): P( w G) = P( N w G) = P( w N, G) = β (1, m) 1 1 1m 1m 1m 1m 1

Parse Triangle The gunman sprayed the building with bullets (1) (2) (3) (4) (5) (6) (7) 1 β DT = 1.0 2 3 4 5 6 7 β = 05 0.5 NN β VBD = 1.0 β = 1.0 DT β NN = 0.5 β P = 1.0 β NNS = 10 1.0 Fill diagonals with β j ( kk, )

Parse Triangle 1 2 3 The gunman sprayed the building with bullets (1) (2) (3) (4) (5) (6) (7) β = 1.0 DT β NP = 0.25 β NN = 0.5 β = 10 1.0 VBD 4 β DT = 1.0 5 β = 0.5 6 7 Calculate l using induction formula β NP = P NP1,2 G (1,2) (the gunman, ) = PNP ( DTNN)* βdt (1,1) 1) * βnn (2, 2) = 0.5*1.0*0.5 = 0.25 NN β P = 1.0 β NNS = 1.0

Example Parse t 1 The gunman sprayed the building with bullets. S 1.0 NP 0.5 VP 0.6 DT 1.0 NN 0.5 VP 0.4 PP 1.0 Rule used here is VP VP PP The gunman VBD 1.0 NP 0.5 P 1.0 NP 0.3 DT 1.0 NN 0.5 with NNS sprayed 1.0 th e building bullet s

Another Parse t 2 The gunman sprayed the building with bullets. S 1.0 Rule used here is NP 0.5 VP 0.4 VP VBD NP DT 1.0 NN 0.5 VBD 1.0 NP 0.2 The gunmansprayed NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 th building with NNS 10 1.0 e bullet s

Parse Triangle The (1) gunman (2) sprayed (3) the (4) building (5) with (6) bullets (7) 1 β DT = 10 1.0 β NP = 025 0.25 β S = 0.04650465 2 β NN = 0.5 3 β VBD = 1.0 β VP = 1.0 β VP = 0.186 4 β DT = 1.0 β NP = 0.25 β NP = 0.015 5 β NN = 0.5 6 β P = 10 1.0 β PP = 0.3 7 β NNS = 1.0 β (3,7) = (sprayed the building with bullets, ) VP P VP37 3,7 G = PVP ( VPPP)* βvp (3,5)* βpp (6,7) + P ( VP VBD NP)* β VBD (3,3)* β NP (4,7) = 0.6*1.0*0.3+ 0.4*1.0*0.015 = 0.186

Different Parses Consider Different splitting points : E.g., 5th and 3 rd position Using different rules for VP expansion : E.g., VP VP PP, VP VBD NP Different parses for the VP sprayed the building with bullets can be constructed this way.

Outside Probabilities α j (p,q) Base case: α (1, m) = 1 for start symbol 1 α j (1, m) = 0 for j 1 Inductive step for calculating α j ( pq, : ) N 1 N f pe α f ( pe, ) PN ( f NN j g ) N j pq N g (q+1) e β ( q + g 1, e ) w 1 w p-1 w p w q w q+1 w e w e+1 w m Summation over f, g & e

Probability of a Sentence Pw (, N G) = Pw ( N, G) = α ( pq, ) β ( pq, ) j 1m pq 1m pq j j j j Joint probability of a sentence w 1m and that there is a constituent spanning words w p to w q is given as: P (The gunman...bullets, N G) N 1 NP 45 4,5 = P(The gunman...bullets N, G) j = α NP j 4,5 (4,5) βnp (4,5) + α (4,5) β (4,5) +... VP VP The gunman sprayed the building with bullets 1 2 3 4 5 6 7