Stochastic models for complex machine learning

Size: px
Start display at page:

Download "Stochastic models for complex machine learning"

Transcription

1 Stochastic models for complex machine learning R. Basili Corso di Web Mining e Retrieval a.a May 16, 2009

2 Outline Outline Natural language as a stochastic process Predictive Models for stochastic processes: Markov models Hidden-Markov models POS tagging as a stochastic process Probabilistic Approaches to Syntactic Analysis CF grammar-based approaches (e.g. PCFG) Other Approaches (lexicalized or dependency based)

3 Quantitative Models of language structures Linguistic structures are example of structures where syntagmatic information is crucial for machine learning. The most used modeling here are grammars: 1. S -> NP V 2. S -> NP 3. NP -> PN 4. NP -> N 5. NP -> Adj N 6. N -> "imposta" 7. V -> "imposta" 8. Adj -> "pesante" 9. PN -> "Pesante"...

4 The role of Quantitative Approaches Pesante imposta S S NP V NP PN Adj N Pesante imposta Pesante imposta

5 The role of Quantitative Approaches Weighted grammars are models of possibly limited degrees of grammaticality. They are ment to deal with a large range of ambiguity problems: 1. S -> NP V.7 2. S -> NP.3 3. NP -> PN.1 4. NP -> N.6 5. NP -> Adj N.3 6. N -> imposta.6 7. V -> imposta.4 8. Adj -> Pesante.8 9. PN -> Pesante.2

6 The role of Quantitative Approaches Pesante imposta S S (.7) NP V (.3) NP (.1) PN (.3) Adj N (.2) Pesante (.4) imposta (.8) Pesante (.6) imposta

7 The role of Quantitative Approaches Weighted grammars allow to compute the degree of grammaticality of different ambiguous derivations, thus supporting disambiguation:

8 The role of Quantitative Approaches Weighted grammars allow to compute the degree of grammaticality of different ambiguous derivations, thus supporting disambiguation: 1. S -> NP V.7 2. S -> NP.3 3. NP -> PN.1 4. NP -> N.6 5. NP -> Adj N.3 6. N -> imposta.6 7. V -> imposta.4 8. Adj -> Pesante.8 9. PN -> Pesante.2... prob(((pesante) PN (imposta) V ) S )= ( ) =

9 The role of Quantitative Approaches Weighted grammars allow to compute the degree of grammaticality of different ambiguous derivations, thus supporting disambiguation: 1. S -> NP V.7 2. S -> NP.3 3. NP -> PN.1 4. NP -> N.6 5. NP -> Adj N.3 6. N -> imposta.6 7. V -> imposta.4 8. Adj -> Pesante.8 9. PN -> Pesante.2... prob(((pesante) PN (imposta) V ) S )= ( ) = prob(((pesante) Adj (imposta) N ) S )= ( ) =

10 Structural Disambiguation portare borsa in pelle S S NP V N VP NP PP NP V VP NP N VP PP portare borsa in pelle portare borsa in pelle Derivation Trees for a structurally ambiguous sentence

11 Structural Disambiguation (cont d) portare borsa in mano S S NP V N VP NP PP NP V VP NP N VP PP portare borsa in mano portare borsa in mano Derivation Trees for a second structurally ambiguous sentence.

12 Structural Disambiguation (cont d) portare borsa in pelle S portare borsa in mano S NP V N VP NP PP NP V VP NP N VP PP portare borsa in pelle portare borsa in mano p(portare,in,pelle) << p(borsa,in,pelle) p(borsa,in,mano) << p(portare,in,mano) Disambiguation of structural ambiguity.

13 Tolerance to errors vendita di articoli da regalo NP vendita articoli regalo NP N PP NP?? vendita P N NP PP N?? N di articoli da regalo vendita articoli regalo An example of ungrammatical but meaningful sentence

14 Error tolerance (cont d) vendita di articoli da regalo NP vendita articoli regalo Γ NP N PP P NP N PP vendita di articoli da regalo NP vendita p(γ) > 0 N articoli???? N b regalo p( ) > 0 Modeling of ungrammatical phenomena

15 Probability and Language Modeling Aims to extend grammatical (i.e. rule-based) models with predictive and disambiguation capabilities to offer theoretically well founded inductive methods to develop (not merely) quantitative models of linguistic phenomena

16 Probability and Language Modeling Aims to extend grammatical (i.e. rule-based) models with predictive and disambiguation capabilities to offer theoretically well founded inductive methods to develop (not merely) quantitative models of linguistic phenomena Methods and Resources: Methematical theories (e.g. Markov models) Systematic testing/evaluation frameworks Extended repositories of examples of language in use Traditional linguistic resources (e.g. "models" like dictionaries)

17 Probability and the Empiricist Renaissance (2) Differences amount of knowledge available a priori target: competence vs. performance methods: deduction vs. induction

18 Probability and the Empiricist Renaissance (2) Differences amount of knowledge available a priori target: competence vs. performance methods: deduction vs. induction The role of probability in NLP is also related to: difficulties in categorial statements in language study (e.g. grammaticality or syntactic categorization) the cognitive nature of language understanding the role of uncertainty

19 Probability and Language Modeling Signals are abstracted via symbols that are not known in advance

20 Probability and Language Modeling Signals are abstracted via symbols that are not known in advance Emitted signals belong to an alphabet A

21 Probability and Language Modeling Signals are abstracted via symbols that are not known in advance Emitted signals belong to an alphabet A Time is discrete: each time point corresponds to an emitted signal

22 Probability and Language Modeling Signals are abstracted via symbols that are not known in advance Emitted signals belong to an alphabet A Time is discrete: each time point corresponds to an emitted signal Sequences of symbols (w 1,...,w n ) correspond to sequences of time points (1,...,n)

23 Probability and Language Modeling Signals are abstracted via symbols that are not known in advance Emitted signals belong to an alphabet A Time is discrete: each time point corresponds to an emitted signal Sequences of symbols (w 1,...,w n ) correspond to sequences of time points (1,...,n), 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

24 Probability and Language Modeling A generative language model A random variable X can be introduced so that

25 Probability and Language Modeling A generative language model A random variable X can be introduced so that It assumes values w i in the alfabet A Probability is used to describe the uncertainty on the emitted signal p(x = w i ) w i A

26 Probability and Language Modeling A random variable X can be introduced so that X assumes values in A at each step i, i.e. X i = w j probability is p(x i = w j )

27 Probability and Language Modeling A random variable X can be introduced so that X assumes values in A at each step i, i.e. X i = w j probability is p(x i = w j ) Constraints: Total probability is for each step: j p(x i = w j ) = 1, 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1 i

28 Probability and Language Modeling Notice that time points can be represented as states of the emitting source An output w i can be considered as emitted in a given state X i by the source, and given a certain history

29 Probability and Language Modeling Notice that time points can be represented as states of the emitting source An output w i can be considered as emitted in a given state X i by the source, and given a certain history, 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

30 Probability and Language Modeling Formally: P(X i = w i,x i 1 = w i 1,...X 1 = w 1 ) =

31 Probability and Language Modeling Formally: P(X i = w i,x i 1 = w i 1,...X 1 = w 1 ) = = P(X i = w i X i 1 = w i 1,X i 2 = w i 2,...,X 1 = w 1 ) P(X i 1 = w i 1,X i 2 = w i 2,...,X 1 = w 1 ), 8, 7, 6, 5, 4, 3, 2, 1, w i8, w i7, w i6, w i5, w i4, w i3, w i2, w i1

32 Probability and Language Modeling What s in a state n 1 preceding words n-gram language models, 3, 2, 1, dog, black, the p(the, black, dog) = p(dog the, black)

33 Probability and Language Modeling What s in a state n 1 preceding words n-gram language models, 3, 2, 1, dog, black, the p(the, black, dog) = p(dog the, black)p(black the)p(the)

34 Probability and Language Modeling What s in a state preceding POS tags stochastic taggers POS2 = Adj, POS1 = Det, dog, black, the

35 Probability and Language Modeling What s in a state preceding POS tags stochastic taggers POS2 = Adj, POS1 = Det, dog, black, the p(the DT,black ADJ,dog N ) = p(dog N the DT,black ADJ )...

36 Probability and Language Modeling What s in a state preceding parses stochastic grammars NP NP Det, 3, N 2, ADJ 1, dog, black, the

37 Probability and Language Modeling What s in a state preceding parses stochastic grammars NP NP Det, 3, N 2, ADJ 1, dog, black, the p((the Det,(black ADJ,dog N ) NP ) NP ) =

38 Probability and Language Modeling What s in a state preceding parses stochastic grammars NP NP Det, 3, N 2, ADJ 1, dog, black, the p((the Det,(black ADJ,dog N ) NP ) NP ) = p(dog N ((the Det ),(black ADJ,_)))...

39 Probability and Language Modeling (2) Expressivity The predictivity of a statistical device can provide a very good explanatory model of the source information Simpler and Systematic Induction Simpler and Augmented Descriptions (e.g. grammatical preferences) Optimized Coverage (i.e. better on more important phenomena)

40 Probability and Language Modeling (2) Expressivity The predictivity of a statistical device can provide a very good explanatory model of the source information Simpler and Systematic Induction Simpler and Augmented Descriptions (e.g. grammatical preferences) Optimized Coverage (i.e. better on more important phenomena) Integrating Linguistic Description Start with poor assumptions and approximate as much as possible what is known (evaluate performance only) Bias the statistical model since the beginning and check the results on a linguistic ground

41 Probability and Language Modeling (3) Advantages: Performances Faster Processing

42 Probability and Language Modeling (3) Advantages: Performances Faster Processing Faster Design

43 Probability and Language Modeling (3) Advantages: Performances Faster Processing Faster Design Linguistic Adequacy Acceptance Psychological Plausibility Explanatory power

44 Probability and Language Modeling (3) Advantages: Performances Faster Processing Faster Design Linguistic Adequacy Acceptance Psychological Plausibility Explanatory power Tools for further analysis of Linguistic Data

45 Markov Models Markov Models Suppose X 1,X 2,...,X T form a sequence of random variables taking values in a countable set W = p 1,p 2,...,p N (State space). Limited Horizon Property: P(X t+1 = p k X 1,...,X t ) = P(X t+1 = k X t )

46 Markov Models Markov Models Suppose X 1,X 2,...,X T form a sequence of random variables taking values in a countable set W = p 1,p 2,...,p N (State space). Limited Horizon Property: P(X t+1 = p k X 1,...,X t ) = P(X t+1 = k X t ) Time invariant: P(X t+1 = p k X t = p l ) = P(X 2 = p k X 1 = p l ) t(> 1)

47 Markov Models Markov Models Suppose X 1,X 2,...,X T form a sequence of random variables taking values in a countable set W = p 1,p 2,...,p N (State space). Limited Horizon Property: P(X t+1 = p k X 1,...,X t ) = P(X t+1 = k X t ) Time invariant: P(X t+1 = p k X t = p l ) = P(X 2 = p k X 1 = p l ) t(> 1) It follows that the sequence of X 1,X 2,...,X T is a Markov chain.

48 Representation of a Markov Chain Markov Models: Matrix Representation A (transition) matrix A: a ij = P(X t+1 = p j X t = p i ) Note that i,j a ij 0 and i j a ij = 1

49 Representation of a Markov Chain Markov Models: Matrix Representation A (transition) matrix A: a ij = P(X t+1 = p j X t = p i ) Note that i,j a ij 0 and i j a ij = 1 Initial State description (i.e. probabilities of initial states): Note that n j=1 π ij = 1. π i = P(X 1 = p i )

50 Representation of a Markov Chain Graphical Representation (i.e. Automata) States as nodes with names Transitions from states i-th and j-th as arcs labelled by conditional probabilities P(X t+1 = p j X t = p i ) Note that 0 probability arcs are omitted from the graph. S 1 S 2 S S

51 Representation of a Markov Chain Graphical Representation P(X 1 = p 1 ) = 1 StartState P(X k = p 3 X k 1 = p 2 ) = 0.7 k P(X k = p 4 X k 1 = p 1 ) = 0 k 0.2 p p p p 4 Start State

52 Hidden Markov Models A Simple Example of Hidden Markov Model Crazy Coffee Machine Two states: Tea Preferring (TP), Coffee Preferring (CP) Switch from one state to another randomly Simple (or visible) Markov model: Iff the machine output Tea in TP AND Coffee in CP What we need is a description of the random event of switching from one state to another. More formally we need for each time step n and couple of states p i and p j to determine following conditional probabilities: P(X n+1 = p j X n = p i ) where p t is one of the two states TP, CP.

53 Hidden Markov Models A Simple Example of Hidden Markov Model Crazy Coffee Machine Assume, for example, the following state transition model: TP CP TP CP and let CP be the starting state (i.e. π CP = 1, π TP = 0). Potential Use: 1 What is the probability at time step 3 to be in state TP? 2 What is the probability at time step n to be in state TP? 3 What is the probability of the following sequence in output: (Coffee, Tea, Coffee)?

54 Hidden Markov Models Crazy Coffee Machine Graphical Representation TP 0.3 CP 0.5 Start State

55 Hidden Markov Models Crazy Coffee Machine Solution to Problem 1: P(X 3 = TP) = (given by (CP,CP,TP) and (CP,TP,TP)) = P(X 1 = CP) P(X 2 = CP X 1 = CP) P(X 3 = TP X 1 = CP,X 2 = CP)+ + P(X 1 = CP) P(X 2 = TP X 1 = CP) P(X 3 = TP X 1 = CP,X 2 = TP) = = P(CP)P(CP CP)P(TP CP, CP) + P(CP)P(TP CP)P(TP CP, TP) = = P(CP)P(CP CP)P(TP CP) + P(CP)P(TP CP)P(TP TP) = = = = 0.60

56 Hidden Markov Models Crazy Coffee Machine Solution to Problem 2 P(X n = TP) = CP,p2,p 3,...,TP P(X 1 = CP)P(X 2 = p 2 X 1 = CP)P(X 3 = p 3 X 1 = CP,X 2 = p 2 )... P(X n = TP X 1 = CP,X 2 = p 2,...,X n 1 = p n 1 ) = = CP,p2,p 3,...,TP P(CP)P(p 2 CP)P(p 3 p 2 )... P(TP p n 1 ) = = CP,p2,p 3,...,TP P(CP) n 1 t=1 P(p t+1 p t ) = = p1,...,p n P(p 1 ) n 1 t=1 P(p t+1 p t )

57 Hidden Markov Models Crazy Coffee Machine Solution to Problem 3: P(Cof,Tea,Cof ) = = P(Cof ) P(Tea Cof ) P(Cof Tea) = = 0.15

58 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Hidden Markov model: If the machine output Tea, Coffee or Capuccino independently from CP and TP. What we need is a description of the random event of output(ting) a drink.

59 Hidden Markov Models Crazy Coffee Machine A description of the random event of output(ting) a drink. Formally we need (for each time step n and for each kind of output O = {Tea, Cof, Cap}), the following conditional probabilities: P(O n = k X n = p i,x n+1 = p j ) where k is one of the values Tea, Coffee or Capuccino. This matrix is called the output matrix of the machine (or of its Hidden markov Model).

60 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Given the following output probability for the machine Tea Coffee Capuccino TP CP and let CP be the starting state (i.e. π CP = 1, π TP = 0). Find the following probabilities of output from the machine

61 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Given the following output probability for the machine Tea Coffee Capuccino TP CP and let CP be the starting state (i.e. π CP = 1, π TP = 0). Find the following probabilities of output from the machine 1 (Cappuccino,Coffee) given that the state sequence is (CP,TP,TP)

62 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Given the following output probability for the machine Tea Coffee Capuccino TP CP and let CP be the starting state (i.e. π CP = 1, π TP = 0). Find the following probabilities of output from the machine 1 (Cappuccino,Coffee) given that the state sequence is (CP,TP,TP) 2 (Tea,Coffee) for any state sequence

63 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Crazy Coffee Machine Given the following output probability for the machine Tea Coffee Capuccino TP CP and let CP be the starting state (i.e. π CP = 1, π TP = 0). Find the following probabilities of output from the machine 1 (Cappuccino,Coffee) given that the state sequence is (CP,TP,TP) 2 (Tea,Coffee) for any state sequence 3 a generic output O = (o 1,...,o n ) for any state sequence

64 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solution for the problem 1 For the given state sequence X = (CP,TP,TP) P(O 1 = Cap,O 2 = Cof,X 1 = CP,X 2 = TP,X 3 = TP) = P(O 1 = Cap,O 2 = Cof X 1 = CP,X 2 = TP,X 3 = TP)P(X 1 = CP,X 2 = TP,X 3 = TP)) = P(Cap, Cof CP, TP, TP)P(CP, TP, TP))

65 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solution for the problem 1 For the given state sequence X = (CP,TP,TP) P(O 1 = Cap,O 2 = Cof,X 1 = CP,X 2 = TP,X 3 = TP) = P(O 1 = Cap,O 2 = Cof X 1 = CP,X 2 = TP,X 3 = TP)P(X 1 = CP,X 2 = TP,X 3 = TP)) = P(Cap, Cof CP, TP, TP)P(CP, TP, TP)) Now: P(Cap,Cof CP,TP,TP) is the probability of output Cap,Cof during transitions from CP to TP and TP to TP and

66 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solution for the problem 1 For the given state sequence X = (CP,TP,TP) P(O 1 = Cap,O 2 = Cof,X 1 = CP,X 2 = TP,X 3 = TP) = P(O 1 = Cap,O 2 = Cof X 1 = CP,X 2 = TP,X 3 = TP)P(X 1 = CP,X 2 = TP,X 3 = TP)) = P(Cap, Cof CP, TP, TP)P(CP, TP, TP)) Now: P(Cap,Cof CP,TP,TP) is the probability of output Cap,Cof during transitions from CP to TP and TP to TP and P(CP,TP,TP) is the probability of the transition chain. Therefore,

67 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solution for the problem 1 For the given state sequence X = (CP,TP,TP) P(O 1 = Cap,O 2 = Cof,X 1 = CP,X 2 = TP,X 3 = TP) = P(O 1 = Cap,O 2 = Cof X 1 = CP,X 2 = TP,X 3 = TP)P(X 1 = CP,X 2 = TP,X 3 = TP)) = P(Cap, Cof CP, TP, TP)P(CP, TP, TP)) Now: P(Cap,Cof CP,TP,TP) is the probability of output Cap,Cof during transitions from CP to TP and TP to TP and P(CP,TP,TP) is the probability of the transition chain. Therefore, = P(Cap CP, TP)P(Cof TP, TP) =(in our simplified model) = P(Cap CP)P(Cof TP) = = 0.04

68 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) P(Tea,Cof X 1,X 2,X 3 ) = P(Tea,Cof ) = (as sequences are a partition for the sample space) = X1,X 2,X 3 P(Tea,Cof X 1,X 2,X 3 )P(X 1,X 2,X 3 ) where

69 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) P(Tea,Cof X 1,X 2,X 3 ) = P(Tea,Cof ) = (as sequences are a partition for the sample space) = X1,X 2,X 3 P(Tea,Cof X 1,X 2,X 3 )P(X 1,X 2,X 3 ) where P(Tea,Cof X 1,X 2,X 3 ) = P(Tea X 1,X 2 )P(Cof X 2,X 3 ) = (for the simplified model of the coffee machine ) = P(Tea X 1 )P(Cof X 2 )

70 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) P(Tea,Cof X 1,X 2,X 3 ) = P(Tea,Cof ) = (as sequences are a partition for the sample space) = X1,X 2,X 3 P(Tea,Cof X 1,X 2,X 3 )P(X 1,X 2,X 3 ) where P(Tea,Cof X 1,X 2,X 3 ) = P(Tea X 1,X 2 )P(Cof X 2,X 3 ) = (for the simplified model of the coffee machine ) = P(Tea X 1 )P(Cof X 2 ) and (for the Markov constraint) P(X 1,X 2,X 3 ) = P(X 1 )P(X 2 X 1 )P(X 3 X 2 )

71 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) P(Tea,Cof X 1,X 2,X 3 ) = P(Tea,Cof ) = (as sequences are a partition for the sample space) = X1,X 2,X 3 P(Tea,Cof X 1,X 2,X 3 )P(X 1,X 2,X 3 ) where P(Tea,Cof X 1,X 2,X 3 ) = P(Tea X 1,X 2 )P(Cof X 2,X 3 ) = (for the simplified model of the coffee machine ) = P(Tea X 1 )P(Cof X 2 ) and (for the Markov constraint) P(X 1,X 2,X 3 ) = P(X 1 )P(X 2 X 1 )P(X 3 X 2 ) The simplified model is concerned with only the following transition chains (CP, CP, CP), (CP, TP, CP), (CP, CP, TP) (CP,TP,TP)

72 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) The following probability is given P(Tea,Cof ) = P(Tea CP)P(Cof CP)P(CP)P(CP CP)P(CP CP)+ P(Tea CP)P(Cof TP)P(CP)P(TP CP)P(CP TP)+ P(Tea CP)P(Cof CP)P(CP)P(CP CP)P(TP CP)+ P(Tea CP)P(Cof TP)P(CP)P(TP CP)P(TP TP) = st.: (CP,CP,CP)) st.: (CP,TP,CP)) st.: (CP,CP,TP)) st.: (CP,TP,TP))

73 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) The following probability is given P(Tea,Cof ) = P(Tea CP)P(Cof CP)P(CP)P(CP CP)P(CP CP)+ P(Tea CP)P(Cof TP)P(CP)P(TP CP)P(CP TP)+ P(Tea CP)P(Cof CP)P(CP)P(CP CP)P(TP CP)+ P(Tea CP)P(Cof TP)P(CP)P(TP CP)P(TP TP) = = = st.: (CP,CP,CP)) st.: (CP,TP,CP)) st.: (CP,CP,TP)) st.: (CP,TP,TP))

74 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solutions for the problem 2 In general, for any sequence of three states X = (X 1,X 2,X 3 ) The following probability is given P(Tea,Cof ) = P(Tea CP)P(Cof CP)P(CP)P(CP CP)P(CP CP)+ P(Tea CP)P(Cof TP)P(CP)P(TP CP)P(CP TP)+ P(Tea CP)P(Cof CP)P(CP)P(CP CP)P(TP CP)+ P(Tea CP)P(Cof TP)P(CP)P(TP CP)P(TP TP) = = = = = = st.: (CP,CP,CP)) st.: (CP,TP,CP)) st.: (CP,CP,TP)) st.: (CP,TP,TP))

75 Hidden Markov Models A Simple Example of Hidden Markov Model (2) Solution to the problem 3 (Likelihood) In the general case, a sequence of n symbols O = (o 1,...,o n ) out from any sequence of n + 1 transitions X = (p 1,...,p n+1 ) can be predicted by the following probability: P(O) = p1,...,p n+1 P(O X)P(X) = = p1,...,p n+1 P(CP) n t=1 P(O t p t,p t+1 )P(p t+1 p t )

76 Advantages Modeling linguistic tasks of Stochastic Processes Outline Powerful mathematical framework Questions for Stochastic models Modeling POS tagging via HMM

77 Advantages Mathematical Methods for HMM The complexity of training and decoding can be limited by the use of optimization techniques Parameter estimation via entropy minimization (EM) Language Modeling via dynamic programming (Forward algorithms) (O(n)) Tagging/Decoding via dynamic programming (O(n)) (Viterbi) A relevant issue is the availablity of source data: supervised training cannot be applied always

78 HMM and POS tagging The task of POS tagging POS tagging Given a sequence of morphemes w 1,...,w n with ambiguous syntactic descriptions (i.e.part-of-speech tags) t j, compute the sequence of n POS tags t j1,...,t jn that characterize correspondingly all the words w i.

79 HMM and POS tagging The task of POS tagging POS tagging Given a sequence of morphemes w 1,...,w n with ambiguous syntactic descriptions (i.e.part-of-speech tags) t j, compute the sequence of n POS tags t j1,...,t jn that characterize correspondingly all the words w i. Examples: Secretariat is expected to race tomorrow NNP VBZ VBN TO VB NR NNP VBZ VBN TO NN NR

80 HMM and POS tagging HMM and POS tagging Given a sequence of morphemes w 1,...,w n with ambiguous syntactic descriptions (i.e.part-of-speech tags), derive the sequence of n POS tags t 1,...,t n that maximizes the following probability: that is P(w 1,...,w n,t 1,...,t n ) (t 1,...,t n ) = argmax pos1,...,pos n P(w 1,...,w n,pos 1,...,pos n )

81 HMM and POS tagging HMM and POS tagging Given a sequence of morphemes w 1,...,w n with ambiguous syntactic descriptions (i.e.part-of-speech tags), derive the sequence of n POS tags t 1,...,t n that maximizes the following probability: that is P(w 1,...,w n,t 1,...,t n ) (t 1,...,t n ) = argmax pos1,...,pos n P(w 1,...,w n,pos 1,...,pos n ) Note that this is equivalent to the following: (t 1,...,t n ) = argmax pos1,...,pos n P(pos 1,...,pos n w 1,...,w n ) as: P(w 1,...,w n,pos 1,...,pos n ) P(w 1,...,w n ) = P(pos 1,...,pos n w 1,...,w n ) and P(w 1,...,w n ) is the same for all the sequencies (pos 1,...,pos n ).

82 HMM and POS tagging HMM and POS tagging How to map a POS tagging problem into a HMM The above problem (t 1,...,t n ) = argmax pos1,...,pos n P(pos 1,...,pos n w 1,...,w n ) can be also written (Bayes law) as: (t 1,...,t n ) = argmax pos1,...,pos n P(w 1,...,w n pos 1,...,pos n )P(pos 1,...,pos n )

83 HMM and POS tagging HMM and POS tagging The HMM Model of POS tagging: HMM States are mapped into POS tags (t i ), so that P(t 1,...,t n ) = P(t 1 )P(t 2 t 1 )...P(t n t n 1 ) HMM Output symbols are words, so that P(w 1,...,w n t 1,...,t n ) = n i=1 P(w i t i ) Transitions represent moves from one word to another Note that the Markov assumption is used to model probability of a tag in position i (i.e. t i ) only by means of the preceeding part-of-speech (i.e. t i 1 ) to model probabilities of words (i.e. w i ) based only on the tag (t i ) appearing in that position (i).

84 HMM and POS tagging HMM and POS tagging The final equation is thus: (t 1,...,t n ) = argmax t1,...,t n P(t 1,...,t n w 1,...,w n ) = argmax t1,...,t n n i=1 P(w i t i )P(t i t i 1 )

85 HMM and POS tagging Fundamental Questions for HMM in POS tagging 1 Given a sample of the output sequences and a space of possible models how to find out the best model, that is the model that best explains the data: how to estimate parameters? 2 Given a model and an observable output sequence O (i.e. words), how to determine the sequence of states (t 1,...,t n ) such that it is the best explanation of the observation O: Decoding Problem 3 Given a model what is the probability of an output sequence, O: Computing Likelihood.

86 HMM and POS tagging Fundamental Questions for HMM in POS tagging 1. Baum-Welch (or Forward-Backward algorithm), that is a special case of Expectation Maximization estimation. Weakly supervised or even unsupervised. Problems: Local minima can be reached when initial data are poor. 2. Viterbi Algorithm for evaluating P(W O). Linear in the sequence length. 3. Not relevant for POS tagging, where (w 1,...,w n ) are always known. Trellis and dynamic programming technique.

87 HMM and POS tagging HMM and POS tagging Advantages for adopting HMM in POS tagging An elegant and well founded theory Training algorithms: Estimation via EM (Baum-Welch) Unsupervised (or possibly weakly supervised) Fast Inference algorithms: Viterbi algorithm Linear wrt the sequence length (O(n)) Sound methods for comparing different models and estimations (e.g. cross-entropy)

88 Viterbi Forward algorithm In computing the likelihood P(O) of an observation we need to sum up the probability of all paths in a Markov model. Brute forse computation is not applicable in most cases. The forward algorithm is an application of dynamic programming.

89 Viterbi Forward algorithm

90 Viterbi HMM and POS tagging: Forward Algorithm

91 Viterbi Viterbi algorithm In decoding we need to find the most likely state sequence given an observation O. The Viterbi algorithm follows the same approach (dynamic programming) of the Forward. Viterbi scores are attached to each possible state in the sequence.

92 Viterbi HMM and POS tagging: the Viterbi Algorithm

93 About Parameter Estimation for POS HMM and POS tagging: Parameter Estimation Supervised methods in tagged data sets: Output probs: P(w i p j ) = C(w i,p j ) C(p j ) Transition probs: P(p i p j ) = C(pi follows p j ) C(p j ) Smoothing: P(w i p j ) = C(w i,p j )+1 C(p j )+K i (see Manning& Schutze, Chapter 6)

94 About Parameter Estimation for POS HMM and POS tagging: Parameter Estimation Unsupervised (few tagged data available): With a dictionary: P(w i p j ) are early estimated from D, while P(p i p j ) are randomly assigned With equivalence classes u L, (Kupiec92): P(w i p L ) = 1 L C(uL ) ul C(u L ) L For example, if L ={noun, verb} then u L = {cross,drive,...}

95 About Parameter Estimation for POS Other Approaches to POS tagging Church (1988): 3 i=n P(w i t i )P(t i 2 t i 1,t i ) (backward) Estimation from tagged corpus (Brown) No HMM training Performances: > 95% De Rose (1988): n i=1 P(w i t i )P(t i 1 t i ) (forward) Estimation from tagged corpus (Brown) No HMM training Performance: 95% Merialdo et al.,(1992), ML estimation vs. Viterbi training Propose an incremental approach: small tagging and then Viterbi training n i=1 P(w i t i )P(t i+1 t i,w i )???

96 About Parameter Estimation for POS POS tagging: References F. Jelinek, Statistical methods for speech recognition, Cambridge, Mass.: MIT Press, Manning & Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Chapter 6. Church (1988), A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text, Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2), Parameter Estimation (slides): "Introduction to Information Retrieval", Christopher D. Manning, Prabhakar Raghavan and Hinrich Schu tze, Cambridge University Press Chapter hinrich/information-retrieval-book.h

97 Statistical Parsing Outline: Parsing and Statistical modeling Parsing Models Probabistic Context-Free Grammars Dependency-based assumptions Statistical Disambiguation

98 Parsing and Statistical modeling Main Issues in statistical modeling of grammatical recognition: Modelling the structure vs. the language How context can be taken into account Modelling the structure vs. the derivation Which Parsing given a Model

99 Probabilistic Context-Free Grammars, PCFG Basic Steps: Start from a CF grammar Attach probability to the rules Tune (i.e. adjust them) against training data (e.g. a treebank)

100 Probabilistic Context-Free Grammars, PCFG Syntactic sugar N rl Γ P(N rl Γ) P(s) = t P(t), where t are parse trees for s N rl w 1, w 2,, w r-1, w r,.., w l, w l+1,, w n

101 Probabilistic Context-Free Grammars, PCFG Main assumptions Total probability of each continuation, i.e. i j P(N i Γ j ) = 1 Context Independence Place Invariance, k Γ) is the same P(N j k(k+const) Context-freeness, r,l P(N j rl Γ) is independent from w 1,..,w r 1 and w l+1,...,w n Derivation (or Anchestor) Independence, i.e. k P(N j rl Γ) is independent from any anchestor node outside N j rl

102 Probabilistic Context-Free Grammars, PCFG Main Questions: How to develop the best model from observable data (Training) Which is the best parse for a sentence Which training data?

103 Probabilistic Context-Free Grammars, PCFG Probabilities of trees P(t) = P( 1 S 13 1 NP 3 12 VP 33, 1 NP 12 the man, 3 VP 33 snores) = = P( 1 α)p( 1 δ 1 α)p( 3 ϕ 1 α, 1 δ) = = P( 1 α)p( 1 δ)p( 3 ϕ) = = P(S NPVP)P(NP the man)p(vp snores) 1 S 1 α 1 NP 3 VP 1 δ 3 ϕ the man snores

104 PCFG Training Supervised case: Treebanks are used. Simple ML estimation by counting subtrees Basic problems are low counts for some structures and local maxima When the target grammatical model is different (e.g. dependency based), rewriting is required before counting

105 PCFG Training Unsupervised: Given a grammar G, and a set of sentences (parsed but not validated) The best model is the one maximizing the probability of useful structures in observable data set Parameters are adjusted via EM algorithms Ranking data according to small training data (Pereira and Schabes 1992)

106 Limitations of PCFGs Poor modeling of structural dependencies, due to the context-freeness modeled through the independence assumptions Example: NPs in Subject vs. Object position NP DT NN NP PRP have always the same probability that does not distinguishes between S NP VP and VP V NP Lack of lexical constraints, as lexical preferences are not taken into account in the grammar

107 Lexical Constraints and PP-attachment compare with: fishermen caught tons of herring

108 Statistical Lexicalized Parsing A form of lexicalized PCFG is introduced in (Collins, 99). Each grammar rule is lexicalized by denoting in the RHS the grammatical head VP(dumped) V(dumped) NP(sacks) PP(into) Probablities are not estimated directly by counting occurrences in TBs Further independence assumptions are made to estimate effectively complex RHS statistics: each rule is computed by scanning the RHS left-to-right, and then treating such process as a generative (markovian) model.

109 Collin s PCFGs Lexicalized Tree in the Collins (1999) model

110 Statistical Lexicalized Parsing In (Collins 96) a dependency-based statistical parser is proposed: In the system also chunks (Abney,1991) for complex NPs are used Probabilities are assigned to head-modifier dependencies, (R,< hw j,ht j >,< w k,t k >) P(...) = P H (H(hw,ht) R,hw,ht) P L (L i (lw i,lt i ) R,H,hw,ht) i P R (R i (rw i,rt i ) R,H,hw,ht) i Dependency relationships are considered independent Supervised training is much more effective from treebanks

111 Evaluation of Parsing Systems

112 Results Performances of different Probabilistic Parsers (Sentences with 40 words) %LR %LP CB % 0 CB Charniak (1996) Collins (1996) Charniak (1997) Collins (1997)

113 Further Issues in statistical grammatical recognition Syntactic Disambiguation PP-attachment Subcategorization frames Induction and Use Semantic Disambiguation Word Clustering for better estimation (e.g. data sparseness)

114 Disambiguation of Prepositional Phrase attachment Purely probabilstic model (Hindle and Rooths, 1991, 1993) VP NP PP sequences (i.e. verb and noun attachment) bi-gram probabilties P(prep noun) and P(prep verb) are compared (via χ 2 test) Smoothing low counts is applied for sparse data problems

115 Disambiguation of Prepositional Phrase attachment An hybrid model (Basili et al, 1993, 1995): Multiple attachment sites in a dependency framework Semantic classes for words are available semantic tri-gram probabilities, i.e. P(prep,Γ noun) and P(prep,Γ verb), are compared Smoothing is not required as Γ are classes and low counts are less effective

116 Statistical Parsing: References Abney, S. P. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23(4), Basili, R., M.T. Pazienza, P. Velardi, Semiðautomatic extraction of linguistic information for syntactic disambiguation, Applied Artificial Intelligence, vol 7, pp. 339ð-364, Bikel, D., Miller, S., Schwartz, R., and Weischedel, R. (1997). Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97, pp Black, E., Abney, S. P., Flickinger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M. P., Roukos, S., Santorini, B., and Strzalkowski, T. (1991). A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings DARPA Speech and Natural Language Workshop, Pacific Grove, CA, pp Morgan Kaufmann. Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL 00), Seattle, Washington, pp Chelba, C. and Jelinek, F. (2000). Structured language modeling. Computer Speech and Language, 14, Collins, M. J. (1999). Head-driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia. Hindle, D. and Rooth, M. (1991). Structural ambiguity and lexical relations. In Proceedings of the 29th ACL, Berkeley, CA, pp ACL.

Stochastic Parsing. Roberto Basili

Stochastic Parsing. Roberto Basili Stochastic Parsing Roberto Basili Department of Computer Science, System and Production University of Roma, Tor Vergata Via Della Ricerca Scientifica s.n.c., 00133, Roma, ITALY e-mail: basili@info.uniroma2.it

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Dept. of Linguistics, Indiana University Fall 2009

Dept. of Linguistics, Indiana University Fall 2009 1 / 14 Markov L645 Dept. of Linguistics, Indiana University Fall 2009 2 / 14 Markov (1) (review) Markov A Markov Model consists of: a finite set of statesω={s 1,...,s n }; an signal alphabetσ={σ 1,...,σ

More information

CS460/626 : Natural Language

CS460/626 : Natural Language CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 23, 24 Parsing Algorithms; Parsing in case of Ambiguity; Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th,

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

CS460/626 : Natural Language

CS460/626 : Natural Language CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27 SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th March, 2011 CMU Pronunciation

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

Processing/Speech, NLP and the Web

Processing/Speech, NLP and the Web CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 18, 2017 Recap: Probabilistic Language

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University September 27, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

Statistical Processing of Natural Language

Statistical Processing of Natural Language Statistical Processing of Natural Language and DMKM - Universitat Politècnica de Catalunya and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4 Graphical and Generative

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

A gentle introduction to Hidden Markov Models

A gentle introduction to Hidden Markov Models A gentle introduction to Hidden Markov Models Mark Johnson Brown University November 2009 1 / 27 Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612

More information

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) Hidden Markov Models HMMs Raymond J. Mooney University of Texas at Austin 1 Part Of Speech Tagging Annotate each word in a sentence with a part-of-speech marker. Lowest level of syntactic analysis. John

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 Parsing Problem Semantics Part of Speech Tagging NLP Trinity Morph Analysis

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

A Context-Free Grammar

A Context-Free Grammar Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily

More information

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Natural Language Processing CS 4120/6120 Spring 2017 Northeastern University David Smith with some slides from Jason Eisner & Andrew

More information

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees Recap: Lexicalized PCFGs We now need to estimate rule probabilities such as P rob(s(questioned,vt) NP(lawyer,NN) VP(questioned,Vt) S(questioned,Vt)) 6.864 (Fall 2007): Lecture 5 Parsing and Syntax III

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models 1 University of Oslo : Department of Informatics INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Language Models & Hidden Markov Models Stephan Oepen & Erik Velldal Language

More information

Lecture 9: Hidden Markov Model

Lecture 9: Hidden Markov Model Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1 This lecture v Hidden Markov

More information

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course) 10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of

More information

POS-Tagging. Fabian M. Suchanek

POS-Tagging. Fabian M. Suchanek POS-Tagging Fabian M. Suchanek 100 Def: POS A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Alizée wrote a really great

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical methods in NLP, lecture 7 Tagging and parsing Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1 overview

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22 Parsing Probabilistic CFG (PCFG) Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 22 Table of contents 1 Introduction 2 PCFG 3 Inside and outside probability 4 Parsing Jurafsky

More information

Structured Output Prediction: Generative Models

Structured Output Prediction: Generative Models Structured Output Prediction: Generative Models CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 17.3, 17.4, 17.5.1 Structured Output Prediction Supervised

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and

More information

CS : Speech, NLP and the Web/Topics in AI

CS : Speech, NLP and the Web/Topics in AI CS626-449: Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-17: Probabilistic parsing; insideoutside probabilities Probability of a parse tree (cont.) S 1,l NP 1,2

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Probabilistic Context-Free Grammars and beyond

Probabilistic Context-Free Grammars and beyond Probabilistic Context-Free Grammars and beyond Mark Johnson Microsoft Research / Brown University July 2007 1 / 87 Outline Introduction Formal languages and Grammars Probabilistic context-free grammars

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

8: Hidden Markov Models

8: Hidden Markov Models 8: Hidden Markov Models Machine Learning and Real-world Data Helen Yannakoudakis 1 Computer Laboratory University of Cambridge Lent 2018 1 Based on slides created by Simone Teufel So far we ve looked at

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto Revisiting PoS tagging Will/MD the/dt chair/nn chair/?? the/dt meeting/nn from/in that/dt

More information

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models. x 1 x 2 x 3 x N Hidden Markov Models 1 1 1 1 K K K K x 1 x x 3 x N Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5)

More information

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n

More information

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation atural Language Processing 1 lecture 7: constituent parsing Ivan Titov Institute for Logic, Language and Computation Outline Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing

More information

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM Today s Agenda Need to cover lots of background material l Introduction to Statistical Models l Hidden Markov Models l Part of Speech Tagging l Applying HMMs to POS tagging l Expectation-Maximization (EM)

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Sequence modelling. Marco Saerens (UCL) Slides references

Sequence modelling. Marco Saerens (UCL) Slides references Sequence modelling Marco Saerens (UCL) Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004), Introduction to machine learning.

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Lecture 7: Sequence Labeling

Lecture 7: Sequence Labeling http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars Cross-Entropy and Estimation of Probabilistic Context-Free Grammars Anna Corazza Department of Physics University Federico II via Cinthia I-8026 Napoli, Italy corazza@na.infn.it Giorgio Satta Department

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

8: Hidden Markov Models

8: Hidden Markov Models 8: Hidden Markov Models Machine Learning and Real-world Data Simone Teufel and Ann Copestake Computer Laboratory University of Cambridge Lent 2017 Last session: catchup 1 Research ideas from sentiment

More information

Spectral Learning for Non-Deterministic Dependency Parsing

Spectral Learning for Non-Deterministic Dependency Parsing Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque 1 Ariadna Quattoni 2 Borja Balle 2 Xavier Carreras 2 1 Universidad Nacional de Córdoba 2 Universitat Politècnica de Catalunya

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Language and Statistics II

Language and Statistics II Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Introduction to Probablistic Natural Language Processing

Introduction to Probablistic Natural Language Processing Introduction to Probablistic Natural Language Processing Alexis Nasr Laboratoire d Informatique Fondamentale de Marseille Natural Language Processing Use computers to process human languages Machine Translation

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

HMM and Part of Speech Tagging. Adam Meyers New York University

HMM and Part of Speech Tagging. Adam Meyers New York University HMM and Part of Speech Tagging Adam Meyers New York University Outline Parts of Speech Tagsets Rule-based POS Tagging HMM POS Tagging Transformation-based POS Tagging Part of Speech Tags Standards There

More information

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

CMPT-825 Natural Language Processing. Why are parsing algorithms important? CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop October 26, 2010 1/34 Why are parsing algorithms important? A linguistic theory is implemented in a formal system to generate

More information

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information