lecture 6: modeling sequences (final part)

Size: px
Start display at page:

Download "lecture 6: modeling sequences (final part)"

Transcription

1 Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation

2 Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs) 2

3 Examples of structured prediction problems Our x } Syntactic Parsing } Protein structure prediction } Visual scene parsing Our y 3

4 Examples of structured prediction problems Our x } Syntactic Parsing } Protein structure prediction Our y We cannot estimate a distinct set of parameters each y, we need to understand: (1) how to break y into parts; (2) how to predict these parts and (3) how these parts interact with each other } Visual scene parsing 4

5 Structured Prediction D = {x i,y i } l i=1 } Given a training dataset, The output is now a graph x i 2 X,y i 2 Y Now input-output pairs are mapped to the feature space } } } Represent examples in some features space Assume the classification rule: Estimate the parameters by optimizing some objective on the training data } For example, the hinge loss (max-margin classification / SVMs): 1X arg min w y 2 w 1,...,w M 2 y ŷ = arg max y f : X Y! R n w T f(x, y), w 2 R n, y 2 Y The highest score among the remaining structures Classification is computationally challenging as we search through the space Y s. t. w T f(x i,y i ) max w T f(x i,y) y2y/y i 1 score for the "gold" structure 5

6 Consider the sequence labeling example The feature space f f N N N N M V... N N N M M V... birds dogs can fly can fly... (dogs:n can:m fly:v)= ( 0, 1, 0, 0,, 1, 1,.. 0, 1, T 1, ) (dogs:n can:n fly:n)= ( 0, 1, 1, 1,, 0, 0,.. 2, 0, T 0, ) w = ( 5, 4, 2, 2,, 10, 5,.. -1, 3, T 3, ) Counts of the corresponding fragments Features of another sequence We want to find weights which score all the wrong sequences below the correct ones w T f (dogs:n can:m fly:v) > w T f (dogs:n can:n fly:n) And all (M 3 1) other 'wrong' sequences for this sentence 6

7 Structured Perceptron } } Return to structured prediction: ŷ = arg max w T f(x, y) y2y Perceptron algorithm, given a training set D = {x i,y i } l i=1 w =0 do Pushes the correct sequence up and the incorrectly predicted one down // initialize err = 0 for i = 1 to l // over the training examples ŷ = arg max w T f(x i,y) // model prediction if ( w T f(x i, ŷ) > w T f(x i,y i ) ) // if mistake w += f(x i,y i ) f(x i, ŷ) // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return w y Runs Viterbi during training 7

8 Averaged Structured Perceptron } } Return to structured prediction: ŷ = arg max w T f(x, y) y2y Perceptron algorithm, given a training set D = {x i,y i } l i=1 w =0; k =0 w k = w; k ++ do Do not run until convergence (just T iterations) // initialize err = 0 for i = 1 to l // over the training examples ŷ = arg max w T f(x i,y) // model prediction if ( w T f(x i, ŷ) > w T f(x i,y i ) ) // if mistake w += f(x i,y i ) f(x i, ŷ) // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return 1 k kx i=1 w k y How can we compute the average (much more) memory efficiently? 8

9 Generative vs discriminative on smaller datasets } For smaller training sets } } Theoretical results: generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS 01] Empirical: A discriminative classifier A generative model Error rates on predicting housing trends prices in Boston area # train examples 1 9

10 Hidden Markov Models: Unsupervised Estimation } } N the number tags, M vocabulary size Parameters (to be estimated from the training set): Note the change in notation from y to s } Transition probabilities a ji = P (s t = i s t 1 = j), A - [ N x N ] matrix } Emission probabilities b ik = P (x t = k s t = i), B - [ N x M] matrix } Training corpus: } x (1) = (In, an, Oct., 19, review, of,. ), y (1) = (IN, DT, NNP, CD, NN, IN,. ) } x (2) = (Ms., Haag, plays, Elianti,.), y (2) = (NNP, NNP, VBZ, NNP,.) } } x (L) = (The, company, said, ), y (L) = (DT, NN, VBD, NNP,.) For notation reasons, let's assume that all the sentences have length n } How to estimate the parameters using maximum likelihood estimation? } How You might do have we guessed estimate what models these estimates in the are? unsupervised set-up? 10

11 EM (intuition) (l) t (i) =P (s t = i x (l) 1,...,x(l) n ) (1) (1) 1 2 N V John loves Mary P (x t = books s t = N) = N V Mary loves books P (x s) = ĈE(x, s) P x ĈE(x, s) N V John hates books Ĉ E (x, s) = LX l=1 nx t=1 (l) t (s)i(x (l) t = x) An indicator function: equals to 1 if the condition is true, 0 o.w.

12 EM (intuition) (l) t (i, j) =P (s t = i, s t+1 = j x (l) 1,...,x(l) n ) V-V: 0.1 V-V: 0.2 (i, j) V-N: 0.1 V-N: 0.1 $-N: 0.8 N-N: 0.2 N-N: 0.3 $-V: 0.2 N-V: 0.6 N-V: 0.4 (1) 0 (1) 1 (i, j) N-$: 0.7 V-$: 0.3 a ji = P (s t = N s t 1 = V )= John loves Mary V-V: 0.2 V-V: 0.1 V-N: 0.2 V-N: 0.1 $-N: 0.7 N-N: 0.2 N-N: 0.1 $-V: 0.3 N-V: 0.4 N-V: 0.7 Mary loves books V-V: 0.1 V-V: 0.1 V-N: 0.2 V-N: 0.1 $-N: 0.6 N-N: 0.2 N-N: 0.3 $-V: 0.4 N-V: 0.5 N-V: 0.5 John hates books N-$: 0.8 V-$: 0.2 N-$: 0.6 V-$: 0.4 P (j i) = Ĉ T (i, j) = Ĉ T (i, j) P j 0 Ĉ T (i, j 0 ) LX nx 1 l=1 t=0 t=1 (l) t (i, j) Disclaimer: posterior distributions in this example may not satisfy natural consistency conditions - this is just an example

13 Intuitive conclusion: } We need to figure out how to compute: (1) probabilistic predictions about states t(i) =P (s t = i x 1,...,x n ) (2) probabilistic predictions about pairs of states t (i, j) =P (s t = i, s t+1 = j x 1,...,x n )

14 Forward-backward probabilities Picture based on one from Tommi Jaakkola Forward probability } Forward probabilities t (i) t (i) =P (x 1,...,x t,s t = i) } Backward probabilities t(i) t(i) =P (x t+1,...,x n s t = i) Backward probability We can think of this as of evidence about the current state from future observationsx

15 Forward-backward probabilities } Recursion for calculating forward probabilities t (i) =P (x 1,...,x t,s t = i) 1 (i) =P (i $)P (x 1 i) 0 1 t (i) X j t 1 (j)p (i j) A P (x t i)

16 Forward-backward probabilities } Analogously, recursion for calculating backward probabilities t(i) =P (x t+1,...,x n s t = i) n(i) =1 0 t(i) X j We assume here that n-th symbol is $ (or </s>) 1 P (j i)p (x t+1 j) t+1 (j) A

17 } The fw and bw probabilities are complementary and permit us to evaluate various probabilities: Forward-backward probabilities t (i) =P (x 1,...,x t,s t = i) t(i) =P (x t+1,...,x n s t = i) P (x 1,x 2,...,x n ) t(i) =P (s t = i x 1,...,x n ) t (i, j) =P (s t = i, s t+1 = j x 1,...,x n ) These two we need for EM

18 Forward-backward probabilities The probability of an observed sequence: P (x 1,...,x n )= X P (x 1,...,x n,s t = i) i = X P (x 1,...,x t,s t = i)p (x t+1,...,x n s t = i) i = X i t (i) t (i)

19 Forward-backward probabilities The posterior probability that HMM was in state i at time t P (s t = i x 1,...,x n )= P (x 1,...,x n,s t = i) t(i) = P (x 1,...,x n ) = t(i) t (i) P j t(j) t (j)

20 Forward-backward probabilities The posterior probability of being at states i and j at times t and t +1, respectively (i, j) =P (s t = i, s t+1 = j x 1,...,x n )= t (i)p (s t+1 = j s t = i)p (x t+1 s t+1 = j) t+1 (j) P j t(j) t (j)

21 Putting everything together: } We need to figure out how to compute: (1) probabilistic predictions about t(i) =P (s t = i x 1,...,x n ) (2) probabilistic predictions about states pairs of states t (i, j) =P (s t = i, s t+1 = j x 1,...,x n ) See also chapter 6.5 of J&M Now we know how: via forward and backward probabilities Now we have all the components in place for EM Ĉ T (i, j) = Ĉ E (x, s) = LX l=1 LX l=1 nx 1 t=1 nx t=1 (l) t (i, j) ) (l) t (s)i(x (l) t = x) P (j i) = ) P Ĉ T (i, j) P j 0 Ĉ T (i, j 0 ) (x s) = ĈE(x, s) P x ĈE(x, s)

22 Viterbi / forward-backward / Baum-Welch } Have you noticed a relation between the forward algorithm and Viterbi? } Roughly speaking, Viterbi select the best previous states, whereas the forward algorithm sums over all possibilities: Forward: Viterbi: Forward: Viterbi: t (i) =P (x 1,...,x t,s t )= v t (i) = max P (x 1,...,x t,s 1,...,s t 1,s t ) s 1,...,s t t (i) =P (x t X t j v t (i) =P (x t i) max v t j X s 1,...,s t 1 P (x 1,...,x t,s 1,...,s t 1,s t ) 1 (j)p (i j) A 1 (j)p (i j) A Sum-Product algorithm A Max-Product algorithm 22

23 Viterbi / forward-backward / Baum-Welch } Forward-backward computation algorithm for HMMs? forward probabilities (beliefs) backward probabilities (beliefs) } It can be generalized to more general graphs ("belief propagation")? 23

24 Summary so far } Supervised estimation: generative (ML) and (feature-rich) discriminative modeling (structured perceptron, conditional random fields, MEMM) } Unsupervised modeling: generative (EM) Recently, there was some work on estimation of feature-rich models for unsupervised modeling

25 Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs)

26 Discriminative estimation } Generative models not only consider the labeling but also score how likely an input sequence is x 1,...,x n } Discriminative models instead concentrate only on modeling how likely an output sequence y 1,...,y n is for a given input x 1,...,x n I switched back to using y instead of s

27 Discriminative estimation } We have learnt one discriminative method: structured perceptron } but what if we want to get probabilities P (y x)?

28 Recap: logistic regression } Logistic ("softmax") regression (aka max-entropy classifier): Notation abuse: we will drop w Z(x) P (y x, w) = exp(wt f(x, y)) Z(x) is a partition function: Z(x) = X y 0 2Y exp(w T f(x, y 0 )) } "Conditional" log-likelihood: L(w) = LX log P (y (l) x (l), w)+kkwk 2 2 l=1 We will forget the regularizer in the future discussion

29 Recap: stochastic gradient descent } We compute the gradient of the conditional log-likelihood 4 function L(w) } We can update our parameter vector based on the gradient: w := w + 4 L(w) Learning rate In practice, slighter "smarter" gradient methods are normally used

30 Recap: gradient for multi-class } Let's derive the gradient: L # X log P (y (l) x (l), i l= LX T f(x (l),y (l) ) log X exp(w T f(x (l),y 0 )) i 0 l=1 y 02Y 1 LX X i (x (l),y (l) ) f i (x (l) exp(w T f(x (l), ỹ)), ỹ) P A l=1 ỹ y 0 2Y exp(wt f(x (l),y 0 )) 0 1 LX X i (x (l),y (l) ) f i (x (l), ỹ)p (ỹ x, w) A l=1 ỹ } Intuition: we are trying to find such a model that feature expectations computed according to the model are similar to their estimates from the data

31 Structured case } The soft-max classifier P (y x, w) = exp(wt f(x, y)) Z(x) How do we generalize it to sequences?

32 Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x 1 x 2 x 3 x n ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision

33 Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x 1 x 2 x 3 x n ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision

34 Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision } What are the training examples? } How do we search?

35 MEMMs: Problem 1 P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Since it is trained using as y t-1 the correct label, it will end up overrelying on it } At test time, y t-1 (unlike x) will be predicted (and often incorrect) } Formally, the distribution of features in training and testing is different and the examples are interdependent (the i.i.d. assumption standard in machine learning is broken) This is not only a theoretical problem

36 MEMMs: Problem 1 } Consider a sequence labeling problem } Input vectors are uniformly distributed } If input is (1,1,1) the output is also (1, 1, 1) } Otherwise, the output is (0, 0, 0) y 1 y 2 y 3 x 1 x 2 x 3 P (y x, w) =P (y 1 x, w) {0, 1} 3!{0, 1} 3 Let's forget for now that the probabilities are computed with softmax ny P (y t y t 1, x, w) t=2 Ignores input!! P (y 1 =1 x 1 = 1) = P (y 1 =0 x 1 = 1) = 1/4 3/4 P (y 2 =0 y 1 =0,x 2 ) =1, 8x 2 P (y 2 =1 y 1 =1,x 2 = 1)= 1 P ((1, 1, 1) (1, 1, 1)) = 1/4 P ((0, 0, 0) (1, 1, 1)) = 3/4

37 MEMMs: Problem 1 } Consider a sequence labeling problem {0, 1} 3!{0, 1} 3 } Input vectors are uniformly distributed One can say: this all has happened because we factorized } If input is (1,1,1) the output is also (1, 1, 1) the model badly (i.e. features are not appropriate) } Otherwise, the output is (0, 0, 0) y 1 y 2 y 3 x 1 x 2 x 3 P (y x, w) =P (y 1 x, w) yes, but Let's forget for now that the probabilities are computed with softmax ny P (y t y t 1, x, w) t=2 Ignores input!! P 1. (y 1 We =1 x do 1 not = 1) know = 1/4if we P will (y 2 better =0 y 1 with =0,x (complex) 2 ) =1, 8x real 2 P (y 1 problems =0 x 1 = 1) = 3/4 P (y 2 =1 y 1 =1,x 2 = 1)= 1 2. P Structured ((1, 1, 1) (1, perceptron 1, 1)) = 1/4 (and CRF) would learn a P perfect ((0, 0, 0) (1, classifier 1, 1)) with = this 3/4 factorization

38 MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = It is often true At time t comes input completely inconsistent with Hypothesis 1 Hypothesis 2: p = t 2 t 1 t Can we the model (in principle) ensure that the red state is not included in the winning path?

39 MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = It is often true p = At time t comes input completely inconsistent with Hypothesis Hypothesis 2: p = p = Can we the model (in principle) ensure that the red state is not included in the winning path? t 2 t 1 t The winner

40 MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = It is often true p = x 0.5 At time t comes input completely inconsistent with Hypothesis Hypothesis 2: p = p = x 0.5 Why is it tnot 2a serious t 1 t problem for HMMs? Can we the model (in principle) ensure that the red state is not included in the winning path? No because we need to preserve the probability mass (the probabilities sum to 1)

41 MEMMs: Conclusions } MEMMs are easy to estimate and use } In practice, they are very often used in NLP We will see them again in the context of parsing } but they have serious problems and, consequently, brittle } these problems are not unique to MEMMs but for any "piecewise" estimation approach (our neural / deep models as well) } Problems very similar to "Problem 1" come up when "pipelines" are used: } } For example, text -> PoS tagging -> syntactic parsing -> semantic analysis -> dialog state prediction -> Error propagates across stages and we should be 'careful' when training models for individual stages

42 Idea 2: conditional random fields (CRF) P P (y x, w) = exp(wt n t=1 f(y t 1,y t, x)) Z(x) The partition function: Z(x) = X The factor graph y 1 y 2 y 3 y n. y 0 2Y n exp(w T nx f(yt 0 t=1 I am not careful with START and STOP to simplify notation 1,y 0 t, x)) The summation is over the set of all potential labelings of the entire sentence (exponential size) w T f(y 1,y 2, x) w T f(y 2,y 3, x) } We score the entire sequence in one shot } What are the training examples? } How do we search?

43 CRF (Lafferty, McCallum and Pereira, 2001)

44 Idea 2: chain conditional random field (CRF) P (y x, w) = exp(wt P n t=1 f(y t 1,y t, x)) Z(x) } We score the entire sequence in one shot } Do we have the i.i.d. problem (problem 1)? } Do we have the label bias problem (problem 2)?

45 How do we estimate the model? P P (y x, w) = exp(wt n t=1 f(y t Z(x) 1,y t, x)) Z(x) = X nx exp(w T f(yt 0 y 0 2Y n t=1 1,yt, 0 x)) } Do we have any hopes to compute the gradient?! LX nx L(w) = t, x) log Z(x (l) 4 4, w) = LX l=1 nx t=1 X l=1 y,y 0 2Y t=1 f(y (l) t 1,y(l) f(y (l) t 1,y(l) t, x) P (y t Again: matching data and model expectations 1 = y, y t = y 0 x, w)f(y (l) t 1 = y, y(l) t = y 0, x)) How do we compute the probabilities?

46 Summary: linear models for (supervised) sequence labeling + - HMMs Very easy to estimate Easy to generalize to unand semi-supervised learning Low asymptotic performance Structured perceptron MEMMs Simple to implement Fast to estimate (no decoding at training time) Does not yield probabilities Does not optimize an objective (kind of) They can be brittle (recall the problems mentioned) CRFs Gives out probabilities, motivated by a clear objective, stabile performance Harder to implement (forward-backward), expensive training (esp. when not 1 st order MM)

47 Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs) 47

48 Recurrent neural networks (RNNs) y 1 y 2 y n Vanishing gradients problem is mitigated by Long-Short Term Memory Networks (LSTMs) but still a major issue } Lots of similarities to MEMMs } What's nice about them? } They can perform fairly well with minimal feature engineering } What's problematic about them? (v.s. MEMMs) } "Vanishing gradients" gradient information is not propagated from observations (y t ) to states far away in the past } Decoding? x 1 x 2 x n

49 Encoder-Decoder Encoder y 1 y 2 y n x 1 x 2 x n Decoder } Sounds like a crazy idea: compressing the entire sentences down to a single vector } More general than sequence labeling (why?) } Decoding is approximate } Brittle (but some "better" ideas are around: attention models)

50 NN models vs linear models Linear models - Often, exact decoding is possible; - Some degree of interpretability - Easier to encode prior knowledge (?) - Convex optimization (for supervised learning) Feature engineering is crucial - They can be expensive (!) in practice Representation learning models (incl. deep / neural) - feature induction - parameter sharing across multiple tasks / features (!!) - (related) modeling compositionality of language - can be efficient (e.g., on GPUs) - Non-convex optimization - Not very interpretable - Exact decoding is not possible - They can be more expensive (for some tasks)

51 NLP Problems Doc. classification Types of structures Models/Views Set-ups Modeling frameworks Bags Naive Bayes Topic analysis Shallow synt. parsing /tagging Sequences / Chains Topic models HMMs Supervised estimation Generative ML Syntactic parsing Relation extraction Semantic parsing Models of inference Machine translation Question answering Spanning trees Hierarchical trees History- / transition-based models PCFGs DOP Unsupervised Partially/semisupervised Many problems in NLP can be cast / or DAGs Global scoring approximated with sequence (e.g., MST) model Generative Bayes Discriminative Discriminative Bayes Representation learning (factorizations / NNs) Opinion analysis Bipartite graphs Summarization Dialogue systems "IBM" models What about dialog systems? (e.g., Siri)

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

CSE 447/547 Natural Language Processing Winter 2018

CSE 447/547 Natural Language Processing Winter 2018 CSE 447/547 Natural Language Processing Winter 2018 Feature Rich Models (Log Linear Models) Yejin Choi University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Announcements HW #3 Due Feb

More information

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Lecture 12: EM Algorithm

Lecture 12: EM Algorithm Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1 Three basic problems for MMs v Likelihood

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Soft Inference and Posterior Marginals. September 19, 2013

Soft Inference and Posterior Marginals. September 19, 2013 Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

NLP Programming Tutorial 8 - Recurrent Neural Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Feed Forward Neural Nets All connections point forward ϕ( x) y It is a directed acyclic

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Slides credit: Graham Neubig Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information