Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Size: px
Start display at page:

Download "Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov"

Transcription

1 Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

2 Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly used as part of a speech recognition system can be described by a graph this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models... [including] many advanced models at the acoustic-, pronunciation-, and languagemodeling levels. Bilmes(2004)

3 ACL 2010 Papers 1.Conditional Random Fields for Word Hyphenation 2.Practical Very Large Scale CRFs 3.Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm 4.Cross-Lingual Latent Topic Extraction 5.Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection 6.PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names 7.A Latent Dirichlet Allocation Method for Selectional Preferences 8.Latent Variable Models of Selectional Preference 9.A Bayesian Method for Robust Estimation of Distributional Similarities 10.Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression 11.Blocked Inference in Bayesian Tree Substitution Grammars 12.Decision Detection Using Hierarchical Graphical Models

4 Today s Lecture A very brief overview of Probabilistic Graphical Models (PGMs) Undirected models: MRFs and CRFs Learning and decoding in MRFs and CRFs Case study: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables For concise introduction to PGMs and their uses in NLP:

5 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

6 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs Random variable: a variable whose value results from a measurement on some type of random process. I.e., A = {Alice s grade for homework 2} Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

7 Conditional dependencies. For instance, Alice and Bob worked on the homework together, while Alice and Charlie Markov Random Fields Describe joint probability, P(A,B,C,D) using did not. an undirected graph Edges describe interacting RVs Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

8 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs For each configuration of RVs X 1, X 2,, X n ϕ(x 1, X 2,, X n ) gives an affinity score. Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

9 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs ϕ 1 (A,B) a a 30 a b 5 a c 1 c c 10 Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

10 Joint Probability via Local Factors ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

11 Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

12 Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors Factors don t have to be tables could be functions (that could be used to produce tables). Often depend on features. ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

13 Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors Factors don t have to be tables could be functions (that could be used to produce tables). Often depend on features. ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z Factors is just a map constant, subsets so of is RVs often to non-negative left out of the real discussion. numbers We don t High need values don t for P(a,b,c,d) even need mean to the calculate local factors it if we have just high want scores to decode Inference: : find we most can use likely Y X P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

14 Factorization implies Independencies P(a,b,c,d) ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) [ϕ 1 (a,b) ϕ 2 (b,c) ] [ ϕ 3 (c,d) ϕ 4 (d,a) ] F(b, a,c) G(d, a,c) Intuition: If (a,c) given, knowing b tells us nothing of d Theorem: X Y W iff. P(X 1 X N ) = F(X, W) G(Y, W)

15 Factorization implies Independencies P(a,b,c,d) ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) [ϕ 1 (a,b) ϕ 2 (b,c) ] [ ϕ 3 (c,d) ϕ 4 (d,a) ] F(b, a,c) G(d, a,c) Intuition: This If (a,c) is how given, we knowing the b independencies tells us nothing from of d the Theorem: factors can we tell just by looking at the X Y W iff. P(Xgraph? 1 X N ) = F(X, W) G(Y, W)

16 Formal MN Independencies X Y W (X and Y are indep. given W) if we can NOT go from X X to Y Y without passing W W blocks the interaction of X and Y Unshaded: unobserved (X, Y) Shaded: observed (W)

17 Formal MN Independencies Consequence: Simpler than global X is independent independencies of everyone in BNs X Y Z (X and Y are indep. given Z) else (d-separation, given its neighbors see Chapter (its Markov 3.3 in KF09) blanket) if we can go from X X to Y Y without touching Z To You make show a me MN: a connect graph, I X can and tell Y you if they what should is never independent Z blocks be rendered the from interaction independent what given of X and what. via Y other rvs Unshaded: unobserved (X, Y) Shaded: observed (W)

18 More on Factors: Factor Product Figure 4.3, KF09 ϕ 1 (A,B) ϕ 2 (B,C) = ψ(a,b,c) every entry in ψ(a,b,c) is ϕ 1 (a,b) * ϕ 2 (b,c)

19 Log-Linear Models Factor graphs are more explicit, but still require bulky tables for all the factor values Can use features to capture patterns that we d like reflected in clique potentials: P Φ (X 1,,X N ) = ϕ 1 (D 1 ) ϕ 2 (D 2 ) ϕ K (D K ) Define ϕ i (D i ) = exp(-w i f i (D i )) f i (D i ) tell us something indicative about some rvs, e.g. f i (Y 1 Y 2 )=1 if the vals of Y 1 & Y 2 are DT & NN, else f i (Y 1 Y 2 )=0 So P Φ (X 1,,X N ) = exp(-w 1 f 1 (D 1 )) exp(-w 1 f 1 (D 1 )) = exp(- w i f i (D i ))

20 Log-Linear Models Factor graphs are more explicit, but still require bulky tables for all the factor values Can use features to capture patterns that we d like reflected in clique potentials: We can have many feature functions for each D i that each f i (D i ) tell us something indicative about indicate some different rvs, e.g. f i (Y 1 Y 2 )=1 if the vals of Y 1 & Y 2 are DT & NN, else f i (Y 1 Y 2 )=0 properties of our rvs. P Φ (X 1,,X N ) = ϕ 1 (D 1 ) ϕ 2 (D 2 ) ϕ K (D K ) Define ϕ i (D i ) = exp(-w i f i (D i )) So P Φ (X 1,,X N ) = exp(-w 1 f 1 (D 1 )) exp(-w 1 f 1 (D 1 )) = exp(- w i f i (D i ))

21 Features in Markov Random Fields Conditional tables vs. Features: ϕ 1 (XY) = 1(X=Y) ϕ 2 (XY) = 1 if X and Y one grade apart 0 otherwise ϕ 1 (A,B) a a 30 a b 5 a c 1 c c 10

22 Conditional Random Fields CRFs: An MRF over X (input) and Y (output) Model conditional probability P(Y X) Uses features

23 Conditional Random Fields

24 Conditional Random Fields Does not need to model input interactions Less parameters easier to learn

25 3906 citations on Google Scholar

26 We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models citations on Google Scholar

27 Conditional Random Fields for Sequences Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3 x 4 A Markov Random Field With features Trained and used for conditional probabilities Useful for modeling sequences

28 Learning MRF parameters Easy when exact inference is possible Given some data D={X i } n, we want to learn MRF parameters Maximum likelihood principle Pick the parameters that maximize the (log) probability of the data: argmax Θ L(D,Θ)

29 MLE for MRF parameters Pick the parameters that maximize the (log) probability of the data: p( D )) p ( Xi) L( D, ) p (X ) i 1 Z log( p(x )) log( Z) log p( D ) i log i exp A j A f A j exp j f j log( j ( x f A j ( x X A f A j i A p ) (X )) ) log( Z) f j j ( x A i ) i log( p (X )) i

30 MLE for MRF parameters Compute the gradient of the log-likelihood Perform gradient-based optimization method (e.g. stochastic gradient decent) to find the parameters that maximize the loglikelihood

31 Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: log( p(y i Xi, )) j f j({ x,y} A) log( ZXi Z is now input-dependent: A j ) log( Z log exp x ) f j j({ x,y} A) i Y A f A j

32 Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: log( p(y i Xi, )) j f j({ x,y} A) log( ZXi Z is now input-dependent: log( log( Z p(x )) i log A x ) ({,y} ) i j f j x A Y A j A j j Compare to MRF: j f j ( x A ) log( Z) )

33 Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: Compare to MRF: log( p(y i X i, )) j f j({,y} A) log( ZXi log( Z) log exp A j f j x A) X A f ja Z is now input-dependent: log( Z log exp x ) f j j({ x,y} A) i Y A f A j )

34 Training a CRF Use a gradient-based method to optimize the conditional log-likelihood of the data

35 Training a CRF Use a gradient-based method to optimize the conditional log-likelihood of the data The derivative of a single parameter θ k and training example {x i,y i } i.e., feature counts expected feature counts Easy to compute when exact inference can be performed Need to count features in the data and compute marginals A i A i i A i i k A A i i k k i i x y x p y x f y x f θ y x L ) }, ({ ) }, ({ ) }, ({ ), (

36 CRFs for Classification CRFs are models for predicting probabilities: p(y X) or p(y k X) In ML we care about decisions (predictions) Is this spam? What is the POS tag of this word? How do we turn probabilities into decisions? For a sequence model: Sequence of states with max probability Max Product Algorithm (Viterbi decoding)

37 CRFs for Classification CRFs are models for predicting probabilities: p(y X) or p(y k X) In ML we care about decisions (predictions) Is this spam? What is the POS tag of this word? How do we turn probabilities into decisions? For a sequence model: Sequence of states with max probability Max Product Algorithm (Viterbi decoding)

38 Decision Theory Concerned with converting probabilities into decisions Use the generalized Bayes rule, or minimum Bayes risk (MBR) rule Select the output y that minimizes the risk or the expected loss: risk( y' ) E l ( y', y) p( y x) l( y', y' ') y'' MBR decoding depends on the loss function Examples: Accuracy output the label with the max marginal probability Can be intractable F-measure

39 Case Study: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables [Nakagawa, Inui, Kurohashi, HLT-NAACL 2010]

40 Sentiment Classification Classify documents / sentences / phrases / words as subjective/objective or positive/negative: The picture quality is unparalleled Positive Watching this movie is like watching paint dry Negative The design is really kewl but this thing is expensive?

41 Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: great killed terrorist

42 Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: great at taking your money killed terrorist

43 Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: not great at taking your money killed terrorist targets

44 Sentiment Composionality Annotations only at the sentence (or phrase level) Need to model hidden structure Model needs to be compositional This paper: Utilize CRF to model dependencies between words Model hidden structure Use dependency parse to indicate sentential structure

45 Dependency Subtrees and Composionality

46 CRF Based on the Dependency Tree S 0 S 1 S 2 S 3 S 4 <root> It prevents cancer and heart disease.

47 CRF Based on the Dependency Tree S 0 S 1 S 2 S 3 S 4 <root> It prevents cancer and heart disease.

48 CRF Based on the Dependency Tree

49 Model Specifics Inference: Belief propagation Decision: Argmax over the root marginal Learning: MAP estimation of the parameters

50 Parameter Estimation Gradient computation: p l w l h l s l label of the entire sentence sequence of labels sequence of heads sequence of polarities

51 Features s i q i r i m i polarity prior polarity polarity reversal number of words i ranges over all phrases u i,k surface form b i,k c i,k f i,k base form coarse POS tag finepos tag k ranges over all words in phrase i

52 Results

53 Summary Probabilistic graphical models: Models of joint probability CRF: A discriminative model for conditional probabilities Training through MLE Decisions using CRFs

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

CS Lecture 4. Markov Random Fields

CS Lecture 4. Markov Random Fields CS 6347 Lecture 4 Markov Random Fields Recap Announcements First homework is available on elearning Reminder: Office hours Tuesday from 10am-11am Last Time Bayesian networks Today Markov random fields

More information

Using Graphs to Describe Model Structure. Sargur N. Srihari

Using Graphs to Describe Model Structure. Sargur N. Srihari Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models Features (Ising,

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs) Machine Learning (ML, F16) Lecture#07 (Thursday Nov. 3rd) Lecturer: Byron Boots Undirected Graphical Models 1 Undirected Graphical Models In the previous lecture, we discussed directed graphical models.

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

CRF for human beings

CRF for human beings CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples Machine Learning Bayes Basics Bayes, probabilities, Bayes theorem & examples Marc Toussaint U Stuttgart So far: Basic regression & classification methods: Features + Loss + Regularization & CV All kinds

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Active learning in sequence labeling

Active learning in sequence labeling Active learning in sequence labeling Tomáš Šabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science Table of contents 1. Introduction

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

2 : Directed GMs: Bayesian Networks

2 : Directed GMs: Bayesian Networks 10-708: Probabilistic Graphical Models 10-708, Spring 2017 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Jayanth Koushik, Hiroaki Hayashi, Christian Perez Topic: Directed GMs 1 Types

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information