Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Similar documents
COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

N-gram Language Modeling

Rapid Introduction to Machine Learning/ Deep Learning

Statistical Methods for NLP

Augmented Statistical Models for Speech Recognition

Hidden Markov Models in Language Processing

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Language Modeling

Natural Language Processing. Statistical Inference: n-grams

Directed and Undirected Graphical Models

Probabilistic Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Machine Learning for natural language processing

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Machine Learning Lecture 14

p L yi z n m x N n xi

N-gram Language Modeling Tutorial

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Directed Probabilistic Graphical Models CMSC 678 UMBC

Statistical Approaches to Learning and Discovery

Directed Graphical Models

Introduction to Bayesian Learning

Probabilistic Graphical Models (I)

Chris Bishop s PRML Ch. 8: Graphical Models

The Noisy Channel Model and Markov Models

Intelligent Systems (AI-2)

Computational Genomics

Language Model. Introduction to N-grams

Conditional Independence

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CPSC 540: Machine Learning

Approximate Inference

Topic Modelling and Latent Dirichlet Allocation

Language Models. Philipp Koehn. 11 September 2018

Introduction to Graphical Models

Stephen Scott.

DT2118 Speech and Speaker Recognition

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Bayesian Machine Learning

Directed and Undirected Graphical Models

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Chapter 3: Basics of Language Modelling

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Networks Inference with Probabilistic Graphical Models

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Machine Learning for natural language processing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Language Modeling. Michael Collins, Columbia University

Lecture 6: Graphical Models

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Midterm sample questions

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

STA 4273H: Statistical Machine Learning

Graphical Models for Automatic Speech Recognition

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Statistical Methods for NLP

Probabilistic graphical models CS 535c (Topics in AI) Stat 521a (Topics in multivariate analysis) Lecture 1. Kevin Murphy

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS Lecture 4. Markov Random Fields

CS 188: Artificial Intelligence Fall 2011

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Introduction to Bayesian Networks. Probabilistic Models, Spring 2009 Petri Myllymäki, University of Helsinki 1

T Machine Learning: Basic Principles

Machine Learning for Data Science (CS4786) Lecture 19

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Chapter 16. Structured Probabilistic Models for Deep Learning

Lecture 15. Probabilistic Models on Graph

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Bayesian Networks

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Hidden Markov Models

Junction Tree, BP and Variational Methods

Naïve Bayes Classifiers

Undirected Graphical Models

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Probabilistic Graphical Models: Representation and Inference

Bayesian Machine Learning - Lecture 7

Naïve Bayes classification

Conditional Random Field

Brief Introduction of Machine Learning Techniques for Content Analysis

Natural Language Processing SoSe Words and Language Model

Graphical Models and Kernel Methods

Introduction to Artificial Intelligence (AI)

Unit 1: Sequence Models

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Transcription:

Graphical Models Mark Gales Lent 2011 Machine Learning for Language Processing: Lecture 3 MPhil in Advanced Computer Science MPhil in Advanced Computer Science

Graphical Models Graphical models have their origin in several areas of research a union of graph theory and probability theory framework for representing, reasoning with, and learning complex problems. Used for for multivariate (multiple variable) probabilistic systems, encompass: language models (Markov Chains); mixture models; factor analysis; hidden Markov models; Kalman filters 4 lectures will examine forms, training and inference with these systems MPhil in Advanced Computer Science 1

Basic Notation A graph consists of a collection of nodes and edges. Nodes, or vertices, are usually associated with the variables distinction between discrete and continuous ignored in this initial discussion Edges connect nodes to one another. For undirected graphs absence of an edge between nodes indicates conditional independence graph can be considered as representing dependencies in the system E C D A B 5 nodes, {A,B,C,D,E}, 6 edges Various operations on sets of these: C 1 = {A,C};C 2 = {B,C,D}; C 3 = {C,D,E} union: S = C 1 C 2 = {A,B,C,D} intersection: S = C 1 C 2 = {C} removal: C 1 \S = {A} MPhil in Advanced Computer Science 2

Conditional Independence A fundamental concept in graphical models is the conditional independence. consider three variables, A, B and C. We can write P(A,B,C) = P(A)P(B A)P(C B,A) if C is conditionally independent of A given B, then we can write P(A,B,C) = P(A)P(B A)P(C B) the value of A does not affect the distribution of C if B is known. Graphically this can be described as A B C Conditional independence is important when modelling highly complex systems. MPhil in Advanced Computer Science 3

Forms of Graphical Model A A A C C C E B E B E B D D D Undirected Graph Factor Graph Bayesian Network For the undirected graph probability calculation based on P(A,B,C,D,E) = 1 Z P(A,C)P(B,C,D)P(C,D,E) where Z is the appropriate normalisation term this is the same as the product of the three factors in the factor graph This course will concentrate on Bayesian Networks MPhil in Advanced Computer Science 4

Bayesian Networks A specific form of graphical model are Bayesian networks: directed acyclic graphs (DAGs) directed: all connections have arrows associated with them; acyclic: following the arrows around it is not possible to complete a loop The main problems that need to be addressed are: inference (from observation it s cloudy infer probability of wet grass). training the models; determining the structure of the network (i.e. what is connected to what) The first two issues will be addressed in these lectures. the final problem of is an area of on-going research. MPhil in Advanced Computer Science 5

Notation In general the variables (nodes) may be split into two groups: observed (shaded) variables are the ones we have knowledge about. unobserved (unshaded) variables are ones we don t know about and therefore have to infer the probability. The observed/unobserved variables may differ between training and testing e.g. for supervised training know the class of interest We need to find efficient algorithms that allow rapid inference to be made preferably a general scheme that allows inference over any Bayesian network First, three basic structures are described in the next slides detail effects of observing one of the variables on the probability MPhil in Advanced Computer Science 6

Standard Structures Structure 1 A C B A C B C not observed: P(A,B) = C P(A,B,C) = P(A) C P(C A)P(B C) then A and B are dependent on each other. C = T observed: P(A,B C = T) = P(A)P(B C = T) A and B are then independent. The path is sometimes called blocked. Structure 2 C C A B A B C not observed: P(A,B) = C P(A,B,C) = C P(C)P(A C)P(B C) then A and B are dependent on each other. C = T observed: P(A,B C = T) = P(A C = T)P(B C = T) A and B are then independent. MPhil in Advanced Computer Science 7

Structure 3 Standard Structures (cont) A B A B C C C not observed: P(A,B) = C P(A,B,C) = P(A)P(B) C P(C A,B) = P(A)P(B) A and B are independent of each other. C = T observed: P(A,B C = T) = P(A,B,C = T) P(C = T) = P(C = T A,B)P(A)P(B) P(C = T) A and B are not independent of each other if C is observed. Two variables are dependent if a common child is observed - explaining away MPhil in Advanced Computer Science 8

Simple Example P(C=T) = 0.8 Consider the Bayesian network to left C P(S=T) T 0.1 F 0.5 Sprinkler Cloudy Rain C P(R=T) T 0.8 F 0.1 whether the grass is wet, W whether the sprinkler has been used, S whether it has rained, R whether the it is cloudy C Wet Grass S R P(W=T) T T 0.99 T F 0.90 F T 0.90 F F 0.00 Associated with each node conditional probability table (CPT) Yields a set of conditional independence assumptions so that: P(C,S,R,W) = P(C)P(S C)P(R C)P(W S,R) Possible to use CPTs for inference: Given C = T what is P(W = T C = T) = P(C = T,S,R,W = T) P(C = T) S={T,F} R={T,F} = 0.7452 MPhil in Advanced Computer Science 9

General Inference A general approach for inference with BNs is message passing no time in this course for detailed analysis of general case very brief overview here Process involves identifying: Cliques C: fully connected (every node is connected to every other node) subset of all the nodes. Separators S: the subset of the nodes of a clique that are connected to nodes outside the clique. Neighbours N: the set of neighbours for a particular clique. Thus given the value of the separators for a clique it is conditionally independent of all other variables. MPhil in Advanced Computer Science 10

Simple Inference Example Cloudy Cloudy Cloudy Sprinkler Rain Sprinkler Rain Sprinkler Rain Sprinkler Rain Wet Grass Wet Grass Wet Grass Bayesian Network Moral Graph Junction Tree Two cliques: C 1 = {C,S,R}, C 2 = {S,R,W}, one separator: S 12 = {S,R} pass message between cliques: φ 12 (S 12 ) = C P(C 1) message is: φ 12 (S 12 ) = P(S C = T)P(R C = T) CPT associated with message to the right S R P() T T 0.08 T F 0.02 F T 0.72 F F 0.18 MPhil in Advanced Computer Science 11

Beyond Naive Bayes Classifier Consider classifiers for the class given sequence: x 1,x 2,x 3 ω ω x 1 x 2 x 3 x 0 x 1 2 x x 4 x 3 P(ω j ) 3 i=1 P(x i ω j ) P(ω j )P(x 0 ω j ) 4 i=1 P(x i x i 1,ω j ) Consider the simple generative classifiers above (with joint distribution) naive-bayes classifier on left (conditional independent features given class) for the classifier on the right - a bigram model addition of sequence start feature x 0 (note P(x 0 ω j ) = 1) addition of sequence end feature x d+1 (variable length sequence) Decision now based on a more complex model this is the approach used for generating (class-specific) language models MPhil in Advanced Computer Science 12

Language Modelling In order to use Bayes decision rule need to be able to have the prior of a class many speech and language processing this is the sentence probability P(w) examples include speech recognition, machine translation P(w) = P(w 0,w 1,...w k,w K+1 ) = K+1 k=1 P(w k w 0,...w k 2,w k 1 ) K words in sentence w 1,...,w k w 0 is the sentence start marker and w K+1 is sentence end marker. require word by word probabilities of partial strings given a history Can be class-specific - topic classification (select topic τ given text w) ˆτ = argmax τ {P(τ w)} = argmax τ {P(w τ)p(τ)} Cambridge University Engineering Department MPhil in Advanced Computer Science 13

N-Gram Language Models Consider a task with a vocabulary of V words (LVCSR 65K+) 10-word sentences yield (in theory) V 10 probabilities to compute not every sequence is valid but number still vast for LVCSR systems Need to partition histories into appropriate equivalence classes Assume words conditionally independent given previous N 1 words: N = 2 P(bank I,robbed,the) P(bank I,fished,from,the) P(bank the) simple form of equivalence mappings - a bigram language model P(w) = K+1 k=1 P(w k w 0,...w k 2,w k 1 ) K+1 k=1 P(w k w k 1 ) Cambridge University Engineering Department MPhil in Advanced Computer Science 14

N-Gram Language Models The simple bigram can be extended to general N-grams P(w) = K+1 k=1 P(w k w 0,...w k 2,w k 1 ) K+1 k=1 P(w k w k N+1,...,w k 1 ) Number of model parameters scales with the size if N (consider V = 65K): unigram (N=1): 65K 1 = 6.5 10 4 bigram (N=2): 65K 2 = 4.225 10 9 trigram (N=3): 65K 3 = 2.746 10 14 4-gram (N=4): 65K 4 = 1.785 10 19 Web comprises about 20 billion pages - not enough data! Long-span models should be more accurate, but large numbers of parameters A central problem is how to get robust estimates and long-spans? Cambridge University Engineering Department MPhil in Advanced Computer Science 15

Modelling Shakespeare Jurafsky & Martin: N-gram trained on the complete works of Shakespeare Unigram Every enter now severally so, let Will rash been and by I the me loves gentle me not slavish page, the and hour; ill let Bigram What means, sir. I confess she? then all sorts, he is trim, captain. The world shall- my lord! Trigram Indeed the duke; and had a very good friend. Sweet prince, Fallstaff shall die. Harry of Monmouth s grave. 4-gram It cannot be but so. Enter Leonato s brother Antonio, and the rest, but seek the weary beds of people sick. Cambridge University Engineering Department MPhil in Advanced Computer Science 16

Assessing Language Models Often use entropy, H, or perplexity, PP, to assess the LM H = w VP(w)log 2 (P(w)), PP = 2 H ; V is the set of all possible events difficult when incorporating word history into LMs not useful to assess how well specific text is modelled with a given LM Quality of a LM is usually measures by the test-set perplexity compute the average value of the sentence log-probability (LP) LP = lim K 1 K +1 K+1 k=1 log 2 P (w k w 0...w k 2 w k 1 ) In practice LP must be estimated from a (finite-sized) portion of test text this is a (finite-set) estimate for the entropy the test-set perplexity, PP, can be found as PP = 2 LP Cambridge University Engineering Department MPhil in Advanced Computer Science 17

Language Model Estimation Simplest approach to estimating N-grams is to count occurrences ˆP(w k w i,w j ) = f(w i,w j,w k ) V k=1 f(w i,w j,w k ) = f(w i,w j,w k ) f(w i,w j ) f(a,b,c,...) = number of times that the word sequence (event) a b c... occurs in the training data This is the maximum likelihood estimate excellent model of the training... many possible events will not be seen, zero counts - zero probability rare events, f(w i,w j ) is small, estimates unreliable Two solutions discussed here: discounting allocating some counts to unseen events backing-off for rare events reduce the size of N Cambridge University Engineering Department MPhil in Advanced Computer Science 18

Maximum Likelihood Training - Example As an example take University telephone numbers. Let s assume that 1. All telephone numbers are 6 digits long 2. All numbers start (equally likely) with 33, 74 or 76 3. All other digits are equally likely What is the resultant perplexity rates for various N-grams? Experiment using 10,000 or 100 numbers to train (ML), 1000 to test. Perplexity numbers are given below (11 tokens including sentence end): Language 10000 100 Model Train Test Train Test equal 11.00 11.00 11.00 11.00 unigram 10.04 10.01 10.04 10.04 bigram 7.12 7.13 6.56 Cambridge University Engineering Department MPhil in Advanced Computer Science 19

= rd r R P(E) Module L101: Machine Learning for Language Processing Discounting Need to reallocate some counts to unseen events Must satisfy (valid PMF) V ˆP(w k w i,w j ) = 1 k=1 Probability Maximum likelihood probabilities P(E) = r R Discounting 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Count Probability Discounted probabilities Count General form of discounting ˆP(ω k ω i,ω j ) = d(f(ω i,ω j,ω k )) f(ω i,ω j,ω k ) f(ω i,ω j ) need to decide form of d(f(ω i,ω j,ω k )) (and ensure sum-to-one constraint) Cambridge University Engineering Department MPhil in Advanced Computer Science 20

Forms of Discounting Notation: r=count for an event, n r =number of N-grams with count r Various forms of discounting (Knesser-Ney also popular) Absolute discounting: subtract constant from each count d(r) = (r b)/r Typically b = n 1 /(n 1 +2n 2 ) - often applied to all counts Linear discounting: d(r) = 1 (n 1 /T c ) where T c is the total number of events - often applied to all counts. Good-Turing discounting: ( mass observed once = n 1, observed r = rn r ) r = (r +1)n r+1 /n r ; probability estimates based on r unobserved same mass as observed once; once same mass as twice etc Cambridge University Engineering Department MPhil in Advanced Computer Science 21

Backing-Off An alternative to using discounting is to use lower N-grams for rare events lower-order N-gram will yield more reliable estimates for the example of a bigram ˆP(w j w i ) = { d(f(w i,w j )) f(w i,w j ) f(w i ) α(w i )ˆP(w j ) f(w i,w j ) > C otherwise α(w i ) is the back-off weight, it is chosen to ensure that V j=1 ˆP(w j w i ) = 1 C is the N-gram cut-off point (can be set for each value of N) value of C also controls the size of the resulting language model Note that the back-off weight is computed separately for each history and uses the N 1 th order N-gram count. Cambridge University Engineering Department MPhil in Advanced Computer Science 22

Graphical Model Lectures The remaining lectures to do with graphical models will cover Latent Variable Models and Hidden Markov Models mixture models, hidden Markov models, Viterbi algorithm Expectation Maximisation and Variational Approaches EM for mixture models and HMMs, extension to variational approaches Condition Random Fields discriminative sequence models, form of features, parameter estimation Cambridge University Engineering Department MPhil in Advanced Computer Science 23