Directed Probabilistic Graphical Models CMSC 678 UMBC

Similar documents
A brief introduction to Conditional Random Fields

Naïve Bayes, Maxent and Neural Models

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

Graphical Models and Kernel Methods

Basic math for biology

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

CS838-1 Advanced NLP: Hidden Markov Models

Introduction to Machine Learning CMU-10701

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Chris Bishop s PRML Ch. 8: Graphical Models

Introduction to Bayesian Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

A gentle introduction to Hidden Markov Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Statistical Methods for NLP

Probabilistic Graphical Networks: Definitions and Basic Results

Hidden Markov Models in Language Processing

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Linear Dynamical Systems

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Dynamic Approaches: The Hidden Markov Model

Undirected Graphical Models

Conditional Random Field

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Statistical Approaches to Learning and Discovery

Probabilistic Graphical Models (I)

p L yi z n m x N n xi

An Introduction to Bayesian Machine Learning

Hidden Markov Models

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Intelligent Systems:

Based on slides by Richard Zemel

Directed and Undirected Graphical Models

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Latent Dirichlet Allocation

Statistical Methods for NLP

Lecture 15. Probabilistic Models on Graph

Graphical Models Seminar

Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

CPSC 540: Machine Learning

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Directed Graphical Models

3 : Representation of Undirected GM

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Conditional Independence and Factorization

Lecture 9. Intro to Hidden Markov Models (finish up)

COMP90051 Statistical Machine Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Learning with Probabilities

Qualifier: CS 6375 Machine Learning Spring 2015

Weighted Finite-State Transducers in Computational Biology

Hidden Markov Models

Bayesian Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning for natural language processing

Introduction to Machine Learning Midterm, Tues April 8

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Expectation-Maximization (EM) algorithm

4 : Exact Inference: Variable Elimination

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Lecture 8: Bayesian Networks

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Machine Learning Lecture 14

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Directed Graphical Models

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Naïve Bayes classification

Rapid Introduction to Machine Learning/ Deep Learning

Expectation Maximization (EM)

Midterm sample questions

Hidden Markov Models

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Final Examination CS 540-2: Introduction to Artificial Intelligence

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 8: Graphical models for Text

an introduction to bayesian inference

Lecture 13 : Variational Inference: Mean Field Approximation

Advanced Data Science

Sequential Supervised Learning

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Transcription:

Directed Probabilistic Graphical Models CMSC 678 UMBC

Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions?

Announcement 2: Progress Report on Project Due Monday April 16 th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you ve made Discuss what remains to be done Discuss any new blocks you ve experienced (or anticipate experiencing) Any questions?

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Recap from last time

Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters p(z i ) count(z i, w i ) 2. M-step: maximize log-likelihood, assuming these uncertain counts p (t) (z) estimated counts p t+1 (z)

max θ new parameters EM Math E-step: count under uncertainty E old parameters log p (z, w) z ~ p (t)( w) θ θ posterior distribution M-step: maximize log-likelihood new parameters M θ = marginal loglikelihood of observed data X C θ = log-likelihood of complete data (X,Y) P θ = posterior loglikelihood of incomplete data Y M θ = E Y θ (t)[c θ X] E Y θ (t)[p θ X] EM does not decrease the marginal log-likelihood

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Lagrange multipliers Assume an original optimization problem

Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:

Lagrange multipliers: an equivalent problem?

Lagrange multipliers: an equivalent problem?

Lagrange multipliers: an equivalent problem?

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i w 1 = 1 w 2 = 5 w 3 = 4 Generative Story for roll i = 1 to N: w i Cat(θ) 6 θ k = 1 k=1 a probability distribution over 6 sides of the die 0 θ k 1, k

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = w 1 = 1 w 2 = 5 w 3 = 4 Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood i p w i L θ = log p θ (w i ) i = log θ wi i

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood L θ = log θ wi i i p w i Q: What s an easy way to maximize this, as written exactly (even without calculus)?

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood L θ = log θ wi i i p w i Q: What s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing θ k (we know θ must be a distribution, but it s not specified)

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) L θ = log θ wi i 6 s. t. θ k = 1 k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) solve using Lagrange multipliers

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) F θ 1 = λ θ k θ wi i:w i =k F θ λ 6 = θ k + 1 k=1

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) θ k = σ i:w i =k 1 λ optimal λ when θ k = 1 6 k=1

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) θ k = σ i:w i =k 1 σ k σ i:wi =k 1 = N k N optimal λ when θ k = 1 6 k=1

Example: Conditionally Rolling a Die p w 1, w 2,, w N = p w 1 p w 2 p w N = p w i add complexity to better explain what we see p z 1, w 1, z 2, w 2,, z N, w N = p z 1 p w 1 z 1 p z N p w N z N i z 1 = H z 2 = T w 1 = 1 w 2 = 5 = p w i z i i p z i p heads = λ p tails = 1 λ p heads = γ p tails = 1 γ p heads = ψ p tails = 1 ψ

Example: Conditionally Rolling a Die p w 1, w 2,, w N = p w 1 p w 2 p w N = p w i p z 1, w 1, z 2, w 2,, z N, w N = p z 1 p w 1 z 1 p z N p w N z N p heads = λ p tails = 1 λ p heads = γ p tails = 1 γ p heads = ψ p tails = 1 ψ = p w i z i i add complexity to better explain what we see p z i Generative Story λ = distribution over penny γ = distribution for dollar coin ψ = distribution over dime for item i = 1 to N: z i ~ Bernoulli λ if z i = H: w i ~ Bernoulli γ else: w i ~ Bernoulli ψ i

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Classify with Bayes Rule argmax Y p Y X) argmax Y log p X Y) + log p(y) likelihood prior

The Bag of Words Representation Adapted from Jurafsky & Martin (draft)

The Bag of Words Representation Adapted from Jurafsky & Martin (draft)

The Bag of Words Representation Adapted from Jurafsky & Martin (draft) 29

Bag of Words Representation seen 2 sweet 1 γ( )=c whimsical 1 classifier Adapted from Jurafsky & Martin (draft) recommend 1 happy 1...... classifier

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters global parameters

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ y

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) local variables y x i1 x i2 x i3 x i4 x i5

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) y x i1 x i2 x i3 x i4 x i5 each x ij conditionally independent of one another (given the label)

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) y x i1 x i2 x i3 x i4 x i5 Maximize Log-likelihood L θ = i log F yi (x ij ; θ yi ) + j i log φ yi s. t. φ k = 1 k θ k is valid for F k

Multinomial Naïve Bayes: A Generative Generative Story φ = distribution over K labels for label k = 1 to K: for item i = 1 to N: y i ~ Cat φ Story θ k = distribution over J feature values for each feature j x ij Cat(θ yi,j) y x i1 x i2 x i3 x i4 x i5 Maximize Log-likelihood L θ = i log θ yi,x i,j + j i log φ yi s. t. φ k = 1 k θ kj = 1 k j

Multinomial Naïve Bayes: A Generative Generative Story φ = distribution over K labels for label k = 1 to K: for item i = 1 to N: y i ~ Cat φ Story θ k = distribution over J feature values for each feature j x ij Cat(θ yi,j) y x i1 x i2 x i3 x i4 x i5 L θ Maximize Log-likelihood via Lagrange Multipliers = log θ yi,x i,j + log φ yi μ φ k 1 i j i k k λ k θ kj 1 j

Multinomial Naïve Bayes: Learning Calculate class priors For each k: items k = all items with class = k Calculate feature generation terms For each k: obs k = single object containing all items labeled as k For each feature j n kj = # of occurrences of j in obs k p k = items k # items p j k = n kj σ j n kj

Adapted from Jurafsky & Martin (draft) Brill and Banko (2001) With enough data, the classifier may not matter

Summary: Naïve Bayes is Not So Naïve, but not without issue Pro Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best) Con Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled) Adapted from Jurafsky & Martin (draft)

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Hidden Markov Models (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(british Left Waffles on Falkland Islands) Class-based model Bigram model of the classes Model all class sequences p w i z i p z i z i 1 z 1,..,z N p z 1, w 1, z 2, w 2,, z N, w N

Hidden Markov Model p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Goal: maximize (log-)likelihood In practice: we don t actually observe these z values; we just see the words w

Hidden Markov Model p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Goal: maximize (log-)likelihood In practice: we don t actually observe these z values; we just see the words w if we did observe z, estimating the probability parameters would be easy but we don t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood but we don t! :(

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Each z i can take the value of one of K latent states

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 transition probabilities/parameters Each z i can take the value of one of K latent states

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items?

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K 2 transition values

Hidden Markov Model Representation p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N emission probabilities/parameters = p w i z i i p z i z i 1 transition probabilities/parameters z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

Hidden Markov Model Representation p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N emission probabilities/parameters initial starting distribution ( SEQ_SYM ) p z 1 z 0 z 1 = p w i z i i p z i z i 1 p z 2 z 1 p z 3 z 2 p z 4 z 3 z 2 z 3 z 4 p w 1 z 1 p w 2 z 2 p w 3 z 3 p w 4 z 4 transition probabilities/parameters w 1 w 2 w 3 w 4 Each z i can take the value of one of K latent states Transition and emission distributions do not change

Example: 2-state Hidden Markov Model as a Lattice z 1 = V z 2 = V z 3 = V z 4 = V z 1 = N z 2 = N z 3 = N z 4 = N w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

Example: 2-state Hidden Markov Model as a Lattice z 1 = V z 2 = V z 3 = V z 4 = V z 1 = N z 2 = N z 3 = N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

Example: 2-state Hidden Markov Model as a Lattice p V start z 1 = V p V V z 2 = p V V p V V z 3 = V V z 4 = V p N start z 1 = N p N N z 2 = p N N z 3 = p N N N N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

Example: 2-state Hidden Markov Model as a Lattice p V start z 1 = V p V V z 2 = p V V p V V z 3 = V V z 4 = V p N start p N V z 1 = N p V N p V N p V N p N V p N V p N N z 2 = p N N z 3 = p N N N N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

A Latent Sequence is a Path through the Graph z 1 = V z 2 = V p V V z 3 = V z 4 = V p V N p N V p N start z 1 = N z 2 = N z 3 = N z 4 = N p w 2 V p w 3 V p w 1 N p w 4 N w 1 w 2 w 3 w 4 N V end start.7.2.1 N.15.8.05 V.6.35.05 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 Q: What s the probability of (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

What s the Maximum Weighted Path? 3 7 4 9 6 32 1

What s the Maximum Weighted Path? 3 7 4 9 6 32 1

What s the Maximum Weighted Path? 3 +3 10 7 4 +3 7 9 6 32 1

What s the Maximum Weighted Path? 3 +3 10 7 4 +3 7 9 6 32 1

What s the Maximum Weighted Path? 3 +3 10 7 4 +3 7 9 6 32 1 +10 19 +10 16 +10 42 +10 11

What s the Maximum Weighted Path? 3 +3 10 7 4 +3 7 9 6 32 1 +10 19 +10 16 +10 42 +10 11

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

What s the Maximum Value? z i-1 = A z i = A z i-1 = B z i = B z i-1 = C z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

What s the Maximum Value? z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

What s the Maximum Value? z i-2 computing = A v at time i-1 will correctly incorporate (maximize over) paths z through i-2 = B time i-2: we correctly obey the Markov property z i-2 = C z i-1 = A z i-1 = B z i-1 = C z i = A z i = B z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) Viterbi algorithm v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

v = double[n+2][k*] b = int[n+2][k*] Viterbi Algorithm v[*][*] = 0 v[0][start] = 1 backpointers/ book-keeping for(i = 1; i N+1; ++i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i state) for(old = 0; old < K*; ++old) { p move = p transition (state old) if(v[i-1][old] * p obs * p move > v[i][state]) { } } } } v[i][state] = v[i-1][old] * p obs * p move b[i][state] = old

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Forward Probability z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C α(i, B) is the total probability of all paths to that state B from the beginning

Forward Probability z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C marginalize across the previous hidden state values α i, B = s α i 1, s p B s ) p(obs at i B) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property α(i, B) is the total probability of all paths to that state B from the beginning

Forward Probability α i, s = s α i 1, s p s s ) p(obs at i s) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

Forward Probability α i, s = s α i 1, s p s s ) p(obs at i s) what are the immediate ways to get into state s? what s the total probability up until now? how likely is it to get into state s this way? α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

α i, s = what are the immediate ways to get into state s? Forward Probability s α i 1, s p s s ) p(obs at i s) what s the total probability up until now? α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i how likely is it to get into state s this way? 3. that emit the observation obs at i Q: What do we return? (How do we return the likelihood of the sequence?) A: α[n+1][end]

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i, B) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i, B) α(i, B) * β(i, B) = total probability of paths through state B at step i α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i+1, s) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i+1, s ) α(i, B) * p(s B) * p(obs at i+1 s ) * β(i+1, s ) = total probability of paths through the B s arc (at time i) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

With Both Forward and Backward Values α(i, s) * β(i, s) = total probability of paths through state s at step i p z i = s w 1,, w N ) = α i, s β(i, s) α(n + 1, END) α(i, s) * p(s B) * p(obs at i+1 s ) * β(i+1, s ) = total probability of paths through the s s arc (at time i) p z i = s, z i+1 = s w 1,, w N ) = α i, s p s s p obs i+1 s β(i + 1, s ) α(n + 1, END)

estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm p obs (w s) p trans (s s) 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters p z i = s w 1,, w N ) = α i, s β(i, s) α(n + 1, END) p obs (w s) p trans (s s) p z i = s, z i+1 = s w 1,, w N ) = α i, s p s s p obs i+1 s β(i + 1, s ) α(n + 1, END) 2. M-step: maximize log-likelihood, assuming these uncertain counts

M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = c(s s ) σ x c(s x) if we observed the hidden transitions

M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = E s s [c s s ] σ x E s x [c s x ] we don t the hidden transitions, but we can approximately count

M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = E s s [c s s ] σ x E s x [c s x ] we compute these in the E-step, with our α and β values we don t the hidden transitions, but we can approximately count

α = computeforwards() β = computebackwards() L = α[n+1][end] EM For HMMs (Baum-Welch Algorithm) for(i = N; i 0; --i) { for(next = 0; next < K*; ++next) { c obs (obs i+1 next) += α[i+1][next]* β[i+1][next]/l for(state = 0; state < K*; ++state) { u = p obs (obs i+1 next) * p trans (next state) c trans (next state) += α[i][state] * u * β[i+1][next]/l } } }

Bayesian Networks: Directed Acyclic Graphs p x 1, x 2, x 3,, x N = i p x i π(x i )) parents of topological sort

Bayesian Networks: Directed Acyclic Graphs p x 1, x 2, x 3,, x N = i p x i π(x i )) exact inference in general DAGs is NP-hard inference in trees can be exact

D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node X P has a fork with an observed parent node X P includes a v-structure or collider with all unobserved descendants X Z Y Y Y

D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node observing Z blocks the path from X to Y X Z Y P has a fork with an observed parent node observing Z blocks the path from X to Y X Z Y not observing Z blocks the path from X to Y P includes a v-structure or collider with all unobserved descendants X Z Y

D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node observing Z blocks the path from X to Y X Z Y P has a fork with an observed parent node observing Z blocks the path from X to Y X Z Y not observing Z blocks the path from X to Y p x, y, z = p x p y p(z x, y) p x, y = p x p y p(z x, y) = p x p y z P includes a v-structure or collider with all unobserved descendants X Z Y

Markov Blanket p x i x j i = factorization of graph = the set of nodes needed to form the complete conditional for a variable x i p(x 1,, x N ) p x 1,, x N dx i ς k p(x k π x k ) ς k p x k π x k ) dx i x = factor out terms not dependent on x i ς k:k=i or i π xk p(x k π x k ) ς k:k=i or i π xk p x k π x k ) dx i Markov blanket of a node x is its parents, children, and children's parents