Bayesian networks Lecture 18. David Sontag New York University

Similar documents
Unsupervised learning (part 1) Lecture 19

Sta$s$cal sequence recogni$on

Directed and Undirected Graphical Models

CS 6140: Machine Learning Spring 2017

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Bayesian networks: Representation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduc)on to Ar)ficial Intelligence

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Graphical Models

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Intelligent Systems (AI-2)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Graphical Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

Intelligent Systems (AI-2)

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

CSE 473: Ar+ficial Intelligence

Latent Dirichlet Alloca/on

A brief introduction to Conditional Random Fields

Undirected Graphical Models

Introduction to Bayesian Learning

Review: Bayesian learning and inference

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Probabilistic Classification

Structure Learning: the good, the bad, the ugly

2 : Directed GMs: Bayesian Networks

CS Lecture 4. Markov Random Fields

STA 4273H: Statistical Machine Learning

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

CS 188: Artificial Intelligence Fall 2011

Lecture 6: Graphical Models

Naïve Bayes classification

Hidden Markov Models,99,100! Markov, here I come!

Probabilistic Graphical Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Learning in Bayesian Networks

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Bayesian Networks Inference with Probabilistic Graphical Models

Graphical models. Sunita Sarawagi IIT Bombay

Probability and Structure in Natural Language Processing

Template-Based Representations. Sargur Srihari

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Learning Bayesian network : Given structure and completely observed data

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Hidden Markov Models

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Intelligent Systems (AI-2)

Sampling Algorithms for Probabilistic Graphical models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

9 Forward-backward algorithm, sum-product on factor graphs

Machine Learning Summer School

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Probabilistic Machine Learning

Lecture 5: Bayesian Network

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

ECE521 Lecture 19 HMM cont. Inference in HMM

Reasoning Under Uncertainty: Bayesian networks intro

Machine Learning for Data Science (CS4786) Lecture 19

UVA CS / Introduc8on to Machine Learning and Data Mining

COMP90051 Statistical Machine Learning

Approximate Inference

Intelligent Systems (AI-2)

Forward algorithm vs. particle filtering

Bayesian Networks. Motivation

Lecture 6: Graphical Models: Learning

CS Lecture 3. More Bayesian Networks

4 : Exact Inference: Variable Elimination

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Directed Graphical Models or Bayesian Networks

CSE 473: Artificial Intelligence Autumn Topics

Learning Bayesian Networks (part 1) Goals for the lecture

Introduction to Artificial Intelligence (AI)

Bayesian Network Representation

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Lecture 15. Probabilistic Models on Graph

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Hidden Markov Models in Language Processing

Machine Learning for Data Science (CS4786) Lecture 24

CS 188: Artificial Intelligence. Bayes Nets

Graphical Models and Kernel Methods

Machine Learning for OR & FE

STAD68: Machine Learning

Introduction to Artificial Intelligence (AI)

Probabilistic Graphical Networks: Definitions and Basic Results

CS221 / Autumn 2017 / Liang & Ermon. Lecture 13: Bayesian networks I

Transcription:

Bayesian networks Lecture 18 David Sontag New York University

Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es Examples Learning and inference

Example applica=on: Tracking Observe noisy measurements of missile loca=on: Y 1, Y 2, Where is the missile now? Where will it be in 10 seconds? Radar

Probabilis=c approach Our measurements of the missile loca=on were Y 1, Y 2,, Y n Let X t be the true <missile loca=on, velocity> at =me t To keep this simple, suppose that everything is discrete, i.e. X t takes the values 1,, k Grid the space:

Probabilis=c approach First, we specify the condi&onal distribu=on Pr(X t X t-1 ): From basic physics, we can bound the distance that the missile can have traveled Then, we specify Pr(Y t X t =<(10,20), 200 mph toward the northeast>): With probability ½, Y t = X t (ignoring the velocity). Otherwise, Y t is a uniformly chosen grid loca=on

Hidden Markov models 1960 s Assume that the joint distribu=on on X 1, X 2,, X n and Y 1, Y 2,, Y n factors as follows: Pr(x 1,...x n,y 1,...,y n )=Pr(x 1 )Pr(y 1 x 1 ) ny Pr(x t x t 1 )Pr(y t x t ) To find out where the missile is now, we do marginal inference: Pr(x n y 1,...,y n ) To find the most likely trajectory, we do MAP (maximum a posteriori) inference: t=2 arg max x Pr(x 1,...,x n y 1,...,y n )

Inference Recall, to find out where the missile is now, we do marginal inference: Pr(x n y 1,...,y n ) How does one compute this? Applying rule of condi=onal probability, we have: Pr(x n y 1,...,y n )= Pr(x n,y 1,...,y n ) Pr(y 1,...,y n ) = Pr(x n,y 1,...,y n ) P kˆx n =1 Pr(ˆx n,y 1,...,y n ) Naively, would seem to require k n-1 summa=ons, Pr(x n,y 1,...,y n )= X x 1,...,x n 1 Pr(x 1,...,x n,y 1,...,y n ) Is there a more efficient algorithm?

Marginal inference in HMMs Use dynamic programming Pr(x n,y 1,...,y n )= X x n 1 Pr(x n 1,x n,y 1,...,y n ) Pr(A = a) = X b = X x n 1 Pr(x n 1,y 1,...,y n 1 )Pr(x n,y n x n 1,y 1,...,y n 1 ) Pr(B = b, A = a) Pr(A = a, B = b) =Pr(A = a)pr(b = b A = a) = X Condi=onal independence in HMMs Pr(x n 1,y 1,...,y n 1 )Pr(x n,y n x n 1 ) x n 1 = X Pr(A = a, B = b) =Pr(A = a)pr(b = b A = a) Pr(x n 1,y 1,...,y n 1 )Pr(x n x n 1 )Pr(y n x n,x n 1 ) x n 1 = X Condi=onal independence in HMMs Pr(x n 1,y 1,...,y n 1 )Pr(x n x n 1 )Pr(y n x n ) x n 1 For n=1, ini=alize Pr(x 1,y 1 )=Pr(x 1 )Pr(y 1 x 1 ) Total running =me is O(nk) linear =me! Easy to do filtering

MAP inference in HMMs MAP inference in HMMs can also be solved in linear =me! arg max x Pr(x 1,...x n y 1,...,y n ) = arg max x Pr(x 1,...x n,y 1,...,y n ) = arg max log Pr(x 1,...x n,y 1,...,y n ) x h i = arg max log Pr(x 1 )Pr(y 1 x 1 ) + x nx h log i=2 Formulate as a shortest paths problem Weight h for edge (s, x 1 ) is i - log Pr(x 1 )Pr(y 1 x 1 ) s Weight for edge (x i-1, x i ) is - i Pr(x i x i 1 )Pr(y i x i ) h i log Pr(x i x i 1 )Pr(y i x i ) Path from s to t gives the MAP assignment t Weight for edge (x n, t) is 0 k nodes per variable X 1 X 2 X n-1 X n Called the Viterbi algorithm

Applica=ons of HMMs Speech recogni=on Predict phonemes from the sounds forming words (i.e., the actual signals) Natural language processing Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence Computa=onal biology Predict intron/exon regions from DNA Predict protein structure from DNA (locally) And many many more!

HMMs as a graphical model We can represent a hidden Markov model with a graph: X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Shading in denotes observed variables (e.g. what is available at test =me) ny Pr(x 1,...x n,y 1,...,y n )=Pr(x 1 )Pr(y 1 x 1 ) Pr(x t x t 1 )Pr(y t x t ) t=2 There is a 1-1 mapping between the graph structure and the factoriza=on of the joint distribu=on

Naïve Bayes as a graphical model We can represent a naïve Bayes model with a graph: Label Y X 1 X 2 X 3... X n Shading in denotes observed variables (e.g. what is available at test =me) Features ny Pr(y, x 1,...,x n )=Pr(y) Pr(x i y) i=1 There is a 1-1 mapping between the graph structure and the factoriza=on of the joint distribu=on

Bayesian networks A Bayesian network is specified by a directed acyclic graph G=(V,E) with: One node i for each random variable X i One condi=onal probability distribu=on (CPD) per node, p(x i x Pa(i) ), specifying the variable s probability condi=oned on its parents values Corresponds 1-1 with a par=cular factoriza=on of the joint distribu=on: p(x 1,...x n )= Y p(x i x Pa(i) ) i2v Powerful framework for designing algorithms to perform probability computa=ons

2011 Turing award was for Bayesian networks

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 Difficulty What is its joint distribu=on? g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2 g 1 g 2 g 2 Grade Letter l 0 0.1 0.4 0.99 Intelligence l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 Example from Koller & Friedman, Probabilis&c Graphical Models, 2009 p(x 1,...x n ) = Y i2v p(x i x Pa(i) ) p(d, i, g, s, l) = p(d)p(i)p(g i, d)p(s i)p(l g)

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 Difficulty What is this model assuming? SAT 6? Grade i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.25 0.08 0.3 SAT? Grade Intelligence 0.3 0.7 0.02 0.2 g 1 g 2 g 2 Grade Letter l 0 0.1 0.4 0.99 Intelligence l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.25 0.08 0.3 0.3 0.7 Difficulty 0.02 0.2 Grade Letter Intelligence Compared to a simple log-linear model to predict intelligence: Captures non-linearity between grade, course difficulty, and intelligence Modular. Training data can come from different sources! g 1 g 2 g 2 l 0 0.1 0.4 0.99 Built in feature selec&on: lerer of recommenda=on is irrelevant given grade l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

Bayesian networks enable use of domain knowledge Will my car start this morning? p(x 1,...x n )= Y i2v p(x i x Pa(i) ) Heckerman et al., Decision-Theore=c Troubleshoo=ng, 1995

Bayesian networks enable use of domain knowledge p(x 1,...x n )= Y i2v What is the differen=al diagnosis? p(x i x Pa(i) ) Beinlich et al., The ALARM Monitoring System, 1989

Bayesian networks are genera&ve models Can sample from the joint distribu=on, top-down Suppose Y can be spam or not spam, and X i is a binary indicator of whether word i is present in the e-mail Let s try genera=ng a few emails! Label Y X 1 X 2 X 3... X n Features Oven helps to think about Bayesian networks as a genera=ve model when construc=ng the structure and thinking about the model assump=ons

Inference in Bayesian networks Compu=ng marginal probabili=es in tree structured Bayesian networks is easy The algorithm called belief propaga=on generalizes what we showed for hidden Markov models to arbitrary trees X 1 X 2 X 3 X 4 X 5 X 6 Label Y X 1 X 2 X 3... X n Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features Wait this isn t a tree! What can we do?

Inference in Bayesian networks In some cases (such as this) we can transform this into what is called a junc=on tree, and then run belief propaga=on

Approximate inference There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these Markov chain Monte Carlo algorithms repeatedly sample assignments for es=ma=ng marginals Varia=onal inference algorithms (determinis=c) find a simpler distribu=on which is close to the original, then compute marginals using the simpler distribu=on

Maximum likelihood es=ma=on in Bayesian networks Suppose that we know the Bayesian network structure G Let xi x pa(i) be the parameter giving the value of the CPD p(x i x pa(i) ) Maximum likelihood estimation corresponds to solving: 1 MX max Xlog p(x M ; ) M m=1 subject to the non-negativity and normalization constraints This is equal to: 1 XMX max log p(x M ; ) = max M m=1 = max 1 M XNX i=1 X MX XNX m=1 1 M i=1 XMX m=1 log p(x M i x M pa(i) ; ) log p(x M i x M pa(i) ; ) The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.