Probabilistic graphical models CS 535c (Topics in AI) Stat 521a (Topics in multivariate analysis) Lecture 1. Kevin Murphy

Similar documents
A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Directed Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Directed and Undirected Graphical Models

Bayesian Networks. Motivation

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Review: Bayesian learning and inference

PROBABILISTIC REASONING SYSTEMS

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Lecture 8: Bayesian Networks

Stat 521A Lecture 1 Introduction; directed graphical models

Introduction to Artificial Intelligence. Unit # 11

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Conditional Independence and Factorization

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Learning Bayesian Networks (part 1) Goals for the lecture

Introduction to Bayesian Learning

Quantifying uncertainty & Bayesian networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Classification

Intelligent Systems (AI-2)

Directed Graphical Models or Bayesian Networks

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Informatics 2D Reasoning and Agents Semester 2,

CS 5522: Artificial Intelligence II

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

2 : Directed GMs: Bayesian Networks

Bayesian networks. Chapter Chapter

Bayes Nets: Independence

Bayesian Networks. Alan Ri2er

CS 188: Artificial Intelligence. Bayes Nets

Probabilistic Reasoning Systems

Artificial Intelligence Bayes Nets: Independence

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Based on slides by Richard Zemel

Graphical Models - Part I

Bayes Networks 6.872/HST.950

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian networks. Chapter 14, Sections 1 4

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Inference in Bayesian Networks

Probabilistic Models. Models describe how (a portion of) the world works

Intelligent Systems (AI-2)

Introduction to Artificial Intelligence (AI)

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Lecture 6: Graphical Models

14 PROBABILISTIC REASONING

Bayes Nets III: Inference

Graphical Models and Kernel Methods

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian belief networks

CS 188: Artificial Intelligence Fall 2009

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Machine Learning Lecture 14

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Bayesian networks Lecture 18. David Sontag New York University

Bayesian Networks Inference with Probabilistic Graphical Models

Probabilistic Graphical Models for Image Analysis - Lecture 1

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Intelligent Systems (AI-2)

COMP5211 Lecture Note on Reasoning under Uncertainty

CPSC 540: Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Bayesian belief networks. Inference.

COMP90051 Statistical Machine Learning

Bayesian Networks (Part II)

Directed Graphical Models

Introduction to Artificial Intelligence Belief networks

Approximate Inference

Intelligent Systems (AI-2)

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

CS 188: Artificial Intelligence Spring Announcements

Rapid Introduction to Machine Learning/ Deep Learning

CSE 473: Artificial Intelligence Autumn 2011

Probabilistic Graphical Models

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Probabilistic Models

Stephen Scott.

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Building Bayesian Networks. Lecture3: Building BN p.1

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Preliminaries Bayesian Networks Graphoid Axioms d-separation Wrap-up. Bayesian Networks. Brandon Malone

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

CS 188: Artificial Intelligence Spring Announcements

2 : Directed GMs: Bayesian Networks

Introduction to Bayesian Networks

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Transcription:

Probabilistic graphical models CS 535c (Topics in AI) Stat 521a (Topics in multivariate analysis) Lecture 1 Kevin Murphy Monday 13 September, 2004

Administrivia Lectures: MW 9.30-10.50, CISR 304 Regular homeworks: 40% of grade Simple theory exercises. Simple Matlab exercises. Final project: 60% of grade Apply PGMs to your research area (e.g., vision, language, bioinformatics) Add new features to my software package for PGMs Theoretical work No exams

Admninistrivia Please send email to majordomo@cs.ubc.ca with the contents subscribe cpsc535c to get on the class mailing list. URL www.cs.ubc.ca/ murphyk/teaching/cs532c Fall04/index.html Class on Wed 15th starts at 10am! No textbook, but some draft chapters may be handed out in class. Introduction to Probabilistic Graphical Models, Michael Jordan Bayesian networks and Beyond, Daphne Koller and Nir Friedman

Probabilistic graphical models Combination of graph theory and probability theory. Informally, Graph structure specifies which parts of system are directly dependent. Local functions at each node specify how parts interact. More formally, Graph encodes conditional independence assumptions. Local functions at each node are factors in the joint probability distribution. Bayesian networks = PGMs based on directed acyclic graphs. Markov networks (Markov random fields) = PGM with undirected graph.

Applications of PGMs Machine learning Statistics Speech recognition Natural language processing Computer vision Error-control codes Bio-informatics Medical diagnosis etc.

Bayesian networks (aka belief network, directed graphical model) Nodes are random variables. Informally, edges represent causation (no directed cycles allowed - graph is a DAG). Formally, local Markov property says: node is conditionally independent of its non-descendants given its parents. U 1... U m Z 1j X Z nj Y 1... Y n

Chain rule for Bayesian networks U 1... U m Z 1j X Z nj Y 1... Y n P(X 1:N ) = P(X 1 )P(X 2 X 1 )P(X 3 X 1,X 2 )... N = P(X i X 1:i 1 ) = i=1 N P(X i X πi ) i=1

Water sprinkler Bayes net Cloudy Sprinkler Rain WetGrass P(C,S,R,W) = P(C)P(S C)P(R S,C)P(W S,R,C) chain rule = P(C)P(S C)P(R S,C)P(W S,R,C) since S R C = P(C)P(S C)P(R S,C)P(W S,R, C) since W C S,R = P(C)P(S C)P(R C)P(W S,R)

Conditional Probability Distributions (CPDs) Associated with every node is a probability distribution over its values given its parents values. If the variables are discrete, these distributions can be represented as tables (CPTs). P(C=F) P(C=T) 0.5 0.5 Cloudy C P(S=F) P(S=T) Sprinkler Rain C P(R=F) P(R=T) F 0.5 0.5 T 0.9 0.1 WetGrass F 0.8 0.2 T 0.2 0.8 S R P(W=F) P(W=T) F F 1.0 0.0 T F F T T T 0.1 0.9 0.1 0.9 0.01 0.99

Bayes nets provide compact representation of joint probability distributions For N binary nodes, need 2 N 1 parameters to specify P(X 1,...,X N ). For BN, need O(N2 K ) parameters, where K = max. number of parents (fan-in) per node. e.g., 2 4 1 = 31 vs 2 + 4 + 4 + 8 = 18 parameters. P(C=F) P(C=T) 0.5 0.5 Cloudy C P(S=F) P(S=T) Sprinkler Rain C P(R=F) P(R=T) F 0.5 0.5 T 0.9 0.1 WetGrass F 0.8 0.2 T 0.2 0.8 S R P(W=F) P(W=T) F F 1.0 0.0 T F F T T T 0.1 0.9 0.1 0.9 0.01 0.99

Alarm network

Intensive Care Unit monitoring

Bayes net for genetic pedigree analysis G i {a,b,o} {a,b,o} = genotype (allele) of person i B i {a,b,o,ab} = phenotype (blood type) of person i Harry Jackie Homer Marge Selma Bart Lisa Maggie

Bayes net for genetic pedigree analysis - CPDs G Harry G Betty B Harry B Betty G Homer G Marge G Peggy B Homer B Marge B Peggy G Bart G Lisa G Maggie B Bart B Lisa B Maggie Mendels laws define P(G G p,g m ) Phenotypic expression specifies P(B G): G P(B = a) P(B = b) P(B = o) P(B = ab) a a 1 0 0 0 a b 0 0 0 1 a o 1 0 0 0 b a 0 0 0 1 b b 0 1 0 0 b o 0 1 0 1 o a 1 0 0 0 o b 0 1 0 0 o o 0 0 1 0

Inference (state estimation) Inference = estimating hidden quantities from observed. Causal reasoning/ prediction (from causes to effects): how likely is it that clouds cause the grass to be wet? P(w = 1 c = 1) Cloudy Sprinkler Rain WetGrass?

Inference (state estimation) Inference = estimating hidden quantities from observed. Diagnostic reasoning (from effects to causes): the grass is wet; was it caused by the sprinkler or rain? P(S = 1 w = 1) vs P(R = 1 w = 1) Most Probable Explanation: arg max s,r P(S = s,r = r w = 1) Cloudy Cloudy? Sprinkler Rain Sprinkler Rain? WetGrass WetGrass

Explaining away Explaining away (inter-causal reasoning) P(S = 1 w = 1,r = 1) < P(S = 1 w = 1) Cloudy? Sprinkler Rain WetGrass Coins 1, 2 marginally independent, become dependent when observe their sum. Coin1 Coin2 C1+C2

Naive inference We can compute any query we want by marginalizing the joint,e.g, P(s = 1,w = 1) P(s = 1 w = 1) = P(w = 1) c,r P(s = 1,w = 1,R = r,c = c) = c,r,s P(S = s,w = 1,R = r,c = c) c,r P(C = c)p(s = 1 C = c)p(r = r C = c)p(w = 1 S = s,r = r = P(S = s,w = 1,R = r,c = c) c,r,s Cloudy? Sprinkler Rain WetGrass Takes O(2 N ) time Homework 1, question 3

Simple inference: classification Example: medical diagnosis Given list of observed findings (evidence), such as e 1 : sex = male e 2 : abdomen pain = high e 3 : shortness of breath = false Infer most likely cause: c = arg max c P(c e 1:N )

Approach 1: learn discriminative classifier We can try to fit a function to approximate P(c e 1:N ) using labeled training data (a set of (c,e 1:N ) pairs). This is the standard approach in supervised machine learning. Possible functional forms: Support vector machine (SVM) Neural network Decision tree Boosted decision tree See classes by Nando de Freitas: CPSC 340, Fall 2004 - undergrad machine learning CPSC 540, Spring 2005 - grad machine learning

Approach 2: build generative model and use Bayes rule to invert We can build a causal model of how diseases cause symptoms, and use Bayes rule to invert: P(c e 1:N ) = P(e 1:N c)p(c) P(e = 1:N c)p(c) P(e) c P(e 1:N c )P(c ) In words posterior = class-conditional likelihood prior marginal likelihood

Naive Bayes classifier Simplest generative model: assume effects are conditionally independent given the cause: E i E j C P(E 1:N C) = N i=1 P(E i C) Hence P(c e 1:N ) P(e 1:N c)p(c) = N i=1 P(e i c)p(c) C E1 EN

Naive Bayes classifier C E1 EN This model is extremely widely used (e.g., for document classification, spam filtering, etc) even when observations are not independent. N P(c e 1:N ) P(e 1:N c)p(c) = P(e i c)p(c) P(C = cancer E 1 = spots,e 2 = vomiting,e 3 = fever) P(spots cancer) P(vomiting cancer) P(fever cancer) P(C=cancer) i=1

QMR-DT Bayes net (Quick medical reference, decision theoretic) flu 570 heart disease Diseases botulism sex=f 4075 WBC count Symptoms abdomen pain

Decision theory Decision theory = probability theory + utility theory. Decision (influence) diagrams = Bayes nets + action (decision) nodes + utility (value) nodes. See David Poole s class, CS 522 Sex Postcoarctectomy Syndrome Tachypnea Dyspnea Tachycardia Failure To Thrive Intercostal Recession Paradoxical Hypertension Aortic Aneurysm Paraplegia Age Heart Failure Treatment Intermediate Result Late Result Hepatomegaly CVA Pulmonary Crepitations Cardiomegaly Myocardial Infarction Aortic Dissection U

POMDPs POMDP = Partially observed Markov decision process Special case of influence diagram (infinite horizon) A t 2 A t 1 A t A t+1 A t+2 X t 1 X t X t+1 X t+2 X t+3 U t+3 R t 1 R t R t+1 R t+2 E t 1 E t E t+1 E t+2 E t+3

Hidden Markov Model (HMM)... X 1 X2 X3 X 4 Y1 Y 2 Y3 Y4 HMM = POMDP - action - utility Inference goal: Online state estimation: P(X t y 1:t ) Viterbi decoding (most probable explanation): arg max x1:t P(x 1:t y 1:t ) Domain Hidden state X Observation Y Speech Words Spectogram Part-of-speech tagging Noun/ verb/ etc Words Gene finding Intron/ exon/ non-coding DNA Sequence alignment Insert/ delete/ match Amino acids

Biosequence analysis using HMMs

Learning Structure learning (model selection): where does the graph come from? Parameter learning (parameter estimation): where do the numbers come from? P(C=F) P(C=T) 0.5 0.5 φ c Cloudy C P(S=F) P(S=T) F 0.5 0.5 T 0.9 0.1 φ s Sprinkler Rain φ r C P(R=F) P(R=T) F 0.8 0.2 T 0.2 0.8 WetGrass φ w S R P(W=F) P(W=T) F F 1.0 0.0 T F F T T T 0.1 0.9 0.1 0.9 0.01 0.99

Parameter learning Assume we have iid training cases where each node is fully observed: D = {c i,s i,r i,w i }. Bayesian approach Treat parameters as random variables. Compute posterior distribution: P(φ D) (inference). Frequentist approach Treat parameters as unknown constants. Find best estimate, e.g., penalized maximum likelihood (optimization): φ = arg max log P(D φ) λc(φ) φ

Structure learning Assume we have iid training cases where each node is fully observed: D = {c i,s i,r i,w i }. Bayesian approach Treat graph as random variable. Compute posterior distribution: P(G D) Frequentist approach Treat graph as unknown constant. Find best estimate, e.g., maxmimum penalized likelihood: G = arg max G log P(D G) λc(g)

Representation Undirected graphical models Markov properties of graphs Inference Outline of class Models with discrete hidden nodes Exact (e.g., forwards backwards for HMMs) Approximate (e.g., loopy belief propagation) Models with continuous hidden nodes Exact (e.g., Kalman filtering) Approximate (e.g., sampling) Learning Parameters (e.g., EM) Structure (e.g., structural EM, causality)

Review: Representation Graphical models encode conditional independence assumptions. Bayesian networks are based on DAGs. Cloudy Sprinkler Rain WetGrass P(C,S,R,W) = P(C)P(S C)P(R C)P(W S,R)

Review: Inference (state estimation) Inference = estimating hidden quantities from observed. Cloudy Cloudy Cloudy Sprinkler Rain? Sprinkler Rain? Sprinkler Rain WetGrass? WetGrass WetGrass Naive method takes O(2 N ) time

Review: Learning Structure learning (model selection): where does the graph come from? Parameter learning (parameter estimation): where do the numbers come from? P(C=F) P(C=T) 0.5 0.5 φ c Cloudy C P(S=F) P(S=T) F 0.5 0.5 T 0.9 0.1 φ s Sprinkler Rain φ r C P(R=F) P(R=T) F 0.8 0.2 T 0.2 0.8 WetGrass φ w S R P(W=F) P(W=T) F F 1.0 0.0 T F F T T T 0.1 0.9 0.1 0.9 0.01 0.99

Outline Conditional independence properties of DAGs

Local Markov property Node is conditionally independent of its non-descendants given its parents. U 1... U m Z 1j X Z nj Y 1... Y n P(X 1:N ) = P(X 1 )P(X 2 X 1 )P(X 3 X 1,X 2 )... N = P(X i X 1:i 1 ) = i=1 N P(X i X πi ) i=1

Topological ordering If we get the ordering wrong, the graph will be more complicated, because the parents may not include the relevant variables to screen off the child from its irrelevant ancestors. Burglary P(B).001 Earthquake P(E).002 MaryCalls MaryCalls JohnCalls JohnCalls Alarm B t t f f E t f t f P(A).95.94.29.001 Alarm Earthquake Burglary Burglary JohnCalls A t f P(J).90.05 MaryCalls A t f P(M).70.01 (a) Earthquake (b) Alarm

Local Markov property version 2 A Node is conditionally independent of all others given its Markov blanket. The markov blanket is the parents, children, and childrens parents. U 1... U m Z 1j X Z nj Y 1... Y n

Global Markov properties of DAGs By chaining together local independencies, we can infer more global independencies. Defn: X 1 X 2 X n is an active path in a DAG G given evidence E if 1. Whenever we have a v-structure, X i 1 X i X i+1, then X i or one of its descendants is in E; and 2. no other node along the path is in E Defn: X is d-separated (directed-separated) from Y given E if there is no active path from any x X to any y Y given E. Theorem: x A x B x C if every variable in A is d-separated from every variable in B conditioned on all the variables in C.

Chain X Y Z X Y Z Q: When we condition on y, are x and z independent? which implies and therefore x z y P(x, y, z) = P(x)P(y x)p(z y) P(x,z y) = P(x)P(y x)p(z y) P(y) = P(x,y)P(z y) P(y) = P(x y)p(z y) Think of x as the past, y as the present and z as the future.

Common Cause Y Y y is the common cause of the two independent effects x and z X Z X Z Q: When we condition on y, are x and z independent? which implies and therefore x z y P(x, y, z) = P(y)P(x y)p(z y) P(x,z y) = P(x,y,z) P(y) = P(y)P(x y)p(z y) P(y) = P(x y)p(z y)

Explaining Away X Z X Z Y Q: When we condition on y, are x and z independent? P(x,y,z) = P(x)P(z)P(y x,z) x and z are marginally independent, but given y they are conditionally dependent. This important effect is called explaining away (Berkson s paradox.) For example, flip two coins independently; let x=coin1,z=coin2. Let y=1 if the coins come up the same and y=0 if different. x and z are independent, but if I tell you y, they become coupled! y is at the bottom of a v-structure, and so the path from x to z is active given y (information flows through).

Bayes Ball Algorithm To check if x A x B x C we need to check if every variable in A is d-separated from every variable in B conditioned on all vars in C. In other words, given that all the nodes in x C are clamped, when we wiggle nodes x A can we change any of the node x B? The Bayes-Ball Algorithm is a such a d-separation test. We shade all nodes x C, place balls at each node in x A (or x B ), let them bounce around according to some rules, and then ask if any of the balls reach any of the nodes in x B (or x A ). X W V Z Y So we need to know what happens when a ball arrives at a node Y on its way from X to Z.

Bayes-Ball Rules The three cases we considered tell us rules: X Y Z X Y Z Y (a) (b) Y X Z X Z X (a) Z X (b) Z Y Y (a) (b)

Bayes-Ball Boundary Rules We also need the boundary conditions: X Y X Y X Y X Y (a) (b) (a) (b) Here s a trick for the explaining away case: If y or any of its descendants is shaded, the ball passes through. X Z X Z Y (a) Y (b) Notice balls can travel opposite to edge directions.

Examples of Bayes-Ball Algorithm x 1 x 6 {x 2,x 3 }? X 4 X 2 X 1 X 6 X 3 X 5

Examples of Bayes-Ball Algorithm x 2 x 3 {x 1,x 6 }? X 4 X 2 X 1 X 6 Notice: balls can travel opposite to edge directions. X 3 X 5

I-equivalence Defn: Let I(G) be the set of conditional independencies encoded by DAG G (for any parameterization of the CPDs): I(G) = {(X Y Z) : Z d-separates X from Y} Defn: G 1 and G 2 are I-equivalent if I(G 1 ) = I(G 2 ) e.g., X Y is I-equivalent to X Y Thm: If G 1 and G 2 have the same undirected skeleton and the same set of v-structures, then they are I-equivalent. v x z v x z w y w y

I-equivalence If G 1 is I-equivalent to G 2, they do not necessarily have the same skeleton and v-structures e.g., I(G 1 ) = I(G 2 ) = : X1 X2 X1 X2 X3 X3 Corollary: We can only identify graph structure up to I-equivalence, i.e., we cannot always tell the direction of all the arrows from observational data. We will return to this issue when we discuss structure learning and causality.