Machine Learning Summer School

Similar documents
Statistical Approaches to Learning and Discovery

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Example: multivariate Gaussian Distribution

Review: Directed Models (Bayes Nets)

Bayesian Machine Learning

Undirected Graphical Models

Bayesian & Markov Networks: A unified view

Directed and Undirected Graphical Models

Machine Learning Lecture 14

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

6.867 Machine learning, lecture 23 (Jaakkola)

Chapter 16. Structured Probabilistic Models for Deep Learning

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Causal Discovery Methods Using Causal Probabilistic Networks

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Conditional Independence and Factorization

9 Forward-backward algorithm, sum-product on factor graphs

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

CPSC 540: Machine Learning

Machine Learning 4771

Directed Graphical Models

Machine Learning Summer School

Graphical Models. Outline. HMM in short. HMMs. What about continuous HMMs? p(o t q t ) ML 701. Anna Goldenberg ... t=1. !

Bayesian Networks Representation and Reasoning

Message Passing Algorithms and Junction Tree Algorithms

Undirected Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models

Undirected Graphical Models 4 Bayesian Networks and Markov Networks. Bayesian Networks to Markov Networks

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models (I)

CSC 412 (Lecture 4): Undirected Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Undirected Graphical Models

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Machine Learning for Data Science (CS4786) Lecture 24

Graphical Models and Kernel Methods

4.1 Notation and probability review

Undirected Graphical Models: Markov Random Fields

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Markov properties for directed graphs

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Introduction to Causal Calculus

Undirected graphical models

3 : Representation of Undirected GM

Representation of undirected GM. Kayhan Batmanghelich

Recitation 9: Graphical Models: D-separation, Variable Elimination and Inference

2 : Directed GMs: Bayesian Networks

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Lecture 4 October 18th

Based on slides by Richard Zemel

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

An Introduction to Bayesian Machine Learning

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Lecture 6: Graphical Models: Learning

Probabilistic Graphical Models

Statistical Learning

4 : Exact Inference: Variable Elimination

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Lecture 6: Graphical Models

Probabilistic Graphical Models

Cavities. Independence. Bayes nets. Why do we care about variable independence? P(W,CY,T,CH) = P(W )P(CY)P(T CY)P(CH CY)

CS281A/Stat241A Lecture 19

Bayesian Machine Learning - Lecture 7

Lecture 15. Probabilistic Models on Graph

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Bayesian Learning in Undirected Graphical Models

Probabilistic Graphical Models

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Bayesian Networks. Alan Ri2er

The Origin of Deep Learning. Lili Mou Jan, 2015

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Machine Learning for Data Science (CS4786) Lecture 19

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Rapid Introduction to Machine Learning/ Deep Learning

Lecture 12: May 09, Decomposable Graphs (continues from last time)

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Cognitive Systems 300: Probability and Causality (cont.)

Bayesian Learning in Undirected Graphical Models

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Variational Inference (11/04/13)

Introduction to Bayesian Networks

Probabilistic Graphical Networks: Definitions and Basic Results

Directed and Undirected Graphical Models

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks 3 D-separation. D-separation

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Probabilistic Graphical Models (1): representation

Bayesian Networks (Part II)

Naïve Bayes Classifiers

CS 188: Artificial Intelligence. Bayes Nets

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Rapid Introduction to Machine Learning/ Deep Learning

Transcription:

Machine Learning Summer School Lecture 1: Introduction to Graphical Models Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ epartment of ngineering University of ambridge, UK Machine Learning epartment arnegie Mellon University, US ugust 2009

Three main kinds of graphical models factor graph undirected graph directed graph Nodes correspond to random variables dges represent statistical dependencies between the variables

Why do we need graphical models? Graphs are an intuitive way of representing and visualising the relationships between many variables. (xamples: family trees, electric circuit diagrams, neural networks) graph allows us to abstract out the conditional independence relationships between the variables from the details of their parametric forms. Thus we can answer questions like: Is dependent on given that we know the value of? just by looking at the graph. Graphical models allow us to define general message-passing algorithms that implement probabilistic inference efficiently. Thus we can answer queries like What is p( = c)? without enumerating all settings of all variables in the model. Graphical models = statistics graph theory computer science.

onditional Independence onditional Independence: when p(y, V ) > 0. lso X Y V p(x Y, V ) = p(x V ) X Y V p(x, Y V ) = p(x V ) p(y V ) In general we can think of conditional independence between sets of variables: X Y V p(x, Y V) = p(x V) p(y V) Marginal Independence: X Y X Y p(x, Y ) = p(x) p(y )

onditional and Marginal Independence (xamples) mount of Speeding Fine Type of ar Speed Lung ancer Yellow Teeth Smoking (Position, Velocity) t+1 (Position, Velocity) t 1 (Position, Velocity) t, cceleration t hild s Genes Grandparents Genes Parents Genes bility of Team bility of Team not ( bility of Team bility of Team Outcome of vs Game )

Factor Graphs Two types of nodes: The circles in a factor graph represent random variables (e.g. ). (a) (b) The filled dots represent factors in the joint distribution (e.g. g 1 ( )). (a) p(,,,, ) = 1 Z g 1(, )g 2 (,, )g 3 (,, ) (b) p(,,,, ) = 1 Z g 1(, )g 2 (, )g 3 (, )g 4 (, )g 5 (, )g 6 (, ) The g i are non-negative functions of their arguments, and Z is a normalization constant..g. in (a), if all variables are discrete and take values in : Z = X a X X X X g 1 ( = a, = c)g 2 ( = b, = c, = d)g 3 ( = c, = d, = e) b c d e Two nodes are neighbors if they share a common factor.

Factor Graphs (a) (b) The circles in a factor graph represent random variables. The filled dots represent factors in the joint distribution. (a) p(,,,, ) = 1 Z g 1(, )g 2 (,, )g 3 (,, ) (b) p(,,,, ) = 1 Z g 1(, )g 2 (, )g 3 (, )g 4 (, )g 5 (, )g 6 (, ) Two nodes are neighbors if they share a common factor. efinition: path is a sequence of neighboring nodes. Fact: X Y V if every path between X and Y contains some node V V orollary: Given the neighbors of X, the variable X is conditionally independent of all other variables: X Y ne(x), Y / {X} ne(x)

Proving onditional Independence ssume: p(x, Y, V ) = 1 Z g 1(X, V )g 2 (Y, V ), g 1 g 2 X V Y (1) We want to show conditional independence: X Y V p(x Y, V ) = p(x V ) (2) Summing (1) over X we get: ividing (1) by (3) we get: p(y, V ) = 1 Z [ ] g 1 (X, V ) X g 2 (Y, V ) (3) p(x Y, V ) = g 1(X, V ) X g 1(X, V ) (4) Since the rhs. of (4) doesn t depend on Y, it follows that X is independent of Y given V. Therefore factorizaton (1) implies conditional independence (2).

Undirected Graphical Models In an Undirected Graphical Model, the joint probability over all variables can be written in a factored form: p(x) = 1 g j (x j ) Z where x = (x 1,..., x K ), and j {1,..., K} are subsets of the set of all variables, and x S (x k : k S). Graph Specification: reate a node for each variable. onnect nodes i and k if there exists a set j such that both i j and k j. These sets form the cliques of the graph (fully connected subgraphs). Note: Undirected Graphical Models are also called Markov Networks. Very similar to factor graphs. j

Undirected Graphical Models p(,,,, ) = 1 Z g 1(, )g 2 (,, )g 3 (,, ) Fact: X Y V if every path between X and Y contains some node V V orollary: Given the neighbors of X, the variable X is conditionally independent of all other variables: X Y ne(x), Y / {X} ne(x) Markov lanket: V is a Markov lanket for X iff X Y V for all Y / {X V}. Markov oundary: minimal Markov lanket ne(x) for undirected and factor graphs

omparing Undirected Graphs and Factor Graphs (a) (b) (c) ll nodes in (a), (b), and (c) have exactly the same neighbors and therefore these three graphs represent exactly the same conditional independence relationships. (c) also represents the fact that the probability factors into a product of pairwise functions. onsider the case where each variables is discrete and can take on K possible values. Then the functions in (a) and (b) are tables with O(K 3 ) cells, whereas in (c) they are O(K 2 ).

Problems with Undirected Graphs and Factor Graphs In UGs and FGs, many useful independencies are not represented two variables are connected merely because some other variable depends on them: Rain Sprinkler Rain Sprinkler Ground wet Ground wet This highlights the difference between marginal independence and conditional independence. R and S are marginally independent (i.e. given nothing), but they are conditionally dependent given G. xplaining way : Observing that the spinkler is on would explain away the observation that the ground was wet, making it less probable that it rained.

irected cyclic Graphical Models (ayesian Networks) G Model / ayesian network 1 corresponds to a factorization of the joint probability distribution: In general: p(,,,, ) = p()p()p(, )p(, )p(, ) p(x 1,..., X n ) = where pa(i) are the parents of node i. n p(x i Xpa(i)) 1 ayesian networks can and often are learned using non-ayesian (i.e. frequentist) methods; ayesian networks (i.e. Gs) do not require parameter or structure learning using ayesian methods. lso called belief networks. i=1

irected cyclic Graphical Models (ayesian Networks) Semantics: X Y V if V d-separates X from Y 2. efinition: V d-separates X from Y if every undirected path 3 between X and Y is blocked by V. path is blocked by V if there is a node W on the path such that either: 1. W has converging arrows along the path ( W ) 4 and neither W nor its descendants are observed (in V), or 2. W does not have converging arrows along the path ( W or W ) and W is observed (W V). orollary: Markov oundary for X: {parents(x) children(x) parents-of-children(x)}. 2 See also the ayes all algorithm in the ppendix 3 n undirected path ignores the direction of the edges. 4 Note that converging arrows along the path only refers to what happens on that path. lso called a collider.

xamples of -Separation in Gs xamples: since is blocked 1 by, is blocked 1 by, etc. not ( ) since is not blocked. {, } since is blocked 2 by, is blocked 2 by, and is blocked 2 by. not ( ) since is not blocked. Note that it is the absence of edges that conveys conditional independence.

From irected Trees to Undirected Trees 1 5 3 4 6 2 7 p(x 1, x 2,..., x 7 ) = p(x 3 )p(x 1 x 3 )p(x 2 x 3 )p(x 4 x 3 )p(x 5 x 4 )p(x 6 x 4 )p(x 7 x 4 ) = p(x 1, x 3 )p(x 2, x 3 )p(x 3, x 4 )p(x 4, x 5 )p(x 4, x 6 )p(x 4, x 7 ) p(x 3 )p(x 3 )p(x 4 )p(x 4 )p(x 4 ) = product of cliques product of clique intersections = g 1 (x 1, x 3 )g 2 (x 2, x 3 )g 3 (x 3, x 4 )g 4 (x 4, x 5 )g 5 (x 4, x 6 )g 6 (x 4, x 7 ) = i g i ( i ) ny directed tree can be converted into an undirected tree representing the same conditional independence relationships, and viceversa.

irected Graphs for Statistical Models: Plate Notation onsider the following simple model. data set of N points is generated i.i.d. from a Gaussian with mean µ and standard deviation σ: p(x 1,..., x N, µ, σ) = p(µ)p(σ) This can be represented graphically as follows: N n=1 p(x n µ, σ) μ σ μ σ x n N x 1 x 2... x N

xpressive Power of irected and Undirected Graphs No irected Graph (ayesian network) can represent these and only these independencies No matter how we direct the arrows there will always be two non-adjacent parents sharing a common child = dependence in irected Graph but independence in Undirected Graph. No Undirected Graph or Factor Graph can represent these and only these independencies irected graphs are better at expressing causal generative models, undirected graphs are better at representing soft constraints between variables.

Summary Three kinds of graphical models: directed, undirected, factor (there are other important classes, e.g. directed mixed graphs) Marginal and conditional independence Markov boundaries and d-separation ifferences between directed and undirected graphs. Next lectures: exact inference and propagation algorithm parameter and structure learning in graphs nonparametric approaches to graph learning

ppendix: Some xamples of irected Graphical Models X 1 X K Λ factor analysis probabilistic P Y 1 Y 2 Y X 1 Y 1 X 2 Y 2 X 3 Y 3 X T Y T hidden Markov models linear dynamical systems U 1 U 2 U 3 S 1 S 2 S 3 X 1 X 2 X 3 Y 1 Y 2 Y 3 switching state-space models

ppendix: xamples of Undirected Graphical Models Markov Random Fields (used in omputer Vision) xponential Language Models (used in Speech and Language Modelling) { } p(s) = 1 Z p 0(s) exp i λ i f i (s) Products of xperts (widely applicable) p(x) = 1 Z p j (x θ j ) j oltzmann Machines (a kind of Neural Network/Ising Model)

ppendix: lique Potentials and Undirected Graphs efinition: a clique is a fully connected subgraph. y clique we usually mean maximal clique (i.e. not contained within another clique) i denotes the set of variables in the i th clique. p(x 1,..., x K ) = 1 Z g i (x i ) i where Z = x 1 x K i g i(x i ) is the normalization. ssociated with each clique i is a non-negative function g i (x i ) which measures compatibility between settings of the variables. xample: Let 1 = {, }, {0, 1}, {0, 1} What does this mean? g 1 (, ) 0 0 0.2 0 1 0.6 1 0 0.0 1 1 1.2

ppendix: Hammersley lifford Theorem (1971) Theorem: probability function p formed by a normalized product of positive functions on cliques of G is a Markov Field relative to G. efinition: The distribution p is a Markov Field relative to G if all conditional independence relations represented by G are true of p. G represents the following I relations: If V V lies on all paths between X and Y in G, then X Y V. Proof: We need to show that if p is a product of functions on cliques of G then a variable is conditionally independent of its non-neighbors in G given its neighbors in G. That is: ne(x l ) is a Markov lanket for x l. Let x m / {x l ne(x l )} p(x l, x m,...) = 1 g i (x i ) = 1 g i (x i ) g j (x j ) Z Z i i:l i j:l/ j = 1 ( Z f 1 xl, ne(x l ) ) ( ) 1 f 2 ne(xl ), x m = Z p(x l ne(x l )) p(x m ne(x l )) It follows that: p(x l, x m ne(x l )) = p(x l ne(x l )) p(x m ne(x l )) x l x m ne(x l ).

ppendix: The ayes-ball algorithm Game: can you get a ball from X to Y without being blocked by V? epending on the direction the ball came from and the type of node, the ball can pass through (from a parent to all children, from a child to all parents), bounce back (from any parent to all parents, or from any child to all children), or be blocked. n unobserved (hidden) node (W / balls from children. V) passes balls through but also bounces back n observed (given) node (W V) bounces back balls from parents but blocks balls from children.