Machine Learning Summer School

Machine Learning Summer School Lecture 1: Introduction to Graphical Models Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ epartment of ngineering University of ambridge, UK Machine Learning epartment arnegie Mellon University, US ugust 2009

Three main kinds of graphical models factor graph undirected graph directed graph Nodes correspond to random variables dges represent statistical dependencies between the variables

Why do we need graphical models? Graphs are an intuitive way of representing and visualising the relationships between many variables. (xamples: family trees, electric circuit diagrams, neural networks) graph allows us to abstract out the conditional independence relationships between the variables from the details of their parametric forms. Thus we can answer questions like: Is dependent on given that we know the value of? just by looking at the graph. Graphical models allow us to define general message-passing algorithms that implement probabilistic inference efficiently. Thus we can answer queries like What is p( = c)? without enumerating all settings of all variables in the model. Graphical models = statistics graph theory computer science.

onditional Independence onditional Independence: when p(y, V ) > 0. lso X Y V p(x Y, V ) = p(x V ) X Y V p(x, Y V ) = p(x V ) p(y V ) In general we can think of conditional independence between sets of variables: X Y V p(x, Y V) = p(x V) p(y V) Marginal Independence: X Y X Y p(x, Y ) = p(x) p(y )

onditional and Marginal Independence (xamples) mount of Speeding Fine Type of ar Speed Lung ancer Yellow Teeth Smoking (Position, Velocity) t+1 (Position, Velocity) t 1 (Position, Velocity) t, cceleration t hild s Genes Grandparents Genes Parents Genes bility of Team bility of Team not ( bility of Team bility of Team Outcome of vs Game )

Factor Graphs Two types of nodes: The circles in a factor graph represent random variables (e.g. ). (a) (b) The filled dots represent factors in the joint distribution (e.g. g 1 ( )). (a) p(,,,, ) = 1 Z g 1(, )g 2 (,, )g 3 (,, ) (b) p(,,,, ) = 1 Z g 1(, )g 2 (, )g 3 (, )g 4 (, )g 5 (, )g 6 (, ) The g i are non-negative functions of their arguments, and Z is a normalization constant..g. in (a), if all variables are discrete and take values in : Z = X a X X X X g 1 ( = a, = c)g 2 ( = b, = c, = d)g 3 ( = c, = d, = e) b c d e Two nodes are neighbors if they share a common factor.

Factor Graphs (a) (b) The circles in a factor graph represent random variables. The filled dots represent factors in the joint distribution. (a) p(,,,, ) = 1 Z g 1(, )g 2 (,, )g 3 (,, ) (b) p(,,,, ) = 1 Z g 1(, )g 2 (, )g 3 (, )g 4 (, )g 5 (, )g 6 (, ) Two nodes are neighbors if they share a common factor. efinition: path is a sequence of neighboring nodes. Fact: X Y V if every path between X and Y contains some node V V orollary: Given the neighbors of X, the variable X is conditionally independent of all other variables: X Y ne(x), Y / {X} ne(x)

Proving onditional Independence ssume: p(x, Y, V ) = 1 Z g 1(X, V )g 2 (Y, V ), g 1 g 2 X V Y (1) We want to show conditional independence: X Y V p(x Y, V ) = p(x V ) (2) Summing (1) over X we get: ividing (1) by (3) we get: p(y, V ) = 1 Z [ ] g 1 (X, V ) X g 2 (Y, V ) (3) p(x Y, V ) = g 1(X, V ) X g 1(X, V ) (4) Since the rhs. of (4) doesn t depend on Y, it follows that X is independent of Y given V. Therefore factorizaton (1) implies conditional independence (2).

Undirected Graphical Models In an Undirected Graphical Model, the joint probability over all variables can be written in a factored form: p(x) = 1 g j (x j ) Z where x = (x 1,..., x K ), and j {1,..., K} are subsets of the set of all variables, and x S (x k : k S). Graph Specification: reate a node for each variable. onnect nodes i and k if there exists a set j such that both i j and k j. These sets form the cliques of the graph (fully connected subgraphs). Note: Undirected Graphical Models are also called Markov Networks. Very similar to factor graphs. j

Undirected Graphical Models p(,,,, ) = 1 Z g 1(, )g 2 (,, )g 3 (,, ) Fact: X Y V if every path between X and Y contains some node V V orollary: Given the neighbors of X, the variable X is conditionally independent of all other variables: X Y ne(x), Y / {X} ne(x) Markov lanket: V is a Markov lanket for X iff X Y V for all Y / {X V}. Markov oundary: minimal Markov lanket ne(x) for undirected and factor graphs

omparing Undirected Graphs and Factor Graphs (a) (b) (c) ll nodes in (a), (b), and (c) have exactly the same neighbors and therefore these three graphs represent exactly the same conditional independence relationships. (c) also represents the fact that the probability factors into a product of pairwise functions. onsider the case where each variables is discrete and can take on K possible values. Then the functions in (a) and (b) are tables with O(K 3 ) cells, whereas in (c) they are O(K 2 ).

Problems with Undirected Graphs and Factor Graphs In UGs and FGs, many useful independencies are not represented two variables are connected merely because some other variable depends on them: Rain Sprinkler Rain Sprinkler Ground wet Ground wet This highlights the difference between marginal independence and conditional independence. R and S are marginally independent (i.e. given nothing), but they are conditionally dependent given G. xplaining way : Observing that the spinkler is on would explain away the observation that the ground was wet, making it less probable that it rained.

irected cyclic Graphical Models (ayesian Networks) G Model / ayesian network 1 corresponds to a factorization of the joint probability distribution: In general: p(,,,, ) = p()p()p(, )p(, )p(, ) p(x 1,..., X n ) = where pa(i) are the parents of node i. n p(x i Xpa(i)) 1 ayesian networks can and often are learned using non-ayesian (i.e. frequentist) methods; ayesian networks (i.e. Gs) do not require parameter or structure learning using ayesian methods. lso called belief networks. i=1

irected cyclic Graphical Models (ayesian Networks) Semantics: X Y V if V d-separates X from Y 2. efinition: V d-separates X from Y if every undirected path 3 between X and Y is blocked by V. path is blocked by V if there is a node W on the path such that either: 1. W has converging arrows along the path ( W ) 4 and neither W nor its descendants are observed (in V), or 2. W does not have converging arrows along the path ( W or W ) and W is observed (W V). orollary: Markov oundary for X: {parents(x) children(x) parents-of-children(x)}. 2 See also the ayes all algorithm in the ppendix 3 n undirected path ignores the direction of the edges. 4 Note that converging arrows along the path only refers to what happens on that path. lso called a collider.

xamples of -Separation in Gs xamples: since is blocked 1 by, is blocked 1 by, etc. not ( ) since is not blocked. {, } since is blocked 2 by, is blocked 2 by, and is blocked 2 by. not ( ) since is not blocked. Note that it is the absence of edges that conveys conditional independence.

From irected Trees to Undirected Trees 1 5 3 4 6 2 7 p(x 1, x 2,..., x 7 ) = p(x 3 )p(x 1 x 3 )p(x 2 x 3 )p(x 4 x 3 )p(x 5 x 4 )p(x 6 x 4 )p(x 7 x 4 ) = p(x 1, x 3 )p(x 2, x 3 )p(x 3, x 4 )p(x 4, x 5 )p(x 4, x 6 )p(x 4, x 7 ) p(x 3 )p(x 3 )p(x 4 )p(x 4 )p(x 4 ) = product of cliques product of clique intersections = g 1 (x 1, x 3 )g 2 (x 2, x 3 )g 3 (x 3, x 4 )g 4 (x 4, x 5 )g 5 (x 4, x 6 )g 6 (x 4, x 7 ) = i g i ( i ) ny directed tree can be converted into an undirected tree representing the same conditional independence relationships, and viceversa.

irected Graphs for Statistical Models: Plate Notation onsider the following simple model. data set of N points is generated i.i.d. from a Gaussian with mean µ and standard deviation σ: p(x 1,..., x N, µ, σ) = p(µ)p(σ) This can be represented graphically as follows: N n=1 p(x n µ, σ) μ σ μ σ x n N x 1 x 2... x N

xpressive Power of irected and Undirected Graphs No irected Graph (ayesian network) can represent these and only these independencies No matter how we direct the arrows there will always be two non-adjacent parents sharing a common child = dependence in irected Graph but independence in Undirected Graph. No Undirected Graph or Factor Graph can represent these and only these independencies irected graphs are better at expressing causal generative models, undirected graphs are better at representing soft constraints between variables.

Summary Three kinds of graphical models: directed, undirected, factor (there are other important classes, e.g. directed mixed graphs) Marginal and conditional independence Markov boundaries and d-separation ifferences between directed and undirected graphs. Next lectures: exact inference and propagation algorithm parameter and structure learning in graphs nonparametric approaches to graph learning

ppendix: Some xamples of irected Graphical Models X 1 X K Λ factor analysis probabilistic P Y 1 Y 2 Y X 1 Y 1 X 2 Y 2 X 3 Y 3 X T Y T hidden Markov models linear dynamical systems U 1 U 2 U 3 S 1 S 2 S 3 X 1 X 2 X 3 Y 1 Y 2 Y 3 switching state-space models

ppendix: xamples of Undirected Graphical Models Markov Random Fields (used in omputer Vision) xponential Language Models (used in Speech and Language Modelling) { } p(s) = 1 Z p 0(s) exp i λ i f i (s) Products of xperts (widely applicable) p(x) = 1 Z p j (x θ j ) j oltzmann Machines (a kind of Neural Network/Ising Model)

ppendix: lique Potentials and Undirected Graphs efinition: a clique is a fully connected subgraph. y clique we usually mean maximal clique (i.e. not contained within another clique) i denotes the set of variables in the i th clique. p(x 1,..., x K ) = 1 Z g i (x i ) i where Z = x 1 x K i g i(x i ) is the normalization. ssociated with each clique i is a non-negative function g i (x i ) which measures compatibility between settings of the variables. xample: Let 1 = {, }, {0, 1}, {0, 1} What does this mean? g 1 (, ) 0 0 0.2 0 1 0.6 1 0 0.0 1 1 1.2

ppendix: Hammersley lifford Theorem (1971) Theorem: probability function p formed by a normalized product of positive functions on cliques of G is a Markov Field relative to G. efinition: The distribution p is a Markov Field relative to G if all conditional independence relations represented by G are true of p. G represents the following I relations: If V V lies on all paths between X and Y in G, then X Y V. Proof: We need to show that if p is a product of functions on cliques of G then a variable is conditionally independent of its non-neighbors in G given its neighbors in G. That is: ne(x l ) is a Markov lanket for x l. Let x m / {x l ne(x l )} p(x l, x m,...) = 1 g i (x i ) = 1 g i (x i ) g j (x j ) Z Z i i:l i j:l/ j = 1 ( Z f 1 xl, ne(x l ) ) ( ) 1 f 2 ne(xl ), x m = Z p(x l ne(x l )) p(x m ne(x l )) It follows that: p(x l, x m ne(x l )) = p(x l ne(x l )) p(x m ne(x l )) x l x m ne(x l ).

ppendix: The ayes-ball algorithm Game: can you get a ball from X to Y without being blocked by V? epending on the direction the ball came from and the type of node, the ball can pass through (from a parent to all children, from a child to all parents), bounce back (from any parent to all parents, or from any child to all children), or be blocked. n unobserved (hidden) node (W / balls from children. V) passes balls through but also bounces back n observed (given) node (W V) bounces back balls from parents but blocks balls from children.