Probabilistic Graphical Models COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline It Introduction ti Representation Bayesian network Conditional Independence Inference: Variable elimination Learning Markov Random Field Clique Pi Pair-wise i MRF Inference: Belief Propagation Conclusion 2
Introduction Graphical Model: Probability Theory + Graph Theory Probability theory: ensures consistency, provides interface models to data. Graph theory: intuitively appealing interface for humans, efficient general purpose p algorithms. 3 Introduction Modularity: a comple system is built by combining simpler parts. Provides a natural tool for two problems: Uncertainty and Compleity Plays an important role in the design and analysis of machine learning algorithms 4
Introduction Many of the classical multivariate probabilistic systems are special cases of the general graphical model formalism: Miture models Factor analysis Hidden Markov Models Kalman filters The graphical model framework provides a way to view all of these systems as instances of common underlying formalism. Techniques that have been developed in one field can be transferred to other fields A framework for the design of new system 5 Representation A graphical model represent probabilistic relationships between a set of random variables. Variables are represented by nodes: Binary events, Discrete variables, Continuous variables Conditional (in)dependency is represented tdb by (absence (b of) edges. Directed Graphical Model: (Bayesian network) Undirected Graphical Model: (Markov Random Field) 6
Outline Introduction Representation Bayesian network Conditional Independence Inference: Variable elimination Learning Markov Random Field Clique Pair-wise MRF Inference: Belief Propagation Conclusion 7 Bayesian Network Directed acyclic graphs (DAG). Directed edges give causality relationships between variables For each variable X and parents pa(x) eists a conditional probability bilit --X pa(x)) (X)) Discrete Variables: Conditional Probability Table(CPT) Description of a noisy causal process Parents 8
A Eample: What Causes Grass Wet? 9 More Comple Eample Diagnose the engine start problem 10
More Comple Eample Computer-based Patient Case Simulation system (CPCS-PM) developed by Parker and Miller 422 nodes and 867 arcs: 14 nodes describe diseases, 33 nodes describe history and risk factors, and the remaining 375 nodes describe various findings related to the diseases 11 Joint Distribution X 1, X n ) If the variables are binary, we need O(2 n ) parameters to describe P For the wet grass eample, need 2^4-1=15 parameters Can we do better? Key idea: use properties of independence. 12
Independent Random Variables X is independent of Y iff P ( X Y y ) P ( X ) for all values,y If X and Y are independent then X Y ) P ( X Y ) P ( Y ) P ( X ) P ( Y ) (, X1... X ) ( ) ( )... ( ), n P X1 P X2 P Xn Unfortunately, most of random variables of interest are not independent of each other The wet grass eample 13 Conditional Independence 14 A more suitable notion is that of conditional independence. XandYareconditionally are independent given Z Notation: X Z, Y ) X X, Y Z) X I( X, Y Z) Z) Z) Y Z) The conditionally independent structure in the grass eample C I(S,R C) I(C,W S,R) S W R
Conditional Independence Directed Markov Property: Each random variable X, is conditionally independent of its non-descendents, given its parents Pa(X) Descendent nt Formally, X NonDesc(X), Pa(X))=X Pa(X)) Notation: I (X, NonDesc(X) Pa(X)) Parent Y 1 X Y 2 Y 3 Y 4 Non-descendent 15 Factorized Representation Full Joint distribution is defined in terms of local conditional distributions(obtained via the chain rule) P ( 1,, n ) p ( i pa ( i )) Graphical Structure encodes conditional independences among random variables Represent the full joint distribution over the variables more compactly Compleity reduction Joint probability of n binary variables O(2 n ) Factorized form O(n*2 k ) k: maimal number of parents of a node 16
Factorized Representation The wetgrass eample C,S,R,W)=W S,R)R C)S C)C) Only need 1+2+2+4=9 parameters 17 Inference Computation of the conditional i probability bili distribution of one set of nodes, given a model and another set of nodes. Bottom-up Given Observation (leaves), the probabilities of the reasons can be calculated accordingly. diagnosis from effects to reasons Top-down Knowledge influences the probability of the outcome Predict the effects 18
Basic Computation The value of depends on y Dependency: conditional probability y) Knowledge about y: prior probability y) Product rule, y) y) y) Sum rule (Marginalization) P ( ), y) P ( y), y) y Bayesian rule P ( y ) y) y) ) poserior y conditional likelihood prior likelihood 19 Inference: Bottom UP Observe: wet grass (denoted by W=T) Two possible causes: rain or sprinkle. Which is more likely? Apply Bayes rule W T ) c (, s, r P C c, S s, R r, W T ) 0.0396 0.009 0.324 0 0.0495 0.18 0.045 0 0.6471 20
Inference: Bottom UP C S R W ) T T T T 0.99*0.8*0.1*0.5=0.0396 T T F T 09*02*01*0 0.9 0.2 0.1 0.5=0.009 009 T F T T 0.9*0.8*0.9*0.5=0.324 T F F T 0*0.2*0.9*0.5=0 F T T T 0.99*0.2*0.5*0.5=0.0495 F T F T 0.9*9.8*0.5*0.5=0.18 F F T T 0.9*0.2*0.5*0.5=0.045 F F F T 0*0.8*0.5*0.5=0 21 Inference: Bottom UP Observe: wet grass (denoted by W=T) Two possible causes: rain or sprinkle. Which is more likely? Apply Bayes rule S T W T ) S T, W T ) W T ) c, r C c, S T, R r, W W T ) T ) 0.0396 0.009 0.0495 0.18 0.6471 0.2781 0.43 0.6471 22
Inference: Bottom UP Observe: wet grass (denoted by W=T) Two possible causes: rain or sprinkle. Which is more likely? Apply Bayes rule R T W T ) R T, W T ) W T ) c, s C c, S s, R T, W W T ) T ) 0.0396 0.324 0.0495 0.045 0.6471 0.4581 0.708 0.6471 23 Inference: Top-down The probability bilit that t the grass will be wet given that it is cloudy. W T, C T ) W T C T ) P ( C T ) C C, S, R, W ) S, R P C S R W S, R, W (,,, ) S R W 24
Inference Algorithms Eact inference problem in general graphical model is NP-hard Eact Inference Variable elimination Message passing algorithm Clustering and joint tree approach Approimate Inference Loopy belief propagation Sampling (Monte Carlo) methods Variational methods 25 Variable Elimination Computing W=T) Approach 1. Blind approach Sum out all un-instantiated variables from the full joint Computation Cost O(2 n ) The wetgrass eample Number of additions: 14 Number of products:? Solution: eplore the graph structure 26
Variable Elimination Approach 2: Interleave sums and Products The key idea is to push sums in as far as possible 27 In computation First compute: Then compute: And so on Computation Cost O(n*2 k ) For wetgrass eample Number of Additions:? Number of products:? Learning 28
Learning Learn parameters or structure from data Structure learning: find correct connectivity between eisting nodes Parameter learning: find maimum likelihood estimates of parameters of each conditional probability distribution A lot of knowledge (structures and probabilities) came from domain eperts 29 Learning Structure Observation Method Known Full Maimum Likelihood (ML) estimation Known Partial Epectation ti Maimization i algorithm (EM) Unknown Full Model selection Unknown Partial EM + model selection 30
Model Selection Method Select a 'good' model from all possible models and use it as if it were the correct model Having defined a scoring function, a search algorithm is then used to find a network structure that receives the highest score fitting the prior knowledge and data Unfortunately, the number of DAG's on n variables is super-eponential p in n. The usual approach is therefore to use local search algorithms (e.g., greedy hill climbing) to search through the space of graphs. 31 EM Algorithm Epectation (E) step Use current parameters to estimate the unobserved data Maimization (M) step Use estimated data to do ML/MAP estimation of the parameter Repeat EM steps, until convergence 32
Outline It Introduction ti Representation Bayesian network Conditional Independence Inference Learning Markov Random Field Clique Pi Pair-wise i MRF Inference: Belief Propagation Conclusion 33 Markov Random Fields Undirected edges simply ygive correlations between variables The joint distribution is product of local functions over the cliques of the graph 1 P ( ) P C ( C ) Z where P C ( C ) are the clique potentials, and Z is a normalization constant w C 1, y, z, w) PA (, y, w) PB (, y, z) Z y z 34
The Clique A clique A set of variables which are the arguments of a local lfunction The order of a clique The number of variables in the clique Eample: 1,..., 5) PA ( 1 ) PB ( 2) PC ( 1, 2, 3) PD ( 3, 4) PE ( 3, 5) first order clique third order clique second order clique 35 Regular and Arbitrary Graph 36
Pair-wise MRF The order of cliques is at most two. Commonly used in computer vision applications. Infer underline unknown variables through local observation and the smooth prior φ 1 (i 1 ) o 1 o 2 o 3 Observed image φ 2 (i 2 ) φ 3 (i 3 ) Underlying truth (i 1, i 4 ) i 4, i 7 ) ψ 14 ψ 47 (i i ψ 12 (i 1, i 2 ) 1 i ψ 23 (i 2, i 3 ) 2 i 3 φ 4 (i 4 ) o 4 o 5 o 6 (i 2, i 5 ) ψ 25 φ 5 (i 5 ) (i 3, i 6 ) i ψ 45 (i 4, i 5 ) 4 i ψ 5 56 (i 5, i 6 ) i 6 φ 6 (i 6 ) o 7 o 8 o 9 i 5, i 8 ) ψ 58 (i i 6, i 9 ) ψ 36 φ 7 (i 7 ) φ 8 (i 8 ) φ 9 (i 9 ) ψ 69 (i i ψ 78 (i 7, i 8 ) 7 i ψ 89 (i 8, i 9 ) 8 i 9 compatibility 37 Pair-wise MRF φ 1 (i 1 ) o 1 o 2 o 3 Observed image 1 φ 2 (i 2 ) 2 φ 3 (i 3 ) 3 Underlying truth 1, i 4 ) i 7 ) ψ 14 (i 1 ψ 47 (i 4, ψ 12 (i 1, i 2 ) ψ 23 (i 2, i 3 ) i 1 i 2 i 3 φ 4 (i 4 ) o 4 o 5 o 6 ψ 45 (i 4, i 5 ) 2, i 5 ) ψ 25 (i 2 φ 5 (i 5 ) ψ 56 (i 5, i 6 ) 3, i 6 ) i 4 i 5 i 6 i 7 i 8 i 9 ψ y (i, i y )i is an n * n y matri. ti i 8 ) i 9 ) ψ 36 (i 3 φ 6 (i 6 ) o 7 o 8 o 9 ψ 58 (i 5, φ 7 (i 7 ) φ 8 (i 8 ) φ 9 (i 9 ) ψ 78 (i 7, i 8 ) ψ 89 (i 8, i 9 ) φ (i ) is a vector of length n, where n is the number of states of i. ψ 69 (i 6, 38
Pair-wise MRF φ 1 (i 1 ) o 1 o 2 o 3 Observed image 1 φ 2 (i 2 ) 2 φ 3 (i 3 ) 3 Underlying truth 1, i 4 ) i 7 ) ψ 14 (i 1 ψ 47 (i 4, ψ 12 (i 1, i 2 ) ψ 23 (i 2, i 3 ) i 1 i 2 i 3 φ 4 (i 4 ) o 4 o 5 o 6 ψ 45 (i 4, i 5 ) 2, i 5 ) ψ 25 (i 2 φ 5 (i 5 ) ψ 56 (i 5, i 6 ) 3, i 6 ) i 4 i 5 i 6 i 8 ) i 7 i 8 i 9 Given all the evidence nodes y i, we want to find the most likely l state t for all the hidden nodes i, which is equivalent to maimizing i 9 ) ψ 36 (i 3 φ 6 (i 6 ) o 7 o 8 o 9 ψ 58 (i 5, φ 7 (i 7 ) φ 8 (i 8 ) φ 9 (i 9 ) ψ 78 (i 7, i 8 ) ψ 89 (i 8, i 9 ) ψ 69 (i 6, 1 P ({ }) ij ( i, j ) i ( i ) Z ij i 39 Belief Propagation φ 1 (i 1 ) o 1 o 2 o 3 Observed image 1 φ 2 (i 2 ) 2 φ 3 (i 3 ) 3 Underlying truth 1, i 4 ) i 7 ) ψ 14 (i 1 ψ 47 (i 4, ψ 12 (i 1, i 2 ) ψ 23 (i 2, i 3 ) i 1 i 2 i 3 φ 4 (i 4 ) o 4 o 5 o 6 ψ 45 (i 4, i 5 ) 2, i 5 ) ψ 25 (i 2 φ 5 (i 5 ) ψ 56 (i 5, i 6 ) 3, i 6 ) i 4 i 5 i 6 i 8 ) i 9 ) ψ 36 (i 3 φ 6 (i 6 ) o 7 o 8 o 9 i 7 i 8 i 9 Beliefs are used to approimate this probability bilit b ( i ) ( i ) m ( i ) ψ 58 (i 5, φ 7 (i 7 ) φ 8 (i 8 ) φ 9 (i 9 ) ψ 78 (i 7, i 8 ) ψ 89 (i 8, i 9 ) z ψ 69 (i 6, z m ( i ) ( i ) ( i, i ) m ( i ) y y i y y z y z 40
Belief Propagation i 2 m 2->5 (i 5 ) o 5 φ 5 (i 5 ) i 4 i 5 i 6 m 4->5 (i 5 ) m 6->5 (i 5 ) 5(i 5 ) m 8->5 i 8 5( 5 ) m 8->5 Beliefs are used to approimate this probability b5 ( i5 ) 5( i5 ) m25( i5 ) m45( i5 ) m65( i5 ) m85( i5) 41 Belief Propagation i 4 ) m 74 (i 4 4) m 14 (i i 1 o φ 4 (i 4 ) 4 i 4 ψ 45 (i 4, i 5 ) ψ 25 (i 2, i 5 ) i 2 (i 5 ) m 25 o 5 φ 5 (i 5 ) ψ 56 (i 5, i 6 ) 4 i 5 i 6 8)m45(i5) m65(i5) 5 ψ 58 (i 5, i 8 ) m 85 (i 5 ) 42 i 7 i 8 Beliefs are used to approimate this probability b5 ( i5) 5( i5 ) m25( i5 ) m45( i5 ) m65( i5 ) m85( i5 ) m45( i5 ) 4( i4) 45( i4, i5 ) m14( i4) m74( i4) i4
Belief Propagation φ(i ) and ψ y (i,ii y ) For every node i Compute m z (i ) for each neighbor i z N Does b (i ) converge? Y Compute b (i ) Output most likely state for every node i 43 Application: Learning Based Image Super Resolution Etrapolate higher resolution images from low- resolution inputs. The basic assumption: there are correlations between low frequency and high frequency information. A node corresponds to an image patch φ ( p ): the probability of high frequency given observed low frequency ψ y ( p, q ): the smooth prior between neighbor patches 44
Image Super Resolution (a) Images from a "generic" eample set. (b) Input (magnified 4) (c) Cubic spline (d) Super-resolution result (e) Actual full-resolution 45 Conclusion A graphical representation of the probabilistic structure of a set of random variables, along with functions that tcan be used dto derive the joint probability distribution. Intuitive interface for modeling. Modular: Useful tool for managing compleity. Common formalism for many models. 46
References 47 Kevin Murphy, Introduction ti to Graphical lmodels, Technical Report, May 2001. M. I. Jordan, Learning in Graphical Models, MIT Press, 1999. Yijuan Lu, Introduction to Graphical Models, http:// www.cs.utsa.edu/~danlo/teaching/cs7123/fall2005/lyijuan. danlo/teaching/cs7123/fall2005/lyijuan. ppt. Milos Hauskrecht, Probabilistic graphical models, http://www.cs.pitt.edu/~milos/courses/cs3710/lectures/clas pitt s3.pdf. P. Smyth, Belief networks, hidden Markov models, and Markov random fields: a unifying i view, Pattern Recognition Letters, 1998. References F. R. Kschischang, B. J. Frey and H. A. Loeliger, 2001. Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory, February, 2001. Yedidia J.S., Freeman W.T. and dweiss Y, Understanding di Belief Propagation and Its Generalizations, IJCAI 2001 Distinguished Lecture track. William T. Freeman, Thouis R. Jones, and Egon C. Pasztor, Eample-based super-resolution, IEEE Computer Graphics and Applications, March/April, 2002. W. T. Freeman, E. C. Pasztor, O. T. Carmichael Learning Low-Level Vision International Journal of Computer Vision, 40(1), pp. 25-47, 2000. 48