Probabilistic Graphical Models

Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University http://bi.snu.ac.r/~btzhang/ Chapter 7. Latent Variable Models 7. Factor Graphs Directed vs. Undirected Graphs Both graphical models Specify a factorization (how to epress the joint distribution) Define a set of conditional independence properties Fig. 7-. Parent-child local conditional distribution

Fig. 7-2. Maimal clique potential function Relation of Directed and Indirected Graphs Converting a directed graph to an undirected graph Case : straight line In this case, the partition function Z = Case 2: general case. Moralization = marrying the parents Add additional undirected lins between all pairs of parents Drop the arrows Results in the moral graph Fully connected no conditional independence properties, in contrast to the original directed graph We should add the fewest etra lins to retain the maimum number of independence properties

Factor Graphs A factor graph is a bipartite graph representing a joint distribution in the form of a product of factors. Factors in directed/undirected graphs Introducing additional nodes for the factors themselves Eplicit decomposition /factorization ψ,, ψψ,, ψ,, (Factor graphs are bipartite) Definition. Given a factorization of a function, where, the corresponding factor graph G = (X, F, E) consists of variable vertices, factor vertices, and edges E. The edges depend on the factorization as follows: there is an undirectedd edge between factor verte f j and variable verte X when. The function is assumed to be real-valued, i.e.. Factor graphs can be combined with message passing algorithms to efficiently compute certain characteristics of the function, such as the marginals. Eample. Consider a function that factorizes as follows: g(x,x 2,X 3 ) = f (X )f 2 (X,XX 2 )f 3 (X,X 2 )f 4 (XX 2,X 3 ), with a corresponding factor graph

This factor graph has a cycle. If we merge f 2 (X,X 2 )f 3 (X,X 2 ) into a single factor, the resulting factor graph will be a tree. This is an important distinction, as message passing algorithms are usually eact for trees, but only approimate for graphs with cycles. Inferences on factor graphs. Sum-product algorithm: evaluating local marginals over nodes or subsets of nodes p( ) = p( ) \ p( ) = Fs (, Xs ) s Ne( ) p ( ) = p( ) \ = Fs(, Xs) \ s Ne( ) = Fs(, Xs) s Ne( ) X s = μ ( ) s Ne( ) fs The messages ( ) μ from the factor node fs to the variable node are computed in the fs vertices and passed along the edges. Ma-sum algorithm: finding the most probable state The Hammersley Clifford theorem shows that other probabilistic models such as Marov networs and Bayesian networs can be represented as factor graphs. Factor graph representation is frequently used when performing inference over such networs using belief propagation. On the other hand, Bayesian networs are more naturally suited for generative models, as they can directly represent the causalities of the model. Properties of Factor Graphs

Converting directed and undirected graphs into factor graphs undirected graph factor graph Note: For a given fully connected undirected graph, two (or more) different factor graphs are possible. Factor graphs are more specific than the undirected graphs. directed graph factor graph For the same directed graph, two or more factor graphs are possible. There can be multiple factor graphs all of which correspond to the same undirected/directed graph Converting a directed/undirected tree to a factor graph The result is again a tree (no loops, one and only one path connecting any two nodes) Converting a directed polytree to a factor graph The results in a tree. Cf. Converting a directed polytree into an undirected graph results in loops due to the moralization step.

Polytree Undirected graph (moral graph) Factor graph Local cycles in a directed graph can be removed on conversion to a factor graph Factor graphs are more specific about the precise form of the factorization For a fully connected undirected graph, two (or more) factor graphs are possible. Directed and undirected graphs can epress different conditional independence properties D: Directed graph Undirected graph

7.2 Probabilistic Latent Semantic Analysis Latent Variable Models Latent variables 4 Variables that are not directly observed but are rather inferred from other variables that are observed and directly measured Latent variable models 4 Eplain the statistical properties of the observed variables in terms of the latent variables General formulation p ( ) = p( z, ) dz= p( z) p( z) dz PLSA

7.3 Gaussian Miture Models Graphical representation of a miture model A binary random variable z having a -of- representation Gaussian miture distribution can be written as a linear superposition of Gaussians ( ) π N = p = ( μ, Σ ) An equivalent formulation of the Gaussian miture involving an eplicit latent variable p( z) = = π z p( z = ) = N( μ, Σ ) p( z) = N( μ, Σ ) = z

z p( ) = p(, z) = p( z) p( z) = π N( μ, Σ) z z z = = = π N( μ, Σ ) = pz ( = ) = π = π = The marginal distribution of is a Gaussian miture of the form (*) for every observed data point there is a corresponding latent variable z n, n p( )= p(,z ) γ ( z ) p z = = = z ( ) p( z = ) p( z = ) j= j= ( j = ) ( j = ) p z p z π N( μ, Σ) π N( μ, Σ ) j j j γ(z ) can also be viewed as the responsibility that component taes for eplaining the observation Generating random samples distributed according to the Gaussian miture model Generating a value for z, which denoted as from the marginal distribution p(z) and then generate a value for from the conditional distribution z a. The three states of z, corresponding to the three components of the miture, are depicted in red, green, blue

b. The corresponding samples from the marginal distribution p() c. The same samples in which the colors represent the value of the responsibilities γ(z ) associated with n data point Illustrating the responsibilities by evaluating the posterior probability for each component in the miture distribution which this data set was generated Distribution Graphical representation of a Gaussian miture model for a set of N i.i.d. data points { }, with n corresponding latent points {z } n Data set: X (N D matri) with n-th row T n Latent variables: Z (N matri) with rows T z n The log of the lielihood function ( X Σ) = N ln p πμ,, ln{ π N( μ, Σ )} n n= = 7.4 Learning Gaussian Mitures by EM The Gaussian miture models can be learned by the epectation-maimization (EM) algorithm. Repeat Epectation step: calculate posterior or responsibilities using the current parameters Maimization step: re-estimate the parameters based on the responsibilities Given a Gaussian miture model, the goal is to maimize the lielihood function with respect to the parameters. Initialize the means μ, covariance Σ and miing coefficients π 2. E-step: evaluate the posterior probabilities or responsibilities using the current value for the parameters γ ( z ) n j= ( μ, ) π N n = Κ π N ( μ, ) j n j j 3. M-step: re-estimate the means, covariances, and miing coefficients using the result of E-step.

μ = N γ ( z ) new n n N n= new N γ new new n n n N n= new N π = N N N = γ n= ( z )( μ )( μ ) T ( z ) n 4. Evaluate the log lielihood N ln p( X μ, Σ, π) = ln πn( n μ, Σ ) n= = If converged, terminate; otherwise, go to Step 2. The General EM Algorithm In maimizing the log lielihood function p( ) = { ΣZ p( )} ln X θ ln X, Z θ, the summation prevents the logarithm from acting directly on the joint distribution Instead, the log lielihood function for the complete data set {X, Z} is straightforward. In practice since we are not given the complete data set, we consider instead its epected value Q under the posterior distribution p( Z X, Θ) of the latent variable General EM Algorithm. Choose an initial setting for the parameters old 2. E step Evaluate p ( X Z, Θ ) 3. M step Evaluate new Θ given by old θ

Θ new old = arg ma Θ Q( Θ, Θ ) old old ( ΘΘ, ) =Σ Z ( Z X, Θ ) ln ( X,Z Θ) Q p p 4. It the covariance criterion is not satisfied, old new then let Θ Θ and return to Step 2. The EM algorithm can also be used for fining MAP (maimum a posteriori) using the modified M-step old Q( θ, θ ) + ln p( θ)