Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017

Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference Generative Models Fancy Inference (when some variables are unobserved) How to learn model parameters from data Undirected Graphical Models Inference (Belief Propagation) New Directions in PGM research & wrapping up

What I cannot create, I do not understand. -Richard Feynman

Generative models vs Discriminative models Discriminative models learn P(Y X). It s easier, requires less data, but is only useful for one particular task: Given X, what is P(Y X)? [Example: Logistic Regression, Feed-Forward or Convolutional Neural Networks, etc.] Generative models instead learn P(Y, X) completely. Once they do that, they can compute everything! P(X) = y P(X,Y) P(Y) = x P(X,Y) P(Y X) = P(Y,X) / y P(Y,X) [Caveat: No Free Lunch!! You want to answer every question under the sun? You need more data!]

Probabilistic Graphical Models: Main Classic approach to modeling P(Y, X) P( Y1,, YM, X1,,XD )

Some Calculations on Space Imagine each variable is binary P( Y1,, YM, X1,, XD )

Some Calculations on Space Imagine each variable is binary P( Y1,, YM, X1,, XD ) How many parameters do we need to estimate from data to specify P(Y,X)??

Some Calculations on Space Imagine each variable is binary P( Y1,, YM, X1,, XD ) How many parameters do we need to estimate from data to specify P(Y,X)?? 2 (M+D) -1

Too many parameters! What can be done? 1) 2) 3) Look for conditional independences Use Chain Rule for probabilities to break P(Y,X) into smaller pieces Rewrite P(Y,X) as product of smaller factors a) 4) Maybe you have more data for a subset of variables.. Simplify some of the modeling assumptions to cut parameters a) b) I.e. Assume data is multivariate Gaussian I.e. Assume conditional independencies even if they don t really always apply

Bayesian Networks Use chain rule for probabilities This is always true, no approximations or assumptions, so no reduction in number of parameters either BNs: Conditional Independence Assumption: For some of variables, P(Xi X1,, Xi-1) is approximated with P(Xi Subset of (X1,, Xi-1) ) This Subset of (X1,, Xi-1) is referred to as Parents(Xi) Reduce parameters (if binary for instance) from 2(i-1) to 2 parents(xi)

Bayesian Networks Variable and assumption Number of parameters in binary case: Raw Chain Rule BN Chain Rule X1: Difficulty X2: Intelligence X1 (Difficulty) P(X1) 1 P(X1) 2-1 P(X1) X2 (Intelligence) P(X2 X1) = P(X2) 2 P(X2 X1) 1 P(X2) X3: Grade X3 (Grade) P(X3 X1,X2) = P(X3 X1,X2) 4 P(X3 X1,X2) 4 P(X3 X1,X2) X5: Letter of recom X4 (SAT score) P(X4 X1,X2,X3) = P(X4 X2) 8 P(X4 X1,X2,X3) 2 P(X4 X2) X5 (Letter) P(X5 X1,X2,X3,X4) = P(X5 X3) 16 P(X5 X1,X2,X3,X4) 2 P(X5 X3) Total P(X1,X2,X3,X4,X5) 1+2+4+8+16 = 31 1+1+4+2+2 =10 X4: SAT

Some Example of a BN for SNPs

Benefits of Bayesian Networks 1) Once estimated they can answer any conditional or marginal queries! a) Called Inference 2) Fewer parameters to estimate! 3) We can start putting prior information into the network 4) We can incorporate LATENT(Hidden/Unobserved) variables based on how we/domain experts think variables might be related Generating samples from the distribution becomes super easy. 5)

Inference in Bayesian Networks Query types: 1) 2) Conditional probabilities P(Y X)=? P(Xi==a X\i==B,Y==C)=? Maximum a posteriori estimate Argmax xi P(Xi X\i) =? Argmax yi P(Yi X) =? X1: Difficulty X2: Intelligence X3: Grade X5: Letter of recom X4: SAT

Key operation: Marginalization P(X) = Σy P(X,Y) P(X5 X2=a) =?? P(X5 X2=a) = P(X5, X2=a) / P(X2=a) P(X5, X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) P(X2=a) = ΣX1,X3,X4,X5 P(X1,X2=a,X3,X4,X5) X1: Difficulty X2: Intelligence X3: Grade X5: Letter of recom X4: SAT

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable... This method is called sum-product or variable elimination

Marginalization when P(X) = Σy P(X,Y) P(X5 X2=a) =?? P(X5 X2=a) = P(X5, X2=a) / P(X2=a) X1: Difficulty X2: Intelligence X3: Grade X5: Letter of recom X4: SAT

Marginalization when P(X) = Σy P(X,Y) P(X5 X2=a) =?? P(X5 X2=a) = P(X5, X2=a) / P(X2=a) X1: Difficulty X2: Intelligence X3: Grade X4: SAT P(X5, X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) X5: Letter = ΣX1,X3,X4 P(X1) P(X2=a) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) of recom = P(X2=a) ΣX1,X3,X4 P(X1) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) ΣX4 P(X4 X2=a) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) PX2=a(X3 X1) P(X5 X3) = P(X2=a) ΣX3 P(X5 X3) ΣX1 PX2=a(X3 X1) P(X1) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) gx2=a(x5) = P(X2=a) gx2=a(x5)

Marginalization when P(X) = Σy P(X,Y) P(X5 X2=a) = gx2=a(x5) P(X5 X2=a) = P(X5, X2=a) / P(X2=a) X1: Difficulty X2: Intelligence X3: Grade X4: SAT P(X5, X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) X5: Letter = ΣX1,X3,X4 P(X1) P(X2=a) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) of recom = P(X2=a) ΣX1,X3,X4 P(X1) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) ΣX4 P(X4 X2=a) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) PX2=a(X3 X1) P(X5 X3) = P(X2=a) ΣX3 P(X5 X3) ΣX1 PX2=a(X3 X1) P(X1) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) gx2=a(x5) = P(X2=a) gx2=a(x5)

Estimating Parameters of a Bayesian Network Maximum Likelihood Estimation Also sometimes Maximum Pseudolikelihood estimation

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known If you remember from other lectures: Likelihood(D; Parameters) = Dj in data P(Dj Parameters) = Dj in data Xij in Dj P(Xij Par(Xij), Parameters{Par(Xij) -> Xij}) = i in variable set Dj in data P(Xij Par(Xij), Parameters{Par(Xij) -> Xij}) = i in variable set (Independent Local terms function of All observed Xij and Par(Xij)) MLE-Parameters{Par(Xij) -> Xij} = Argmax (Local likelihood of observed Xij and Par(Xij) in data!)

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known If variables are discrete: P(Xi = a Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B)

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known P(Xi=a Parent(Xi) =B) = Some_PDF_Function(a, B) Single Multivariate Gaussian Mixture of Multivariate Gaussian Non-parametric density functions

How to estimate parameters of a Bayesian Network? (2) You have observed all Y,X variables, but dependency structure is NOT known

Structure learning when all variables are observed 1) Neighborhood Selection: Lasso: L1 regularized regression per variable, learning using other variables. Not necessarily a tree structure 2) Tree Learning via Chaw-Liu method: Per variable pairs find empirical distribution P(Xi,Xj) = Count(Xi,Xj)/M Per variable pairs, compute mutual information Use I(Xi,Xj) as weight in graph. Learn maximum spanning tree.

How to estimate parameters of a Bayesian Network? (3) You have unobserved variables!!, but dependency structure is known Most commonly used Bayesian Networks these days!

In practice, Bayes Nets are most used to inject priors and structure into the task Modeling documents as a collection of topics where each topic is a distribution over words: Topic Modeling via Latent Dirichlet Allocation

In Practice, Bayes Nets are most used to inject priors and structure Correcting for hidden confounders in expression data

Estimation/Inference in when missing values 1) Sometimes P(observed) = Σunobserved P(observed & unobserved) has closed form! a) b) 2) Combining Gaussian conditional and priors usually lead to Gaussian marginals (has closed form) If your prior distribution on latent variables is a conjugate to the conditional distribution, you get closed form i) Lots of known pairs of distribution. Gaussian and Gaussian; Dirichlet and Multinomial; Gamma and Gamma; etc. etc. Expectation maximization (EM) a) b) c) d) Initialize parameters randomly. Do Inference: MAP-Estimate: Most likely value unobserved variables (E step) Re-estimate: MLE-Estimate: re-estimate the parameters (M step) Iterate (a) and (b) until parameters converge

Estimation/Inference in when missing values 3) Gibbs sampling or MCMC a) Initialize randomly. b) Sample new P(xi everything else). c) Burn-In: Repeat over variables & draw thousands of samples sequentially. d) Eventually (It s proven), you ll be sampling from true distribution! Use those samples to compute anything you want. (Note that in those samples all variables are observed) 3) Variational Inference (Approximate another model which HAS a closed form) a) b) Find a functional mapping from the probability under the original bayesian model into the probability under simpler model (per data point) Estimation = Minimize the gap between the two distributions

Example of EM for estimating Hidden Markov Model Parameters Y1 Y2 Y3 Y4 Y5 Y6 X1 X2 X3 X4 X5 X6 P(X,Y) = P(Y1)P(X1 Y1) P(Yi Yi-1)P(Xi Yi) i

Gibbs Sampling for all variants of models. Let your imagination go wild!

Problems with Bayesian Networks Prior has to have the form of conditional probability What if the variables are symmetric? Bayes Nets can t have loops A B What if the relationship can be described in un-normalized way? (i.e energy)

Undirected Graphical Models (aka Markov Random Fields) Comes from world of Statistical Physics and modeling energy and electron spins. Define a joint probability as normalized product of factors (i.e. energies) over cliques of variables P( X1,..., XD ) = 1 / Z Ci={subsets of X1..XD} f(ci) Z = Σ x1,x2,..xd f(ci) In practice people often use pairwise and node-wise factors only. Often called Edge and Node potentials The main problem with these models: How do we estimate Z?!

Conditional Independencies in Markov Random Fields We assume one edge for every pairwise potential. According to definition of undirected graphical models: Every variable Xi is conditionally independent of other variables Xj, if in every path that goes from Xi to Xj, at least one variable is observed.

Example: Gaussian Graphical Models: They are equivalent to a multivariate Gaussian distribution with: and Easily allow conditional independence decisions especially during inference.

Computing Z (Normalization factor) Note: Z is a function of the parameters not the samples. So without Z, you can still compute some conditional probabilities But need Z to compute MAP estimates Actual probabilities Just like with Bayes Nets: You can use sum-product method to compute Z

Factor graph representation of MRFs P(X) = 1/Z f1(x1,x2)f2(x2,x3,x4)f3(x3,x5)f4(x4,x6) Z = Σx1,x2,x3,x4,x5,x6 f1(x1,x2)f2(x2,x3,x4)f3(x3,x5)f4(x4,x6) = Σx1,x2 f1(x1,x2) Σx3,x4 f2(x2,x3,x4) (Σx5 f3(x3,x5)) (Σx6 f4(x4,x6)))

Belief Propagation Algorithm Kschischang, Frank R., Brendan J. Frey, and H-A. Loeliger. "Factor graphs and the sum-product algorithm." IEEE Transactions on information theory (2001)

Some notes on Belief Propagation/Inference in MRFs If the structure doesn t have a loop, results are exact If the structure is loopy, still people use loopy BP for inferring Z. Sometimes messages don t have a closed form. keep passing messages until the messages converge Some theoretical properties of the convergence exist. Use approximations to keep them within closed form i.e. Incoming D messages are mixtures of K gaussians Outgoing would be mixture of DK gaussians Reapproximate them with K new gaussians Variants of this method exist like Expectation propagation If replacing sum with max, you can get MAP estimates at the same time complexity

Related Topics (No time to cover) Generative Adversarial Networks Another method to generate samples but without factorizing the probability When conditional independencies are bad assumptions Useful for highly correlated data like images, sounds etc. Deep variational inference: Make that function that maps the two distributions more powerful and optimize that via gradient descent Probabilistic Programming! http://probabilistic-programming.org/wiki/home Nonparametric models (dirichlet processes) & Kernel based graphical models Causal inference and Bayesian Networks

Back to the big picture PGMs give you a full model of the task You can inject prior information into your model You can use partial data for better estimation Give you justifications for your results. Easy to interpret and allow humans to find hypothesis If your data changes you can adjust parts of the model but re-estimate other parts Comes with the costs: You re making independence assumption: Often wrong You re multiplying a ton of factors: Errors can grow exponentially Inference can be slow if you need sampling