Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Similar documents
Graphical Models and Kernel Methods

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Pattern Recognition and Machine Learning

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

STA 4273H: Statistical Machine Learning

13: Variational inference II

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Probabilistic Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 9: PGM Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

p L yi z n m x N n xi

Bayes Nets III: Inference

CS 188: Artificial Intelligence. Bayes Nets

Lecture 8: PGM Inference

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Nonparametric Bayesian Methods (Gaussian Processes)

Directed and Undirected Graphical Models

Probabilistic Models

Undirected Graphical Models

Probabilistic Graphical Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Machine Learning Lecture 14

Bayesian Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Learning Bayesian network : Given structure and completely observed data

6.867 Machine learning, lecture 23 (Jaakkola)

Intelligent Systems (AI-2)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

4 : Exact Inference: Variable Elimination

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

CPSC 540: Machine Learning

Undirected Graphical Models: Markov Random Fields

Variational Inference (11/04/13)

Machine Learning 4771

Intelligent Systems (AI-2)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Machine Learning for Data Science (CS4786) Lecture 24

9 Forward-backward algorithm, sum-product on factor graphs

Probabilistic Graphical Networks: Definitions and Basic Results

3 : Representation of Undirected GM

CS 5522: Artificial Intelligence II

Bayesian Machine Learning - Lecture 7

Lecture 6: Graphical Models

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

CS 188: Artificial Intelligence Spring Announcements

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

The Ising model and Markov chain Monte Carlo

Bayesian Learning in Undirected Graphical Models

Undirected graphical models

Structure Learning: the good, the bad, the ugly

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Directed Graphical Models or Bayesian Networks

STA 4273H: Statistical Machine Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Review: Directed Models (Bayes Nets)

Bayesian Networks Inference with Probabilistic Graphical Models

CS 188: Artificial Intelligence Fall 2008

COMP90051 Statistical Machine Learning

CS 343: Artificial Intelligence

STA 4273H: Statistical Machine Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Introduction to Machine Learning Midterm, Tues April 8

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

CS 188: Artificial Intelligence Fall 2009

Intelligent Systems (AI-2)

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Networks BY: MOHAMAD ALSABBAGH

CPSC 540: Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

A brief introduction to Conditional Random Fields

Graphical Models for Collaborative Filtering

The Origin of Deep Learning. Lili Mou Jan, 2015

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Bayesian Learning in Undirected Graphical Models

Alternative Parameterizations of Markov Networks. Sargur Srihari

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Inference in Bayesian Networks

Bayes Networks 6.872/HST.950

Bayesian Machine Learning

CS 5522: Artificial Intelligence II

Bayesian Methods for Machine Learning

Logistic Regression. Machine Learning Fall 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Gibbs Fields & Markov Random Fields

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Bayesian Networks. Motivation

Bayesian Networks (Part II)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Probabilistic Machine Learning

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Unsupervised Learning

Lecture 6: Graphical Models: Learning

Latent Dirichlet Allocation

Probabilistic Graphical Models: Representation and Inference

Transcription:

Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017

Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference Generative Models Fancy Inference (when some variables are unobserved) How to learn model parameters from data Undirected Graphical Models Inference (Belief Propagation) New Directions in PGM research & wrapping up

What I cannot create, I do not understand. -Richard Feynman

Generative models vs Discriminative models Discriminative models learn P(Y X). It s easier, requires less data, but is only useful for one particular task: Given X, what is P(Y X)? [Example: Logistic Regression, Feed-Forward or Convolutional Neural Networks, etc.] Generative models instead learn P(Y, X) completely. Once they do that, they can compute everything! P(X) = y P(X,Y) P(Y) = x P(X,Y) P(Y X) = P(Y,X) / y P(Y,X) [Caveat: No Free Lunch!! You want to answer every question under the sun? You need more data!]

Probabilistic Graphical Models: Main Classic approach to modeling P(Y, X) P( Y1,, YM, X1,,XD )

Some Calculations on Space Imagine each variable is binary P( Y1,, YM, X1,, XD )

Some Calculations on Space Imagine each variable is binary P( Y1,, YM, X1,, XD ) How many parameters do we need to estimate from data to specify P(Y,X)??

Some Calculations on Space Imagine each variable is binary P( Y1,, YM, X1,, XD ) How many parameters do we need to estimate from data to specify P(Y,X)?? 2 (M+D) -1

Too many parameters! What can be done? 1) 2) 3) Look for conditional independences Use Chain Rule for probabilities to break P(Y,X) into smaller pieces Rewrite P(Y,X) as product of smaller factors a) 4) Maybe you have more data for a subset of variables.. Simplify some of the modeling assumptions to cut parameters a) b) I.e. Assume data is multivariate Gaussian I.e. Assume conditional independencies even if they don t really always apply

Bayesian Networks Use chain rule for probabilities This is always true, no approximations or assumptions, so no reduction in number of parameters either BNs: Conditional Independence Assumption: For some of variables, P(Xi X1,, Xi-1) is approximated with P(Xi Subset of (X1,, Xi-1) ) This Subset of (X1,, Xi-1) is referred to as Parents(Xi) Reduce parameters (if binary for instance) from 2(i-1) to 2 parents(xi)

Bayesian Networks Variable and assumption Number of parameters in binary case: Raw Chain Rule BN Chain Rule X1: Difficulty X2: Intelligence X1 (Difficulty) P(X1) 1 P(X1) 2-1 P(X1) X2 (Intelligence) P(X2 X1) = P(X2) 2 P(X2 X1) 1 P(X2) X3: Grade X3 (Grade) P(X3 X1,X2) = P(X3 X1,X2) 4 P(X3 X1,X2) 4 P(X3 X1,X2) X5: Letter of recom X4 (SAT score) P(X4 X1,X2,X3) = P(X4 X2) 8 P(X4 X1,X2,X3) 2 P(X4 X2) X5 (Letter) P(X5 X1,X2,X3,X4) = P(X5 X3) 16 P(X5 X1,X2,X3,X4) 2 P(X5 X3) Total P(X1,X2,X3,X4,X5) 1+2+4+8+16 = 31 1+1+4+2+2 =10 X4: SAT

Some Example of a BN for SNPs

Benefits of Bayesian Networks 1) Once estimated they can answer any conditional or marginal queries! a) Called Inference 2) Fewer parameters to estimate! 3) We can start putting prior information into the network 4) We can incorporate LATENT(Hidden/Unobserved) variables based on how we/domain experts think variables might be related Generating samples from the distribution becomes super easy. 5)

Inference in Bayesian Networks Query types: 1) 2) Conditional probabilities P(Y X)=? P(Xi==a X\i==B,Y==C)=? Maximum a posteriori estimate Argmax xi P(Xi X\i) =? Argmax yi P(Yi X) =? X1: Difficulty X2: Intelligence X3: Grade X5: Letter of recom X4: SAT

Key operation: Marginalization P(X) = Σy P(X,Y) P(X5 X2=a) =?? P(X5 X2=a) = P(X5, X2=a) / P(X2=a) P(X5, X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) P(X2=a) = ΣX1,X3,X4,X5 P(X1,X2=a,X3,X4,X5) X1: Difficulty X2: Intelligence X3: Grade X5: Letter of recom X4: SAT

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable...

Marginalize from the first parents (root) to the variable... This method is called sum-product or variable elimination

Marginalization when P(X) = Σy P(X,Y) P(X5 X2=a) =?? P(X5 X2=a) = P(X5, X2=a) / P(X2=a) X1: Difficulty X2: Intelligence X3: Grade X5: Letter of recom X4: SAT

Marginalization when P(X) = Σy P(X,Y) P(X5 X2=a) =?? P(X5 X2=a) = P(X5, X2=a) / P(X2=a) X1: Difficulty X2: Intelligence X3: Grade X4: SAT P(X5, X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) X5: Letter = ΣX1,X3,X4 P(X1) P(X2=a) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) of recom = P(X2=a) ΣX1,X3,X4 P(X1) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) ΣX4 P(X4 X2=a) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) PX2=a(X3 X1) P(X5 X3) = P(X2=a) ΣX3 P(X5 X3) ΣX1 PX2=a(X3 X1) P(X1) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) gx2=a(x5) = P(X2=a) gx2=a(x5)

Marginalization when P(X) = Σy P(X,Y) P(X5 X2=a) = gx2=a(x5) P(X5 X2=a) = P(X5, X2=a) / P(X2=a) X1: Difficulty X2: Intelligence X3: Grade X4: SAT P(X5, X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) X5: Letter = ΣX1,X3,X4 P(X1) P(X2=a) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) of recom = P(X2=a) ΣX1,X3,X4 P(X1) P(X3 X1,X2=a) P(X4 X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) ΣX4 P(X4 X2=a) = P(X2=a) ΣX1,X3 P(X1) P(X3 X1,X2=a) P(X5 X3) = P(X2=a) ΣX1,X3 P(X1) PX2=a(X3 X1) P(X5 X3) = P(X2=a) ΣX3 P(X5 X3) ΣX1 PX2=a(X3 X1) P(X1) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) ΣX3 P(X5 X3) fx2=a(x3) = P(X2=a) gx2=a(x5) = P(X2=a) gx2=a(x5)

Estimating Parameters of a Bayesian Network Maximum Likelihood Estimation Also sometimes Maximum Pseudolikelihood estimation

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known If you remember from other lectures: Likelihood(D; Parameters) = Dj in data P(Dj Parameters) = Dj in data Xij in Dj P(Xij Par(Xij), Parameters{Par(Xij) -> Xij}) = i in variable set Dj in data P(Xij Par(Xij), Parameters{Par(Xij) -> Xij}) = i in variable set (Independent Local terms function of All observed Xij and Par(Xij)) MLE-Parameters{Par(Xij) -> Xij} = Argmax (Local likelihood of observed Xij and Par(Xij) in data!)

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known If variables are discrete: P(Xi = a Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B)

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known If variables are discrete: P(Xi = a Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B) If variables are continuous: P(Xi = a Parents(Xi) = B) = fit Some_PDF_Function(a,B)

How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known P(Xi=a Parent(Xi) =B) = Some_PDF_Function(a, B) Single Multivariate Gaussian Mixture of Multivariate Gaussian Non-parametric density functions

How to estimate parameters of a Bayesian Network? (2) You have observed all Y,X variables, but dependency structure is NOT known

Structure learning when all variables are observed 1) Neighborhood Selection: Lasso: L1 regularized regression per variable, learning using other variables. Not necessarily a tree structure 2) Tree Learning via Chaw-Liu method: Per variable pairs find empirical distribution P(Xi,Xj) = Count(Xi,Xj)/M Per variable pairs, compute mutual information Use I(Xi,Xj) as weight in graph. Learn maximum spanning tree.

How to estimate parameters of a Bayesian Network? (3) You have unobserved variables!!, but dependency structure is known Most commonly used Bayesian Networks these days!

In practice, Bayes Nets are most used to inject priors and structure into the task Modeling documents as a collection of topics where each topic is a distribution over words: Topic Modeling via Latent Dirichlet Allocation

In Practice, Bayes Nets are most used to inject priors and structure Correcting for hidden confounders in expression data

In Practice, Bayes Nets are most used to inject priors and structure Correcting for hidden confounders in expression data

Estimation/Inference in when missing values 1) Sometimes P(observed) = Σunobserved P(observed & unobserved) has closed form! a) b) 2) Combining Gaussian conditional and priors usually lead to Gaussian marginals (has closed form) If your prior distribution on latent variables is a conjugate to the conditional distribution, you get closed form i) Lots of known pairs of distribution. Gaussian and Gaussian; Dirichlet and Multinomial; Gamma and Gamma; etc. etc. Expectation maximization (EM) a) b) c) d) Initialize parameters randomly. Do Inference: MAP-Estimate: Most likely value unobserved variables (E step) Re-estimate: MLE-Estimate: re-estimate the parameters (M step) Iterate (a) and (b) until parameters converge

Estimation/Inference in when missing values 3) Gibbs sampling or MCMC a) Initialize randomly. b) Sample new P(xi everything else). c) Burn-In: Repeat over variables & draw thousands of samples sequentially. d) Eventually (It s proven), you ll be sampling from true distribution! Use those samples to compute anything you want. (Note that in those samples all variables are observed) 3) Variational Inference (Approximate another model which HAS a closed form) a) b) Find a functional mapping from the probability under the original bayesian model into the probability under simpler model (per data point) Estimation = Minimize the gap between the two distributions

Example of EM for estimating Hidden Markov Model Parameters Y1 Y2 Y3 Y4 Y5 Y6 X1 X2 X3 X4 X5 X6 P(X,Y) = P(Y1)P(X1 Y1) P(Yi Yi-1)P(Xi Yi) i

Gibbs Sampling for all variants of models. Let your imagination go wild!

Problems with Bayesian Networks Prior has to have the form of conditional probability What if the variables are symmetric? Bayes Nets can t have loops A B What if the relationship can be described in un-normalized way? (i.e energy)

Undirected Graphical Models (aka Markov Random Fields) Comes from world of Statistical Physics and modeling energy and electron spins. Define a joint probability as normalized product of factors (i.e. energies) over cliques of variables P( X1,..., XD ) = 1 / Z Ci={subsets of X1..XD} f(ci) Z = Σ x1,x2,..xd f(ci) In practice people often use pairwise and node-wise factors only. Often called Edge and Node potentials The main problem with these models: How do we estimate Z?!

Conditional Independencies in Markov Random Fields We assume one edge for every pairwise potential. According to definition of undirected graphical models: Every variable Xi is conditionally independent of other variables Xj, if in every path that goes from Xi to Xj, at least one variable is observed.

Example: Gaussian Graphical Models: They are equivalent to a multivariate Gaussian distribution with: and Easily allow conditional independence decisions especially during inference.

Computing Z (Normalization factor) Note: Z is a function of the parameters not the samples. So without Z, you can still compute some conditional probabilities But need Z to compute MAP estimates Actual probabilities Just like with Bayes Nets: You can use sum-product method to compute Z

Factor graph representation of MRFs P(X) = 1/Z f1(x1,x2)f2(x2,x3,x4)f3(x3,x5)f4(x4,x6) Z = Σx1,x2,x3,x4,x5,x6 f1(x1,x2)f2(x2,x3,x4)f3(x3,x5)f4(x4,x6) = Σx1,x2 f1(x1,x2) Σx3,x4 f2(x2,x3,x4) (Σx5 f3(x3,x5)) (Σx6 f4(x4,x6)))

Belief Propagation Algorithm Kschischang, Frank R., Brendan J. Frey, and H-A. Loeliger. "Factor graphs and the sum-product algorithm." IEEE Transactions on information theory (2001)

Some notes on Belief Propagation/Inference in MRFs If the structure doesn t have a loop, results are exact If the structure is loopy, still people use loopy BP for inferring Z. Sometimes messages don t have a closed form. keep passing messages until the messages converge Some theoretical properties of the convergence exist. Use approximations to keep them within closed form i.e. Incoming D messages are mixtures of K gaussians Outgoing would be mixture of DK gaussians Reapproximate them with K new gaussians Variants of this method exist like Expectation propagation If replacing sum with max, you can get MAP estimates at the same time complexity

Related Topics (No time to cover) Generative Adversarial Networks Another method to generate samples but without factorizing the probability When conditional independencies are bad assumptions Useful for highly correlated data like images, sounds etc. Deep variational inference: Make that function that maps the two distributions more powerful and optimize that via gradient descent Probabilistic Programming! http://probabilistic-programming.org/wiki/home Nonparametric models (dirichlet processes) & Kernel based graphical models Causal inference and Bayesian Networks

Back to the big picture PGMs give you a full model of the task You can inject prior information into your model You can use partial data for better estimation Give you justifications for your results. Easy to interpret and allow humans to find hypothesis If your data changes you can adjust parts of the model but re-estimate other parts Comes with the costs: You re making independence assumption: Often wrong You re multiplying a ton of factors: Errors can grow exponentially Inference can be slow if you need sampling