CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

Similar documents
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CPSC 540: Machine Learning

The Generative Self-Organizing Map. A Probabilistic Generalization of Kohonen s SOM

Self-Organization by Optimizing Free-Energy

Probabilistic Graphical Models

Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Graphical Models

Bayesian Machine Learning - Lecture 7

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

STA 4273H: Statistical Machine Learning

Lecture 4 October 18th

Lecture 1: Introduction, Entropy and ML estimation

A brief introduction to Conditional Random Fields

13: Variational inference II

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Expectation Maximization

2 : Directed GMs: Bayesian Networks

STA 4273H: Statistical Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

Variational Inference (11/04/13)

4.1 Notation and probability review

Expectation Propagation Algorithm

STA 4273H: Statistical Machine Learning

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Chapter 16. Structured Probabilistic Models for Deep Learning

Series 7, May 22, 2018 (EM Convergence)

Chris Bishop s PRML Ch. 8: Graphical Models

Bayesian Networks. Motivation

Bayesian Networks Inference with Probabilistic Graphical Models

CS6220: DATA MINING TECHNIQUES

6.867 Machine learning, lecture 23 (Jaakkola)

Introduction to Artificial Intelligence. Unit # 11

Lecture 15. Probabilistic Models on Graph

BAYESIAN MACHINE LEARNING.

Lecture 6: Graphical Models: Learning

Conditional Independence and Factorization

4 : Exact Inference: Variable Elimination

Recent Advances in Bayesian Inference Techniques

Introduction to Machine Learning

3 : Representation of Undirected GM

Probabilistic Graphical Models (I)

CPSC 540: Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

11. Learning graphical models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Machine Learning Summer School

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Sum-Product Networks: A New Deep Architecture

Tufts COMP 135: Introduction to Machine Learning

STA 414/2104: Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

CS Lecture 3. More Bayesian Networks

An Introduction to Bayesian Machine Learning

Junction Tree, BP and Variational Methods

Active and Semi-supervised Kernel Classification

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

CS6220: DATA MINING TECHNIQUES

The Origin of Deep Learning. Lili Mou Jan, 2015

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Undirected Graphical Models

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Directed Graphical Models or Bayesian Networks

Dynamic Approaches: The Hidden Markov Model

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

Directed and Undirected Graphical Models

Machine Learning (CS 567) Lecture 2

Probabilistic Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

CSC 412 (Lecture 4): Undirected Graphical Models

Probabilistic Graphical Models

Machine Learning Techniques for Computer Vision

Computational Systems Biology: Biology X

Directed and Undirected Graphical Models

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models for Image Analysis - Lecture 1

Lecture 9: PGM Learning

Introduction to Machine Learning Midterm, Tues April 8

14 : Theory of Variational Inference: Inner and Outer Approximation

Representation of undirected GM. Kayhan Batmanghelich

Graphical Models for Collaborative Filtering

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Introduction to continuous and hybrid. Bayesian networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Introduction to Bayesian Learning

Overfitting, Bias / Variance Analysis

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Transcription:

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/

A Prototypical Course Project Ø Find a problem where learning with graphical models seems promising. Typically this needs to be an area where data is easily accessible. Ø Identify a baseline approach someone has used to analyze related data. Could involve graphical models, or could involve alternative ML methods. Ø Propose, derive, implement, and validate a new approach requiring: A different graphical model structure than previous work, and/or different inference/learning algorithms than those tested in previous work. Ø Validate your graphical model via careful experiments. Not only performance metrics for your application, but also confirmation of learning algorithm correctness, and exploration of learned model properties. Ø Interested in a different project style, perhaps more theoretical? Talk to me.

Course Project Timeline Details online: http://cs.brown.edu/courses/cs242/projects/ Ø Oct. 26: Identify project teams, send brief description (2-3 paragraphs). Team size? 3 students preferred, 2 is okay, 1 requires special permission. Ø Nov. 9: Project proposal due (2-3 pages). Requires an in-depth plan and preliminary results, not just problem area. Ø Dec. 13: Project presentations (~10 minutes per team). All team members must contribute. Ø Dec. 20: Project reports due (6-10 pages). Although your results need not be sufficiently novel for publication, the presentation and experimental protocols should be of high quality.

CS242: Lecture 4B Outline Ø Learning Tree-Structured Graphical Models Ø Learning Directed Graphical Models

A Little Information Theory The entropy is a natural measure of the inherent uncertainty (difficulty of compression) of some random variable: H(p) = p(x) log p(x) 0 x The relative entropy or Kullback-Leibler (KL) divergence is a non-negative, but asymmetric, distance between a given pair of probability distributions: D(p q) = p(x) log p(x) 0 q(x) x2 The KL divergence equals zero if and only if p(x) =q(x) for all x. The mutual information measures dependence between a pair of random variables: I(p xy )=D(p xy p x p y )= x2 y2y p(x, y) log p(x, y) p(x)p(y) Zero if and only if variables and Y are independent. 0

Factorizations for Pairwise Markov Trees x 3 x 2 x 1 x 4 Pick any node as root, write directed factorization out to leaves: p(x) =p(x s ) Y p(x t x (t) )=p(x s ) Y p(x t,x (t) ) = p(x s ) Y p(x t,x (t) ) p(x t6=s t6=s (t) ) p(x t )p(x t6=s (t) ) p(x t) Grouping terms gives canonical undirected factorizations: p(x) = Y Y p(x s,x t ) p(x s ) p(x s )p(x t ) = Y Y p(x s ) 1 d s p(x s,x t ) s2v s2v d s = number of neighbors for node s x 5

x 2 Factorizations for Pairwise Markov Trees x 1 x 3 x 5 x 4 Pairwise MRFs defined by non-negative potentials: p(x) = 1 Y Y s(x s ) st(x s,x t ) Z s2v A canonical parameterization via marginals: p(x) = Y Y q st (x s,x t ) q s (x s ) q s (x s )q t (x t ) s2v Suppose we choose a self-consistent set of non-negative discrete marginals: x s q s (x s )=1, x t q st (x s,x t )=q s (x s ) for all x s. The canonical parameterization then exactly matches these marginals: q s (a) = x x s =a p(x) q st (a, b) = x x s =a,x t =b p(x)

x 2 Learning with a Fixed Tree Structure x 1 x 3 x 5 x 4 A canonical parameterization via valid marginals: p(x) = Y Y q st (x s,x t ) q s (x s ) q s (x s )q t (x t ) s2v Suppose we have L observations ˆp s (a) = 1 L L x t q st (x s,x t )=q s (x s ) for all x s. {x (`) } L a(x (`) s ) ˆp st (a, b) = 1 L s2v with empirical distribution: L a(x (`) s ) b (x (`) t ) The (normalized) maximum likelihood parameter estimation objective is: 1 L log p(x (`) )= 1 L apple log q s (x (`) s )+ log q st(x (`) s,x (`) t ) ) L L q s (x (`) s )q t (x (`) t

x 2 Learning with a Fixed Tree Structure x 1 x 3 x 4 A canonical parameterization via valid marginals: p(x) = Y Y q st (x s,x t ) q s (x s ) q s (x s )q t (x t ) s2v q st (x s,x t )=q s (x s ) for all x s. x 5 x t This is an exponential family distribution with statistics (features): s;a(x) = a (x s ) for all s 2V,a=1,...,K s. st;ab(x) = a (x s ) b (x t ) for all (s, t) 2E,a=1,...,K s,b=1,...,k t. Maximum likelihood estimates match expected values of the statistics, and thus exactly match the empirical marginal distributions: q s (x s )=ˆp s (x s ) for all s 2V. q st (x s,x t )=ˆp st (x s,x t ) for all (s, t) 2E.

x 2 Learning with a Fixed Tree Structure x 1 x 3 x 5 x 4 p(x) = Y s2v q s (x s ) q s (a) =ˆp s (a) = 1 L q st (a, b) =ˆp st (a, b) = 1 L Y P L P L q st (x s,x t ) q s (x s )q t (x t ) a(x (`) s ) a(x (`) s ) b (x (`) t ) Optimum ML model matches empirical marginals for all single variables, and empirical pairwise marginals for all neighboring variables For these optimal parameters, the corresponding training log-likelihood is: 1 L L log p(x (`) )= s2v Ĥ s + Ĥ s = x s ˆp s (x s ) log ˆp s (x s ) Empirical Entropy & Empirical Mutual Info Î st Î st = x s,x t ˆp st (x s,x t ) log ˆp st(x s,x t ) ˆp s (x s )ˆp t (x t )

x 2 Learning with a Fixed Tree Structure x 1 x 3 x 5 x 4 p(x) = Y s2v q s (x s ) q s (a) =ˆp s (a) = 1 L q st (a, b) =ˆp st (a, b) = 1 L Y P L P L q st (x s,x t ) q s (x s )q t (x t ) a(x (`) s ) a(x (`) s ) b (x (`) t ) Optimum ML model matches empirical marginals for all single variables, and empirical pairwise marginals for all neighboring variables For these optimal parameters, the corresponding training log-likelihood is: 1 L L log p(x (`) )= s2v Ĥ s + Î st Data that is more random is harder to predict. Empirical Entropy & Empirical Mutual Info Likelihood increases when link strongly dependent variables.

Optimizing Tree Structures p(x) = Y s2v q s (x s ) Y q st (x s,x t ) q s (x s )q t (x t ) Picking the max-likelihood marginals for any tree: 1 L log p(x (`) )= Ĥ s + Î st L Need to maximize the following objective over acyclic edge patterns: f(e) = Î st s2v Ĥ s = x s ˆp s (x s ) log ˆp s (x s ) Î st = x s,x t ˆp st (x s,x t ) log ˆp st(x s,x t ) ˆp s (x s )ˆp t (x t ) Weight each pair of nodes by its empirical mutual information, and find the tree which maximizes the sum of edge weights! Example greedy algorithm (Kruskal): Ø Initialize with highest-weight edge (most-dependent pair of variables). Ø Iteratively add the highest-weight edge that is not yet included in the tree, and does not introduce any cycles. Stop when all variables connected.

Example: Word Usage in Newsgroups ML tree structure email ftp phone files number disk format windows drive memory system image card dos driver pc program version win car scsi data problem display graphics video software space team won bmw dealer engine honda mac help server launch moon nasa orbit shuttle technology fans games hockey league players puck season oil lunar mars earth satellite solar mission nhl baseball god hit bible christian jesus religion jews government israel children power president rights state war health human law university world Schmidt 2010 PhD Thesis: Presence of 100 words across 16,242 postings to 20 newsgroups. aids food insurance medicine msg water patients studies case cancer disease doctor vitamin fact course gun research evidence question science computer

Example: Word Usage in Newsgroups L 1 regularization ion ( = 256, isolated nodes are not plotted). case children bible course christian computer evidence health insurance disk email display card fact earth files graphics government god dos format help data image video gun human car president israel jesus drive memory number power law engine dealer jews baseball ftp mac scsi problem rights war religion games fans pc program phone nasa state question hockey software research shuttle league nhl launch moon science orbit players space university world season Schmidt 2010 PhD Thesis: Presence of 100 words across version 16,242 postings to 20 newsgroups. windows system technology driver win won team

CS242: Lecture 4B Outline Ø Learning Tree-Structured Graphical Models Ø Learning Directed Graphical Models 1,1 1,2 1,N 1 2,1 2,2 2,N 2 3,1 3,2 3,N 3

Learning Directed Graphical Models 1 1 1 1 p(x) = Y i2v p(x i x (i), i ) 2 3 4 2 3 2 3 Intuition: Must learn a good predictive model of each node, given its parent nodes 4 Directed factorization allows likelihood to locally decompose: p(x ) =p(x 1 1 )p(x 2 x 1, 2 )p(x 3 x 1, 3 )p(x 4 x 2,x 3, 4 ) log p(x ) = log p(x 1 1 ) + log p(x 2 x 1, 2 ) + log p(x 3 x 1, 3 ) + log p(x 4 x 2,x 3, 4 ) We often assume a similarly factorized (meta-independent) prior: p( ) =p( 1 )p( 2 )p( 3 )p( 4 ) log p( ) = log p( 1 ) + log p( 2 ) + log p( 3 ) + log p( 4 ) We thus have independent ML or Bayesian learning problems at each node

Learning with Complete Observations 1 1 2 2 3 3 Directed (a) (a) Graphical Model 1,1 1,2 1,1 2,1 3,1 1,2 2,1 2,2 3,1 3,2 2,2 3,2 1,N 1,N 2,N 2,N 3,N 1 2 3,N L Independent, Identically (b) (b) Distributed Training Examples 3 (a) Plate Notation Directed graph encodes statistical structure of single training examples: L Y N Y (`) (`) p(x ) = p(xi x (i), i ) i=1 Given L independent, identically distributed, completely observed samples: " # L N N L (`) (`) (`) (`) log p(x ) = log p(xi x (i), i ) = log p(xi x (i), i ) i=1 i=1 L

Prior Distributions and Tied Parameters log p(x ) = L N i=1 log p(x (`) i log p( ) = P N i=1 log p( i) x (`) (i), i)= " N L A meta-independent factorized prior Factorized posterior allows independent MAP learning for each node: log p( x) =C + " N log p( i )+ i=1 i=1 L log p(x (`) i x (`) (i), i) log p(x (`) i x (`) (i), i) Learning easy when groups of nodes are tied to use identical, shared parameters: log p(x ) = " N L i=1 log p(x (`) i x (`) (i), b i ) log p( b x) =C + log p( b )+ i b i =b L # b i log p(x (`) i x (`) (i), b) # # type or class for variable node i

Example: Tied Temporal Parameters z 0 z 0 z 0 z 1 z 2 z 3 z 4 z 5 x 1 x 2 x 3 x 4 x 5 z 1 z 2 z 3 x 1 x 2 x 3 z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 LY YT` p(x, z ) = t=1 p(z (`) t z (`) t 1, time)p(x (`) t Sequence 1 Sequence 2 Sequence 3 z (`) t, obs )

Learning (Discrete) Directed Graphical Models log p( x) =C + " N log p( i )+ i=1 For nodes with no parents, parameters define some categorical distribution Ø MAP or ML learning of standard Bernoulli/categorical parameters More generally, there are multiple categorical distributions per node, one for every combination of parent variables Ø Learning objective decomposes into multiple terms, one for subset of training data with each parent configuration Ø Apply independent MAP or ML learning to each Concerns for nodes with many parents: Ø Computation: Large number of parameters to estimate Ø Sparsity: May have little (or even no) data for some configurations of the parent variables Ø Priors can help, but may still be inadequate L log p(x (`) i x (`) (i), i) #

Generalized Linear Models General framework for modeling non-gaussian data with linear prediction, using exponential families: Ø Choose features of both inputs y and outputs x: (x i ) 2 R M (y i ) 2 R D Ø Construct context-specific vector of parameters: i = W (y i ) Ø Observation comes from exponential family: W 2 R M D p(x i y i,w)=exp T i (x i ) ( i ) Special cases: linear regression (Gaussian) & logistic regression (discrete) Application to directed graphical models: y i = x (i) Learn to predict each node based on features of parents.

Logistic Regression & Binary Graphical Models p(x i x (i),w)=ber(x i (w T (x (i) ))) Suppose that node i has d binary parent nodes An arbitrary conditional distribution can be written using 2 d binary features: Indicators for all joint configurations of parents Alternatively, we can use d+1 binary features to independently model the relationship between node i and each of its d parents: Or, we can pick feature sets in between these extremes, for example O(d 2 ) binary features to capture pairwise relationships among parents 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 5 0 5 10 ( ) = 1 1+e = e 1+e (x (i) )=[1,x i1,x i2,...,x id ] for ij 2 (i)

Example: Medical Diagnosis Learning independent distributions for every combination of diseases may be computationally intractable and lead to poor generalization Instead assume restricted parameterizations, in which child distributions depend on some features of parents. Use logistic regression for binary symptoms, or more generally some appropriate generalized linear model

EM for Learning with Partial Observations Ø Initialization: Randomly select starting parameters Ø E-Step: Given parameters, infer posterior of hidden data q (t) (z) =p(z x, (t 1) ) (0) z 1 z 2 z 3 x 1 x 2 x 3 Find posterior distribution of factor/clique linking each variable to its parents, as in EM for undirected models. Ø M-Step: Given posterior distributions, learn parameters that fit data (t) = arg max log p( )+ z q (t) (z)lnp(x, z ) Independently for every node in the graphical model, use expected statistics from E-step to estimate parameters. Ø Iteration: Alternate E-step & M-step until convergence z 0 L