CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/
A Prototypical Course Project Ø Find a problem where learning with graphical models seems promising. Typically this needs to be an area where data is easily accessible. Ø Identify a baseline approach someone has used to analyze related data. Could involve graphical models, or could involve alternative ML methods. Ø Propose, derive, implement, and validate a new approach requiring: A different graphical model structure than previous work, and/or different inference/learning algorithms than those tested in previous work. Ø Validate your graphical model via careful experiments. Not only performance metrics for your application, but also confirmation of learning algorithm correctness, and exploration of learned model properties. Ø Interested in a different project style, perhaps more theoretical? Talk to me.
Course Project Timeline Details online: http://cs.brown.edu/courses/cs242/projects/ Ø Oct. 26: Identify project teams, send brief description (2-3 paragraphs). Team size? 3 students preferred, 2 is okay, 1 requires special permission. Ø Nov. 9: Project proposal due (2-3 pages). Requires an in-depth plan and preliminary results, not just problem area. Ø Dec. 13: Project presentations (~10 minutes per team). All team members must contribute. Ø Dec. 20: Project reports due (6-10 pages). Although your results need not be sufficiently novel for publication, the presentation and experimental protocols should be of high quality.
CS242: Lecture 4B Outline Ø Learning Tree-Structured Graphical Models Ø Learning Directed Graphical Models
A Little Information Theory The entropy is a natural measure of the inherent uncertainty (difficulty of compression) of some random variable: H(p) = p(x) log p(x) 0 x The relative entropy or Kullback-Leibler (KL) divergence is a non-negative, but asymmetric, distance between a given pair of probability distributions: D(p q) = p(x) log p(x) 0 q(x) x2 The KL divergence equals zero if and only if p(x) =q(x) for all x. The mutual information measures dependence between a pair of random variables: I(p xy )=D(p xy p x p y )= x2 y2y p(x, y) log p(x, y) p(x)p(y) Zero if and only if variables and Y are independent. 0
Factorizations for Pairwise Markov Trees x 3 x 2 x 1 x 4 Pick any node as root, write directed factorization out to leaves: p(x) =p(x s ) Y p(x t x (t) )=p(x s ) Y p(x t,x (t) ) = p(x s ) Y p(x t,x (t) ) p(x t6=s t6=s (t) ) p(x t )p(x t6=s (t) ) p(x t) Grouping terms gives canonical undirected factorizations: p(x) = Y Y p(x s,x t ) p(x s ) p(x s )p(x t ) = Y Y p(x s ) 1 d s p(x s,x t ) s2v s2v d s = number of neighbors for node s x 5
x 2 Factorizations for Pairwise Markov Trees x 1 x 3 x 5 x 4 Pairwise MRFs defined by non-negative potentials: p(x) = 1 Y Y s(x s ) st(x s,x t ) Z s2v A canonical parameterization via marginals: p(x) = Y Y q st (x s,x t ) q s (x s ) q s (x s )q t (x t ) s2v Suppose we choose a self-consistent set of non-negative discrete marginals: x s q s (x s )=1, x t q st (x s,x t )=q s (x s ) for all x s. The canonical parameterization then exactly matches these marginals: q s (a) = x x s =a p(x) q st (a, b) = x x s =a,x t =b p(x)
x 2 Learning with a Fixed Tree Structure x 1 x 3 x 5 x 4 A canonical parameterization via valid marginals: p(x) = Y Y q st (x s,x t ) q s (x s ) q s (x s )q t (x t ) s2v Suppose we have L observations ˆp s (a) = 1 L L x t q st (x s,x t )=q s (x s ) for all x s. {x (`) } L a(x (`) s ) ˆp st (a, b) = 1 L s2v with empirical distribution: L a(x (`) s ) b (x (`) t ) The (normalized) maximum likelihood parameter estimation objective is: 1 L log p(x (`) )= 1 L apple log q s (x (`) s )+ log q st(x (`) s,x (`) t ) ) L L q s (x (`) s )q t (x (`) t
x 2 Learning with a Fixed Tree Structure x 1 x 3 x 4 A canonical parameterization via valid marginals: p(x) = Y Y q st (x s,x t ) q s (x s ) q s (x s )q t (x t ) s2v q st (x s,x t )=q s (x s ) for all x s. x 5 x t This is an exponential family distribution with statistics (features): s;a(x) = a (x s ) for all s 2V,a=1,...,K s. st;ab(x) = a (x s ) b (x t ) for all (s, t) 2E,a=1,...,K s,b=1,...,k t. Maximum likelihood estimates match expected values of the statistics, and thus exactly match the empirical marginal distributions: q s (x s )=ˆp s (x s ) for all s 2V. q st (x s,x t )=ˆp st (x s,x t ) for all (s, t) 2E.
x 2 Learning with a Fixed Tree Structure x 1 x 3 x 5 x 4 p(x) = Y s2v q s (x s ) q s (a) =ˆp s (a) = 1 L q st (a, b) =ˆp st (a, b) = 1 L Y P L P L q st (x s,x t ) q s (x s )q t (x t ) a(x (`) s ) a(x (`) s ) b (x (`) t ) Optimum ML model matches empirical marginals for all single variables, and empirical pairwise marginals for all neighboring variables For these optimal parameters, the corresponding training log-likelihood is: 1 L L log p(x (`) )= s2v Ĥ s + Ĥ s = x s ˆp s (x s ) log ˆp s (x s ) Empirical Entropy & Empirical Mutual Info Î st Î st = x s,x t ˆp st (x s,x t ) log ˆp st(x s,x t ) ˆp s (x s )ˆp t (x t )
x 2 Learning with a Fixed Tree Structure x 1 x 3 x 5 x 4 p(x) = Y s2v q s (x s ) q s (a) =ˆp s (a) = 1 L q st (a, b) =ˆp st (a, b) = 1 L Y P L P L q st (x s,x t ) q s (x s )q t (x t ) a(x (`) s ) a(x (`) s ) b (x (`) t ) Optimum ML model matches empirical marginals for all single variables, and empirical pairwise marginals for all neighboring variables For these optimal parameters, the corresponding training log-likelihood is: 1 L L log p(x (`) )= s2v Ĥ s + Î st Data that is more random is harder to predict. Empirical Entropy & Empirical Mutual Info Likelihood increases when link strongly dependent variables.
Optimizing Tree Structures p(x) = Y s2v q s (x s ) Y q st (x s,x t ) q s (x s )q t (x t ) Picking the max-likelihood marginals for any tree: 1 L log p(x (`) )= Ĥ s + Î st L Need to maximize the following objective over acyclic edge patterns: f(e) = Î st s2v Ĥ s = x s ˆp s (x s ) log ˆp s (x s ) Î st = x s,x t ˆp st (x s,x t ) log ˆp st(x s,x t ) ˆp s (x s )ˆp t (x t ) Weight each pair of nodes by its empirical mutual information, and find the tree which maximizes the sum of edge weights! Example greedy algorithm (Kruskal): Ø Initialize with highest-weight edge (most-dependent pair of variables). Ø Iteratively add the highest-weight edge that is not yet included in the tree, and does not introduce any cycles. Stop when all variables connected.
Example: Word Usage in Newsgroups ML tree structure email ftp phone files number disk format windows drive memory system image card dos driver pc program version win car scsi data problem display graphics video software space team won bmw dealer engine honda mac help server launch moon nasa orbit shuttle technology fans games hockey league players puck season oil lunar mars earth satellite solar mission nhl baseball god hit bible christian jesus religion jews government israel children power president rights state war health human law university world Schmidt 2010 PhD Thesis: Presence of 100 words across 16,242 postings to 20 newsgroups. aids food insurance medicine msg water patients studies case cancer disease doctor vitamin fact course gun research evidence question science computer
Example: Word Usage in Newsgroups L 1 regularization ion ( = 256, isolated nodes are not plotted). case children bible course christian computer evidence health insurance disk email display card fact earth files graphics government god dos format help data image video gun human car president israel jesus drive memory number power law engine dealer jews baseball ftp mac scsi problem rights war religion games fans pc program phone nasa state question hockey software research shuttle league nhl launch moon science orbit players space university world season Schmidt 2010 PhD Thesis: Presence of 100 words across version 16,242 postings to 20 newsgroups. windows system technology driver win won team
CS242: Lecture 4B Outline Ø Learning Tree-Structured Graphical Models Ø Learning Directed Graphical Models 1,1 1,2 1,N 1 2,1 2,2 2,N 2 3,1 3,2 3,N 3
Learning Directed Graphical Models 1 1 1 1 p(x) = Y i2v p(x i x (i), i ) 2 3 4 2 3 2 3 Intuition: Must learn a good predictive model of each node, given its parent nodes 4 Directed factorization allows likelihood to locally decompose: p(x ) =p(x 1 1 )p(x 2 x 1, 2 )p(x 3 x 1, 3 )p(x 4 x 2,x 3, 4 ) log p(x ) = log p(x 1 1 ) + log p(x 2 x 1, 2 ) + log p(x 3 x 1, 3 ) + log p(x 4 x 2,x 3, 4 ) We often assume a similarly factorized (meta-independent) prior: p( ) =p( 1 )p( 2 )p( 3 )p( 4 ) log p( ) = log p( 1 ) + log p( 2 ) + log p( 3 ) + log p( 4 ) We thus have independent ML or Bayesian learning problems at each node
Learning with Complete Observations 1 1 2 2 3 3 Directed (a) (a) Graphical Model 1,1 1,2 1,1 2,1 3,1 1,2 2,1 2,2 3,1 3,2 2,2 3,2 1,N 1,N 2,N 2,N 3,N 1 2 3,N L Independent, Identically (b) (b) Distributed Training Examples 3 (a) Plate Notation Directed graph encodes statistical structure of single training examples: L Y N Y (`) (`) p(x ) = p(xi x (i), i ) i=1 Given L independent, identically distributed, completely observed samples: " # L N N L (`) (`) (`) (`) log p(x ) = log p(xi x (i), i ) = log p(xi x (i), i ) i=1 i=1 L
Prior Distributions and Tied Parameters log p(x ) = L N i=1 log p(x (`) i log p( ) = P N i=1 log p( i) x (`) (i), i)= " N L A meta-independent factorized prior Factorized posterior allows independent MAP learning for each node: log p( x) =C + " N log p( i )+ i=1 i=1 L log p(x (`) i x (`) (i), i) log p(x (`) i x (`) (i), i) Learning easy when groups of nodes are tied to use identical, shared parameters: log p(x ) = " N L i=1 log p(x (`) i x (`) (i), b i ) log p( b x) =C + log p( b )+ i b i =b L # b i log p(x (`) i x (`) (i), b) # # type or class for variable node i
Example: Tied Temporal Parameters z 0 z 0 z 0 z 1 z 2 z 3 z 4 z 5 x 1 x 2 x 3 x 4 x 5 z 1 z 2 z 3 x 1 x 2 x 3 z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 LY YT` p(x, z ) = t=1 p(z (`) t z (`) t 1, time)p(x (`) t Sequence 1 Sequence 2 Sequence 3 z (`) t, obs )
Learning (Discrete) Directed Graphical Models log p( x) =C + " N log p( i )+ i=1 For nodes with no parents, parameters define some categorical distribution Ø MAP or ML learning of standard Bernoulli/categorical parameters More generally, there are multiple categorical distributions per node, one for every combination of parent variables Ø Learning objective decomposes into multiple terms, one for subset of training data with each parent configuration Ø Apply independent MAP or ML learning to each Concerns for nodes with many parents: Ø Computation: Large number of parameters to estimate Ø Sparsity: May have little (or even no) data for some configurations of the parent variables Ø Priors can help, but may still be inadequate L log p(x (`) i x (`) (i), i) #
Generalized Linear Models General framework for modeling non-gaussian data with linear prediction, using exponential families: Ø Choose features of both inputs y and outputs x: (x i ) 2 R M (y i ) 2 R D Ø Construct context-specific vector of parameters: i = W (y i ) Ø Observation comes from exponential family: W 2 R M D p(x i y i,w)=exp T i (x i ) ( i ) Special cases: linear regression (Gaussian) & logistic regression (discrete) Application to directed graphical models: y i = x (i) Learn to predict each node based on features of parents.
Logistic Regression & Binary Graphical Models p(x i x (i),w)=ber(x i (w T (x (i) ))) Suppose that node i has d binary parent nodes An arbitrary conditional distribution can be written using 2 d binary features: Indicators for all joint configurations of parents Alternatively, we can use d+1 binary features to independently model the relationship between node i and each of its d parents: Or, we can pick feature sets in between these extremes, for example O(d 2 ) binary features to capture pairwise relationships among parents 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 5 0 5 10 ( ) = 1 1+e = e 1+e (x (i) )=[1,x i1,x i2,...,x id ] for ij 2 (i)
Example: Medical Diagnosis Learning independent distributions for every combination of diseases may be computationally intractable and lead to poor generalization Instead assume restricted parameterizations, in which child distributions depend on some features of parents. Use logistic regression for binary symptoms, or more generally some appropriate generalized linear model
EM for Learning with Partial Observations Ø Initialization: Randomly select starting parameters Ø E-Step: Given parameters, infer posterior of hidden data q (t) (z) =p(z x, (t 1) ) (0) z 1 z 2 z 3 x 1 x 2 x 3 Find posterior distribution of factor/clique linking each variable to its parents, as in EM for undirected models. Ø M-Step: Given posterior distributions, learn parameters that fit data (t) = arg max log p( )+ z q (t) (z)lnp(x, z ) Independently for every node in the graphical model, use expected statistics from E-step to estimate parameters. Ø Iteration: Alternate E-step & M-step until convergence z 0 L