11. Learning graphical models

Size: px

Start display at page:

Download "11. Learning graphical models"

Nicholas Clarke
5 years ago
Views:

1 Learning graphical models Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models

2 Learning graphical models 11-2 statistical inference addresses how to overcome computational challenges given a graphical model learning addresses how to overcome both computational and statistical challenges to learn graphical models from data the ultimate goal of learning could be (a) to uncover structure; (b) solve particular problem such as estimation, classification, or decision; or (c) learn the joint distribution in order to make predictions and the right learning algorithm depends on the particular goal

3 Learning graphical models 11-3 choosing the right model to learn is crucial, in the fundamental tradeoff between overeating and underfitting, often called bias-variance tradeoff consider the task of learning a function over x 2 R d where Y = f (x ) + Z with i.i.d noise Z the goal is to predict y from x by learning the function f, from training data f(x 1 ; y 1 ); : : : ; (x n ; y n )g bias-variance tradeoff for Err(x ) for a given x is E[(Y ˆf Y n 1 (x )) 2 ] = (E[ˆf Y n 1 (x )] f (x )) 2 + E[(ˆf Y n 1 (x ) E[ˆf Y n 1 (x )]) 2 ] {z } Bias {z } Variance + E[(Y f (x )) 2 ] {z } Irreducible error

4 Learning graphical models 11-4 when the model is too complex, the model overfits the training data and does not generalize to new data, which is referred to as high variance when the model is too simple, the model undercuts the data and has high bias common strategy to combat this is to (a) constrain the class of models or (b) penalize the model complexity graphical models provide a hierarchy of model complexity to avoid overfitting/underfitting in increasing level of difficulty, learning tasks can be classified as follows we know the graph, or we impose a particular graph structure, and we need to learn the parameters we observe all the variables, and we need to infer the structure of the graph and the parameters the variables are partially observation, and we need to infer the structure

5 Maximum likelihood (x ) = 1 Z Y a2f we want to learn the graphical model consisting of p nodes from n i.i.d. samples x (1) ; : : : ; x (n) 2 X p it is more convenient to parametrize the compatibility functions as e a ) = and one approach is to use maximum likelihood estimator L n (; G; fx (`) g`2[n]) = 1 Ỳ n n log (x (`) ) =1 = 1 n X X `2[n] a2f a (x ) log Z (; G) this is a strictly concave function over = [ a 2 R jx a2f jj@aj, however, evaluating this function requires computing the log partition function, which is in general #p-hard, even if G is given if inference is efficient then learning is efficient Learning graphical models 11-5 P

6 Learning graphical models 11-6 In case of Ising models, (x ) = 1 Z Y e ij x i x j Y (i ;j )2E i2v e i x i the log likelihood function is X n L n(; G; fx (`) 1 nx g`2[n] ) = ij n o X n 1 n x (`) i x (`) j + i (i ;j )2E `=1 i2v `=1 {z } h ˆM;i nx o x (`) i log Z (; G) {z } () ˆM = [ ˆM ij ; : : : ; ˆM i ; : : :] 2 R jej+p where ˆM ij = 1 n P` x (`) i x (`) j and ˆM i = 1 n P` x (`) i this is a strictly concave function over = [ij ; : : : ; i ; : : :] 2 R jej+p

7 Learning graphical models 11-7 this maximum likelihood estimator is consistent, i.e. let ˆ = arg max L n (; G; fx (`) g), and let denote the true parameter, then Proof. by optimality of ˆ, lim ˆ = n!1 h ˆM; ˆi (ˆ) h ˆM; i ( ) by strong convexity of ( ), there exists > 0 such that (ˆ) ( ) + hm; (ˆ )i + 2 kˆ k 2 which follows form the fact that r ( ) = M, where M ij = E [x i x j ] and M i = E [x i ], and together we get h ˆM; (ˆ )i (ˆ) ( ) hm; (ˆ )i + 2 kˆ k 2 it follows that k ˆM Mk kˆ k h( ˆM M); (ˆ )i 2 kˆ k 2

8 Learning graphical models 11-8 kˆ k 2 k ˆM we have which gives, with high probability, Mk 1 je j + p kˆ k n where = inf min (r 2 ()) computing the gradient of is straight forward: () = log ij = 1 Z (; G) Y (i ;j )2E X x Y ij xi xj e Y (i ;j )2E i2v Y ij xi xj e i xi e i2v e i xi x i x j = E [x i x j ] = M ij again, this is computationally challenging, in general, but if such inference is efficient, then learning can also be efficient

9 Learning graphical models 11-9 recall that Let then the log-likelihood is (x ) = 1 Z c(x ) = nx i=1 = L( ; x (1) ; : : : ; x (n) ) = 1 n = Y a 2F I(x (i) = x ) ; and nx i=1 I(x = x@a) X X i2[n] a2f XX n a2f {z } log( a(x )) ) log( a(x )) log Z log Z

10 Learning graphical models taking the ; x (1) ; : : : ; x (n) = ˆ(x@a) if the graph is a tree, then one can find the ML estimate using the above equation in general, it is computationally challenging, and one resorts to approximate solutions iterative proportional fitting (IPF) updates the estimates iteratively using the above fixed point equation: (t+1) a (t) a (x (t) although is difficult to compute, IPF has a nice property that the partition function Z (t) in ˆ (t) = 1 (t) Z Qa2F (t) a is independent of t. So we can start with a distribution with known Z (0).

11 Parameter learning suppose G is given, and want to estimate the compatibility functions f a2f from i.i.d. samples x (1) ; : : : ; x (n) 2 X p drawn from (x ) = 1 Z Y a2f Example. a single node with x 2 f0; 1g, and let (x ) = for x = 1 1 for x = 0 P consider an estimator ˆ = 1 ǹ n =1 x (`), Claim. For any "; > 0, let n ("; ) denote the minimum number of samples required to ensure that P(jˆ j > "), then n ("; ) 1 2" 2 log 2 this follows directly from Hoeffding;s inequality: P j 1 n nx `=1 x (`) j > " 2 expf 2n" 2 g but it does not depend on the value of Learning graphical models 11-11

12 Learning graphical models often we want " to be in the same range as, and letting " = gives n (; ) 1 but applying Chernoff s bound: 1 nx P x (`) > n 1 P n `=1 nx `=1 x (`) < log 2 expf D KL ((1 + )k) ng ; and expf D KL ((1 )k) ng and using the fact that D KL ((1 + )k) (1=4) 2, for jj < 1=2, this can be tightened to n (; ) 4 2 log 2

13 Learning graphical models Example. consider now a joint distribution (x ) (satisfying (x ) > 0 for all x ) over x 2 f0; 1g P p with a large p, and an estimator ˆ(x ) = 1 ǹ n =1 I(x (`) = x ) it follows that the minimum number of samples required to ensure that P(9x such that jˆ(x ) (x )j > (x )) is n 4 (; ) 2 min x f(x )g log 2n+1 which follows from the Chernoff s bound with union bound over x 2 f0; 1g n since minx f(x )g 2 p, the sample complexity is exponential in the dimension p this is unavoidable in general, but when ( ) factorizes according to a graphical model with a sparse graph, then the sample complexity significantly decreases

14 Learning graphical models we want to learn the compatibility functions of a graphical model using local data, but compatibility functions are not uniquely defined, (x ) = 1 Z Y a2f we first need to define canonical form of compatibility functions recall that a positive joint distribution that satisfies all conditional independencies according to a graph G has a factorization according to the graph G, which is formally shown in the following Hammersely-Clifford Theorem. Suppose a (x ) > 0 satisfies all independencies implied by a graphical model G(V ; F ; E), then (x ) = (0) Y a2f where Y U 2@a = Y U 2@a (x U ; 0 V nu ) ( 1)j@anU j (xu ; 0 V nu ) ( 1) j@anu j (0 U ; 0 V nu )

15 Learning graphical models to estimate each term from samples, notice that (x U ; 0 V nu ) (0 U ; 0 V nu ) = (x U nu ) (0 U nu ) if the degree is bounded by k, then this involves only k 3 variable nodes

16 Learning graphical models Structural learning consider p random variables x1 ; : : : ; x p represented as a graph with p nodes p the number of edges is 2 = p(p 1)=2, and each is either present or not, resulting in 2 p(p 1)=2 possible graphs given n i.i.d. observations x (1) ; : : : ; x (n) 2 X p, (assuming there is no hidden nodes and all variables are observed), we want to select the best graph among 2 p(p 1)=2, i.e. how do we give scores to each model (i.e. graph) given data? how do we find the model with the highest score?

17 Learning graphical models in statistics there are two schools of thoughts: frequentist and Bayesian frequentist assume the parameter of interest are deterministic but unknown, in particular there is no distribution over maximum likelihood (ML) estimation finds that maximizes the score of a model parameter, with the score being the log likelihood log P (x (1) ; : : : ; x (n) ) Baysians assume there exists a distribution over the parameter of interest () maximum a posteriori (MAP) estimation finds maximizing the posterior distribution evaluated on the given data, P(jx (1) ; : : : ; x (n) ) Frequentist s approach to graph learning arg max G arg max G 1 n nx log G;G (x (i) ) i=1{z } L(G;x (1) ;:::;x (n) )

Learning graphical models 11-18 likelihood scores for graphical models are closely related to mutual information and entropy recall

18 Learning graphical models likelihood scores for graphical models are closely related to mutual information and entropy recall that I (X ; Y ) H (X ) H (Y jx ) X x ;y X x X x ;y (x ; y) log (x ) log (x ) (x ; y) (x )(y) ; (x ; y) log (yjx ) = I (X ; Y ) + H (Y ) :

19 Learning graphical models consider a directed acyclic graphical model (DAG) where (i) G (x ) = py i=1 (i) (x i jx G (i)) represents all the parameters required to describe the conditional distribution (e.g. the entries of the table describing the conditional distribution) in computing L(G; x (1) ; : : : ; x (n) ), arg max G (x i jx (i)) = [ (i) G ] x i ;x (i) arg max G 1 n nx log G;G (x (i) ) i=1{z } L(G;x (1) ;:::;x (n) ) the ML estimate ˆ G for given data is the empirical distribution, i.e. [ˆ (i) G ] x i ;x pi(i) = ˆ(x i jx {z (i)) } denotes empirical distribution = ˆ(x i ; x (i)) ˆ(x (i))

20 Learning graphical models it follows that the log likelihood is where together, it gives L(G; x (1) ; : : : ; x (n) ) = px i=1 L i(g; x (1) ; : : : ; x (n) ) = 1 n L(G; x (1) ; : : : ; x (n) ) = = = X L i(g; x (1) ; : : : ; x (n) ) X `2[n] log ˆ(x (`) i jx (`) (i) ) x i ;x (i) ˆ(x i ; x (i)) log ˆ(x ijx (i)) = H ( ˆX ij ˆX (i)) = I ( ˆX i; ˆX (i)) H ( ˆX i) px px i=1 i=1 L(G; x (1) ; : : : ; x (n) ) I ( ˆX i; ˆX (i)) px H ( ˆX i) i=1 {z } independent of G

21 Learning graphical models given a graph G, this gives the likelihood, and all graphs are subtracted the same node entropies, and we only need to compare the sum of mutual information terms per graph Chow-Liu 1968 addressed this for graphs restricted to trees, where each node only has one parent. In this case, the pairwise mutual p information can be computed for all 2 pairs and ML graphs amounts to being the maximum spanning tree on this weighted complete graph. This can be done efficiently using for example Kruskal s algorithm, where you sort the edges according to the mutual information, and you add edges iteratively starting from the largest one, adding an edge each time if it does not introduce a cycle. One needs to be careful with the direction in order to avoid a node having two parents.

22 Learning graphical models the ML score always favors more complex models, i.e. by adding more edges one can easily increase the likelihood (this follow from the fact that complex models include simpler models as special cases) without a penalty on complexity or a restriction to a class of models, ML will end up giving a complete graph one way to address this issue is to introduce prior, which we do not address in this lecture

23 Parameter estimation from partial observations Learning graphical models 11-23

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem