Joint Gaussian Graphical Model Review Series I

Size: px

Start display at page:

Download "Joint Gaussian Graphical Model Review Series I"

Jack Melton
5 years ago
Views:

1 Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia June 23rd, 2017 Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

2 Outline 1 Notation 2 Probability 3 Dependence and Correlation 4 Conditional Dependence and Partial Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

3 Notation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

4 Notation P The probability measure. Ω The sample space. F The event set. X, Y, Z The random variables. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

5 Probability Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

6 Probability Space Probability Space Let (Ω, F, P) be the probability space. Ω be an arbitrary non-empty set. F 2 Ω is a set of events. P is the probability measure. In another word, a function : F [0, 1]. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

7 Events F contains Ω. F is closed under complements. F is closed under countable unions. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

8 Probability Measure Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

9 Random Variable Random Variable Let X : Ω R be a random variable. X is a measurable function. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

10 Random Variable Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

11 Probability Distribution Probability Distribution function Let F (x) : R [0, 1] = P[X < x] where x R. X = Y, they follow same distribution? F X = F Y, then X = Y? Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

12 Joint Probability Joint Probability The probability distribution of random vector (X, Y ). Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

13 Joint Probability Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

14 Marginal Probability Marginal Probability A pair of random variable (X, Y ), the probability distribution of X. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

15 Joint Probability Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

16 Conditional Distribution Conditional Distribution Given the information of Y, the probability distribution of X. Notation X Y. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

17 Joint Probability Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

18 Relationship Relationship P(X = x, Y = y) = P(Y = y)p(x = x Y = y) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

19 Dependence and Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

20 Independence Independence X and Y are independent if and only if p X,Y (x, y) = p X (x)p Y (y), where p is the probability density function. Independence Y X = Y Filp coin example Causal relationship Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

21 Correlation Covariance Cov(X, Y ) = E[(X µ X )(Y µ Y )], where µ X, µ Y is the mean vector. Correlation ρ(x, Y ) = Cov(X,Y ) σ X σ Y Linear relationship Linear dependency between X and Y. ρ(x, Y ) = 1 means that X and Y are in the same linear direction while ρ(x, Y ) = 1 means that X and Y are in the reverse linear direction. ρ(x, Y ) = 1 means that when X increase, Y increase with all the points lying on the same line. ρ(x, Y ) = 0 means that X and Y are perpendicular with each other. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

22 Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

23 Dependence and Correlation Correlation is easy to estimate the value while independence is a relationship to infer. Dependence is stronger relationship than correlation. In another word, if X and Y are independent, ρ(x, Y ) = 0. However, the reverse doesn t hold. For example, suppose the random variable X is symmetrically distributed about zero and Y = X 2. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

24 Gaussian Example The distribution of bivariate Gaussian is: ( ( 1 f (x, y) = 2πσ X σ exp 1 (x Y 1 ρ 2 2(1 ρ 2 ) µx ) 2 σ 2 X + (y µ Y ) 2 σ 2 Y (3.1) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

25 Gaussian Example Suppose (X, Y ) are uncorrelated. i.e.,(x, Y ) N(0, diag(σ 2 X, σ2 Y )). f (x, y) = = 1 exp( 1 2πσ X σ Y 2 ((x µ X ) 2 σx 2 (x µ X ) 2 1 2πσX exp( 1 2 = f (x)f (y) σ 2 X + (y µ Y ) 2 σy 2 )) 1 ) exp( 1 (y µ Y ) 2 ) 2πσY 2 σ 2 Y (3.2) Therefore, if (X, Y ) follows bivariate Gaussian, (X, Y ) are uncorrelated if and only if (X, Y ) are independent. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

26 Summary Correlation is easy to estimate the value while independence is a relationship to infer. In the Gaussian Case, they are equivalent. From the structure learning angle, dependence is about the causal relationship, while correlation is, more specifically, the linear relationship. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

27 Conditional Dependence and Partial Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

28 Conditional Dependence Let s consider a more complicated case. There is another third random variable Z. There are two ways to view the conditional dependence. X and Y are independent conditional on Z X Z and Y Z are independent Conditional Dependence X and Y are independent on Z if and only if p X,Y Z (x, y) = p X Z (x)p Y Z (y), where p is the probability density function. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

29 Partial Correlation Partial Correlation Formally, the partial correlation between X and Y given random variable Z, written ρ XY Z, is the correlation between the residuals R X and R Y resulting from the linear regression of X with Z and of Y with Z, respectively. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

30 Partial Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

31 Partial Correlation Partial Correlation Calculation Suppose P = Σ 1 (Σ is covariance matrix or Correlation matrix) ρ Xi X j V\{X i,x j } = p ij pii p jj. The value is exactly related to the precision matrix (the inverse of covariance matrix)! Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

32 Conditional Dependence and Partial Correlation Similarly, in the Gaussian Case, they are equivalent. A detailed derivation is in the next talk. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

33 Gaussian Case Partial Correlation is easy to estimate the value while conditional independence is a relationship to infer. Conditional Dependence is stronger relationship than partial correlation. In another word, if X Z and Y Z are independent, ρ(x, Y Z) = 0. However, the reverse doesn t hold. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

34 Summary Partial correlation is easy to estimate the value while conditional independence is a relationship to infer. In the Gaussian Case, they are equivalent. From the structure learning angle, conditional dependence is about the causal relationship, while partial correlation is, more specifically, the linear relationship. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series I June 23rd, / 34

35 Joint Gaussian Graphical Model Review Series II Gaussian Graphical Model Basics Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia June 30th, 2017 Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

36 Outline 1 Notation 2 Reviews 3 Why partial correlation and condition dependence are equivalent in the Gaussian case? 4 Maximum Likelihood Method 5 Regression Method Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

37 Notation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

38 Notation Σ The covariance matrix. Ω The precision matrix. µ The mean vector. x i The i-th sample follows multivariate normal distribution. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

39 Reviews Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

40 Reviews Probability basics Dependency vs. Correlation Conditional dependency vs. partial Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

41 Summary from last talk Partial correlation is easy to estimate the value while conditional independence is a relationship to infer. In the Gaussian Case, they are equivalent. From the structure learning angle, conditional dependence is about the causal relationship, while partial correlation is, more specifically, the linear relationship. So the remaining question is why in the Gaussian case they are equivalent and how to infer this relationship. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

42 Review: Gaussian Example Suppose (X, Y ) are uncorrelated. i.e.,(x, Y ) N(0, diag(σ 2 X, σ2 Y )). f (x, y) = = 1 exp( 1 2πσ X σ Y 2 ((x µ X ) 2 σx 2 (x µ X ) 2 1 2πσX exp( 1 2 = f (x)f (y) σ 2 X + (y µ Y ) 2 σy 2 )) 1 ) exp( 1 (y µ Y ) 2 ) 2πσY 2 σ 2 Y (2.1) Therefore, if (X, Y ) follows bivariate Gaussian, (X, Y ) are uncorrelated if and only if (X, Y ) are independent. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

43 Why partial correlation and condition dependence are equivalent in the Gaussian case? Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

44 Multivariate Gaussian Distribution Density function Let X N(µ, Σ). f (x) = (2π) p 2 det(σ) 1 2 exp( 1 2 (x µ)t Σ 1 (x µ)) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

45 Partition X, µ, and Σ Partition X, µ, Σ, Ω. µ = [ µ1 µ 2 [ ] Σ11 Σ Σ = 12 Σ 21 Σ 22 [ ] Ω = Σ 1 Ω11 Ω = 12 Ω 21 Ω 22 [ ] X1 X = X 2 ] Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

46 Conditional Distribution of Multivariate Gaussian If X N(µ, Σ), it holds that X 2 N(µ 2, Σ 22 ). If Σ 22 is regular, it further holds that X 1 (X 2 = a) N(µ 1 2, Σ 1 2 ) where µ 1 2 = µ 1 + Σ 12 Σ 1 22 (a µ 2), and Σ 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 = (Ω 11 ) 1. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

47 Partial correlation and condition dependence are equivalent in the Gaussian case X 1 X 2 = a N(µ 1 2, (Ω 11 ) 1 ), If X 1 only contains x i and x j, then x i and x j are conditional independent on others iff Ω ij = 0. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

48 Estimate the condition dependence graph/partial correlation Now the only thing left is to estimate Ω = Σ 1. There are three potential ways to do that. We call this problem as Gaussian Graphical model. Directly calculate the inverse of the sample covariance matrix Σ. However, we cannot do that when the sample covariance matrix is not invertible. Maximum Likelihood Method Regression method For the first one, the sample covariance matrix Σ may not be invertible. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

49 Maximum Likelihood Method Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

50 The MLE of µ L(µ, Ω) = (2π) np 2 n i=1 det(ω 1 ) 1 2 exp ( 1 2 (x i µ) T Ω(x i µ) ). After take a first derivative, it is easy to show that x = x 1+ +x n n Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

51 The Likelihood of Ω L( x, Ω) = (2π) np 2 n i=1 det(ω 1 ) 1 2 exp ( 1 2 (x i x) T Ω(x i x) ). Notice that (x i x) T Ω(x i x) is a scalar. Therefore, (x i x) T Ω(x i x) = trace((x i x) T Ω(x i x)). Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

52 The Likelihood of Ω Since tr(a, B) = tr(b, A). ( L( x, Ω) det(ω 1 ) n 2 exp 1 n 2 i=1 ( = det(ω 1 ) n 2 exp 1 n 2 i=1 ( ( n )) = det(ω 1 ) n 2 exp 1 2 tr (x i x) (x i x) T Ω i=1 ( = det(ω 1 ) n 2 exp 1 ) 2 tr (SΩ) where, S = n (x i x)(x i x) T R p p. i=1 ( tr (x i x) T Ω (x i x)) ) (4.1) ( tr (x i x) (x i x) Ω) ) T (4.2) (4.3) (4.4) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

53 The Log-Likelihood of Ω ( ln L( x, Ω) = const n 2 ln det(ω 1 ) 1 2 tr Ω n ) ( x µ)( x µ) T. Since det(a 1 ) = 1/ det(a), ( ) ln L( x, Ω) ln det(ω) tr Ω 1 n ( x µ)( x µ) T n i=1 ( ) = ln det(ω) tr ΩŜ i=1 (4.5) (4.6) where Ŝ is the sample covariance matrix. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

54 Regression Method Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

55 Partial Correlation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

56 Partial correlation As we know, the partial correlation can also be solved by the linear regression. In the Gaussian case, we can use so-called neighborhood approach. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

57 Conditional Distribution of Multivariate Gaussian If X N(µ, Σ), it holds that X 2 N(µ 2, Σ 22 ). If Σ 22 is regular, it further holds that X 1 X 2 = a N(µ 1 2, Σ 1 2 ) where µ 1 2 = µ 1 + Σ 12 Σ 1 22 (a µ 2), and Σ 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 = (Ω 11 ) 1. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

58 Neighborhood approach If X N(0, Σ) and let X 1 = X j. X j X \j N(Σ \j,j Σ 1 \j,\j X \j, Σ jj Σ \j,j Σ 1 \j,\j Σ \j,j) Let α j := Σ \j,j Σ 1 \j,\j and σj 2 := Σ jj Σ \j,j Σ 1 \j,\j Σ \j,j. We have that where ɛ j N(0, σ 2 j ) is independent of X \j. X j = α T j X \j + ɛ j (5.1) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

59 Neighborhood approach We can estimate the α j by solving p simple linear regression. if i-th entry of α j equals to 0, it means that X i and X j are partial uncorrelated and conditional independent. Perhaps we want more assumption on α j like sparsity. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

60 Summary In Gaussian case, the partial correlation and the conditional dependence are equivalent We have two ways to estimate them. First, directly estimate the precision matrix by MLE. Second, solve p linear regression problem by neighborhood approach. None of them have any assumptions on the partial correlation coefficient. In the next talk, let s introduce the solutions of these two estimators. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series II June 30th, / 26

61 Joint Gaussian Graphical Model Review Series III Markov Random Field and Log Linear Model Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia July 7th, 2017 Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

62 Outline 1 Why we need Graphical Model? 2 Graphical Model 3 Markov Random Field Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

63 Road Map Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

64 Road Map Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

65 Review: Gaussian Case In the Gaussian case, we know the conditional dependence and partial correlation are equivalent. This pairwise relationship can be naturally represented by a graph G = (V, E). Ω > 0 is a natural adjacency matrix. We call the pairwise conditional dependence relationship among variables as undirected Graphical Model. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

66 Why we need Graphical Model? Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

67 A Toy Example Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

68 A Toy Example Suppose X = (X 1, X 2, X 3, X 4, X 5, X 6 ). Each variable only takes either 0 or 1. To estimate the joint probability p(x ), you need to estimate 2 6 values. However, if we know the conditional independence graph, p(x ) = p(x 1, X 2, X 3 )p(x 4, X 5, X 6 ). You only need to estimate 2 4 values. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

69 Proof of the decomposition First, let s prove that if X 1 X 3 X 2, then p(x 1 X 3, X 2 ) = p(x 1 X 2 ). p(x 1 X 2 )p(x 3 X 2 ) = p(x 1, X 3 X 2 ) = p(x 1 X 3, X 2 )p(x 3 X 2 ). Cancel out p(x 3 X 2 ) in the both sides, we can have the conclusion. It is easy to obtain the similar result under the local markov property: p(x v X v\n(v), X N(v) ) = p(x v X N(v) ). Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

70 Proof of the decomposition p(x 1, X 2, X 3, X 4, X 5, X 6 ) = p(x 1 X 2, X 3, X 4, X 5, X 6 )p(x 2 X 3, X 4, X 5, X 6 )p(x 3 X By the conclusion we have in the last page, the left equals to p(x 1 X 2, X 3 )p(x 2 X 3 )p(x 3 )p(x 4, X 5, X 6 ) (1.1) =p(x 1, X 2, X 3 )p(x 4, X 5, X 6 ) (1.2) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

71 Graphical Model Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

72 Graphical Model Probability Inference: estimate joint probability, marginal probability, and conditional probability. Structure learning: Give dataset X, learn the Graph structure from X (i.e., learn the edge patterns between variables). Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

73 A Toy Example Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

74 Probability Inference: Calculate the joint Probability You know that p(x ) = p(x 1, X 2, X 3 )p(x 4, X 5, X 6 ). Traditionally, p(x 1, X 2 = a) = p(x 1, X 2 = a, X 3, X 4, X 5, X 6 ). X 3,X 4,X 5,X 6 16 operators. By the graph, we can have p(x 1, X 2 = a) = p(x 1, X 2 = a, X 3 ) p(x 4, X 5, X 6 ). X 3 X 4,X 5,X 6 10 operators. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

75 Probability Inference: Calculate the joint Probability You know that p(x ) = p(x 1, X 2, X 3 )p(x 4, X 5, X 6 ). Traditionally, p(x 1, X 2 = a) = p(x 1, X 2 = a, X 3, X 4, X 5, X 6 ). X 3,X 4,X 5,X 6 16 operators. By the graph, we can have p(x 1, X 2 = a) = p(x 1, X 2 = a, X 3 ) p(x 4, X 5, X 6 ). X 3 X 4,X 5,X 6 10 operators. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

76 Markov Random Field Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

77 Markov Random Field Markov Random Field Given an undirected graph G = (V, E), a set of random variables X = (X v ) v V indexed by V form a Markov random field with respect to G if they satisfy the local Markov property: A variable is conditionally independent of all other variables given its neighbors: X v X V \N(v) X N(v) This property is stronger than the pairwise Markov property: Any two non-adjacent variables are conditionally independent given all other variables: X u X v X V \{u,v} if {u, v} / E. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

78 A Toy Example Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

79 Clique factorization If this joint density can be factorized over the cliques of G: p(x = x) = φ C (x C ) C cl(g) then X forms a Markov random field with respect to G. Here, cl(g) is the set of cliques of G. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

80 A Toy Example Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

81 Log-linear Model Any Markov random field can be written as log-linear model with feature functions f k such that the full-joint distribution can be written as: ( ) P(X = x) = 1 Z exp wk f k(x ) k. Notice that the reverse doesn t hold. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

82 Example I: Pairwise Model Pairwise Model P(X = x) = 1 Z(Θ) exp θs xs 2 + s V. Examples: Gaussian Graphical Model Ising Model (s,t) E θstx s x t These two models have good estimators to infer the MRF. Generally, estimate Θ is difficult. Since it involves computing Z(Θ) or its derivatives. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

83 Example I: Pairwise Model Gaussian Case Gaussian Case. f (x 1,..., x k ) = exp ( 1 2 (x µ)t Σ 1 (x µ) ) (2π) k Σ Solution: ( ) ln L( x, Ω) ln det(ω) tr Ω 1 n ( x µ)( x µ) T n i=1 ( ) = ln det(ω) tr ΩŜ (3.1) (3.2) where Ŝ is the sample covariance matrix. For the Ising model, we use generalized covariance matrix to avoid the normalization term. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

84 Example II: Non-pairwise model Nonparanormal Graphical Model Are there any non-pairwise model which is easy to estimate? Nonparanormal Graphical Model P(X = x) = 1 ( Z exp 1 ) 2 (f (x) µ)t Σ 1 (f (x) µ). where f (X ) = (f 1 (X 1 ), f 2 (X 2 ),... f p (X p )) and each f i is a univariate monotone function. f (X ) N(µ, Σ). eilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

85 Summary The formal definition of Markov Random Field (undirected Graphical Model) General formulation: Clique factorization log-linear Model Two examples: pairwise model and nonparanormal Graphical Model. In the next talk, let s introduce the solutions of these two estimators for sggm. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series III July 7th, / 25

86 Joint Gaussian Graphical Model Review Series IV A Unified Framework for M-estimaotr and Elementary Estimators Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia July 21st, 2017 Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

87 Road Map Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

88 Outline 1 Notation 2 Review 3 Regularized M-estimator 4 A unified framework 5 Elementary Estimator Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

89 Notation Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

90 Notation L The loss function. R The Regularization function (norm). R The Dual norm of R. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

91 Review Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

92 Review from last talk Likelihood of the precision matrix in the Gaussian case Graphical Model Basics Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

93 Regularized M-estimator Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

94 Example Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

95 Regularized M-estimator M-estimator In statistics, M-estimators are a broad class of estimators, which are obtained as the minima of sums of functions of the data. The parameters are estimated by argmin the sums of functions of the data. target L(X, θ) the loss function Conditions R(θ) the Regularization function Therefore, the whole objective function is: argmin L(X, θ) + λ n R(θ) (3.1) θ Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

96 Example: Linear Model Let s use the linear regression model as an example. Target Find β, such that X β = y. Constrains: Sparsity Prediction Accuracy: Sacrifice a little bias and reduce the variance. Improve the overall performance. Interpretation: With a large number of predictors, we often would like to determine a smaller subset that exhibits the strongest effect. argmin y X β 2 (3.2) β Subject to: β 0 t (3.3) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

97 Example: Lasso Since l 0 -norm is not a convex function, we need the closest convex function of l 0 -norm. argmin y X β 2 (3.4) β Subject to: β 1 t (3.5) Lasso argmin y X β 2 + λ n β 1 β Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

98 Other equivalent formulation argmin β 1 (3.6) β Subject to: y = X β (3.7) Dantzig selector argmin β 1 (3.8) β Subject to: X T (X β y) λ n (3.9) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

99 A unified framework Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

100 Three major Criteria Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

101 Three major Criteria Statistical Convergence Rate: How close is between your estimated parameter and the true parameter. It corresponds to estimation error and approximation error. Computational Complexity: How fast the algorithm is with respect to certain parameters, e.g., n and p. Optimization Rate of Convergence: How fast each optimization step move to the estimated parameter, such as linear or quadratic. Traditional statisticians focus on the statistical convergence rate (Accuracy). Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

102 High dimension vs low dimension low dimension: when n is large, the error is asymptotic 0 by the law of large number. high dimension (i.e.,p/n c 0): the error is not asymptotic 0. High dimensional analysis is relative hard. Traditionally, we need carefully proof for every estimator. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

103 Three major Criteria Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

104 A unified framework for M-estimator [Negahban et al.(2009)negahban, Yu, Wainwright, and Ravik Decomposability of R Suppose a subspace M R p, a norm-based regularizer R is decomposable with respect to (M, M ) if R(θ + γ) = R(θ) + R(γ) for all θ M and γ M, where M := {v R p < u, v >= 0 u M}. Subspace compatibility constant Φ(M) := with respect to the pair (R, ). R(u) sup u M\{0} u Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

105 A unified framework for M-estimator [Negahban et al.(2009)negahban, Yu, Wainwright, and Ravik Example: l 1 l 1 is decomposable and the Φ(M) = s with respect to (l 1, l 2 ). Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

106 A unified framework for M-estimator Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

107 Example: Lasso θ λn θ 2 2 O( s log p ) n In high dimensional setting, the sparsity assumption actually improves the convergence rate a lot. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

108 Elementary Estimator Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

109 We have a very powerful tool to easily prove the convergence rate. We can also follow the similar process to prove the convergence rate for estimators like Dantzig Selector. However, a lot of statistical method is slow when p and n are large and they are not scalable at all. Are there any estimators with close form solution for the statistic problem, which also achieve the optimal convergence rate? Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

110 Three major Criteria Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

111 Three major Criteria Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

112 Elementary Estimator[Yang et al.(2014b)yang, Lozano, and Ravikumar] Here B ( φ) is a backward mapping for φ. argminr(θ) (5.1) θ Subject to: R (θ B ( φ)) λ n (5.2) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

113 Example: sparse linear regression[yang et al.(2014a)yang, Lozano, and Ravikumar] argmin θ 1 (5.3) θ Subject to: θ (X T X + ɛi ) 1 X T y λ n (5.4) Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

114 Summary We review the unified framework for M-estimator, which can be applied to most regularized M-estimator problem Following the similar proof strategy, we have the set of elementary estimators. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

115 References I S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages , E. Yang, A. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear regression. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages , 2014a. E. Yang, A. C. Lozano, and P. K. Ravikumar. Elementary estimators for graphical models. In Advances in Neural Information Processing Systems, pages , 2014b. Beilun Wang, Advisor: Yanjun Qi (UniversityJoint of Virginia) Gaussian Graphical Model Review Series IV July 21st, / 30

116 Joint Gaussian Graphical Model Series V sparse Gaussian Graphical Model estimators Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia July 28th, 2017 Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

117 Road Map Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

118 Outline 1 Notation 2 Review 3 Neighborhood Method 4 Graphical Lasso 5 CLIME 6 Elementary Estimator for Gaussian Graphical Model Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

119 Notation Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

120 Notation Σ The covariance matrix. Ω The precision matrix. p The number of features. n The number of samples. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

121 Review Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

122 Review from last talk Regularized M-estimator argmin L(θ) + λ n R(θ) θ a unified framework to analyze the statistical convergence rate for high-dimensional statistics Elementary Estimator Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

123 Review of Gaussian Graphical Model Suppose the precision matrix Ω = Σ 1. ( ) The log-likelihood of Ω equals to ln det(ω) tr ΩŜ In this talk, we will use this likelihood to derive several estimators of sparse Gaussian Graphical Model (sggm) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

124 Neighborhood Method Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

125 Neighborhood approach If X N(0, Σ) and let X 1 = X j. X j X \j N(Σ \j,j Σ 1 \j,\j X \j, Σ jj Σ \j,j Σ 1 \j,\j Σ \j,j) Let α j := Σ \j,j Σ 1 \j,\j and σj 2 := Σ jj Σ \j,j Σ 1 \j,\j Σ \j,j. We have that where ɛ j N(0, σ 2 j ) is independent of X,\j. X j = α T j X,\j + ɛ j (3.1) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

126 Neighborhood approach with sparse assumption By the sparse assumption, we estimate each α j by a lasso estimator α j = argmin αj T X,\j X j λ α j 1 (3.2) α j Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

127 Review of Lasso solution Lasso subgradient method β = argmin β T X y λ β 1 (3.3) β g(β; λ) = 2X T (y X β) + λsgn(β) (3.4) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

128 Review of Lasso solution: State of the Art We see that the proximity operator is important because x is a minimizer to the problem min x H F (x) + R(x) if and only if x = prox γr (x γ F (x )), where γ > 0. γ is any positive real number. Proximal gradient method ( proxγr (x) ) i x i γ, x i > γ = 0, x i γ x i + γ, x i < γ, (3.5) By using the fixed point method, you can obtain the estimation of β. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

129 Graphical Lasso Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

130 Graphical Lasso[Friedman et al.(2008)friedman, Hastie, and Tibshirani We already have the log-likelihood as the loss function. Can we use it to obtain a similar estimator as Lasso? ( ) argmin ln det(ω) + tr Ω ΩŜ + λ n Ω 1 (4.1) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

131 Proximal gradient method to solve it Let s do a practice in the white board. Super Linear algorithm. lim k x k+1 x x k x = 0. eilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

132 State of the art method: Big & QUIC[Hsieh et al.(2011)hsieh, Sustik, Dhillon, and Ravikuma Parallelized Coordinate descent. approximated quadratic algorithm. lim k x k+1 x x k x 2 < M eilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

133 CLIME Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

134 CLIME[Cai et al.(2011)cai, Liu, and Luo] CLIME argmin Ω 1, subject to: ΣΩ I λ (5.1) Ω Here λ > 0 is the tuning parameter. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

135 By taking the first derivative of Eq. (4.1) and setting it equal to zero, the solution Ω glasso also satisfies: Ω 1 glasso Σ = λẑ (5.2) where Ẑ is an element of the subdifferential Ω glasso 1. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

136 Column-wise estimator argmin β 1 subject to Σβ e j λ CLIME can be estimated column-by-column. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

137 Elementary Estimator for Gaussian Graphical Model Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

138 Elementary Estimator Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

139 Elementary Estimator[Yang et al.(2014b)yang, Lozano, and Ravikumar] Here B ( φ) is a backward mapping for φ. argminr(θ) (6.1) θ Subject to: R (θ B ( φ)) λ n (6.2) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

140 Example: sparse linear regression[yang et al.(2014a)yang, Lozano, and Ravikumar] argmin θ 1 (6.3) θ Subject to: θ (X T X + ɛi ) 1 X T y λ n (6.4) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

141 Elementary Estimator for sggm argmin Ω 1,off Ω (6.5) subject to: Ω [T v (Σ)] 1,off λ n Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

142 Summary We review most sggm estimators. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

143 References I T. Cai, W. Liu, and X. Luo. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494): , J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): , C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. D. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In NIPS, pages , Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

144 References II E. Yang, A. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear regression. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages , 2014a. E. Yang, A. C. Lozano, and P. K. Ravikumar. Elementary estimators for graphical models. In Advances in Neural Information Processing Systems, pages , 2014b. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series V July 28th, / 29

145 Joint Gaussian Graphical Model Series VI Multi-task sggms and its optimization challenges Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia August 4th, 2017 Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

146 Road Map Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

147 Outline 1 Notation 2 Review 3 Multi-task Learning 4 Multi-task sggms 5 Optimization Challenge of Multi-task sggms 6 Joint Graphical Lasso Example Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

148 Notation Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

149 Notation X (i) The i-th data matrix Σ (i) The i-th covariance matrix. Ω (i) The i-th precision matrix. p The number of features. n i The number of samples in the i-th data matrix. K The number of tasks. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

150 Review Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

151 Review from last talk We introduce four estimators of sparse Gaussian Graphical Model. We finish most contents about sparse Gaussian Graphical Model in the last five talks. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

152 Review of Gaussian Graphical Model Suppose the precision matrix Ω = Σ 1. ( ) The log-likelihood of Ω equals to ln det(ω) tr ΩŜ Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

153 Multi-task Learning Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

154 Multi-task Learning Multi-task Learning Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

155 Multi-task Learning Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

156 Multi-task Learning Linear Classifier Example Linear Classifier f (x) = sgn(w T x + b) (3.1) Multi-task Linear Classifiers For the i-th task, f i (x) = sgn((w T S + w T i )x + b) (3.2) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

157 Multi-task sggms Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

158 Multi-task sggms Problem Input: {X (i) } Output: {Ω (i) } Assumption I: Sparsity Assumption II: Commonalities and Differences Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

159 Multi-task sggms Likelihood i ( n i (ln det(ω (i) ) tr Ω (i) Ŝ (i)) ) (4.1) Likelihood with sparsity assumption ( n i (ln det(ω (i) ) tr Ω (i) Ŝ (i)) ) (4.2) i Subject to: Ω (i) 1 t (4.3) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

160 Multi-task sggms Likelihood with multi-task setting ( n i (ln det(ω (i) ) tr Ω (i) Ŝ (i)) ) (4.4) i Subject to: Ω (i) 1 t (4.5) P(Ω (1), Ω (2),..., Ω (K) ) t 2 (4.6) Joint Graphical Lasso [Danaher et al.(2013)danaher, Wang, and Witten] i ( n i (ln det(ω (i) )+tr Ω (i) Ŝ (i)) )+λ 1 Ω (i) 1 +λ 2 P(Ω (1), Ω (2),..., Ω (K) ) (4.7) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

161 Optimization Challenge of Multi-task sggms Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

162 General formulation Likelihood with multi-task setting i ( n i (ln det(ω (i) ) + tr Ω (i) Ŝ (i)) ) (5.1) Subject to: Ω (i) 1 t (5.2) P(Ω (1), Ω (2),..., Ω (K) ) t 2 (5.3) General formulation f (x) + g(z) (5.4) x,z Subject to: Ax + Bz = c (5.5) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

163 Optimization Challenge Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

164 Solution Distributed optimization Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

165 Optimization Challenges For K > 2 tasks, you need carefully derive the whole optimization solution. Each step in each iteration is still a sub-optimization problem. Sometimes, it is already difficult to solve. This method is at most linear Convergence. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

166 Joint Graphical Lasso Example Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

167 JGL-group Lasso example Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

168 JGL solution updating Θ (i) Set the gradient to be 0, we can get the SVD part of the solution. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

169 JGL solution updating Z (i) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

170 An example for difficulty of ADMM Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

171 Summary We introduce the multi-task sggms problem. We introduce the challenges of the optimization for this problem. We introduce the ADMM method and its drawbacks. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

172 References I P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VI August 4th, / 28

173 Joint Gaussian Graphical Model Series VII Multi-task sggms estimators Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia August 18th, 2017 Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

174 Road Map Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

175 Outline 1 Notation 2 Review 3 Multi-task Learning 4 Multi-task sggms 5 Multi-task sggms estimators Joint Graphical Lasso Directly learn the commonalities and differences among tasks Directly learn the differences between case and control Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

176 Notation Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

177 Notation X (i) The i-th data matrix Σ (i) The i-th covariance matrix. Ω (i) The i-th precision matrix. p The number of features. n i The number of samples in the i-th data matrix. K The number of tasks. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

178 Review Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

179 Review from last talk We introduce multi-task learning sparse Gaussian Graphical Models (sggms). We introduce the optimization chanllenges in the multi-task sggms. We introduce the ADMM method and the solution of Joint Graphical Lasso. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

180 Review of Gaussian Graphical Model Suppose the precision matrix Ω = Σ 1. ( ) The log-likelihood of Ω equals to ln det(ω) tr ΩŜ Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

181 Multi-task Learning Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

182 Multi-task Learning Multi-task Learning Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

183 Multi-task Learning Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

184 Multi-task Learning Linear Classifier Example Linear Classifier f (x) = sgn(w T x + b) (3.1) Multi-task Linear Classifiers For the i-th task, f i (x) = sgn((w T S + w T i )x + b) (3.2) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

185 Multi-task sggms Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

186 Multi-task sggms Problem Input: {X (i) } Output: {Ω (i) } Assumption I: Sparsity Assumption II: Commonalities and Differences Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

187 Multi-task sggms Likelihood i ( n i (ln det(ω (i) ) tr Ω (i) Ŝ (i)) ) (4.1) Likelihood with sparsity assumption argmax Ω (i) n i (ln det(ω (i) ) tr i ( Ω (i) Ŝ (i)) ) (4.2) Subject to: Ω (i) 1 t (4.3) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

188 Multi-task sggms Likelihood with multi-task setting argmax Ω (i) n i (ln det(ω (i) ) tr i ( Ω (i) Ŝ (i)) ) (4.4) Subject to: Ω (i) 1 t (4.5) P(Ω (1), Ω (2),..., Ω (K) ) t 2 (4.6) Joint Graphical Lasso [Danaher et al.(2013)danaher, Wang, and Witten] argmin Ω (i) i ( n i (ln det(ω (i) )+tr Ω (i) Ŝ (i)) )+λ 1 Ω (i) 1 +λ 2 P(Ω (1), Ω (2),..., Ω (4.7) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

189 Multi-task sggms estimators Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

190 Multi-task sggms estimators Joint Graphical Lasso type estimators Directly learn the commonalities and differences among tasks Directly learn the differences between case and control Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

191 Joint Graphical Lasso estimators Different Joint Graphical Lasso In the end, different multi-task sggms estimators choose different P(Ω (1), Ω (2),..., Ω (K) ). Solutions Most methods use ADMM as the solution of the estimators. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

192 JGL:Problem Input: {X (i) } Output: {Ω (i) } Assumption I: Sparsity Assumption II: Commonalities and Differences Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

193 Multi-task sggms estimators Group Lasso[Danaher et al.(2013)danaher, Wang, and Witten] P(Ω (1), Ω (2),..., Ω (K) ) = Ω (1), Ω (2),..., Ω (K) G,2. SIMONE[Chiquet et al.(2011)chiquet, Grandvalet, and Ambroise] P(Ω (1), Ω (2),..., Ω (K) ) = (( T i j (Ω (k) ij k=1 ) 2 +)) (( K ( Ω (k) k=1 ij ) 2 +)) 1 2. Node JGL[Mohan et al.(2013)mohan, London, Fazel, Lee, and Witten] P(Ω (1), Ω (2),..., Ω (K) ) = RCON(Ω (i) Ω (j) ). ij,i>j Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

194 Node JGL[Mohan et al.(2013)mohan, London, Fazel, Lee, and Wit Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

195 Directly learn the commonalities and differences among tasks: Problem Input: {X (i) } Output: {Ω (i) I, Ω S } Assumption I: Sparsity Assumption II: Commonalities and Differences Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

196 Multi-task sggms estimators Direct modeling The second penalty function is still an indirect way to model the commonality and differences among tasks. Some works try to directly model this relationship. Mixed Neighborhood Selection (MSN)[Monti et al.(2015)monti, Anagnostopoulos, and Montana] the neighborhood edges of a given node v in the i-task is modeled as β v + b (i),v. Here b (i),v N(0, Φ v ). Consider the CLIME estimator, we can directly model the graphs as the sum of commonality and differences SIMULE Ω (i) = ɛω S + Ω (i) I. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

197 Direct modeling commonalities and differences SIMULE SIMULE Ω (1) I, Subject to: Σ (i) (Ω (i) I (2) Ω I,..., Ω (K) I, Ω S = argmin Ω (i) I,Ω S + Ω S ) I λ n, i = 1,..., K i Ω (i) I 1 + ɛk Ω S 1 Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

198 Multi-task sggms estimators Direct modeling the differential networks: Problem Input: {X (i) } Output: { } Assumption I: Sparse Differential networks Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

199 Multi-task sggms estimators Direct modeling the differential networks I Fused GLasso By adding a regularization to enforce the sparsity of = Ω c Ω d, we have the following formulation: argmin L(Ω c ) + L(Ω d )λ n ( Ω c 1 + Ω d 1 ) + λ 2 1 (5.1) Ω c,ω d 0, The Fused Lasso assumes Ω case, Ω control,. However, many real world applications, like brain imaging data, only assume the differential network is sparse. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

200 Direct modeling the differential networks II: Differential CLIME A recent study proposes the following model, which only assume the sparsity of. Differential CLIME argmin 1 Subject to: Σ c Σ d ( Σ c Σ (5.2) d ) λ n However, this method is solved by a linear programming. It has p 2 variables in this method. Therefore, the time complexity is at least O(p 8 ). In practice, it takes more than 2 days to finish running the method when p = 120. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

201 Direct modeling the differential networks III: Density Ratio The above methods all make the Gaussian assumption. This method relaxes the model to the exponential family distribution. Density Ratio p c (x, θ c ) p d (x, θ d ) exp( t t f t (x)) (5.3) Here t encodes the difference between two Networks for factor f t. Density Ratio r(x; θ) = 1 N(θ) exp( t t f t (x)) (5.4) Here t encodes the difference between two Networks for factor f t. N(θ) is a normalization term. eilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

202 Direct modeling the differential networks III: Density Ratio Density Ratio for Markov Random Field KL[p c p] = Const. p(x) = p d (x)r(x; θ) p c (x) log r(x; θ)dx. (5.5) Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

203 Summary We introduce the multi-task sggms estimators. We introduce the multi-task sggms estimators, which directly model the commonalities and differences. We introduce the multi-task sggms estimators, which directly model the differences. Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

204 References I J. Chiquet, Y. Grandvalet, and C. Ambroise. Inferring multiple graphical structures. Statistics and Computing, 21(4): , P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), K. Mohan, P. London, M. Fazel, S.-I. Lee, and D. Witten. Node-based learning of multiple gaussian graphical models. arxiv preprint arxiv: , Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

205 References II R. P. Monti, C. Anagnostopoulos, and G. Montana. Learning population and subject-specific brain connectivity networks via mixed neighborhood selection. arxiv preprint arxiv: , Beilun Wang, Advisor: Yanjun Qi (University of Joint Virginia) Gaussian Graphical Model Series VII August 18th, / 33

206 Joint Gaussian Graphical Model Series VIII A deep introduction of the metrics for evaluating an/a estimator/learner Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia Sep 22nd, 2017 Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

207 Road Map Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

208 Outline 1 Notation 2 Review 3 The metrics for evaluating an estimator 4 Statistical Convergence Rate 5 Optimization Convergence Rate 6 Computational Complexity Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

209 Notation Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

210 Notation X The data matrix Σ The covariance matrix. Ω The precision matrix. p The number of features. n The number of samples in the data matrix. s The number of non-zero entries in the precision matrix. Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

211 Review Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

212 Review from last talk We introduce different sggm estimators and their solution. eilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

213 Review from last talk We introduce different sggm estimators and their solution. We briefly introduce the three metrics used in evaluating an estimator. eilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

214 Review from last talk We introduce different sggm estimators and their solution. We briefly introduce the three metrics used in evaluating an estimator. We introduce different multi-task sggms estimators and their optimization challenges. Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

215 The metrics for evaluating an estimator Beilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

216 Motivation I: Select a proper estimator There may be a lot of similar estimators. eilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

217 Motivation I: Select a proper estimator There may be a lot of similar estimators. You need to decide which one to use. eilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

218 Motivation I: Select a proper estimator There may be a lot of similar estimators. You need to decide which one to use. You need some metrics to make the decision. eilun Wang, Advisor: Yanjun Qi (University Joint of Virginia) Gaussian Graphical Model Series VIII Sep 22nd, / 30

Joint Gaussian Graphical Model Review Series I

Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun