Learning graphs from data:

Size: px

Start display at page:

Download "Learning graphs from data:"

Brice Patterson
6 years ago
Views:

1 Learning graphs from data: A signal processing perspective Xiaowen Dong MIT Media Lab Graph Signal Processing Workshop Pittsburgh, PA, May 2017

2 Introduction What is the problem of graph learning? 2/34

3 Introduction What is the problem of graph learning? - Given observations on a number of variables and some prior knowledge (distribution, model, etc) # samples M N # variables X 2/34

4 Introduction What is the problem of graph learning? - Given observations on a number of variables and some prior knowledge (distribution, model, etc) - Build/learn a measure of relations between variables (correlation/covariance, graph topology/operator or equivalent) # samples M N # variables X 2/34

5 Introduction What is the problem of graph learning? - Given observations on a number of variables and some prior knowledge (distribution, model, etc) - Build/learn a measure of relations between variables (correlation/covariance, graph topology/operator or equivalent) # samples M N... # variables X x : V! R N v 2 v5 v 8 v 9 v v 6 7 negative positive 2/34

6 Introduction What is the problem of graph learning? - Given observations on a number of variables and some prior knowledge (distribution, model, etc) - Build/learn a measure of relations between variables (correlation/covariance, graph topology/operator or equivalent) # samples M N... # variables X x : V! R N v 2 v5 v 8 v 9 v v 6 7 negative positive 2/34

7 Introduction What is the problem of graph learning? - Given observations on a number of variables and some prior knowledge (distribution, model, etc) - Build/learn a measure of relations between variables (correlation/covariance, graph topology/operator or equivalent) # samples M N... # variables X x : V! R N v 2 v5 v 8 v 9 v v 6 7 negative positive 2/34

8 Introduction Why is it important? - Learning relations between entities benefits numerous application domains 3/34

9 Introduction Why is it important? - Learning relations between entities benefits numerous application domains Objective: functional connectivity between brain regions Objective: behavioral similarity/ influence between people Input: fmri recordings in these regions Input: individual history of activities image credit: 3/34

10 Introduction Why is it important? - Learning relations between entities benefits numerous application domains - The learned relations can help us predict future observations Objective: functional connectivity between brain regions Objective: behavioral similarity/ influence between people Input: fmri recordings in these regions Input: individual history of activities image credit: 3/34

11 Introduction Why is it important? - Learning relations between entities benefits numerous application domains - The learned relations can help us predict future observations Objective: functional connectivity between brain regions Objective: behavioral similarity/ influence between people Input: fmri recordings in these regions Input: individual history of activities How do we build/learn the graph? image credit: 3/34

12 Outline A (partial) historic overview A signal processing perspective - GSP idea for graph learning - Three signal/graph models Perspective 4/34

13 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) 5/34

14 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models Undirected graphical models: Markov random fields (MRF) 5/34

15 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models Undirected graphical models: Markov random fields (MRF) Directed graphical models: Bayesian networks (BN) 5/34

16 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models Undirected graphical models: Markov random fields (MRF) Directed graphical models: Bayesian networks (BN) Factor graphs 5/34

17 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models Undirected graphical models: Markov random fields (MRF) Directed graphical models: Bayesian networks (BN) Factor graphs 5/34

18 A (partial) historical overview Learning pairwise MRF 6/34

19 A (partial) historical overview Learning pairwise MRF conditional independence: (i, j) /2 E, x i? x j x \{x i,x j } x 3 x 4 x 2 x 5 x 1 6/34

20 A (partial) historical overview Learning pairwise MRF conditional independence: (i, j) /2 E, x i? x j x \{x i,x j } x 3 x 4 probability parameterized by : x 2 x 5 P (x ) = 1 Z( ) exp X i2v ii x 2 i + X (i,j)2e ij x i x j x 1 6/34

21 A (partial) historical overview Learning pairwise MRF conditional independence: (i, j) /2 E, x i? x j x \{x i,x j } x 3 x 4 probability parameterized by : x 2 x 5 P (x ) = 1 Z( ) exp X i2v ii x 2 i + X (i,j)2e ij x i x j x 1 Gaussian graphical models with precision : P (x ) = 1/2 (2 ) N/2 exp 1 2 xt x 6/34

22 A (partial) historical overview Learning pairwise MRF conditional independence: (i, j) /2 E, x i? x j x \{x i,x j } x 3 x 4 probability parameterized by : x 2 x 5 P (x ) = 1 Z( ) exp X i2v ii x 2 i + X (i,j)2e ij x i x j x 1 Gaussian graphical models with precision : P (x ) = 1/2 (2 ) N/2 exp 1 2 xt x Learning a sparse : - interactions are mostly local - computationally more tractable 6/34

23 A (partial) historical overview covariance selection Dempster 1972 Prune the smallest elements in precision (inverse covariance) matrix 7/34

24 A (partial) historical overview covariance selection Dempster 1972 Prune the smallest elements in precision (inverse covariance) matrix groundtruth precision X N (0, ) data matrix S 1 inverse of sample covariance 7/34

25 A (partial) historical overview covariance selection Dempster 1972 Prune the smallest elements in precision (inverse covariance) matrix X N (0, ) S 1 groundtruth precision data matrix inverse of sample covariance Not applicable when sample covariance is not invertible! 7/34

26 A (partial) historical overview covariance selection Dempster `1-regularized neighborhood regression Meinshausen & Buhlmann Learning a graph = learning neighborhood of each node 8/34

27 A (partial) historical overview covariance selection Dempster `1-regularized neighborhood regression Meinshausen & Buhlmann Learning a graph = learning neighborhood of each node X 1 T X \1 T /34

28 A (partial) historical overview covariance selection Dempster `1-regularized neighborhood regression Meinshausen & Buhlmann Learning a graph = learning neighborhood of each node X 1 T 14 X \1 T 12 LASSO regression: min X 1 X \ /34

29 A (partial) historical overview covariance selection Dempster `1-regularized neighborhood regression Meinshausen & Buhlmann `1 -regularized log-determinant Banerjee Friedman Estimation of sparse precision matrix X S 9/34

30 A (partial) historical overview covariance selection Dempster `1-regularized neighborhood regression Meinshausen & Buhlmann `1 -regularized log-determinant Banerjee Friedman Estimation of sparse precision matrix graphical LASSO maximizes likelihood of precision matrix : X S M/2 exp MX m=1 1 2 X(m)T X(m) 9/34

31 A (partial) historical overview covariance selection Dempster `1-regularized neighborhood regression Meinshausen & Buhlmann `1 -regularized log-determinant Banerjee Friedman Estimation of sparse precision matrix X S graphical LASSO maximizes likelihood of precision matrix : max log det tr(s ) 1 log-likelihood function 9/34

A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant quadratic approx. of Gauss. neg.

32 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Hsieh Estimation of sparse precision matrix X S graphical LASSO maximizes likelihood of precision matrix : max log det tr(s ) 1 log-likelihood function 9/34

33 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh Neighborhood learning for discrete variables /34

34 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh Neighborhood learning for discrete variables m X 1m X \1m /34

35 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh Neighborhood learning for discrete variables m X 1m X \1m regularized logistic regression: max log P (X 1m X \1m ) logistic function 10/34

36 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models - Classical learning approaches lead to both positive/negative relations - What about learning a graph topology with non-negative weights? 11/34

37 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models - Classical learning approaches lead to both positive/negative relations - What about learning a graph topology with non-negative weights? Learning topologies with non-negative weights - M-matrices (sym., p.d., non-pos. off-diag.) have been used as precision, leading to attractive GMRF (Slawski and Hein 2015) 11/34

38 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models - Classical learning approaches lead to both positive/negative relations - What about learning a graph topology with non-negative weights? Learning topologies with non-negative weights - M-matrices (sym., p.d., non-pos. off-diag.) have been used as precision, leading to attractive GMRF (Slawski and Hein 2015) - The combinatorial graph Laplacian L = Deg - W belongs to M-matrices and is equivalent to graph topology 11/34

39 A (partial) historical overview Simple and intuitive methods - Sample correlation - Similarity function (e.g., Gaussian RBF) Learning graphical models - Classical learning approaches lead to both positive/negative relations - What about learning a graph topology with non-negative weights? Learning topologies with non-negative weights - M-matrices (sym., p.d., non-pos. off-diag.) have been used as precision, leading to attractive GMRF (Slawski and Hein 2015) - The combinatorial graph Laplacian L = Deg - W belongs to M-matrices and is equivalent to graph topology From arbitrary precision matrix to graph Laplacian! 11/34

40 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh max log det tr(s ) 1 graph Laplacian L can be the precision, BUT it is singular 12/34

41 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh max log det tr(s ) 1 s.t. = L I Lake graph Laplacian L can be the precision, BUT it is singular `1-regularized log-determinant on generalized L 12/34

42 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh max log det tr(s ) 1 s.t. = L I Lake graph Laplacian L can be the precision, BUT it is singular `1-regularized log-determinant on generalized L precision by graphical LASSO Laplacian by Lake et al. 12/34

A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg.

43 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh max log det tr(s ) 1 s.t. = L I Lake graph Laplacian L can be the precision, BUT it is singular `1-regularized log-determinant on generalized L Slawski and Hein (2015) precision by graphical LASSO Laplacian by Lake et al. 12/34

44 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood Dempster Meinshausen & Buhlmann Banerjee Friedman Ravikumar Hsieh locally linear embedding [Roweis00] Daitch quadratic form of power of L Lake `1-regularized log-determinant on generalized L Hu quadratic form of power of L LX 2 F = X T L 2 X tr(x T L s X) W F Slawski and Hein (2015) 13/34

45 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood v5 Dempster Meinshausen & Buhlmann v 8 v 9 x : V! R N Banerjee Friedman Ravikumar Hsieh v 7 v GSP Daitch quadratic form of power of L Lake `1-regularized log-determinant on generalized L Slawski and Hein (2015) Hu quadratic form of power of L 14/34

46 A (partial) historical overview covariance selection `1-regularized neighborhood regression `1 -regularized log-determinant `1-regularized logistic regression quadratic approx. of Gauss. neg. log-likelihood v5 Dempster Meinshausen & Buhlmann v 8 v 9 x : V! R N Banerjee Friedman Ravikumar Hsieh v 7 v GSP Daitch Lake Hu Dong Egilmez Thanou quadratic form of power of L `1-regularized log-determinant on generalized L quadratic form of power of L Mei Pasdeloup Kalofolias Segarra Baingana Chepuri Slawski and Hein (2015) signal processing perspective 14/34

47 A signal processing perspective Existing approaches have limitations - Simple correlation or similarity functions are not enough - Most classical methods for learning graphical models do not directly lead to topologies with non-negative weights - There is no strong emphasis on signal/graph interaction with spectral/frequencydomain interpretation 15/34

48 A signal processing perspective Existing approaches have limitations - Simple correlation or similarity functions are not enough - Most classical methods for learning graphical models do not directly lead to topologies with non-negative weights - There is no strong emphasis on signal/graph interaction with spectral/frequencydomain interpretation Opportunity and challenge for graph signal processing - GSP tools such as frequency-analysis and filtering can contribute to the graph learning problem - Filtering-based approaches can provide generative models for signals with complex non-gaussian behavior 15/34

49 A signal processing perspective Signal processing is about D c = x = D c x 16/34

50 A signal processing perspective Graph signal processing is about D(G) c = x = D(G) c x G 16/34

51 A signal processing perspective Forward: Given G and x, design D to study c = D(G) c x G Fourier/wavelet atoms graph Fourier/ wavelet coefficient [Coifman06,Narang09,Hammond11, Shuman13,Sandryhaila13] trained dictionary atoms graph dictionary coefficient [Zhang12,Thanou14] 16/34

52 A signal processing perspective Backward (graph learning): Given x, design D and c to infer G = D(G) c x G 16/34

53 A signal processing perspective Backward (graph learning): Given x, design D and c to infer G = D(G) c x G - The key is a signal/graph model behind D - Designed around graph operators (adjacency/laplacian matrices, shift operators) 16/34

54 A signal processing perspective Backward (graph learning): Given x, design D and c to infer G = D(G) c x G - The key is a signal/graph model behind D - Designed around graph operators (adjacency/laplacian matrices, shift operators) - Choice of/assumption on c often determines signal characteristics 16/34

55 Model 1: Global smoothness Signal values vary smoothly between all pairs of nodes that are connected Example: Temperature of different locations in a flat geographical region Usually quantified by the Laplacian quadratic form: x T Lx = 1 2 X i,j W ij (x(i) x(j)) 2 17/34

56 Model 1: Global smoothness Signal values vary smoothly between all pairs of nodes that are connected Example: Temperature of different locations in a flat geographical region Usually quantified by the Laplacian quadratic form: x T Lx = 1 2 X x : V! R N W ij (x(i) x(j)) 2 i,j v5 v 8 v 9 v 7 v 6 x T Lx =1 v5 v 8 v 9 v7 v 6 x T Lx = 21 17/34

57 Model 1: Global smoothness Signal values vary smoothly between all pairs of nodes that are connected Example: Temperature of different locations in a flat geographical region Usually quantified by the Laplacian quadratic form: x T Lx = 1 2 X W ij (x(i) x(j)) 2 i,j x : V! R N v5 Similar to previous approaches: v 8 v 9 v 7 v 6 Lake (2010): Daitch (2009): Hu (2013): max =L+ 1 2 I log det 1 M tr(xxt ) 1 min L X T L 2 X min L tr(x T L s X) W F v5 v 8 v 9 x T Lx =1 v7 v 6 x T Lx = 21 17/34

58 Model 1: Global smoothness Dong et al. (2015) & Kalofolias (2016) = - D(G) = (eigenvector matrix of L) c x G - Gaussian assumption on c: c N (0, ) 18/34

59 Model 1: Global smoothness Dong et al. (2015) & Kalofolias (2016) = - D(G) = (eigenvector matrix of L) - Gaussian assumption on c: c N (0, ) - Maximum a posterior (MAP) estimation on c leads to minimization of Laplacian quadratic form: min c x c 2 2 log P c (c) c x G min L,Y X Y 2 F + tr(y T LY)+ L 2 F data fidelity smoothness on Y regularization 18/34

60 Model 1: Global smoothness Dong et al. (2015) & Kalofolias (2016) = - D(G) = (eigenvector matrix of L) - Gaussian assumption on c: c N (0, ) - Maximum a posterior (MAP) estimation on c leads to minimization of Laplacian quadratic form: c x G min c x c 2 2 log P c (c) v5 v 8 v 9 v 7 v 6 min L,Y X Y 2 F + tr(y T LY)+ L 2 F data fidelity smoothness on Y regularization x 18/34

61 Model 1: Global smoothness Dong et al. (2015) & Kalofolias (2016) = - D(G) = (eigenvector matrix of L) - Gaussian assumption on c: c N (0, ) - Maximum a posterior (MAP) estimation on c leads to minimization of Laplacian quadratic form: c x G min c x c 2 2 log P c (c) v5 v 8 v 9 v 7 v 6 min L,Y X Y 2 F + tr(y T LY)+ L 2 F data fidelity smoothness on Y regularization y 18/34

62 Model 1: Global smoothness Dong et al. (2015) & Kalofolias (2016) = - D(G) = (eigenvector matrix of L) - Gaussian assumption on c: c N (0, ) - Maximum a posterior (MAP) estimation on c leads to minimization of Laplacian quadratic form: c x G min c x c 2 2 log P c (c) v5 v 8 v 9 v 7 v 6 min L,Y X Y 2 F + tr(y T LY)+ L 2 F data fidelity smoothness on Y regularization y Learning enforces signal property (global smoothness)! 18/34

63 Model 1: Global smoothness Egilmez et al. (2016) min tr( K) log det s.t. K = S 2 (11T I) = c x G - Solve for as three different graph Laplacian matrices: 19/34

64 Model 1: Global smoothness Egilmez et al. (2016) min tr( K) log det s.t. K = S 2 (11T I) = c x G - Solve for as three different graph Laplacian matrices: non-negative negative generalized Laplacian = L + V = Deg W + V 19/34

65 Model 1: Global smoothness Egilmez et al. (2016) min tr( K) log det s.t. K = S 2 (11T I) = c x G - Solve for as three different graph Laplacian matrices: non-negative negative generalized Laplacian = L + V = Deg W + V diagonally dominant generalized Laplacian = L + V = Deg W + V (V 0) 19/34

66 Model 1: Global smoothness Egilmez et al. (2016) min tr( K) log det s.t. K = S 2 (11T I) = c x G - Solve for as three different graph Laplacian matrices: non-negative negative generalized Laplacian diagonally dominant generalized Laplacian combinatorial Laplacian = L + V = L + V = L = Deg W = Deg W + V = Deg W + V (V 0) 19/34

67 Model 1: Global smoothness Egilmez et al. (2016) min tr( K) log det s.t. K = S 2 (11T I) = c x G - Solve for as three different graph Laplacian matrices: non-negative negative generalized Laplacian diagonally dominant generalized Laplacian combinatorial Laplacian = L + V = L + V = L = Deg W = Deg W + V = Deg W + V (V 0) Generalizes graphical LASSO and Lake Adding priors on edge weights leads to interpretation of MAP estimation 19/34

68 Model 1: Global smoothness Chepuri et al. (2016) = - An edge selection mechanism based on the same smoothness measure: c x G v5 v 8 v 9 v7 v 6 20/34

69 Model 1: Global smoothness Chepuri et al. (2016) = - An edge selection mechanism based on the same smoothness measure: c x G v5 v 8 v 9 v7 v 6 20/34

70 Model 1: Global smoothness Chepuri et al. (2016) = - An edge selection mechanism based on the same smoothness measure: c x G v7 v5 v 8 v 9 v 6 v7 v5 v 8 v 9 v 6 20/34

71 Model 1: Global smoothness Chepuri et al. (2016) = - An edge selection mechanism based on the same smoothness measure: c x G v7 v5 v 8 v 9 v 6 v7 v5 v 8 v 9 v 6 v7 v5 v 8 v 9 v 6 20/34

72 Model 1: Global smoothness Chepuri et al. (2016) = - An edge selection mechanism based on the same smoothness measure: c x G v7 v5 v 8 v 9 v 6 v7 v5 v 8 v 9 v 6 v7 v5 v 8 v 9 v 6 Similar in spirit to Dempster Good for learning unweighted graph Explicit edge-handler is desirable in some applications 20/34

73 Model 2: Diffusion process Signals are outcome of some diffusion processes on the graph (more of local smoothness than global one!) Example: Movement of people/vehicles in transportation network 21/34

74 Model 2: Diffusion process Signals are outcome of some diffusion processes on the graph (more of local smoothness than global one!) Example: Movement of people/vehicles in transportation network Characterized by diffusion operators v v v v 9 initial stage 21/34

75 Model 2: Diffusion process Signals are outcome of some diffusion processes on the graph (more of local smoothness than global one!) Example: Movement of people/vehicles in transportation network Characterized by diffusion operators heat diffusion v5 v 8 v 9 v 7 v 6 observation v v v v 9 initial stage 21/34

76 Model 2: Diffusion process Signals are outcome of some diffusion processes on the graph (more of local smoothness than global one!) Example: Movement of people/vehicles in transportation network Characterized by diffusion operators heat diffusion v5 v 8 v 9 v 7 v 6 observation v v v v 9 initial stage general graph shift operator (e.g., A) v5 v 8 v 9 v 7 v 6 observation 21/34

77 Model 2: Diffusion process Pasdeloup et al. (2015, 2016) = - D(G) =T k(m) = W k(m) norm W k norm c x G - {c m } are i.i.d. samples with independent entries 22/34

78 Model 2: Diffusion process Pasdeloup et al. (2015, 2016) = - D(G) =T k(m) = W k(m) norm W k norm c x G - {c m } are i.i.d. samples with independent entries - Two-step approach: - Estimate eigenvector matrix from sample covariance (if covariance unknown): h X M = E m=1 X(m)X(m) T i = MX m=1 W 2k(m) norm (polynomial of ) W norm 22/34

79 Model 2: Diffusion process Pasdeloup et al. (2015, 2016) = - D(G) =T k(m) = W k(m) norm W k norm c x G - {c m } are i.i.d. samples with independent entries - Two-step approach: - Estimate eigenvector matrix from sample covariance (if covariance unknown): h X M = E m=1 X(m)X(m) T i = MX m=1 W 2k(m) norm (polynomial of ) W norm W norm - Optimize for eigenvalues given constraints of (mainly non-negativity of off-diagonal of and eigenvalue range) and some priors (e.g., sparsity) W norm 22/34

Model 2: Diffusion process Pasdeloup et al. (2015, 2016) = - D(G) =T k(m) = W k(m) norm W k norm c x G - {c m } are i.i.d. samples with independent entries - Two-step approach: - Estimate eigenvector

80 Model 2: Diffusion process Pasdeloup et al. (2015, 2016) = - D(G) =T k(m) = W k(m) norm W k norm c x G - {c m } are i.i.d. samples with independent entries - Two-step approach: - Estimate eigenvector matrix from sample covariance (if covariance unknown): h X M = E m=1 X(m)X(m) T i = MX m=1 W 2k(m) norm (polynomial of ) W norm W norm - Optimize for eigenvalues given constraints of (mainly non-negativity of off-diagonal of and eigenvalue range) and some priors (e.g., sparsity) W norm More a graph-centric learning framework: Cost function on graph components instead of signals 22/34

81 Model 2: Diffusion process Segarra et al. (2016) - D(G) =H(S G )= (diffusion defined by a graph shift operator S G that can be arbitrary, but practically W or L) - c is a white signal LX 1 l=0 h l S G l H(S G ) = c x G 23/34

82 Model 2: Diffusion process Segarra et al. (2016) - D(G) =H(S G )= (diffusion defined by a graph shift operator S G that can be arbitrary, but practically W or L) - c is a white signal LX 1 l=0 - Two-step approach: h l S G l - Estimate eigenvector matrix: = HH T H(S G ) = c x G 23/34

83 Model 2: Diffusion process Segarra et al. (2016) - D(G) =H(S G )= (diffusion defined by a graph shift operator S G that can be arbitrary, but practically W or L) - c is a white signal LX 1 l=0 - Two-step approach: h l S G l H(S G ) = c x G - Estimate eigenvector matrix: = HH T - Select eigenvalues that satisfy constraints of : S G min S G, S G 0 s.t. S G = NX n=1 nv n v n T spectral templates (eigenvectors) 23/34

84 Model 2: Diffusion process Segarra et al. (2016) - D(G) =H(S G )= (diffusion defined by a graph shift operator S G that can be arbitrary, but practically W or L) - c is a white signal LX 1 l=0 - Two-step approach: h l S G l H(S G ) = c x G - Estimate eigenvector matrix: = HH T - Select eigenvalues that satisfy constraints of : S G min S G, S G 0 s.t. S G = NX n=1 nv n v n T spectral templates (eigenvectors) Similar in spirit to Pasdeloup, same assumption on stationarity but different inference framework due to different D Can handle noisy or incomplete information on spectral templates 23/34

85 Model 2: Diffusion process Thanou et al. (2016) = - D(G) =e L (localization in vertex domain) e L c x G - Sparsity assumption on c 24/34

86 Model 2: Diffusion process Thanou et al. (2016) = - D(G) =e L (localization in vertex domain) - Sparsity assumption on c e L - Each signal is a combination of several heat diffusion processes at time c x G 24/34

87 Model 2: Diffusion process Thanou et al. (2016) = - D(G) =e L (localization in vertex domain) - Sparsity assumption on c e L - Each signal is a combination of several heat diffusion processes at time c x G min X L,C, MX D(L)C 2 F + c m 1 + m=1 L 2 F s.t. D =[e 1L,...,e SL ] 24/34

88 Model 2: Diffusion process Thanou et al. (2016) = - D(G) =e L (localization in vertex domain) - Sparsity assumption on c e L - Each signal is a combination of several heat diffusion processes at time c x G min X L,C, MX D(L)C 2 F + c m 1 + m=1 L 2 F s.t. D =[e 1L,...,e SL ] data fidelity sparsity on c regularization 24/34

diffusion processes at time e L c x G min X L,C, MX D(L)C 2 F + c m 1 + m=1 L 2 F s.t. D =[e 1L,.

89 Model 2: Diffusion process Thanou et al. (2016) = - D(G) =e L (localization in vertex domain) - Sparsity assumption on c - Each signal is a combination of several heat diffusion processes at time e L c x G min X L,C, MX D(L)C 2 F + c m 1 + m=1 L 2 F s.t. D =[e 1L,...,e SL ] data fidelity sparsity on c regularization Still diffusion-based model, but more signal-centric No assumption on eigenvectors/stationarity, but on signal structure and sparsity Can be extended to general polynomial case (Maretic et al. 2017) 24/34

90 Model 3: Time-varying observations Signals are time-varying observations that are causal outcome of current or past values (mixed degree of smoothness depending on previous states) Example: Evolution of individual behavior due to influence of different friends at different timestamps Characterized by an autoregressive model or a structural equation model (SEM) 25/34

91 Model 3: Time-varying observations Mei and Moura (2015) S s=1 = - D s (G) =P s (W) : polynomial of W of degree s P s (W) x[t s] x G - Define as x[t s] c s 26/34

92 Model 3: Time-varying observations Mei and Moura (2015) S s=1 = - D s (G) =P s (W) : polynomial of W of degree s P s (W) x[t s] x G - Define as c s x[t s] v5 v 8 v 9 v7 v 6... = v7 v v 8 v 9 v 8 v 9 v7 v 6 x[t] x[t 1] x[t S] 26/34

93 Model 3: Time-varying observations Mei and Moura (2015) S s=1 = - D s (G) =P s (W) : polynomial of W of degree s P s (W) x[t s] x G - Define as c s x[t s] v5 v 8 v 9 v7 v 6... = v7 v v 8 v 9 v 8 v 9 v7 v 6 x[t] x[t 1] x[t S] min W,a 1 2 KX k=s+1 x[k] SX P s (W)x[k s] vec(w) a 1 s=1 26/34

94 Model 3: Time-varying observations Mei and Moura (2015) S s=1 = - D s (G) =P s (W) : polynomial of W of degree s P s (W) x[t s] x G - Define as c s x[t s] v5 v 8 v 9 v7 v 6... = v7 v v 8 v 9 v 8 v 9 v7 v 6 x[t] x[t 1] x[t S] min W,a 1 2 KX k=s+1 x[k] SX P s (W)x[k s] vec(w) a 1 s=1 data fidelity sparsity on W sparsity on a 26/34

95 Model 3: Time-varying observations Mei and Moura (2015) S s=1 = - D s (G) =P s (W) : polynomial of W of degree s P s (W) x[t s] x G - Define as c s x[t s] v5 v 8 v 9 v7 v 6... = v7 v v 8 v 9 v 8 v 9 v7 v 6 x[t] x[t 1] x[t S] min W,a 1 2 KX k=s+1 x[k] SX P s (W)x[k s] vec(w) a 1 s=1 data fidelity sparsity on W sparsity on a Polynomial design similar in spirit to Pasdeloup and Segarra Good for inferring causal relations between signals Kernelized version (nonlinear): Shen et al. (2016) 26/34

96 Model 3: Time-varying observations Baingana and Giannakis (2016) + = - D(G) =W s(t) : Graph at time t ext. W x x G (topologies switch at each time between S discrete states) - Define c as x 27/34

97 Model 3: Time-varying observations Baingana and Giannakis (2016) + = - D(G) =W s(t) : Graph at time t ext. W x x G (topologies switch at each time between S discrete states) - Define c as x x[t] =W s(t) x[t]+b s(t) y[t] internal (neighbors) external 27/34

98 Model 3: Time-varying observations Baingana and Giannakis (2016) + = - D(G) =W s(t) : Graph at time t ext. W x x G (topologies switch at each time between S discrete states) - Define c as x x[t] =W s(t) x[t]+b s(t) y[t] internal (neighbors) external - Solve for all states of W: min {W s(t),b s(t) } 1 2 TX x[t] W s(t) x[t] B s(t) y[t] 2 F + t=1 SX s=1 s W s(t) 1 data fidelity sparsity on W 27/34

99 Model 3: Time-varying observations Baingana and Giannakis (2016) + = - D(G) =W s(t) : Graph at time t ext. W x x G (topologies switch at each time between S discrete states) - Define c as x x[t] =W s(t) x[t]+b s(t) y[t] internal (neighbors) external - Solve for all states of W: min {W s(t),b s(t) } 1 2 TX x[t] W s(t) x[t] B s(t) y[t] 2 F + t=1 SX s=1 s W s(t) 1 data fidelity sparsity on W Good for inferring causal relations between signals as well as dynamic topologies 27/34

100 Comparison of different methods Methods Signal model Assumption Learning output Edge direction Inference Dong (2015) Global smoothness Gaussian Laplacian Undirected Signal-centric Kalofolias (2016) Global smoothness Gaussian Adjacency Undirected Signal-centric Egilmez (2016) Global smoothness Gaussian Generalized Laplacian Undirected Signal-centric Chepuri (2016) Global smoothness Gaussian Adjacency Undirected Graph-centric Pasdeloup (2015) Diffusion by Adj. Stationary Normalized Adj./ Laplacian Undirected Graph-centric Segarra (2016) Diffusion by Graph shift operator Stationary Graph shift operator Undirected Graph-centric Thanou (2016) Heat diffusion Sparsity Laplacian Undirected Signal-centric Mei (2015) Time-varying Dependent on previous states Adjacency Directed Signal-centric Baingana (2016) Time-varying Dependent on current int/ext info Time-varying Adjacency Directed Signal-centric 28/34

101 Perspective GSP for graph learning v 7 v5 v 8 v 9 v 7 v 6 v 8 v 9 v 6 29/34

102 Perspective GSP for graph learning v 7 v5 v 8 v 9 v 7 v 6 v 8 v 9 v 6 Learning input - missing observations - partial observations, e.g., by sampling 29/34

103 Perspective GSP for graph learning v 7 v5 v 8 v 9 v 7 v 6 v 8 v 9 v 6 Learning input - missing observations - partial observations, e.g., by sampling Learning output - directed graphs (Shen 2017) - time-varying graphs (Kalofolias 2017) - multi-layer graphs - subgraphs or ego-networks - intermediate graph representation 29/34

104 Perspective GSP for graph learning v 7 v5 v 8 v 9 v 7 v 6 v 8 v 9 v 6 Learning input Signal/graph model Learning output - missing observations - partial observations, e.g., by sampling - beyond smoothness: localization in vertex-frequency domain, bandlimited (Sardellitti 2017) - directed graphs (Shen 2017) - time-varying graphs (Kalofolias 2017) - multi-layer graphs - subgraphs or ego-networks - intermediate graph representation 29/34

105 Perspective Theoretical consideration - performance guarantee (Rabbat 2017) - computational efficiency GSP for graph learning v 7 v 6 v 7 v5 v 8 v 9 v 8 v 9 v 6 Learning input Signal/graph model Learning output - missing observations - partial observations, e.g., by sampling - beyond smoothness: localization in vertex-frequency domain, bandlimited (Sardellitti 2017) - directed graphs (Shen 2017) - time-varying graphs (Kalofolias 2017) - multi-layer graphs - subgraphs or ego-networks - intermediate graph representation 29/34

106 Perspective Theoretical consideration - performance guarantee (Rabbat 2017) - computational efficiency Learning objective - for what SP applications? e.g., classification (Yankelevsky 2016), coding and compression (Rotondo 2015, Fracastoro 2016) - for traditional graph-based learning, e.g., clustering, dim. reduction, ranking v 7 GSP for graph learning v 6 v 7 v5 v 8 v 9 v 8 v 9 v 6 Learning input Signal/graph model Learning output - missing observations - partial observations, e.g., by sampling - beyond smoothness: localization in vertex-frequency domain, bandlimited (Sardellitti 2017) - directed graphs (Shen 2017) - time-varying graphs (Kalofolias 2017) - multi-layer graphs - subgraphs or ego-networks - intermediate graph representation 29/34

107 Graph learning at GSPW 2017 Thursday June 1st Friday June 2nd 30/34

108 References B. Baingana and G. B. Giannakis. Tracking switching network topologies from propagating graph signals. In Graph Signal Processing Workshop, B. Baingana and G. B. Giannakis. Tracking switched dynamic network topologies from information cascades. IEEE Transactions on Signal Processing, 65(4): , O. Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9: , S. P. Chepuri, S. Liu, G. Leus, and A. O. Hero III. Learning sparse graphs under smoothness prior. arxiv: , D. I Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83-98, S. I. Daitch, J. A. Kelner, and D. A. Spielman. Fitting a graph to vector data. In Proceedings of the International Conference on Machine Learning, , A. P. Dempster. Covariance selection. Biometrics, 28(1): , X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. Laplacian matrix learning for smooth graph signal representation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, , X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23): , /34

109 References H. E. Egilmez, E. Pavez, and A. Ortega. Graph learning from data under structural and laplacian constraints. arxiv: , G. Fracastoro, D. Thanou, and P. Frossard. Graph transform learning for image compression. In Proceedings of the Picture Coding Symposium, J.Friedman, T.Hastie, and R.Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): , D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2): , C.-J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems 24, , C. Hu, L. Cheng, J. Sepulcre, G. E. Fakhri, Y. M. Lu, and Q. Li. A graph theoretical regression model for brain connectivity learning of Alzheimer s disease. In Proceedings of the IEEE International Symposium on Biomedical Imaging, , V. Kalofolias. How to learn a graph from smooth signals. In Proceedings of the International Conference on Artificial Intelligence and Statistics, , V. Kalofolias, A. Loukas, D. Thanou, and P. Frossard. Learning time varying graphs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, , B. Lake and J. Tenenbaum. Discovering structure by learning sparse graph. In Proceedings of the Annual Cognitive Science Conference, /34

110 References H. P. Maretic, D. Thanou, and P. Frossard. Graph learning under sparsity priors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, , J. Mei and J. M. F. Moura. Signal processing on graphs: Estimating the structure of a graph. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, , N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3): , S. K. Narang and A. Ortega. Lifting based wavelet transforms on graphs. In Proceedings of the Asia- Pacific Signal and Information Processing Association Annual Summit and Conference, , B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat. Characterization and inference of weighted graph topologies from observations of diffused signals. arxiv: , B. Pasdeloup, M. Rabbat, V. Gripon, D. Pastor, and G. Mercier. Graph reconstruction from the observation of diffused signals. In Proceedings of the Annual Allerton Conference, , P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1- regularized logistic regression. Annals of Statistics, 38(3): , I. Rotondo, G. Cheung, A. Ortega, and H. E. Egilmez. Designing sparse graphs via structure tensor for block transform coding of images. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, , A. Sandryhaila and J. M. F. Moura. Discrete signal processing on graphs. IEEE Transactions on Signal Processing, 61(7): , /34

111 References S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro. Network topology inference from spectral templates. arxiv: , Y. Shen, B. Baingana, and G. B. Giannakis. Nonlinear structural vector autoregressive models for inferring effective brain network connectivity. arxiv: , Y. Shen, B. Baingana, and G. B. Giannakis. Kernel-based structural equation models for topology identification of directed networks. IEEE Transactions on Signal Processing, 65(10): , M. Slawski and M. Hein. Estimation of positive definite M-matrices and structure learning for attractive Gaussian Markov random fields. Linear Algebra and its Applications, 473: , D. Thanou, D. I Shuman, and P. Frossard. Learning parametric dictionaries for signals on graphs. IEEE Transactions on Signal Processing, 62(15): , D. Thanou, X. Dong, D. Kressner, and P. Frossard. Learning heat diffusion graphs. arxiv: , Y. Yankelevsky and M. Elad. Dual graph regularized dictionary learning. IEEE Transactions on Signal and Information Processing over Networks, 2(4): , X. Zhang, X. Dong, and P. Frossard. Learning of structured graph dictionaries. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, , /34

Characterization and inference of weighted graph topologies from observations of diffused signals

Characterization and inference of weighted graph topologies from observations of diffused signals Bastien Pasdeloup*, Vincent Gripon*, Grégoire Mercier*, Dominique Pastor*, Michael G. Rabbat** * name.surname@telecom-bretagne.eu