Probabilistic Graphical Models

School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

Network Research Social Network Internet Regulatory Network Eric Xing @ CMU, 005-06

Where do networks come from? The Jesus network Eric Xing @ CMU, 005-06 3

Evolving networks Can I get his vote? Corporativity, Antagonism, Cliques, over time? March 005 January 006 August 006 Eric Xing @ CMU, 005-06 4

Evolving networks t= 3 T Eric Xing @ CMU, 005-06 5

Recall: ML Structural Learning for completely observed GMs E R B A C Data ),, ( ) ( ) ( n ),, ( ) ( ) ( M n M ),, ( ) ( ) ( n Eric Xing @ CMU, 005-06 6

Two Optimal approaches Optimal here means the employed algorithms guarantee to return a structure that maimizes the objectives (e.g., LogLik) Many heuristics used to be popular, but they provide no guarantee on attaining optimality, interpretability, or even do not have an eplicit objective E.g.: structured EM, Module network, greedy structural search, etc. We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs: Trees: The Chow-Liu algorithm (this lecture) Pairwise MRFs: covariance selection, neighborhood-selection (later) Eric Xing @ CMU, 005-06 7

Key Idea: network inference as parameter estimation 8

Model: Pairwise Markov Random Fields Nodal states can be either discrete (Ising/Potts model), or continuous (Gaussian graphical model), or heterogeneous the parameter matri encodes the graph structure 3 4 6 5 Eric Xing @ CMU, 005-06 9

Recall Multivariate Gaussian Multivariate Gaussian density: WOLG: let We can view this as a continuous Markov Random Field with potentials defined on every node and edge: ) - ( ) - ( - ep ) ( ), ( / / T n p j i j i ij i i ii n p q q Q Q p / / - ep ) ( ) 0,,,, ( Eric Xing @ CMU, 005-06 0

Gaussian Graphical Model Cell type Microarray samples Encodes dependencies among genes Eric Xing @ CMU, 005-06

Precision Matri Encodes Non-Zero Edges in Gaussian Graphical Modela Edge corresponds to nonzero precision matri element Eric Xing @ CMU, 005-06

Markov versus Correlation Network Correlation network is based on Covariance Matri A GGM is a Markov Network based on Precision Matri Conditional Independence/Partial Correlation Coefficients are a more sophisticated dependence measure With small sample size, empirical covariance matri cannot be inverted Eric Xing @ CMU, 005-06 3

Sparsity One common assumption to make: sparsity Makes empirical sense: Genes are only assumed to interface with small groups of other genes. Makes statistical sense: Learning is now feasible in high dimensions with small sample size sparse Eric Xing @ CMU, 005-06 4

Network Learning with the LASSO Assume network is a Gaussian Graphical Model Perform LASSO regression of all nodes to a target node Eric Xing @ CMU, 005-06 5

Network Learning with the LASSO LASSO can select the neighborhood of each node Eric Xing @ CMU, 005-06 6

L Regularization (LASSO) A conve relaation. Constrained Form Lagrangian Form Enforces sparsity! y 3 n- 3 n n- n Eric Xing @ CMU, 005-06 7

Theoretical Guarantees Assumptions Dependency Condition: Relevant Covariates are not overly dependent Incoherence Condition: Large number of irrelevant covariates cannot be too correlated with relevant covariates Strong concentration bounds: Sample quantities converge to epected values quickly If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant. Eric Xing @ CMU, 005-06 8

Network Learning with the LASSO Repeat this for every node Form the total edge set Eric Xing @ CMU, 005-06 9

Consistent Structure Recovery [Meinshausen and Buhlmann 006, Wainwright 009] If Then with high probability, Eric Xing @ CMU, 005-06 0

Why this algorithm work? What is the intuition behind graphical regression? Continuous nodal attributes Discrete nodal attributes Are there other algorihtms? More general scenarios: non-iid sample and evolving networks Case study Eric Xing @ CMU, 005-06

Multivariate Gaussian Multivariate Gaussian density: A joint Gaussian: How to write down p( ), p( ) or p( ) using the block elements in and? Formulas to remember: ) - ( ) - ( - ep ) ( ), ( / / T n p ), ( ), ( N p ) ( ), ( ) ( V m V m N p m m m m p V m V m ), ( ) ( N Eric Xing @ CMU, 005-06

The matri inverse lemma Consider a block-partitioned matri: First we diagonalize M Schur complement: Then we inverse, using this formula: Matri inverse lemma H G F E M H G E-FH I G -H I H G F E I -FH I - - - 0 0 0 0 X ZW Y W XYZ - - G E-FH M/H - 0 0 0 0 M/E GE M/E - M/E F -E GE M/E F E E FH M/H G H H M/H G -H FH M/H - M/H I -FH I H M/H I G -H I H G F E M - - - - - - - - - GE F H-GE F E E G E-FH Eric Xing @ CMU, 005-06 3

The covariance and the precision matrices - - FH M/H G H H M/H G -H FH M/H - M/H M Q q q q q I -q -q q Q T T T T Eric Xing @ CMU, 005-06 4

Single-node Conditional p ( m V ) N ( m, V ( ) ) The conditional dist. of a single node i given the rest of the nodes can be written as: WOLG: let Q -q q T T -q q q T I q q Q Eric Xing @ CMU, 005-06 5

Conditional auto-regression From We can write the following conditional auto-regression function for each node: Neighborhood est. based on auto-regression coefficient Eric Xing @ CMU, 005-06 6

Conditional independence From Given an estimate of the neighborhood s i, we have: Thus the neighborhood s i defines the Markov blanket of node i Eric Xing @ CMU, 005-06 7

Recent trends in GGM: Covariance selection (classical method) Dempster [97]: Sequentially pruning smallest elements in precision matri Drton and Perlman [008]: Improved statistical tests for pruning Serious limitations in practice: breaks down when covariance matri is not invertible L -regularization based method (hot!) Meinshausen and Bühlmann [Ann. Stat. 06]: Used LASSO regression for neighborhood selection Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matri Friedman et al. [Biostatistics 08]: Efficient fied-point equations based on a sub-gradient algorithm Structure learning is possible even when # variables > # samples Eric Xing @ CMU, 005-06 8

The Meinshausen-Bühlmann (MB) algorithm: Solving separated Lasso for every single variables: Step : Pick up one variable Step : Think of it as y, and the rest as z Step 3: Solve Lasso regression problem between y and z The resulting coefficient does not correspond to the Q value-wise Step 4: Connect the k-th node to those having nonzero weight in w Eric Xing @ CMU, 005-06 9

L -regularized maimum likelihood learning Input: Sample covariance matri S Assumes standardized data (mean=0, variance=) S is generally rank-deficient Thus the inverse does not eist Output: Sparse precision matri Q Originally, Q is defined as the inverse of S, but not directly invertible Need to find a sparse matri that can be thought as of as an inverse of S log likelihood regularizer Approach: Solve an L -regularized maimum likelihood equation Eric Xing @ CMU, 005-06 30

From matri opt. to vector opt.: coupled Lasso for every single Var. Focus only on one row (column), keeping the others constant Optimization problem for blue vector is shown to be Lasso (L - regularized quadratic programming) Difference from MB s: Resulting Lasso problems are coupled The gray part is actually not constant; changes after solving one Lasso problem (because it is the opt of the entire Q that optimize a single loss function, whereas in MB each lasso has its own loss function.. This coupling is essential for stability under noise Eric Xing @ CMU, 005-06 3

Learning Ising Model (i.e. pairwise MRF) Assuming the nodes are discrete (e.g., voting outcome of a person), and edges are weighted, then for a sample, we have It can be shown the pseudo-conditional likelihood for node k is Eric Xing @ CMU, 005-06 3

Question: vector-valued nodes Eric Xing @ CMU, 005-06 33

New Problem: Evolving Social Networks Can I get his vote? Corporativity, Antagonism, Cliques, over time? March 005 January 006 August 006 Eric Xing @ CMU, 005-06 34

Reverse engineering timespecific "rewiring" networks t * T 0 n= or some small # T N Drosophila development Eric Xing @ CMU, 005-06 35

Inference I [Song, Kolar and Xing, Bioinformatics 09] KELLER: Kernel Weighted L -regularized Logistic Regression Constrained conve optimization Lasso: Estimate time-specific nets one by one, based on "virtual iid" samples Could scale to ~0 4 genes, but under stronger smoothness assumptions Eric Xing @ CMU, 005-06 36

Algorithm nonparametric neighborhood selection Conditional likelihood Neighborhood Selection: Time-specific graph regression: Estimate at Where and Eric Xing @ CMU, 005-06 37

Structural consistency of KELLER Assumptions Define: A: Dependency Condition A: Incoherence Condition A3: Smoothness Condition A4: Bounded Kernel Eric Xing @ CMU, 005-06 38

Theorem [Kolar and Xing, 09] Eric Xing @ CMU, 005-06 39

Inference II [Amr and Xing, PNAS 009, AOAS 009] TESLA: Temporally Smoothed L -regularized logistic regression Constrained conve optimization Scale to ~5000 nodes, does not need smoothness assumption, can accommodate abrupt changes. Eric Xing @ CMU, 005-06 40

Temporally Smoothed Graph Regression TESLA: Eric Xing @ CMU, 005-06 4

Modified estimation procedure estimate block partition on which the coefficient functions are constant (*) estimate the coefficient functions on each block of the partition (**) Eric Xing @ CMU, 005-06 4

Structural Consistency of TESLA [Kolar, and Xing, 009] I. It can be shown that, by applying the results for model selection of the Lasso on a temporal difference transformation of (*), the block are estimated consistently II. Then it can be further shown that, by applying Lasso on (**), the neighborhood of each node on each of the estimated blocks consistently Further advantages of the two step procedure choosing parameters easier faster optimization procedure Eric Xing @ CMU, 005-06 43

Senate network 09 th congress Voting records from 09th congress (005-006) There are 00 senators whose votes were recorded on the 54 bills, each vote is a binary outcome Eric Xing @ CMU, 005-06 44

Senate network 09 th congress March 005 January 006 August 006 Eric Xing @ CMU, 005-06 45

Senator Chafee Eric Xing @ CMU, 005-06 46

Senator Ben Nelson T=0. T=0.8 Eric Xing @ CMU, 005-06 47

Progression and Reversion of Breast Cancer cells S (normal) T4 (malignant) T4 revert T4 revert 3 T4 revert Eric Xing @ CMU, 005-06 48

Estimate Neighborhoods Jointly Across All Cell Types S T4 How to share information across the cell types? T4R T4R3 T4R Eric Xing @ CMU, 005-06 49

Sparsity of Difference Penalize differences between networks of adjacent cell types S T4 T4R T4R3 T4R Eric Xing @ CMU, 005-06 50

Tree-Guided Graphical Lasso (Treegl) RSS for all cell types sparsity Sparsity of difference Eric Xing @ CMU, 005-06 5

Network Overview T4 EGFR-ITGB S MMP PI3K-MAPKK Eric Xing @ CMU, 005-06 5

Interactions Biological Processes S cells T4 cells: Increased Cell Proliferation, Growth, Signaling, Locomotion Eric Xing @ CMU, 005-06 53

Interactions Biological Processes T4 cells MMP-T4R cells: Significantly reduced interactions Eric Xing @ CMU, 005-06 54

Interactions Biological Processes T4 cells PI3K-MAPKK-T4R: Reduced Growth, Locomotion and Signaling Eric Xing @ CMU, 005-06 55

Fancier network est. scenarios Dynamic Directed (auto-regressive) Networks [Song, Kolar and Xing, NIPS 009] Missing Data [Kolar and Xing, ICML 0] Multi-attribute Data [Kolar, Liu and Xing, JMLR 03] Eric Xing @ CMU, 005-06 56

Summary Graphical Gaussian Model The precision matri encode structure Not estimatable when p >> n Neighborhood selection: Conditional dist under GGM/MRF Graphical lasso Sparsistency Time-varying Markov networks Kernel reweighting est. Total variation est. Eric Xing @ CMU, 005-06 57