Probabilistic Graphical Models

Similar documents
Probabilistic Graphical Models

Probabilistic Graphical Models

Proximity-Based Anomaly Detection using Sparse Structure Learning

11 : Gaussian Graphic Models and Ising Models

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits

Introduction to graphical models: Lecture III

ESTIMATING TIME-VARYING NETWORKS. By Mladen Kolar, Le Song, Amr Ahmed and Eric P. Xing School of Computer Science, Carnegie Mellon University

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Gaussian Graphical Models and Graphical Lasso

Learning discrete graphical models via generalized inverse covariance matrices

Robust Inverse Covariance Estimation under Noisy Measurements

10708 Graphical Models: Homework 2

Sparse Graph Learning via Markov Random Fields

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

25 : Graphical induced structured input/output models

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

High dimensional Ising model selection

Statistical Machine Learning for Structured and High Dimensional Data

Multivariate Normal Models

Probabilistic Graphical Models

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Example: multivariate Gaussian Distribution

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

Multivariate Normal Models

Chapter 17: Undirected Graphical Models

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Lecture 25: November 27

3 : Representation of Undirected GM

Graphical Model Selection

Representation of undirected GM. Kayhan Batmanghelich

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

CPSC 540: Machine Learning

Probabilistic Graphical Models

25 : Graphical induced structured input/output models

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Probabilistic Graphical Models

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

CSC 412 (Lecture 4): Undirected Graphical Models

Learning Gaussian Graphical Models with Unknown Group Sparsity

High dimensional ising model selection using l 1 -regularized logistic regression

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Robust Principal Component Analysis

6. Regularized linear regression

Learning Markov Network Structure using Brownian Distance Covariance

Causal Inference: Discussion

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

CSC 576: Variants of Sparse Learning

High-dimensional covariance estimation based on Gaussian graphical models

Markov Networks.

Sparse inverse covariance estimation with the lasso

1 Regression with High Dimensional Data

STA414/2104 Statistical Methods for Machine Learning II

Network Topology Inference from Non-stationary Graph Signals

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 21: Spectral Learning for Graphical Models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Regression, Ridge Regression, Lasso

Probabilistic Graphical Models

Graphical Models. Outline. HMM in short. HMMs. What about continuous HMMs? p(o t q t ) ML 701. Anna Goldenberg ... t=1. !

Permutation-invariant regularization of large covariance matrices. Liza Levina

Learning structurally consistent undirected probabilistic graphical models

Sparse Gaussian conditional random fields

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Markov Network Estimation From Multi-attribute Data

Linear Methods for Regression. Lijun Zhang

Homework 4. Convex Optimization /36-725

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Cross-Validation with Confidence

MAP Examples. Sargur Srihari

Joint Gaussian Graphical Model Review Series I

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Statistical Learning with the Lasso, spring The Lasso

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Topics in High-dimensional Unsupervised Learning

DS-GA 1002 Lecture notes 10 November 23, Linear models

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Probabilistic Graphical Models

Undirected graphical models

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Robust and sparse Gaussian graphical modelling under cell-wise contamination

14 : Theory of Variational Inference: Inner and Outer Approximation

Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Transcription:

School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

Network Research Social Network Internet Regulatory Network Eric Xing @ CMU, 005-06

Where do networks come from? The Jesus network Eric Xing @ CMU, 005-06 3

Evolving networks Can I get his vote? Corporativity, Antagonism, Cliques, over time? March 005 January 006 August 006 Eric Xing @ CMU, 005-06 4

Evolving networks t= 3 T Eric Xing @ CMU, 005-06 5

Recall: ML Structural Learning for completely observed GMs E R B A C Data ),, ( ) ( ) ( n ),, ( ) ( ) ( M n M ),, ( ) ( ) ( n Eric Xing @ CMU, 005-06 6

Two Optimal approaches Optimal here means the employed algorithms guarantee to return a structure that maimizes the objectives (e.g., LogLik) Many heuristics used to be popular, but they provide no guarantee on attaining optimality, interpretability, or even do not have an eplicit objective E.g.: structured EM, Module network, greedy structural search, etc. We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs: Trees: The Chow-Liu algorithm (this lecture) Pairwise MRFs: covariance selection, neighborhood-selection (later) Eric Xing @ CMU, 005-06 7

Key Idea: network inference as parameter estimation 8

Model: Pairwise Markov Random Fields Nodal states can be either discrete (Ising/Potts model), or continuous (Gaussian graphical model), or heterogeneous the parameter matri encodes the graph structure 3 4 6 5 Eric Xing @ CMU, 005-06 9

Recall Multivariate Gaussian Multivariate Gaussian density: WOLG: let We can view this as a continuous Markov Random Field with potentials defined on every node and edge: ) - ( ) - ( - ep ) ( ), ( / / T n p j i j i ij i i ii n p q q Q Q p / / - ep ) ( ) 0,,,, ( Eric Xing @ CMU, 005-06 0

Gaussian Graphical Model Cell type Microarray samples Encodes dependencies among genes Eric Xing @ CMU, 005-06

Precision Matri Encodes Non-Zero Edges in Gaussian Graphical Modela Edge corresponds to nonzero precision matri element Eric Xing @ CMU, 005-06

Markov versus Correlation Network Correlation network is based on Covariance Matri A GGM is a Markov Network based on Precision Matri Conditional Independence/Partial Correlation Coefficients are a more sophisticated dependence measure With small sample size, empirical covariance matri cannot be inverted Eric Xing @ CMU, 005-06 3

Sparsity One common assumption to make: sparsity Makes empirical sense: Genes are only assumed to interface with small groups of other genes. Makes statistical sense: Learning is now feasible in high dimensions with small sample size sparse Eric Xing @ CMU, 005-06 4

Network Learning with the LASSO Assume network is a Gaussian Graphical Model Perform LASSO regression of all nodes to a target node Eric Xing @ CMU, 005-06 5

Network Learning with the LASSO LASSO can select the neighborhood of each node Eric Xing @ CMU, 005-06 6

L Regularization (LASSO) A conve relaation. Constrained Form Lagrangian Form Enforces sparsity! y 3 n- 3 n n- n Eric Xing @ CMU, 005-06 7

Theoretical Guarantees Assumptions Dependency Condition: Relevant Covariates are not overly dependent Incoherence Condition: Large number of irrelevant covariates cannot be too correlated with relevant covariates Strong concentration bounds: Sample quantities converge to epected values quickly If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant. Eric Xing @ CMU, 005-06 8

Network Learning with the LASSO Repeat this for every node Form the total edge set Eric Xing @ CMU, 005-06 9

Consistent Structure Recovery [Meinshausen and Buhlmann 006, Wainwright 009] If Then with high probability, Eric Xing @ CMU, 005-06 0

Why this algorithm work? What is the intuition behind graphical regression? Continuous nodal attributes Discrete nodal attributes Are there other algorihtms? More general scenarios: non-iid sample and evolving networks Case study Eric Xing @ CMU, 005-06

Multivariate Gaussian Multivariate Gaussian density: A joint Gaussian: How to write down p( ), p( ) or p( ) using the block elements in and? Formulas to remember: ) - ( ) - ( - ep ) ( ), ( / / T n p ), ( ), ( N p ) ( ), ( ) ( V m V m N p m m m m p V m V m ), ( ) ( N Eric Xing @ CMU, 005-06

The matri inverse lemma Consider a block-partitioned matri: First we diagonalize M Schur complement: Then we inverse, using this formula: Matri inverse lemma H G F E M H G E-FH I G -H I H G F E I -FH I - - - 0 0 0 0 X ZW Y W XYZ - - G E-FH M/H - 0 0 0 0 M/E GE M/E - M/E F -E GE M/E F E E FH M/H G H H M/H G -H FH M/H - M/H I -FH I H M/H I G -H I H G F E M - - - - - - - - - GE F H-GE F E E G E-FH Eric Xing @ CMU, 005-06 3

The covariance and the precision matrices - - FH M/H G H H M/H G -H FH M/H - M/H M Q q q q q I -q -q q Q T T T T Eric Xing @ CMU, 005-06 4

Single-node Conditional p ( m V ) N ( m, V ( ) ) The conditional dist. of a single node i given the rest of the nodes can be written as: WOLG: let Q -q q T T -q q q T I q q Q Eric Xing @ CMU, 005-06 5

Conditional auto-regression From We can write the following conditional auto-regression function for each node: Neighborhood est. based on auto-regression coefficient Eric Xing @ CMU, 005-06 6

Conditional independence From Given an estimate of the neighborhood s i, we have: Thus the neighborhood s i defines the Markov blanket of node i Eric Xing @ CMU, 005-06 7

Recent trends in GGM: Covariance selection (classical method) Dempster [97]: Sequentially pruning smallest elements in precision matri Drton and Perlman [008]: Improved statistical tests for pruning Serious limitations in practice: breaks down when covariance matri is not invertible L -regularization based method (hot!) Meinshausen and Bühlmann [Ann. Stat. 06]: Used LASSO regression for neighborhood selection Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matri Friedman et al. [Biostatistics 08]: Efficient fied-point equations based on a sub-gradient algorithm Structure learning is possible even when # variables > # samples Eric Xing @ CMU, 005-06 8

The Meinshausen-Bühlmann (MB) algorithm: Solving separated Lasso for every single variables: Step : Pick up one variable Step : Think of it as y, and the rest as z Step 3: Solve Lasso regression problem between y and z The resulting coefficient does not correspond to the Q value-wise Step 4: Connect the k-th node to those having nonzero weight in w Eric Xing @ CMU, 005-06 9

L -regularized maimum likelihood learning Input: Sample covariance matri S Assumes standardized data (mean=0, variance=) S is generally rank-deficient Thus the inverse does not eist Output: Sparse precision matri Q Originally, Q is defined as the inverse of S, but not directly invertible Need to find a sparse matri that can be thought as of as an inverse of S log likelihood regularizer Approach: Solve an L -regularized maimum likelihood equation Eric Xing @ CMU, 005-06 30

From matri opt. to vector opt.: coupled Lasso for every single Var. Focus only on one row (column), keeping the others constant Optimization problem for blue vector is shown to be Lasso (L - regularized quadratic programming) Difference from MB s: Resulting Lasso problems are coupled The gray part is actually not constant; changes after solving one Lasso problem (because it is the opt of the entire Q that optimize a single loss function, whereas in MB each lasso has its own loss function.. This coupling is essential for stability under noise Eric Xing @ CMU, 005-06 3

Learning Ising Model (i.e. pairwise MRF) Assuming the nodes are discrete (e.g., voting outcome of a person), and edges are weighted, then for a sample, we have It can be shown the pseudo-conditional likelihood for node k is Eric Xing @ CMU, 005-06 3

Question: vector-valued nodes Eric Xing @ CMU, 005-06 33

New Problem: Evolving Social Networks Can I get his vote? Corporativity, Antagonism, Cliques, over time? March 005 January 006 August 006 Eric Xing @ CMU, 005-06 34

Reverse engineering timespecific "rewiring" networks t * T 0 n= or some small # T N Drosophila development Eric Xing @ CMU, 005-06 35

Inference I [Song, Kolar and Xing, Bioinformatics 09] KELLER: Kernel Weighted L -regularized Logistic Regression Constrained conve optimization Lasso: Estimate time-specific nets one by one, based on "virtual iid" samples Could scale to ~0 4 genes, but under stronger smoothness assumptions Eric Xing @ CMU, 005-06 36

Algorithm nonparametric neighborhood selection Conditional likelihood Neighborhood Selection: Time-specific graph regression: Estimate at Where and Eric Xing @ CMU, 005-06 37

Structural consistency of KELLER Assumptions Define: A: Dependency Condition A: Incoherence Condition A3: Smoothness Condition A4: Bounded Kernel Eric Xing @ CMU, 005-06 38

Theorem [Kolar and Xing, 09] Eric Xing @ CMU, 005-06 39

Inference II [Amr and Xing, PNAS 009, AOAS 009] TESLA: Temporally Smoothed L -regularized logistic regression Constrained conve optimization Scale to ~5000 nodes, does not need smoothness assumption, can accommodate abrupt changes. Eric Xing @ CMU, 005-06 40

Temporally Smoothed Graph Regression TESLA: Eric Xing @ CMU, 005-06 4

Modified estimation procedure estimate block partition on which the coefficient functions are constant (*) estimate the coefficient functions on each block of the partition (**) Eric Xing @ CMU, 005-06 4

Structural Consistency of TESLA [Kolar, and Xing, 009] I. It can be shown that, by applying the results for model selection of the Lasso on a temporal difference transformation of (*), the block are estimated consistently II. Then it can be further shown that, by applying Lasso on (**), the neighborhood of each node on each of the estimated blocks consistently Further advantages of the two step procedure choosing parameters easier faster optimization procedure Eric Xing @ CMU, 005-06 43

Senate network 09 th congress Voting records from 09th congress (005-006) There are 00 senators whose votes were recorded on the 54 bills, each vote is a binary outcome Eric Xing @ CMU, 005-06 44

Senate network 09 th congress March 005 January 006 August 006 Eric Xing @ CMU, 005-06 45

Senator Chafee Eric Xing @ CMU, 005-06 46

Senator Ben Nelson T=0. T=0.8 Eric Xing @ CMU, 005-06 47

Progression and Reversion of Breast Cancer cells S (normal) T4 (malignant) T4 revert T4 revert 3 T4 revert Eric Xing @ CMU, 005-06 48

Estimate Neighborhoods Jointly Across All Cell Types S T4 How to share information across the cell types? T4R T4R3 T4R Eric Xing @ CMU, 005-06 49

Sparsity of Difference Penalize differences between networks of adjacent cell types S T4 T4R T4R3 T4R Eric Xing @ CMU, 005-06 50

Tree-Guided Graphical Lasso (Treegl) RSS for all cell types sparsity Sparsity of difference Eric Xing @ CMU, 005-06 5

Network Overview T4 EGFR-ITGB S MMP PI3K-MAPKK Eric Xing @ CMU, 005-06 5

Interactions Biological Processes S cells T4 cells: Increased Cell Proliferation, Growth, Signaling, Locomotion Eric Xing @ CMU, 005-06 53

Interactions Biological Processes T4 cells MMP-T4R cells: Significantly reduced interactions Eric Xing @ CMU, 005-06 54

Interactions Biological Processes T4 cells PI3K-MAPKK-T4R: Reduced Growth, Locomotion and Signaling Eric Xing @ CMU, 005-06 55

Fancier network est. scenarios Dynamic Directed (auto-regressive) Networks [Song, Kolar and Xing, NIPS 009] Missing Data [Kolar and Xing, ICML 0] Multi-attribute Data [Kolar, Liu and Xing, JMLR 03] Eric Xing @ CMU, 005-06 56

Summary Graphical Gaussian Model The precision matri encode structure Not estimatable when p >> n Neighborhood selection: Conditional dist under GGM/MRF Graphical lasso Sparsistency Time-varying Markov networks Kernel reweighting est. Total variation est. Eric Xing @ CMU, 005-06 57