Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

Regulatory Inferece from Gene Expression CMSC858P Spring 2012 Hector Corrada Bravo

Graphical Model Let y be a vector- valued random variable Suppose some condi8onal independence proper8es hold for some variables in y. Example: variable y 2 and y 3 are independent given remaining variables in y. We can encode this condi8onal independence proper8es in a graph 3

Graphical Model Hammersley- Clifford theorem: all probability distribu0ons that sa0sfy condi0onal independence proper0es in a graph can be wri8en as P (y) = 1 Z exp c C f c (y c ), cliques in graph poten8al func8on variables in clique 4

Graphical Models The probability distribu8on is determined by choice of poten8al func8ons Example: 1. f c (y i )= τ 2 ii y2 i 2 2. f c ({y i,y j })= τ ijy i y j 2 3. f c (y c ) = 0 for y c >= 3. 5

Gaussian Graphical Models Define matrix as 1. Σ 1 ij 2. Σ 1 ij = τ ij if there is an edge between y i and y j = 0 otherwise Σ 1 = τ11 2 τ 12 0 τ 14 τ 12 τ22 2 τ 23 0 0 τ 23 τ33 2 τ 34 τ 14 0 τ 34 τ44 2 6

Network Discovery Gene expression vector y is assumed to be distributed as y N(0, Σ) Likelihood given data (n arrays) is then l(σ) =(2π det(σ)) n/2 exp 1 2 n y T i Σ 1 y i i=1 7

BIOINFORMATICS Vol. 19 Suppl. 1 2003, pages i273 i282 DOI: 10.1093/bioinformatics/btg1038 Genome-wide discovery of transcriptional modules from DNA sequence and gene expression E. Segal, R. Yelensky and D. Koller Computer Science Department of Stanford University, Stanford, CA 94305-9010, USA Received on January 6, 2003; accepted on February 20, 2003 ABSTRACT that transcriptional elements should explain the observed

Module Networks: Segal 2003 9

Module Networks: Segal 2003 P(g. R g.s) P(g.R g.s ) g. M P(g.E g.m) g.s 1 g.s 2 g.s 3 R R g. M 1 0 1 2 3 CPD 1 0 0 g.r 1 g.r 2 g.m CPD 3 CPD 2 2 3 g.e 1 g.e 2 g.e 3 10

Module Networks: Segal 2003 11

Segal and Widom, 2009

Network Discovery Recent efforts in discovering networks directly from expression data Main idea: treat en8re vector of gene expression for a given sample as mul8variate normal A sparse (inverse) covariance matrix induces a network This is related to gaussian graphical models Banerjee, et al. [ICML 2006, JMLR 2008], Friedman [Biosta8s8cs 2007] 13

Gaussian Graphical Model We can write P as P (y) = 1 Z exp yt Σ 1 y 2 By a nice property of exponen0al families, this makes y be mul8variate normal implies Z = 2π det(σ) 14

ML Es8ma8on Maximum Likelihood Es8mate is given by inverse of solu8on to max X0 log det X (SX) S is the sample covariance S = n i=1 y i y i T 15

Condi8onal Independence Condi8onal independence is given by sparse inverse covariance Σ 1 = τ11 2 τ 12 0 τ 14 τ 12 τ22 2 τ 23 0 0 τ 23 τ33 2 τ 34 τ 14 0 τ 34 τ44 2 16

Sparse- inducing penalty Use a penalized likelihood method to induce sparsity where max X0 log det X (SX) λx 1 X 1 = ij X ij 17

Block- coordinate ascent Solve by maximizing one column of matrix at a 8me current solu8on Turns out this is equivalent to solving where W = W11 w 12 w12 T, S = w 22 1 min β 2 W 1 2 11 β z2 + λβ 1 z = W 1 2 11 s 12 S11 s 12 s T 12 s 22 Solu8on is then w 12 = W 11 β 18

Block- coordinate ascent So, we have l1- regularized regression (lasso) problem at each step 1 min β 2 W 1 2 11 β z2 + λβ 1 Each of these can be solved via coordinate descent as before Recall: so_- thresholding at each step 19

Genes associated with iron homeostasis Genes associated with cellular membrane fusion

Summary Using gaussian graphical model representa8on mul8variate normal probability over a sparse graph we can take resul8ng graph as gene network Use sparsity- inducing regulariza8on (l1- norm) Block- coordinate ascent method leads to l1- regularized regression at each step Can use efficient coordinate descent (so_- thresholding) to solve regression 21