Priors in Dependency network learning

Size: px

Start display at page:

Download "Priors in Dependency network learning"

Nigel Ball
5 years ago
Views:

1 Priors in Dependency network learning Sushmita Roy Computa:onal Network Biology Biosta2s2cs & Medical Informa2cs 826 Computer Sciences 838 hbps://compnetbiocourse.discovery.wisc.edu

2 Plan for this sec:on We will look at three approaches to integrate other types of data to beber learn regulatory networks Physical Module Networks (Sep 27 th, Oct 3 rd ) Bayesian network structure prior distribu2ons (Oct 4 th, 6 th ) Dependency network parameter prior distribu2ons (Oct 6 th, 11 th )

3 Goals for this lecture Incorpora2ng priors in Dependency networks Linear regression Regularized linear regression Using regularized linear regression framework to incorporate prior knowledge

4 Problem overview How to learn a transcrip2onal regulatory network (gene regulatory network (GRN) from Gene expression data and Complementary data that supports the presences of an edge Presence of a sequence mo2f on a gene promoter ChIP-seq binding of factor X on gene Y s promoter Place a prior on the graph where the prior is obtained from complementary data

5 Recall Dependency network A type of probabilis2c graphical model Approximate Markov networks Are much easier to learn from data As in Bayesian networks has A graph structure Parameters capturing dependencies between a variable and its parents Unlike Bayesian network Can have cyclic dependencies Compu2ng a joint probability is harder It is approximated with a pseudo likelihood. Dependency Networks for Inference, Collabora2ve Filtering and Data visualiza2on Heckerman, Chickering, Meek, Rounthwaite, Kadie 2000

6 Learning dependency networks One can think about this problem as es2ma2ng the Markov blanket of each random variable B j Let B j denote the Markov Blanket of a variable X j.??? f j =P(X j B j ) X j B j is the set of variables that make X j independent of all other variables, X -j P (X j X j )=P (X j B j ) B j can be es2mated by finding the set of variables that best predict X j This requires us to specify the form of P(X j B j )

7 Linear regression with one predictor Suppose we have an output variable y that we wish to predict with one input x Linear regression assumes that y is a linear func2on of x y y = x + 0 x Slope Intercept

8 Linear regression with p predictors Suppose we have N samples of input output pairs {(x 1,y 1 ),, (x N,y N )} x i =(x i1,,x ip ) Where is p-dimensional That is we have p different features/predictors A linear regression model with p features is px y i = 0 + x ij j + i j=1 Regression coefficients intercept Learning the linear regression model requires us to find the parameters than minimizes predic2on error

Linear regression with p predictors Learning a regression model requires us find the regression weights that minimize the predic2on error 2 3 minimize 0, j 4 1 NX px (y i 0 x ij j ) 2 5 2N i=1 j=1

9 Linear regression with p predictors Learning a regression model requires us find the regression weights that minimize the predic2on error 2 3 minimize 0, j 4 1 NX px (y i 0 x ij j ) 2 5 2N i=1 j=1 Residual sum of squared errors (RSS) = { 0, 1,, p} To find the we would need to the RSS with respect to each parameters, set the deriva2ve to 0 and solve OLS es2mate b j = P N i=1 (y i 0)x ij P N i=1 x2 ij

10 In matrix nota:on Easier to think in matrix form 0 B B y 1 y 2. y N 1 C C C A 0 B B B B N 1 C C C C C A = 0 B B 1 x 11 x 12 x 1p 1 x 21 x 22 x 2p x N1 x N2 x Np 1 C C C A Y = X RSS( )=(y X ) T (y X ) This is the square of y-xb in matrix world

11 Es:ma:ng by minimizing RSS RSS( ) = 2X T (y X ) 2X T (y X )=0 X T y X T X =0 X T X = X T y =(X T X) 1 X T y Works well when (X T X) -1 is inver2ble. But oken this is not true. Need to regularize or add a prior

12 Regularized regression The least squares solu2on is oken not sa2sfactory Predic2on accuracy has high variance: small varia2ons in the training set can result in very different answers Interpreta2on is not easy: ideally, we would like to have a good predic2ve model, and that is interpretable The regularized regression framework can be generally described as follows: minimize 0, i N 3 Regulariza2on term NX (y i 0 px x ij j ) f( ) i=1 j=1 Depending upon f we may have different types of regularized regression frameworks

13 Regularized regression f( ) takes the form of some norm of L1 norm px j L2 norm j=1 px j=1 2 j

14 Ridge regression The simplest type of regularized regression is called ridge regression This has the effect of smoothing out the regression weights minimize 0, j 2 It is oken convenient to center the output (mean=0) and standardize the predictors (mean=0, variance =1) 2 3 minimize j 4 1 2N 4 1 2N 3 NX (y i 0 px x ij j ) i=1 NX px (y i x ij j ) i=1 j=1 j=1 px j=1 px j=1 2 j 2 j

15 LASSO regression The ridge regression handles the case of variance, and suitable when there are correlated predictors But does not give an interpretable model The LASSO regression model was developed to learn a sparse model minimize 0, j Or aker standardiza2on: minimize j N N 3 NX (y i 0 px x ij j ) i=1 j=1 3 NX (y i px x ij j ) i=1 j=1 px j j=1 px j j=1

16 Learning the regression weights in Lasso Due to the absolute value in the objec2ve func2on, the deriva2ve is not defined at 0 That is deriva2ve of b at b=0 is not defined To address this, we need to consider the possible scenarios of the regression weight

17 Learning the regression weights in Lasso To handle the discon2nuity in the L1 norm, we consider the possible scenarios of sign of b j = 8 P >< 1 N N i=1 y ix ij if j > 0 P 1 N >: N i=1 y ix ij + if j < 0 =0, otherwise No2ce that the regulariza2on term controls the extent to which is pushed to 0. j

18 Cyclic coordinate descent to learn LASSO regression weights To es2mate the regression weights in LASSO, we cycle through each regression weight, senng it to its op2mal value while keeping the others constant That is we re-write the objec2ve as N 3 NX X (y i x ik k6=j k x ij j ) X k + k6=j j i=1 We derive with respect to its op2mal value. j at a 2me, and set it to

19 Goals for this lecture Incorpora2ng priors in Dependency networks Linear regression Regularized linear regression Using regularized linear regression framework to incorporate prior knowledge

20 Approach Extend a previous approach (Inferelator) to integrate an exis2ng prior A network inference approach that can incorporate both 2me course and single 2me point data Two approaches to integrate prior graph structure Modified Elas2c Net (MEN) Bayesian Best Subset Regression (BBSR) Both approaches rely on a linear regression model

21 Integra:ve network inference method must possess the following key proper:es The method must only include the part of the prior with support from the data. Robustness to noisy prior Using a structure prior must not limit the ability to learn the part of the network for which no prior informa2on exists. The user must be able to control the weight given to the prior.

22 Modeling the rela:onship between regulator and target Time series y i ðt kþm Þ¼ X p2p i i, p x p ðt k Þ m is the 2me lag i ¼ 1,..., N, k ¼ 1,..., K 1 Steady state X x i ðe l Þ¼ i i, p x p ðe l Þ, p2p i i ¼ 1,..., N, l ¼ 1,..., L Number of genes Number of samples Network inference: Es2mate coefficients i,p

23 RECAP Using regularized linear regression framework to incorporate prior knowledge Regularized regression: Adds a regulariza2on term in the objec2ve func2on Regulariza2on term is some type of norm on the regression weights LASSO: Uses an L1 norm, suitable for finding sparse models RIDGE: Uses an L2 norm, can handle correlated predictors ELASTIC: Uses both LASSO and RIDGE regression Inferelator approach to integra2ng prior knowledge Modified Elas2c Net Bayesian Best Subset Regression

24 Modified Elas:c Net (MEN) Elas2c Net regression Minimize sum of squared error E i ðþ ¼ XR r¼1 y iðrþ X i, p x p ðrþ p2p i 2 Subject to L1 norm L2 norm ð1 Þ X p2p i j i, p j þ X p2p i 2 i, p s i X j ols i, p j p2p i Es2mate via cross valida2on

25 MEN con:nued The modifica2on to Elas2c net ð1 Þ X p2p i j i, p i, p j þ X p2p i 2 i, p s i X j ols i, p j p2p i Set this <1 so that if there is a prior edge between x p ->y i, the regression coefficient will be penalized less

26 Probabilis:c interpreta:on for the one predictor case Recall our linear model for one predictor Assume noise is distributed according a Gaussian with mean 0 and variance σ How to es2mate y i = x i 1 + i Noise y i N (x i 1, ) from N datapoints? 1 Maximize likelihood of data given model

27 Maximum Likelihood es:mate of 1 Likelihood of data LL = LL = NY i=1 NY i=1 P (y i x i, 1, ) 1 p 2 exp (yi x i 1 ) Taking log = NX i=1 1 p 2 N X i=1 (y i x i 1 ) Deriving wrt β 1 and senng to 0 1 = P N i=1 y ix i P N i=1 x i 2 Would get the same answer if minimizing RSS

28 Probabilis:c interpreta:on in case of p inputs Assume output Y is y i N (x i, ) Again can compute likelihood, maximize it to find Again the ML es2mate would be the same as we derived by minimizing the RSS

29 Bayesian framework to es:mate parameters Instead of op2mizing the likelihood, we put a prior on the parameters and op2mize the posterior probability of the parameters P ( D) P (D )P ( ) Gaussian data likelihood Parameter prior What types of priors can we use?

30 Priors on parameters in regression Gaussian prior P ( ) / exp( 2 2 ) Also called ridge regression Laplace prior P ( )=N (0, 2 I) P ( i ) = Laplace(0,t) i P ( i ) exp( t ) Also called Lasso regression T

31 Bayesian Best Subset Regression (BBSR) Based on a Bayesian framework of model selec2on Search among all subsets of regulators and pick the best one to minimize trade off between data fit and model complexity Assume that the expression level y is distributed according to a Gaussian distribu2on Response variable yj, 2, X Regulators / Nn X, 2 I Prior over parameters is a Gaussian Place a prior distribu2on on parameters, and incorporate prior knowledge of interac2ons in the parameters p( Prior 2 ) / N n ( 0,g(X 0 X) 1 2 ) A number between 0 and infinity g 2 (0, 1)

32 BBSR con:nued The posterior distribu2on over the parameters is given as: p( y, g can be tuned to provide a trade-off between the prior and the OLS solu2on When g 0 2 ) / N( g +1 + g g +1 OLS ), ( is larger, beta is closer to the OLS solu2on When it is smaller, beta is closer to the prior The prior is set to be a vector of all 0s g g +1 (X0 X) 1 ))

33 BBSR con:nued Inferelator uses a p-dimensional vector for p predictors Predictors with prior are set to g (push more towards the OLS solu2on)

34 BBSR model selec:on The final step in BBSR is to determine the best model out 2 p possible sets p cannot be very high: the approach sets p to 10 The best model is the one that minimizes predic2on error and has the lowest model complexity

35 Experimental setup Three datasets DREAM4: In silico dataset with 100 nodes E. coli dataset from DREAM5 B. sub-lis dataset Evalua2on based on AUPR Ranking of edges obtained from a bootstrapping strategy

36 Workflow of experiments time-lagged response and design variables resample data matrices prior known interactions MEN/BBSR Rank combine ensemble Fig. 1. Method flow chart. Our method takes as input an expression dataset. To build a mechanistic model of gene expression, we create time-lagged response and design variables, such that the expression of the TF is time-lagged with respect to the expression of the target. We then resample the response and designing matrices, running model selection (using either MEN or BBSR) for each resample. This generates an ensemble of networks, which we rank combine into one final network

37 How does the prior parameter affect the performance? Fig. 2. Effect of weight parameter on performance. We use all GSIs as the set of PKIs, and evaluate performance (in terms of AUPR) against the set of GSIs. We evaluate this performance for a variety of choices of the weight parameter for both methods

38 Can the data discriminate between different types of prior edges? Low ranked interac2ons do not have a strong posi2ve or nega2ve correla2on Fig. 3. Incorporation of prior interactions is data driven. For all three datasets, we used all GSIs as PKIs. Here, we display the distribution of time-lagged correlation of predicted TF-target pairs at a recall level of 0:5 (higher ranked, blue), and low ranked interactions that are in the gold standard (lower ranked, red). Note that high ranked interactions are less likely to have low absolute time-lagged correlation, and the low ranked GSIs are centred around 0

39 Ability to recover new edges is not hampered on adding prior DREAM4 E. coli B. sub2lis Prior helps Fig. 4. Performance change on the leave-out set. PKIs were sampled randomly from 20%, 40%, 60% and 80% of the GSIs in five repetitions. We define the leave-out set as the set of GSIs that are not PKIs. Here, we compare the AUPR of the leave-out set when using PKIs (y-axis) to the AUPR when not using PKIs (x-axis). Points above the line indicate a performance increase when PKIs are used Prior does not help

40 What happens when one adds noisy priors? High noise regime Fig. 5. Robustness to incorrect prior information. For each dataset, we considered half of the GSIs as TPIs, and added varying numbers of FPIs that were not GSIs. We show the AUPR of both methods for multiple choices of the respective weight parameters, as well as methods that do not use any PKIs (horizontal lines). Additionally, we show the performance of a naive interaction ranking method, which places all PKIs at the top of the list (gray bars) Low and high in BBSR and MEN means more sparse or less sparse

41 Inferelator key points Based on linear regression models Handles 2me series and steady state data Prior is incorporated at the edge weight Modified Elas2c Net aims to reduce the penalty for edges with prior support BBSR aims to put a prior distribu2on on regression weights

42 Discussion How does one pick the prior parameter values? How does one evaluate the network? How does one iden2fy novel edges and not incorporate only the prior?

43 irafnet Extends the GENIE3 Random Forests based approach to incorporate priors irafnet uses a weighted sampling scheme to incorporate informa2on from different sources of data Petralia et al. 2015, Bioinforma2cs

44 Weighted sampling algorithm in irafnet Each data source d provides a score for a regulator k and target j Convert these scores to sampling weights, w k->j in a data source and score-specific way For each node split, instead of sampling uniformly from N poten2al regulators, select a dataset d randomly and sample N regulators based on their weights in d

45 Petralia et al 2015, Bionforma2cs irafnet overview

46 Construc:ng sampling weights The prior knowledge is described as a set of weighted networks Weights for selec2ng a regulator is derived in a dataset specific manner Undirected protein-protein interac2ons: Weights derived from a diffusion process over graphs (we will see this later lectures) Time-series expression data Weight w j->k assess how predic2ve g j s expression at 2me t is of g i s expression at a future 2me point t+1 Knockout data w j->k either are derived in mul2ple ways: If g k s expression changes significantly when g k is knocked w j->k is derived from the P-value Otherwise it is derived based on the overlap of g j and g k s knockout targets or knockout regulators

47 irafnet applica:on to real data Ground truth Significant interac2ons iden2fied from ChIP-chip experiments of yeast Expression dataset This was a large study measuring gene expression in mul2ple yeast strains Prior datasets (included other expression datasets) Expression 2me course during cell cycle Expression data of gene2c knockouts of TFs Protein-protein interac2ons from public databases (BioGRID, MINT, DIP)

48 Does adding prior help for irafnet? Method Data AUC AUPR GENIE3 Expression (0.537,0.566) (0.537,0.548) irafnet Multiple weights (0.613,0.636) (0.561,0.569) Expression and KO (0.645,0.673) (0.562,0.574) Expression and TS (0.528,0.557) (0.530,0.541) Expression and PPI (0.562,0.591) (0.551,0.561) Evaluate on ChIP-chip network of yeast Expression dataset This was a large study measuring gene expression Prior datasets (included other expression datasets) Expression 2me course during cell cycle Knockout data from Hu et al Protein-protein interac2ons from public databases (BioGRID, MINT, DIP)

49 Concluding remarks We have seen different ways to incorporate other data types to improve the quality of the inferred network Physical Module Network Use a noisy observa2ons of physical edges to constrain the regulatory network structure Low false posi2ve regulatory rela2onships Allows integra2on of unmeasured nodes Bayesian networks with structure prior Use an energy func2on to assess concordance Sensi2ve to incorrect prior informa2on Dependency networks with priors Linear regression approach aims to reduce the penalty on inferred edges Tree-based approach enables a biased selec2on of regulators

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia UVA CS 4501: Machine Learning Lecture 6: Linear Regression Model with Regulariza@ons Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sec@ons of this course