Lecture Network analysis for biological systems

Lecture 11 2014 Network analysis for biological systems Anja Bråthen Kristoffersen

Biological Networks Gene regulatory network: two genes are connected if the expression of one gene modulates expression of another one by either activation or inhibition Protein interaction network: proteins that are connected in physical interactions or metabolic and signaling pathways of the cell Metabolic network: metabolic products and substrates that participate in one reaction Statistical bioinformatics 3

What is Gene Regulatory Network? Gene regulatory networks (GRNs) are the on-off switches of a cell operating at the gene level. Two genes are connected if the expression of one gene modulates expression of another one by either activation or inhibition Statistical bioinformatics 4

Simplified Representation of Gene Regulatory Network A gene regulatory network can be represented by a directed graph Node represents a gene Directed edge stands for the modulation (regulation) of one node by another: e.g. arrow from gene X to gene Y means gene X affects expression of gene Y Statistical bioinformatics 5

Why study Gene Regulatory Network Genes are not independent They regulate each other and act collectively This collective behavior can be observed using microarray Some genes control the response of the cell to changes in the environment by regulating other genes; Potential discovery of triggering mechanism and treatments for disease Statistical bioinformatics 6

Network Modeling techniques Boolean network (BN) Bayesian belief network Metabolic network modeling methods Statistical bioinformatics 7

Boolean network modeling Boolean: either true or false (1 or 0) Binarization reduces the noise in biological data captures the dynamic behavior in complex systems need a threshold value leads to loss of information Genes are modeled as switch like dynamic elements either on or off Statistical bioinformatics 8

Boolean network consist of A set of genes. A set of Boolean functions F = f i (x 1, x 2,, x n ) the function is described with three boolean operators AND / && / & / OR / / / NOT /! / ~ Statistical bioinformatics 9

Example Graph (G) with 3 genes Given network G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } x 2 x 1 x 3 Wiring diagram Input (t-1) 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Truth table Output(t) Statistical bioinformatics 10

Example Graph (G) with 3 genes Given network G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } x 2 x 1 x 3 Input (t-1) 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Output(t) Statistical bioinformatics 11

Example Graph (G) with 3 genes Given network G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } x 2 x 1 x 3 Input (t-1) Output(t) 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Statistical bioinformatics 12

Example Graph (G) with 3 genes Given network G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } x 2 x 1 x 3 Input (t-1) Output(t) 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Statistical bioinformatics 13

Example Graph (G) with 3 genes Given network G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } x 2 x 1 x 3 Input (t-1) Output(t) 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 Statistical bioinformatics 14

Example Graph (G) with 3 genes Given network G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } Input (t-1) Output(t) 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 110 x 2 x 1 x 3 Statistical bioinformatics 15

R code truth table G(V,F), V = {x 1, x 2, x 3 } F = {f 1 = x 2 & x 3, f 2 = x 1, f 3 = x 2 } # f1 state x(t-1) is a vector x with three elements # want to find state x(t) here called y also with three elements y <- rep(na, 3) # allocate space for three elements in y y[1] <- x[2] && x[3] y[2] <- x[1] y[3] <- x[2] # test it by leting x <- c(0,1,0) and run the code over, according to the # truth table should y be 0,0,1. x <- c(0,1,0) y <- rep(na, 3) # allocate space for three elements in y y[1] <- x[2] && x[3] y[2] <- x[1] y[3] <- x[2] Statistical bioinformatics 20

Make a function of the truth table R code output <- function(x){ y <- rep(na, 3) y[1] <- x[2] && x[3] y[2] <- x[1] y[3] <- x[2] return(y) } output(c(0,0,1)) [1] 0 0 0 Statistical bioinformatics 21

R code: Make the output table based on a input table input <- rbind(c(0,0,0), c(0,0,1), c(0,1,0), c(0,1,1), c(1,0,0), c(1,0,1), c(1,1,0), c(1,1,1)) y <- matrix(na, ncol = ncol(input), nrow= nrow(input)) for(i in 1:nrow(y)){ y[i,] <- output(input[i,]) } Statistical bioinformatics 22

Search for Boolean functions Chi-square testing-based search Kim H, Lee JK, Park T. (2007). Boolean networks using the chi-square test for inferring large-scale gene regulatory networks. BMC Bioinformatics, 8:37. G1 G2 G3 G4 T1 0 1 0 1 T2 1 0 0 1 T3 1 1 1 0 T4 0 1 0 0 T5 0 0 0 1 T6 1 0 1 0 T7 0 1 1 0 T8 0 0 0 0 T9 0 0 1 0 11. januar 2014 Statistical bioinformatics 23 T10 0 0 1 0 Binary data set from simple network: Four nodes (G1, G2, G3, G4) and 10 time points

Observe which gene is on/off the time point before If we want to find a Boolean function, f 4 for node G4, we have to test the independency between G4 at time t and all nodes (G1, G2, G3, G4) at time t-1. We use a chi square distribution and compare expected with observed Statistical bioinformatics 24

2 x 2 Contingency table G4 t 0 1 G1 t-1 0 1 G4 t 0 1 G2 t-1 0 1 Find both the expected and the observed contigency tables G4 t 0 1 G3 t-1 0 1 G4 t 0 1 G4 t-1 0 1 Statistical bioinformatics 25

2 x 2 Contingency table, expected Assume independence between Gi and Gj Only depend on background distribution G1 t-1 0 1 G4 t 0 P(G4=0)*P(G1=0) P(G4=0)*P(G1=1) 1 P(G4=1)*P(G1=0) P(G4=1)*P(G1=1) Similar for the other pars of Gi and Gj Statistical bioinformatics 26

2 x 2 Contingency table, observed G4 t 0 1 G4 t 0 1 G1 t-1 0 1 G3 t-1 0 1 G4 t 0 1 G4 t 0 1 G2 t-1 0 1 G4 t-1 0 1 G1 G2 G3 G4 T1 0 1 0 1 T2 1 0 0 1 T3 1 1 1 0 T4 0 1 0 0 T5 0 0 0 1 T6 1 0 1 0 T7 0 1 1 0 T8 0 0 0 0 T9 0 0 1 0 T10 0 0 1 0 Statistical bioinformatics 27

2 x 2 Contingency table, observed G1 t-1 0 1 G4 t 0 4 3 1 G4 t 0 1 G3 t-1 0 1 G4 t 0 1 G4 t 0 1 G2 t-1 0 1 G4 t-1 0 1 G1 G2 G3 G4 T1 0 1 0 1 T2 1 0 0 1 T3 1 1 1 0 T4 0 1 0 0 T5 0 0 0 1 T6 1 0 1 0 T7 0 1 1 0 T8 0 0 0 0 T9 0 0 1 0 T10 0 0 1 0 Statistical bioinformatics 28

2 x 2 Contingency table, observed G1 t-1 0 1 G4 t 0 4 3 1 2 0 G3 t-1 0 1 G4 t 0 3 4 1 2 0 G2 t-1 0 1 G4 t 0 5 2 1 0 2 G4 t-1 0 1 G4 t 0 5 2 1 1 1 G1 G2 G3 G4 T1 0 1 0 1 T2 1 0 0 1 T3 1 1 1 0 T4 0 1 0 0 T5 0 0 0 1 T6 1 0 1 0 T7 0 1 1 0 T8 0 0 0 0 T9 0 0 1 0 T10 0 0 1 0 Statistical bioinformatics 29

chi square distribution Compare expected with observed Statistical bioinformatics 30

Chi-Square Test Result of independence for G4 t node G1 t-1 G2 t-1 G3 t-1 G4 t-1 1.286 3.214 2.057 0.321 p-value 0.511 0.170 0.430 1 Statistical bioinformatics 31

R code, 2 x 2 Contingency table d1 <- read.table("c:/users/anjab/desktop/infstk/simpledataset4genes10timepoints.txt", header = T, sep = "\t") #I got my dataset read in with the rownames in the first coloumn, #I did not like that so I changed it d2 <- d1[,2:5] rownames(d2) <- d1[,1] #always check that you read it in correctly. head(d2) G1 G2 G3 G4 T1 0 1 0 1 T2 1 0 0 1 T3 1 1 1 0 T4 0 1 0 0 T5 0 0 0 1 T6 1 0 1 0 Statistical bioinformatics 32

R code, 2 x 2 contingency table G1 t-1 0 1 G4 t 0 a b 1 c d # start with gene 4 and make the 2 x 2 contingency table for gene 1 # find the number a. n <- nrow(d2) posible <- which(d2[2:n,4] == 0) posible [1] 2 3 5 6 7 8 9 a <- sum(length(which(d2[posible, 1] == 0))) a [1] 4 G1 G4 T1 0 1 T2 1 1 T3 1 0 T4 0 0 T5 0 1 T6 1 0 T7 0 0 T8 0 0 T9 0 0 Statistical bioinformatics 33 T10 0 0

R code, 2 x 2 contingency table # find the number a, b, c and d. n <- nrow(d2) posible0 <- which(d2[2:n,4] == 0) posible1 <- which(d2[2:n,4] == 1) a <- sum(length(which(d2[posible0, 1] == 0))) b <- sum(length(which(d2[posible0, 1] == 1))) c1 <- sum(length(which(d2[posible1, 1] == 0))) d <- sum(length(which(d2[posible1, 1] == 1))) conttable <- matrix(c(a,b,c1,d), ncol = 2, byrow = T) conttable [,1] [,2] [1,] 4 3 [2,] 2 0 Statistical bioinformatics G1 t-1 0 1 G4 t 0 a b 1 c d G1 G4 T1 0 1 T2 1 1 T3 1 0 T4 0 0 T5 0 1 T6 1 0 T7 0 0 T8 0 0 T9 0 0 T10 0 0

chisq.test(conttable) > chisq.test(conttable) Pearson's Chi-squared test with Yates' continuity correction data: conttable X-squared = 0.0804, df = 1, p-value = 0.7768 Warning message: In chisq.test(conttable) : Chi-squared approximation may be incorrect NB, we have a very little dataset, how can we get rid of the warning message?. Look at help(chisq.test) Statistical bioinformatics 35

help(chisq.test) You find out that the p-value can be simulated using Monte Carlo simulation. This is for us with a small dataset, a good option. Statistical bioinformatics 36

chisq.test(, simulate.p.value = T) > chisq.test(conttable, simulate.p.value = T) Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: conttable X-squared = 1.2857, df = NA, p-value = 0.5097 > chisq.test(conttable, simulate.p.value = T) Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: conttable X-squared = 1.2857, df = NA, p-value = 0.5002 Statistical bioinformatics 37

Chi-Square Test Result of independence for all genes G1 t-1 G2 t-1 G3 t-1 G4 t-1 p-value G1 t p-value G2 t p-value G3 t p-value G4 t 0.511 0.170 0.430 1 Statistical bioinformatics 38

Make a function that takes two vectors, x (gene at time t) and y (gene at time t-1) chisqres <- function(x,y){ n <- length(x) posible0 <- which(y[2:n] == 0) posible1 <- which(y[2:n] == 1) a <- sum(length(which(x[posible0] == 0))) b <- sum(length(which(x[posible0] == 1))) c1 <- sum(length(which(x[posible1] == 0))) d <- sum(length(which(x[posible1] == 1))) counttable <- matrix(c(a,b,c1,d), ncol = 2, byrow = T) chisq.test(counttable,, simulate.p.value = T)$p.value } 11. januar 2014 Statistical bioinformatics 39

Use chisqres() res <- matrix(na, ncol(d2), ncol(d2)) for(i in 1:ncol(d2)){ } for(j in 1:ncol(d2)){ } res[i,j] <- chisqres(d2[,i], d2[,j]) colnames(res) <- paste(colnames(d2), "t-1", sep = "") rownames(res) <- paste(colnames(d2), "t", sep = "") round(res, 3) G1t-1 G2t-1 G3t-1 G4t-1 G1t 1.000 1.000 0.168 0.012 G2t 0.009 1.000 0.512 1.000 G3t 1.000 0.007 1.000 1.000 This only tells us that G1t is dependent on the state G4t-1 had, not which type of dependence. G4t 0.491 0.161 0.428 1.000 40

Goes further and calculate 2 2 2 tables with three genes: a gene at time t and two genes at time t - 1. Eg. G1t, G3t - 1, G4t - 1 Statistical bioinformatics 41

Probabilistic Boolean Network Allow multiple Boolean functions at each node with different probabilities F = {{(f 11, c 11 ),..., (f 1k1, c 1k1 )},..., {(f n1, c n1 ),..., (f nkn, c nkn )}} where k1 is the number of different transition functions for gene 1 the sum of all transition probabilities c 11 to c 1k1 is always 1 Statistical bioinformatics 42

Example: Probabilistic Boolean Network Given three genes, and the transition functions: F = { F 1 = {(f 11, c 11 ), (f 12, c 12 )}, F 2 = {(f 21, 1)}, F 3 = {(f 31, c 31 ), (f 32, c 32 )}} Given the truth tabel and probabilities for each transition function There are 2*1*2 different set of transition functions that can be used. They are all listed in tabel K Statistical bioinformatics 43

Statistical bioinformatics 44

Transition probability matrix P 1 P 2 P 3 P 4 P 1 = c 11 *1* c 31 000 001 010 011 100 101 110 111 000 001 010 011 100 101 Statistical bioinformatics 45 110 111

R code x1x2x3 f11 f12 f21 f31 f32 State000 0 0 0 0 0 State001 1 1 1 0 0 State010 1 1 1 0 0 State011 1 0 0 1 0 State100 0 0 1 0 0 State101 1 1 1 1 0 State110 1 1 0 1 0 State111 1 1 1 1 1 truthtable1 <- read.table("m:/undervisning/statistical bioinformatics/datasets used/truthtableex.txt", header = T) truthtable <- truthtable1[,2:6] rownames(truthtable) <- substr(truthtable1[,1], 6,8) possiblecomb <- nrow(truthtable) c11 <- 0.6 #probability of function f11 being used c12 <- 1 - c11 c21 <- 1 c31 <- 0.5 c32 <- 1 - c31 A <- matrix(0, nrow = possiblecomb, ncol = possiblecomb) rownames(a) <- rownames(truthtable) colnames(a) <- rownames(truthtable) Statistical bioinfomratics 46

possiblemodelstruthtablecolumn <- rbind(c(1,3,4), c(1,3,5), c(2,3,4), c(2,3,5)) probmodels <- c(c11*c21*c31, c11*c21*c32, c12*c21*c31, c12*c21*c32) #probability that each of the possiblemodels are used for(i in 1:nrow(A)){ from <- rownames(a)[i] for (j in 1:nrow(possibleModelsTruthTableColumn)){ modelj <- possiblemodelstruthtablecolumn[j,] to <- paste(truthtable[i,modelj[1]], truthtable[i,modelj[2]], truthtable[i,modelj[3]], sep = "") probmodelsj <- probmodels[j] A[from, to] <- A[from, to] + probmodelsj } } Statistical bioinfomratics 47

Synchronous Boolean networks Assume that all genes are updated at the same time This simplification facilitates the analysis of the networks We have until now looked at such simplified networks Statistical bioinformatics 48

Asynchronous Boolean networks at each point of time t, only one of the transition functions f i F is chosen at random, and the corresponding Boolean variable is updated. Statistical bioinformatics 49

Provides tools for assembling analyzing visualizing Synchronous, asynchronous and probabilistic Boolean networks install.packages("boolnet") library(boolnet) Statistical bioinformatics 50

BoolNet, syntaxes targets, factors or targets, factors, probabilities Target is the gene that is effected Factors are those genes effecting it Probabilities occurs when it is more then one transition function Statistical bioinformatics 51

BoolNet, syntaxes Example CycD is an input, considered as constant. Translated into a transition rule: CycD, CycD Statistical bioinformatics 52

BoolNet, syntaxes Example Rb is expressed if all the genes CycA, CycB, CycD and CycE is absence; it can be expressed in the presence of CycE or CycA if their inhibitory activity is blocked by p27. Translated into a transition rule: First part:! CycA &! CycB &! CycD &! CycE Second part: p27 &! CycB &! CycD Together: Rb, (! CycA &! CycB &! CycD &! CycE) (p27 &! CycB &! CycD) Statistical bioinformatics 53

Read a network into R Assume that we have the file cellcycle.txt targets, factors CycD, CycD Rb, (! CycA &! CycB &! CycD &! CycE) (p27 &! CycB &! CycD) E2F, (! Rb &! CycA &! CycB) (p27 &! Rb &! CycB) CycE, (E2F &! Rb) CycA, (E2F &! Rb &! Cdc20 &! (Cdh1 & UbcH10)) (CycA &! Rb &! Cdc20 &! (Cdh1 & UbcH10)) p27, (! CycD &! CycE &! CycA &! CycB) (p27 &! (CycE & CycA) &! CycB &! CycD) Cdc20, CycB Cdh1,(! CycA &! CycB) (Cdc20) (p27 &! CycB) UbcH10,! Cdh1 (Cdh1 & UbcH10 & (Cdc20 CycA CycB)) CycB,! Cdc20 &! Cdh1 Read it into R by: cellcycle <- loadnetwork("cellcycle.txt") cellcycle Statistical bioinformatics 54

Reconstruct a network from time series A dataset that are already in BoolNet is the yeasttimeseries: To use this data it has to be binarized Statistical bioinformatics 55

Binarization, can be done in many ways. BoolNet support three methods: k-means clustering For each gene, k-means clustering are performed to determine a good separation of groups Edge detector This approach first sorts the measurements for each gene. In the sorted measurements, the algorithm searches for differences of two successive values that satisfy a predefined condition Scan statistic The scan statistic assumes that the measurements for each gene are uniformly and independently distributed. The scan statistic shifts a scanning window across the data and decides for each window position whether there is an unusual accumulation of data points based on an approximated test statistic (see Glaz et al.). 56

Reconstruct network. Statistical bioinformatics 57

How to read the output Fkh2 = <f(clb1){01}> means Clb1(t) Fkh2(t+1) 0 0 1 1 Sic1 = <f(sic1,clb1){0001}> means Sic1(t) Clb(t) Sic1(t+1) 0 0 0 1 1 0 1 1. Statistical bioinformatics 58

How to ead the output Fkh2 = <f(clb1){01}> means Clb1(t) Fkh2(t+1) 0 0 1 1 Sic1 = <f(sic1,clb1){0001}> means Sic1(t) Clb(t) Sic1(t+1) 0 0 0 0 1 0 1 0 0 1 1 1 Statistical bioinformatics 59

plotnetworkwiring(net) Statistical bioinformatics 60

Creating random networks It is desirable to generate artificial networks To study structural properties of Boolean networks To determine the specific properties of biological networks in comparison to arbitrary networks net <- generaterandomnknetwork(n=10, k=3) Statistical bioinformatics 61

Attractors Attractors are stable cycles of states in a Boolean network Attractors in models of gene-regulatory networks are expected to be linked to phenotypes All states that lead to a certain attractor form its basin of attraction Statistical bioinformatics 62

Simple attractors occur in synchronous Boolean networks consist of a set of states whose synchronous transitions form a cycle. Complex or loose attractors in asynchronous networks usually more than one possible transition for each state in an asynchronous network a complex attractor is formed by two or more overlapping loops. Steady-state attractors are attractors that consist of only one state. All transitions from this state result in the state itself. Statistical bioinformatics 63

Statistical bioinformatics 64

Perturbation experiments The generation of perturbed copies of a network is a way to test the robustness of structural properties of the networks to noise and mismeasurements. For example, you could assess the relevance of an attractor by checking whether the same attractor is still found when small random changes are applied to the network. If this is the case, it is less likely that the attractor is an artifact of mismeasurements. perturbednet <- perturbnetwork(cellcycle, perturb="functions", method="bitflip") Statistical bioinformatics 65

Generate random networks Generate a random network with as many nodes and edges as your original network Find the attractors in this network Perturbate the random network 1000 times, how many times are the original attractors from the random network found in the perturbated networks Repeate 1000 times 66

data(cellcycle) perturbednet <- perturbnetwork(cellcycle, perturb="functions", method="bitflip") Statistical bioinformatics 67

Boolean network modeling Positive: Explains the dynamic behavior of living systems efficiently (with possible loops!) Boolean algebra provides a rich set of algorithms already available for supervised learning in binary domain, such as logical analysis of data, and Boolean-based classification algorithms Dichotomization to binary values improves accuracy of classification and simplifies the obtained models by reducing the noise level in experimental data Statistical bioinformatics 68

Boolean network modeling Negative: Requires heavy computing times to construct a network structure Needs specific time course data that well capture pathway interactions, but often uncertain whether there are such time points and, even so, whether they were captured well by a time-course experiment Needs a relatively large number of time points Lose quantitative information by dichotomization to binary values Statistical bioinformatics 69