Supplementary Information

Size: px

Start display at page:

Download "Supplementary Information"

Harry McGee
5 years ago
Views:

1 Supplementary Information For the article "The organization of transcriptional activity in the yeast, S. cerevisiae" by I. J. Farkas, H. Jeong, T. Vicsek, A.-L. Barabási, and Z. N. Oltvai For the Referees convenience, we are attaching the supplementary material containing additional results (including numerical values) and details on the methodology that are necessary to reproduce our findings. Upon publication, this material will also be available on our designated website located at 1

2 Web Note A - Additional Results Table of Contents: p. 4 Web Fig. A (Statistical analysis of data sets) The averages and standard deviations of the genes and transcriptomes' expressi on level values in the control and the perturbation data sets. p. 5 Web Fig. B (Statistical analysis of the control data set) Analysis of the distributions displayed by the genes expression level values in the control data set. p. 6 Web Note A1 (Statistical analysis of data sets) The average extent by which the different gene deletions alter the genomic expression program of S. cerevisiae cells. p. 6 Web Fig. C (Control analysis to Figs. 1&) Analysis of numerical artifacts in the stepwise correlation search method. p. 7 Web Fig. D (Control analysis to Figs. 1&) Analysis of the structural properties of the similarity graph obtained by the correlation search method when changing the empirical parameters in the algorithm. p. 8 Web Fig. E (Control analysis to Figs. 1&) Analysis of transcriptome similarities in the perturbation data set using a random reordering of genes. p. 9 Web Fig. F (Control analysis to Figs. 1&) Analysis of transcriptome similarities in the perturbation data set using the hierarchically clustered order of genes. p. 1 Web Fig. G (Control analysis to Figs. 1&) Analysis of transcriptome similarities in the perturbation data set when genes are listed in the descending order of the standard deviations of their expression level values.

3 p. 1 Web Fig. H (Control analysis to Figs. 1&) Analysis of transcriptome similarities in the perturbation data set with a modified version of the correlation search algorithm containing overlapping gene regions. p. 13 Web Fig. I (Control analysis to Figs. 1&) The similarity graph predicted for the wild-type (control) data set by the correlation search algorithm.7 p. 14 Web Fig. J (Additional analysis to Figs. 1&) Analysis of the microarray data set published by Kim.et.al. 1 reveals a scale-free structure of the transcriptome similarity graph in Caenorhabditis elegans. p. 15 Web Fig. K (Control analysis to Fig. 3b) After randomly reordering the genes of the perturbation data set, the list of transcriptomes/ deleted genes with the highest numbers of connections on the similarity graph is highly similar to the original result of Fig.3b in the manuscript, where genes were listed in alphabetical order. 3

4 PERTURBATION DATA SET CONTROL DATA SET PERTURBATION DATA SET CONTROL DATA SET ANALYSIS OF GENES average expression level average expression level average expression level average expression level (c) alphabetical order of genes (a) ANALYSIS OF TRANSCRIPTOMES (e) (g) alphabetical order of transcriptomes std.dev. of expression level std.dev. of expression level std.dev. of expression level std.dev. of expression level (b) (d) genes sorted to obtain the descending order of values (f) (h) transcriptomes sorted to obtain the descending order of values Fig. A The averages and standard deviations of the genes and the transcriptomes' expression level values in the control and the perturbation data sets. Averages ( offsets ) of the genes expression level values and the descending sequence of the standard deviations of the genes expression levels in the control- (a, b) and the perturbation (c, d) data sets. Observe that ~ genes display significantly higher standard deviations of expression level than the remaining ones (subfigure b). For the reader s convenience, an arrow indicates the dividing line. Using (b) as a control for subfigure (d), we find that the number genes with strongly varying expression values in the perturbation data set is ~15, which is well below the size of the 4

5 complete yeast transcriptome. The average expression level values and the descending sequence of expression level standard deviations in transcriptomes of the control- (e, f) and the perturbation (g, h) data sets. Using (f) as a control for (h), we find that the amount to which the expression level in individual perturbation transcriptomes varies, is highly different. These results indicate that in the perturbation experiments, transcriptional responses are localized, but the level of localization varies strongly. A detailed quantitative analysis is presented in Web Fig C. χ measured / χ normal (a) data ratio=1 4 6 sorted order of genes χ measured / χ normal (b) data ratio=1 4 6 sorted order of genes Fig. B data set. Analysis of the distributions displayed by the genes expression level values in the control (a) Using a standard χ -test with a confidence level of 95%, the hypothesis that a given gene s expression level values are normally distributed was rejected for 3.7% of all genes. Shown are the ratios of measured vs. standard for a given gene, if the χ values in ascending order. (The χ -test rejects the hypothesis χ ratio for that gene is above 1.) (b) To remove suspected experimental errors, we have filtered the control data set. After filtering, the statistical test still rejected the hypothesis of a normal distribution of the steady-state expression level values for 15.7% of all genes. (During filtering, we first measured the average expression level of each gene. Next, for each gene, we removed points farther from the average expression level than 3 times the radius of the major population, 9%, of the points.) 5

6 Note A1 The average extent by which the different gene deletions alter the genomic expression program of S. cerevisiae cells. We use the wild-type average expression level ( A ) and standard deviation ( Σ wt i wt i ) of gene i as obtained from the filtered data set analyzed on panel b of Web Fig. B. Next, we computed for all e ij expression level values in both the complete control and perturbation data sets the normalized difference, e ij = e ij E i wt Σ i wt, of the expression level value from the gene s wild-type average. Considering a relative 3-fold up- or downregulation to be significant, we found that the average percentage of significantly up- or downregulated genes per transcriptome is 1.3 ±1.87% in the control data set vs. 1.1 ± 11.4% in the perturbation data set. Thus, while the ratio of up- or downregulated genes per transcriptome in the perturbation data set is significantly higher than in the control set, on average transcriptional responses and genetic noise combined involves only about one tenth of all genes. erg3 ymr58c Fig. C Analysis of numerical artifacts in the stepwise correlation search method. To test the effect of numerical artifacts on our results, we have performed the stepwise correlation search method on a modified version of the perturbation data set, where all transcriptomes were scrambled independently to remove all possible similar patterns among transcriptomes. In the resulting data matrix, the e ij values of any row contained data points measured for different genes. The two strongest edges in the graph predicted for this case reached the similarity scores C=.59 and C=.5 (both colored in yellow). Comparing these results to the similarity graph displayed in the top layer of Fig.1c of the paper, where 16 connections above C=.9 were found, we conclude that the effect of numerical artifacts on our results is negligible. 6

7 X 6 4 (a) (b) 6 4 (c) (d) (e) (f) Y s (window size) u (u th largest value picked) C (similarity threshold) Fig. D Irrespective of the set of empirical parameters, the similarity graph supplied by the correlation search algorithm is most closely described by a scale-free graph. We compared the similarity graph obtained by the correlation search method to the three idealized test graphs using the spectral tests outlined below. Results for the three test graphs are given in the same color as below: blue (random graph), green (small-world graph) and red (scale-free graph). For both quantities characterizing the quality of fit in the spectral comparison - X, testing localization of the first eigenvector, and Y, testing the closeness of eigenvalues - smaller values indicate a better agreement between the structural properties of the similarity graph and the given test graph. The parameters used on Fig. of the paper were s=3, u=1 and C=.7. Compared to the original parameter set, here only one parameter was changed for each column of subfigures. In the first column of subfigures (a, d) the size of the gene segment, s, was varied. For the analysis in the second column (b, e), u was changed. For the third column of subfigures (c, f), the similarity threshold, C, was changed. The best fit is given everywhere by the scale-free graph. 7

8 CDC4 fus ymr141c ERG11 spf1 KAR cbp gas1 imp erg3 msu1 ERG11 gcn4 yel8w swi4 cyt1 cbp anp1 gas1 sgs1 erg3 msu1 yer83c jnm1 yel8w yhl9c imp rml ymr93c ckb gcn4 hda1 erg3 ubr1 clb ymr37w sir pet111 mrpl33 msu1 ERG11 rad57 swi4 swi5 yer83c jnm1 yel8w yhl9c ymr93c sst ste4 sir3 vps8 ymr14w ste1 dig1,dig yor78w ste7 fus3,kss1 isw1 cem1 afg3 mac1 isw1,isw ymr93c ste5 kim4 yhr11w ste11 ste1 dig1,dig ste18 ste7 cka dig1,dig ymr37w fus3,kss1 sst sir4 ckb ste4 sir vps8 sir3 yor78w ssn6 rpl8a tup1 rml gcn4 hda1 ubr1 yhr39c cyt1 pet111 cbp mrpl33 msu1 imp anp1 ERG11 CMD1 yer83c spf1 yor8w PMA1 gas1 rad57 jnm1 erg6 erg3 gcn4 sgs1 she4 swi4 yar14c yel8w yhl9c fus sbp1 ymr141c swi5 vac8 ymr1w RHO1 CDC4 KAR erp yel1c cla4 yor6c rps4a ymr14w isw1 cem1 afg3 mac1 isw1,isw isw rml ymr93c ste5 yhr11w ste11 ste1 dig1,dig ste18 ste7 cka dig1,dig ymr37w fus3,kss1 sst rts1 sir4 ckb utr4 hda1 ste4 ras npr cin5 hog1 ubr1 gfd1 yal4w sir vps8 hir imp sir3 pex1 yor78w ssn6 rpl8a tup1 ymr69w rpl1a ymr9c yel33w rpl7a nrf1 yor6c yml5w ymr14w rps4a (a) C=.9 C=.84 C=.88 C=.8 rml kim4 kin3 clb cmk yhr39c yml3w (b) cyt1 pet111 isw1 cem1 afg3 mac1 isw1,isw cbp mrpl33 msu1 isw rml ymr93c ste5 imp kim4 yhr11w kin3 ste11 ste1 dig1,dig anp1 ERG11 ste18 ste7 CMD1 yer83c cka dig1,dig spf1 ymr37w fus3,kss1 yor8w sst rts1 PMA1 sir4 gas1 rad57 jnm1 ckb utr4 erg6 erg3 gcn4 hda1 ste4 ras sgs1 npr cin5 hog1 she4 swi4 ubr1 gfd1 yar14c yel8w yal4w sir vps8 hir yhl9c imp sir3 fus clb pex1 cmk sbp1 yor78w ssn6 ymr141c yhr39c rpl8a swi5 tup1 yml3w ymr69w vac8 ymr1w rpl1a ymr9c yel33w RHO1 rpl7a nrf1 yor6c yml5w CDC4 KAR erp yel1c ymr14w cla4 rps4a number of occurences number of links sorted order of vertices (g) sorted order of gene windows (c) 1 1 random (d) inverse participation ratio small world (e) scale free (f) eigenvalue Fig. E of genes. Analysis of transcriptome similarities in the perturbation data set using a random reordering As an additional test we have randomly reordered genes in the perturbation set prior to the analysis. Note that the similarity graph obtained by the correlation search method on the reordered data is not altered significantly (compare to Figs. 1b and 1c of the paper). (a) Each node (vertex) of the graph represents a transcriptome and two transcriptomes are connected, if they were found to contain region(s) of similarity in their expression patterns. The similarity graph obtained for increasing similarity score thresholds C=.8, C=.84, C=.88 and C=.9. (b) Enlarged view of the graph obtained for C=.8. The graph is rich in loops and strongly interconnected groups of experiments. The close similarity between this graph and the one shown on Fig. 1c indicates that the random reordering of genes has no significant effect on our results. (c) The degree sequence of the largest component of 8

9 fus ymr141c cyt1 yor51c ERG11 pet111 pfd rml msu1 rml msu1 msu1 ymr93c sir4 sir4 ymr93c ecm18 yor78w ymr93c imp sir3 ymr14w sir3 yor78w ymr14w erg4 ecm18 ecm18 ymr93c erg4 erg4 vps8 sir ste rps7b the measured graph (black), and an idealized random graph (blue), small-world graph (green), or scale-free graph (red) are shown at C=.7. (d, e, f) Spectral comparison of the data graph and the three idealized test graphs. Each plot shows the inverse participation ratios of the graphs eigenvectors vs. the corresponding eigenvalue of the graph. (g) The frequency at which a given segment shows similarity between two transcriptomes vs. the sorted order of gene windows (C=.7). The fitted line is a power-law with the exponent.37. (a) (b) C=.9 pet111 msu1 cyt1 ymr93c C=.84 rml imp ecm18 erg4 ERG11 vps8 C=.88 fus sir3 sir ymr141c pfd sir4 ste C=.8 yor51c yor78w rps7b ymr14w Fig. F Analysis of transcriptome similarities in the perturbation data set using the hierarchically clustered order of genes. To test if the grouping similar expression level values together in each experiment affects our results, we have performed the correlation search algorithm on the perturbation data set after a hierarchical clustering of genes. Comparing to any other ordering of the genes tested here, hierarchical clustering produces a very small and weak network structure. Since the correlation search method compares the expression level values of adjacent genes in the experiments, the smoothing of expression levels by ordering similar expression values close to each other makes it harder for the algorithm to detect characteristic changes. Shown are (a) the network at different similarity thresholds and (b) enlarged for C=.8. (Note that the network is too small for a meaningful structural analysis.) Observe also that the main hubs of the graph predicted previously (see Fig. 1c of the paper) are still present suggesting that the only effect of the hierarchical clustering of genes on the correlation search technique was a shift in the value of the similarity threshold, C. 9

10 cla4 cbp gas1 erg3 msu1 gcn4 yel8w erg3 imp ERG11 swi4 msu1 yer83c yel8w yhl9c rml ymr93c jnm1 ckb gcn4 hda1 ubr1 ymr37w sir ymr93c ste4 sir3 vps8 ymr14w yor78w ste7 fus3,kss1 cyt1 pet111 isw1 cem1 afg3 isw1,isw mrpl33 msu1 isw rml ymr93c cbp imp yhr11w ste1 dig1,dig anp1 ERG11 sst ste7 yer83c cka dig1,dig spf1 ymr37w fus3,kss1 gas1 rad57 jnm1 ckb erg3 gcn4 hda1 ste4 sgs1 npr cin5 swi4 ubr1 yel8w sir vps8 ste11 yhl9c imp sir3 fus clb pex1 ste5 yor78w ymr141c KAR yhr39c rpl8a swi5 erp yor6c CDC4 yel1c ymr14w rps4a cyt1 pet111 isw1 msu1 cem1 afg3 mac1 isw1,isw mrpl33 isw rml ymr93c cbp imp kim4 yhr11w ste4 ste1 dig1,dig anp1 ERG11 sst ste18 ste7 CMD1 yer83c cka dig1,dig spf1 ymr37w kin3 fus3,kss1 yor8w ecm9 sir4 gas1 rad57 jnm1 ckb utr4 erg6 sbp1 erg3 gcn4 hda1 ste4 ras sgs1 npr cin5 hog1 she4 swi4 ubr1 gfd1 yal4w yel8w sir vps8 hir ste11 yhl9c AUR1 imp sir3 fus clb pex1 ste5 yor78w ssn6 ymr141c KAR yhl13c rpl8a swi5 tup1 ymr69w ymr1w rpl1a mrt4 ymr9c RHO1 yel33w erp rpl7a yor6c nrf1 CDC4 yel1c erg4 ymr14w yml5w rps4a ecm18 (a) C=.9 C=.84 C=.88 C=.8 yhr39c yml3w (b) cyt1 pet111 isw1 msu1 cem1 afg3 mac1 isw1,isw mrpl33 isw rml ymr93c cbp imp kim4 yhr11w ste4 ste1 dig1,dig anp1 ERG11 sst ste18 ste7 CMD1 yer83c cka dig1,dig spf1 ymr37w kin3 fus3,kss1 yor8w ecm9 sir4 gas1 rad57 jnm1 ckb utr4 erg6 sbp1 erg3 gcn4 hda1 ste4 ras sgs1 npr cin5 hog1 she4 swi4 ubr1 gfd1 yel8w yal4w sir vps8 hir ste11 yhl9c AUR1 imp sir3 fus clb pex1 ste5 yor78w ssn6 ymr141c KAR yhl13c yhr39c rpl8a swi5 tup1 yml3w ymr69w ymr1w rpl1a mrt4 ymr9c RHO1 yel33w erp yor6c rpl7a nrf1 CDC4 yel1c erg4 ymr14w yml5w cla4 rps4a ecm18 number of occurences number of links (c) sorted order of vertices (g) sorted order of gene windows inverse participation ratio random (d) small world (e) scale free (f) eigenvalue Fig. G Analysis of transcriptome similarities in the perturbation data set when genes are listed in the descending order of the standard deviations of their expression levels. To test whether the similarities detected by the correlation search method are due to a small number of genes strongly up- or downregulated on the given transcriptome segment, or they are due to a broad range of similarly regulated genes, we have listed the genes of the perturbation data set in the descending order of their expression levels standard deviations measured across the 87 experi- 1

11 ments. (a) The transcriptome similarity graph at different C similarity thresholds and (b) enlarged for C=.8. Note the strong similarity of the obtained graph to the one shown in the manuscript. Both the analysis of the (c) degree sequence and the spectral comparison (d, e, f) shows that the closest description for the transcriptome network is given by the scale-free model. In addition, the frequencies at which individual transcriptome segments hold similar expression patterns (g) display a scale-free distribution. The fitted line is a power-law with the exponent.3. 11

12 CDC4 spf1 spf1 KAR anp1 cbp erg3 cyt1 anp1 cbp gas1 msu1 gcn4 yel8w imp ERG11 yel47c erg3 rml msu1 yer83c gcn4 yel8w yhl9c rml ubr1 ckb sir ymr93c ymr93c sir3 vps8 ymr14w sst dig1,dig yor78w mac1 ste1 ste7 fus3,kss1 ste11 ste5 pet111 cem1 afg3 mac1 isw1,isw mrpl33 msu1 isw rml ymr93c yhr11w sst imp ste1 dig1,dig ERG11 kin3 ste7 yer83c dig1,dig ymr37w fus3,kss1 yel47c yer71c rad57 jnm1 ckb ste4 erg3 gcn4 hda1 npr cin5 utr4 swi4 ubr1 gfd1 yel8w sir vps8 yhl9c imp sir3 pex1 ste5 yor78w ste11 swi5 rpl8a ymr14w cyt1 pet111 isw1 msu1 cem1 afg3 mac1 isw1,isw mrpl33 isw rml ymr93c kim4 yhr11w sst ste4 imp ste1 dig1,dig ERG11 ste18 kin3 ste7 yer83c CMD1 cka spf1 dig1,dig ymr37w fus3,kss1 cbp yel47c ecm34 sir4 yer71c yor8w rad57 jnm1 ckb ste4 cmk gas1 erg3 gcn4 hda1 ras npr sgs1 cin5 utr4 hog1 swi4 ubr1 ade1 gfd1 yel8w yal4w sir vps8 yhl9c imp sir3 fus clb pex1 ste5 erg6 gln3 yor78w ste11 ymr141c yhr39c swi5 ymr69w rpl8a nrf1 yor6c rps7b CDC4 KAR erp yel1c ymr14w rps4a (a) C=.9 C=.84 C=.88 C=.8 anp1 aqy-b (b) cyt1 pet111 isw1 cem1 afg3 mac1 isw1,isw mrpl33 msu1 isw rml ymr93c kim4 yhr11w sst ste4 imp ste1 dig1,dig anp1 aqy-b ERG11 ste18 kin3 ste7 yer83c CMD1 cka spf1 dig1,dig ymr37w fus3,kss1 cbp yel47c ecm34 sir4 yer71c yor8w rad57 jnm1 ckb ste4 cmk gas1 erg3 gcn4 hda1 ras npr sgs1 cin5 utr4 hog1 swi4 ubr1 ade1 gfd1 yel8w yal4w sir vps8 yhl9c imp sir3 fus clb pex1 ste5 erg6 gln3 yor78w ste11 ymr141c yhr39c swi5 ymr69w rpl8a nrf1 yor6c rps7b CDC4 KAR erp yel1c ymr14w rps4a number of occurences number of links (c) sorted order of vertices (g) sorted order of gene windows inverse participation ratio random (d) small world (e) scale free (f) eigenvalue Fig. H Analysis of transcriptome similarities in the perturbation data set with a modified version of the correlation search algorithm containing overlapping gene regions. As an additional test we have increased the segment size and allowed the segments to overlap in the perturbation set prior to the analysis (see Web Note B for details). Note that the similarity graph obtained by this slightly altered correlation search method provides results that are comparable to those in Fig.1 of the paper. (a) Each node (vertex) of the graph represents a transcriptome and two transcriptomes are connected, if they were found to contain region(s) of similarity in their expression patterns. The similarity graph obtained for increasing similarity score thresholds C=.8, C=.84, C=.88 and C=.9. (b) Enlarged view of the graph obtained for C =.8. Note that the graph is rich in 1

13 loops and strongly interconnected groups of experiments. Strongly connected experiments of the original graph (see Fig. 1c of the paper) are usually strongly connected here, too. (c) The degree sequence of the largest component of the measured graph (black), and an idealized random graph (blue), small-world graph (green), or scale-free graph (red) is shown at C=.7. (d, e, f) Spectral comparison of the data graph and the three idealized test graphs at C=.7. Each plot shows the inverse participation ratios of the eigenvectors vs. the corresponding eigenvalues of the graphs. (g) The frequency at which a given transcriptome segment shows similarity between two transcriptomes vs. the sorted order of the segments (C=.7). The fitted line is a power-law with the exponent.37. EXP14 EXP Fig. I The similarity graph predicted for the control data set by the correlation search algorithm obtained with the same parameters as Fig. 1c of the paper (s=3, u=1 and C=.8). Labels show the indices of wild-type transcriptomes. Observe that the number and strength of similarities is well below those predicted for the perturbation data set. On the other hand, finding a connection above the C=.8 threshold indicates that the strongest similarities in the control data set are still much stronger than numerical artifacts, which usually do not yield connections stronger than C=.5.6 (see Web Fig. D). 13

14 (a) K=.96 K=.94 K=.9 K= (b) (c) sorted order of vertices 1 4 (g) number of occurences number of links sorted order of gene windows inverse participation ratio random (d) small world (e) scale free (f) eigenvalue Fig. J Analysis of the microarray data set published by Kim.et.al. 1 reveals scale-free structure of the transcriptome similarity graph of Caenorhabditis elegans. (a) Each node (vertex) of the graph represents a transcriptome and two transcriptomes are connected, if they were found to contain genes strongly up- or downregulated in both experiments. The similarity graph obtained for increasing similarity score thresholds C=.9, C=.9, C=.94 and C=.96. (b) Enlarged view of the graph obtained for C=.9. Note that the graph is rich in loops and strongly interconnected groups of experiments. (c) The degree sequence of the largest component of 14

15 the measured graph (black), and an idealized random graph (blue), small-world graph (green), or scale-free graph (red) is shown at C=.8. (d-f) Spectral comparison of the data graph and the three idealized test graphs at C=.8. Each plot shows the inverse participation ratios of the eigenvectors vs. the corresponding eigenvalues of the graphs. (g) The frequency at which a given gene shows similarity between two transcriptomes vs. the sorted order of genes (C=.8). The fitted line is a power-law with the exponent.1. k= yel8w (36) gcn4 (34) sir (34) swi4 (34) jnm1 (33) yer83c (3) (3) vps8 (31) ubr1 (31) ste4 (3) hda1 (9) yhl9c (8) (8) clb (7) ckb (6) rad57 (5) (4) erg3 (4) gas1 (3) (3) cin5 () ymr37w (1) imp () yhr39c (18) swi5 (18) yel8w (3) swi4 (9) jnm1 (8) gcn4 (7) sir (7) yhl9c (6) (6) yer83c (5) ste4 (4) vps8 (4) erg3 () clb () hda1 (1) ubr1 (1) (1) () cin5 (19) (19) ymr37w (17) ckb (17) rad57 (17) gas1 (15) pex1 (14) yhr39c (14) imp (13) swi4 (5) yel8w (4) sir () yer83c (1) gcn4 () jnm1 () erg3 () () ste4 (19) yhl9c (18) (18) vps8 (18) ubr1 (17) hda1 (17) clb (16) (16) (15) rad57 (13) ymr93c (11) ymr37w (11) cin5 (1) ckb (9) swi5 (9) gas1 (9) yhr39c (8) yel8w () (18) swi4 (17) gcn4 (16) sir (15) jnm1 (15) yhl9c (14) erg3 (14) vps8 (13) yer83c (1) (1) ubr1 (1) ste4 (11) (11) clb (1) (1) hda1 (1) ymr93c (9) rad57 (8) msu1 (7) ymr37w (7) ckb (6) (6) (6) yhr39c (5) gcn4 (14) yel8w (14) (14) swi4 (13) (11) sir (1) jnm1 (9) erg3 (9) yhl9c (8) yer83c (8) hda1 (8) ymr93c (7) ste4 (7) clb (6) ubr1 (6) msu1 (6) vps8 (6) ckb (5) (5) (5) (5) ERG11 (4) ymr37w (4) rml (4) ste7 (3) Fig. K After randomly reordering the rows of genes of the perturbation data set, the list of transcriptomes/deleted genes with the highest numbers of connections on the similarity graph is highly similar to the original result of Fig.3b in the paper. 15

16 Web Note B - Detailed Methods Table of Contents: 1. Data preparation p. 16 Data source and construction of microarray matrices. Stepwise correlation search method p. 18 Description of the stepwise correlation search algorithm p. Testing the stepwise correlation search technique by applying it to reordered versions of the perturbation data matrix p. 1 Testing the stepwise correlation search technique by allowing overlapping transcriptome segments 3. Analysis of the similarity graph provided by the stepwise correlation search method p. Description of the random graph models used to test the structure of the predicted similarity graph p. Tools of the spectral analysis p. 3 Quality of fit in the spectral comparison of the data graph and the test graphs p. 3 Testing the structure of the similarity graph by using alternative parameter sets 16

17 1. Data preparation Data source and construction of microarray matrices We used the publicly available microarray data set of Friend and colleagues 3, and the files control_expts1-63_ratios.txt and data_expts1-3_ratios.txt 3 were used from the package downloaded from As a first step, we arranged the 63 control- and the 87 perturbation microarray data sets into two separate matrices, with the expression level values of gene i being listed in the ith row, and the expression level values of the jth measurement listed in the jth column. (The 87 data set was obtained by keeping only the data of singlegene deletion mutant strains out of the 3 set.) In both cases, we ordered rows and columns as they were listed in the original data files. Genes (rows) were listed alphabetically, and transcriptomes (columns) were listed in the temporal order of experiments in the control data set and alphabetically in the perturbation set (see Web Fig. A). For the statistical characterization of the two matrices we use the following notations. In either case, the data matrix, e, has N rows (each containing the expression levels of one gene) and m columns (each containing the expression levels of all genes in one experiment, i.e., one measured transcriptome). The expression level of the ith gene in the jth array is e ij, and the average expression level of this gene throughout the m arrays is A i = m 1 m m e j =1 ij. The standard deviation of the expression level of the same gene is Σ i = m 1 (e ij A i ). The average expression level of genes in the jth array is a j = N 1 N e i=1 ij array is σ j = N 1 (e ij a j ) N i=1 j =1, and the standard deviation of the expression level values in the same. The raw data files contain base 1 logarithmic values. A value of, e.g.,.5 indicates the upregulation of a gene' s expression level by a factor of 1.5 = 3.16, and a value of -.7 means downregulation to the =. part of the expression level. From the mathematical point of view, there are several possible scales that can be used for the analyses of these datasets. As an example, the linear data scale would mean using the value 3.16 instead of.5, and using the value. instead of -.7. We decided to use the original, base 1 logarithmic data scale 3 for two reasons. First, in most biological systems, where experimental data span several orders of magnitude, the data scale most readily applicable for description is the logarithmic scale. Second, in both raw data sets (the 63 control- and the 87 perturbation subsets) data are approximately centered around, which minimizes the accumulation rate of data error throughout the computational analysis shown below. Third, microarray 17

18 data is produced by measuring light intensities, for which the physically relevant data scale is again logarithmic. It is important to note that the raw data matrices are not complete. First, in both downloaded data sets for many genes less than m data points are available. Another discrepancy observed in the data files was that upregulation by more than a base 1 logarithmic ratio of (i.e., higher than 1-fold upregulation) is always indicated by +, and downregulation by more than - (i.e., 1-fold downregulation) is always indicated by -. In other words, the data set contains an experimental cutoff at + and. If a row contains either a missing e ij, an e ij = + or an e ij = - value, then the simplest approach is to remove this row (i.e., to remove this gene from the data set). An alternative approach is to treat each e ij = +/ - value as unknown, too, and to keep track of all unkown values throughout the analysis. For the 87 perturbation data set, the first approach (i.e., the removal of genes for which less than 87 measured values are available) would discard more than one fourth of the complete transcriptome. As our aim was to carry out global analyses of transcriptomes, we decided to use the second approach (i.e., to keep all rows of the matrix and to keep track of missing values) throughout our analyses. However, in all cases, we removed repeated open reading frame (ORF) names and those ORF names that did not follow the common ' Y number' nomenclature to designate individual yea st ORFs. Following this selection, the number of rows (i.e., individual genes) was 687 for the control data matrix, and 68 for the perturbation data set.. Stepwise correlation search method Description of the stepwise correlation search algorithm The stepwise correlation search technique in this paper uses full-transcriptome data sets as input, and searches the columns (i.e., the transcriptomes) of the data set for groups of expressed genes displaying similar expression patterns. For each pair of compared transcriptomes, the groups of similarly expressed genes are allowed to be different. Since the suggested algorithm compares each pair of transcriptomes individually, the results are independent of the order in which the transcriptomes are listed. For the analyses shown in the paper (see Figs. 1-4), genes were listed in the alphabetical order of their open reading frame (ORF) names. We have also shown that the predicted similarity graph changes only slightly and the graph s statistical properties are identical when genes of the data set are randomly reordered. When searching for groups of similarly expressed genes in two transcriptomes, ideally, one should test all possible subsets of the N genes. Unfortunately, the number of all possible subsets of 18

19 N size s in a set of N genes is, which is growing too rapidly with N to enable the testing of all s subsets. However, in practice, the number of all co-regulated subsets of genes is usually far below this number. Here we introduce a method that reduces computational time from the ideally necessary binomial to linear in N. The basic tool of the algorithm is a sliding transcriptome segment (i.e., a small group of sequentially listed genes) that is used to select a small number of genes and check whether they show a similar expression profile in the two transcriptomes being compared. To search for correlations among transcriptomes, we compare each pair of transcriptomes individually. For one transcriptome pair, we first find the list of genes with known expression level values in both transcriptomes (in the downloaded data files, we called a value known, if it was not missing and was not + or -.). Next, we define a sliding segment with size s, and place this segment on the first s genes with known expression values in both transcriptomes. The two data sets to be compared are now the 1.,.,, s. gene expression level values of the first transcriptome and the 1.,.,, s. gene expression level values of the second transcriptome. We label these two sets (two vectors) by e 1 ={e 1,1, e 1,,, e 1, s } and e ={e,1, e,,, e, s }, respectively. Next, we compute the mean values (m 1 and m ) and standard deviations ( σ 1 and σ ) of these two vectors: m 1 = s 1 σ 1 = s 1 e 1, j m 1 s j =1( ) s e j =1 1, j. For the measure of similarity, C1,, between the vectors e 1 and e we used the absolute value of the (Pearson) correlation: C 1, = (sσ 1 σ ) 1 (e 1, j m 1 )(e, j m ). We used the absolute value, because two biological signals with the mathematical correlation 1 (changing in exactly the opposite way) are coupled with the same strength as two with the correlation +1 (changing in exactly the same way). After saving the obtained value for C 1,, we move the segment with a step size of s (in the paper, we used s=3), therefore, the second segment contained the (s+1)., (s+).,, (s). genes, the third segment contained the (s+1)., (s+).,, (3s). genes, etc. Note, that segments cover the entire genome, but they do not overlap. We defined the similarity score of the two transcriptomes as the u-th (in the paper u=1) largest C 1, value obtained for the given two transcriptomes. Having computed the similarity score for each transcriptome pair (an m x m symmetrical matrix), we used a constant threshold, C, to decide which transcriptome pairs are coupled strongly enough (in the paper, C is varied between.8 and.9). If the similarity score for a given pair of transcriptomes was above C, then the two points of the graph corresponding to these two transcriptomes were connected. Fig. 1c in the paper shows the graph for the parameters s=3, u=1, C=.8. (Note, that only transcriptomes with at least one connection are shown.) s j =1 and 19

20 The correlation search algorithm performs a subspace search in the space of transcriptome vectors independently for each pair of transcriptomes with the aim to find groups of similarly expressed genes in the two transcriptomes. Thus, we expect it to find response patterns displayed by a small number of genes under a small number of conditions more easily than methods targeted at the blockdiagonalization of distance matrices derived from full-genome transcriptomes 4. Further, the correlation search algorithm is a direct subspace search method as opposed to biclustering 5, which is an iterative algorithm, and searches for local minima in the space of all possible submatrices by making short steps toward the steepest descent. For data sets containing a high number characteristic expression patterns extending over almost all genes used in the analysis, we expect the correlation search technique to be comparable in its results and speed to biclustering combined with a refined method for localizing minima, e.g., simulated annealing 6, 7. However, for the analysis of microarray data sets containing a small number of similarities displayed under a small number of conditions (e.g., localized transcriptional responses in large data sets), we expect the correlation search algorithm to be more suitable. In summary, the major strength of the stepwise correlation search method is that it compares each pair of transcriptomes individually and for each transcriptome pair it allows for similar patterns to appear on different groups of expressed genes. Thus, it is able to detect a high variety of shared similarities among experiments. Testing the stepwise correlation search technique by applying it to reordered versions of the perturbation data matrix To analyze the effect of the order of genes on our results, we have performed two tests. In both tests, the stepwise correlation search algorithm was applied to a modified version of the matrix, e, of the perturbation data set. First, we intended to test the role of numerical artifacts in the predicted similarity graph shown in Fig.1c of the paper, and scrambled the expression values in each transcriptome of the perturbation data set independently. In the resulting matrix the e ij values in any of the rows were expression values experimentally measured for different genes. Having performed the correlation search algorithm on this modified data set with unchanged parameter values (s=3, u=1 and C=.8), we found that no pair of transcriptomes has reached the similarity score C=.8. The three highest similarity scores measured were C=.59, C=.5 and C=.49 (see Web Fig. D). In comparison, the original data set yielded 16 pairs of transcriptomes with a similarity score above C=.9. We conclude, that the similarities detected by the correlation search method are not numerical artifacts, but truly existing similarities between transcriptomes.

21 In the second, third and fourth tests, we set out to analyze whether the results obtained by the stepwise correlation search method are influenced by the order of genes used in the data set. In the second test, we created another modified version of the perturbation data set by scrambling the expression values in each transcriptome identically. In the resulting matrix any of the rows contained expression values of the same gene. According to the analyses shown in Web Fig. F, the predicted similarity graph changes only slightly upon a random reordering of genes and also, the statistical properties of the graph remain identical. In the third test (see Web Fig.G), we listed genes using the hierarchically clustered order. Since the hierarchical clustering technique groups similar expression values close to each other in each transcriptome, the significant differences between neighboring expression levels in each column of the microarray matrix are smoothed, and the resulting similarity matrix is much weaker than before. Observe, however, that the main hubs of the graph predicted previously (see Fig.1e of the paper) are still present suggesting that the only effect of the hierarchical clustering of genes on the correlation search technique was a shift in the value of the similarity threshold, C. In the fourth test (see Web Fig.H), we listed genes using the descending order of their expression level variances. The similarity graph predicted for this case was again almost identical to the graph predicted for the original case (see Fig.1e of the paper). In other words, the similarity graph found by the correlation search method represents not merely the effect of a few genes with large expression level changes, but rather the combined effect of genes with large, medium and small expression level variances forming a seamless continuum. Testing the stepwise correlation search technique by allowing overlapping transcriptome segments In Web Fig. I, we used a slightly modified version of the correlation search technique, where transcriptome segments are allowed to overlap. When selecting two transcriptomes to be compared, the first segment is placed on the 1.,.,, s. genes, as before, however, the second segment contains the (t+1)., (t+).,, (t+s). genes, the third segment contains the (t+1)., (t+).,, (t+s). genes, etc. Note that if t<s, then adjacent gene segments will overlap: the (t+1)., (t+).,, s. genes will be contained by both the first and the second segment. On Web Fig. I we used s=6, t=15, u=1 and C=.8, and found that transcriptomes strongly connected on the original graph (Fig. 1e of the paper) are usually strongly connected here, too. In addition, we have performed all analyses shown in the paper and have demonstrated that the statistical properties of the similarity graph remained identical. 1

22 3. Analysis of the similarity graph provided by the stepwise correlation search method Description of the random graph models used to test the structure of the predicted similarity graph To test the structure of the similarity graph computed for a certain set of parameters, we compared it to three random graph models. In the uncorrelated random graph (also called the random graph 8 ) n e edges are connecting randomly chosen pairs of the graph s n v vertices. For the small-world graph 9, one starts with n v vertices placed along the perimeter of a circle, connects each vertex to its z nearest neighbors with z being the closest integer to n v / (n e ) and then, rewires a randomly chosen p r proportion of all edges. (We used p r =.1 everywhere.) For the scale-free graph 1 we performed the iteration mimicking the growth of the network for t time steps. During one time step, a new vertex was added with probability p v, and a new edge was added with probability 1- p v. In the case of a new edge the first vertex, i, to be connected was chosen randomly. The probability, i Π j, for connecting vertex i, to another vertex, j, was defined using the degree, k j, (i.e., the i number of links) of vertex j as Π j = k j k l i l representing a linear preference of vertices with a higher number of connections. The number of edges and the number of vertices had to be identical to those in the data graph, thus, the two parameters of the model were t = n e +n v and p v = n e / (n e +n v ). Tools of the spectral analysis Consider a simple graph with N vertices (nodes), i.e., a graph where none of the N vertices is connected to itself, all connections are undirected and have the same, 1, weight. The adjacency matrix of this graph is a symmetrical N x N square matrix, A, with A ij =1, if the ith and jth vertex of the graph are connected and A ij =, if they are not. The diagonal entries of the adjacency matrix are all zeroes: A ii = for each i=1,,, N. The spectrum of the graph is the set of eigenvalues of the graph s adjacency matrix. For a simple graph the eigenvalues are all real numbers and the eigenvectors are all vectors containing real numbers. According to a recent study 11, the eigenvalues of a graph and the inverse participation ratios of the graph s eigenvectors are well applicable for the structural analysis of the graph. The eigenvalues and eigenvectors of a graph are the eigenvalues and eigenvectors of the graph s adjacency matrix, A. The largest eigenvalue of a graph is called the graph s first eigenvalue and the first eigenvector is the eigenvector of the first eigenvalue. The inverse participation ratio of a normalized eigenvector is the sum of fourth powers of the eigenvector s components. If N is the number of

23 components in the eigenvector e j, and e j,k is the kth component of e j, then the inverse participation ratio I j, of this eigenvector is I j = N e k =1( j,k ) =1 N ( e j,k ) 4 k=1. Since each eigenvector is normalized, i.e.,, the inverse participation ratio can be used to measure the number of those components in e j that are significantly different from. If N-1 components of e j are zeroes and only one differs from zero (i.e., it is equal to 1), then I j will be 1. On the other hand, if all components of e j are different from zero, e.g., they are all equal to 1/ N, then I j will be 1/N. Note that any eigenvector can be treated as a set of N numbers, with the ith component of the eigenvector being written on the ith vertex of the graph. Thus, the inverse participation ratio of a given eigenvector can be used to determine if that eigenvector is localized on a small number of vertices of the graph or not localized at all. If the inverse participation ratio of an eigenvector is high (close to 1), then the eigenvector is localized; if it is low (close to 1/N), then the eigenvector is non-localized. If an eigenvector is localized on a small number of vertices, then only those few vertices are significant for the eigenvalue of that eigenvector. On the other hand, a non-localized eigenvector shows that all vertices of the graph have approximately the same significance in determining the eigenvalue of that eigenvector. Quality of fit in the spectral comparison of the data graph and the test graphs In Fig. b-d of the paper, we plot the inverse participation ratio as a function of the corresponding eigenvalue for the similarity graph and the three test graphs: the uncorrelated random 8 graph, the small-world 9 - and the scale-free 1 graph. To analyze how well the inverse participation ratio vs. eigenvalue function computed for a test graphs fits the same function obtained for the similarity graph, we used the following two quantities. X = ln(i (data) (test 1 / I ) 1 ) compares the inverse participation ratios of the data graph s and the test graph s first eigenvectors, i.e., it compares the level of structural dominance of the most highly connected vertices in the two graphs. N { j=1[ ] } / N λ (data) j =1 j Y = λ j (data) λ j (test ) ( ) compares the test graph s eigenvalues to those of the data graph. For both quantities, the lowest scores indicating the best agreement between data and test is given by the scale-free model. With the parameters used for Fig. of the manuscript, the uncorrelated random test graph gives X=.8 and Y=15., for the small-world test graph X=1.9 and Y=16.14, and for the scale-free test graph X=.5 and Y=6.5. Testing the structure of the similarity graph by using alternative parameter sets To analyze how the empirical parameters s, u and C affect the structural changes of the similarity 3

24 graph, we have computed the similarity graph for a broad range of parameters. For each similarity graph, we used three test graphs (with the closest possible number of vertices and edges) and for each test graph we computed the values of X and Y characterizing the quality of fit in the spectral comparison. Consider a 3-dimensional parameter space with the coordinates s, u and C. Starting from the point s=3, u=1, C=.8, we scanned the parameter space in all three directions. First (see Web Fig. E a, d), we constructed the similarity graph and its test graphs with s varied between 13 and 5, and u=1, C=.8. Next (Figs. E b, e), the similarity graph and its three test graphs were analyzed with u varied between 3 and, and s=3, C=.8. Finally (Web Fig. E c, f), we examined the results when the parameter C was varied between.6 and.95, and s=3 and u=1 were kept constant. For each investigated point of the parameter space, the spectral comparison of the three test graphs with the data graph is shown on Web Fig. E. All subfigures analyze the quality of fit (see above) of the inverse participation ratio vs. eigenvalue plot of the graph. The first and second rows of subfigures on Web Fig. E compare the quality of fit via the values of X and Y computed for the test graphs, respectively. In all cases, a lower score means closer agreement between the structural properties of the test graph and the data graph. 4

Using graphs to relate expression data and protein-protein interaction data

Using graphs to relate expression data and protein-protein interaction data R. Gentleman and D. Scholtens October 31, 2017 Introduction In Ge et al. (2001) the authors consider an interesting question.