Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)

Size: px

Start display at page:

Download "Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)"

Daisy Harmon
6 years ago
Views:

1 Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM) The data sets alpha30 and alpha38 were analyzed with PNM (Lu et al. 2004). The first two time points were deleted to alleviate block/release artifacts. Data points at 105 min were deleted from both data sets due to unsatisfactory hybridization. From each data set, the 1000 least variable genes and genes with more than 25% missing data were discarded. The remaining genes were centered and normalized with mean 0 and standard deviation of 1. The gene expression profiles were fitted with a linear combination of six sinusoidal functions. The gene-specific Fourier decomposition coefficients and the cell cycle rate were iteratively estimated until the top 100 genes with the least fitting residuals and the estimated cell cycle rate became stable. Permutation Based Statistical Method (PBM) Alpha30 and alpha38 data were combined for PBM analysis with three other data sets previously used for identifying periodic transcripts in the budding yeast genome (Cho et al. 1998; Spellman et al. 1998) involving three different induced synchrony protocols: two temperature sensitive mutations that arrest the cells in G1 or late M phase at 37 C (cdc28 and cdc15) and the mating pheromone alpha factor. This method ranks each transcript by combining two permutation-based statistical tests for periodicity and magnitude of oscillation, respectively (de Lichtenberg 2005). The scoring penalizes genes that only display one property, i.e. high amplitude fluctuations with no periodicity or very low amplitude periodic oscillation. The method also computes a gene specific number called the "peak time" for each transcript in each data set, describing when in the cell cycle the gene is maximally expressed. To compare the peak timing across experiments, the time scales are transformed to percent of the cell cycle by dividing by the cell cycle span calculated by Zhao et al (Zhao et al. 2001) for each type of synchrony, and then aligning different experiments relative to each other (de Lichtenberg 2005). Zero was set based on the peak time of 19 M/G1-specific genes (de Lichtenberg 2005). In addition, a combined peak time is calculated as a weighted average from all five data sets and the error associated with that calculation is also provided. Peak times are expressed as percent of the cell cycle and zero is set at the M/G1 boundary. Ranks, peak times for individual data sets, and the weighted average peak time with the error associated with that calculation are provided at our website ( Promoter region alignments Upstream sequences from seven yeast species were collected from two previous studies (Cliften et al. 2003; Kellis et al. 2003). We wanted to generate high-quality alignments of these upstream regions, and were willing to discard unalignable sequences in order to achieve this goal. Accordingly, we designed an iterative procedure that produces 1

2 upstream region alignments with pairwise percent sequence identity above 40%. The procedure first removes leading single-sequence columns from the alignment, which occur frequently because the upstream regions are often of widely varying length. Thereafter, if any sequence matches poorly to the rest of the alignment, that sequence is removed, and the alignment is re-computed. The resulting collection of alignments is available upon request. Yeast phylogeny We inferred a phylogenetic tree among the seven yeast species from alignments of the coding sequences for three proteins. We selected the Mcm proteins, and used only those that occur unambiguously in all seven species: MCM2, CDC47 and CDC54. The concatenated alignment, consisting of 3201 columns, was analyzed using fastdnaml (olsen:fastdnaml) with the default parameters. PhyME We searched for motifs using PhyME (Sinha et al. 2004), which takes into account aligned orthologous sequences, as well as the phylogenetic species tree. For each gene, we extracted 800 bp upstream of the start codon from each sensu stricto species (cerevisiae, mikitae, kudravzevii, bayanus, and paradoxus). These sequences were then aligned using Lagan (Brudno et al. 2003), which is distributed with PhyME. For a given group of genes with similar peak times, we searched the corresponding alignments for the top three motifs, using the following options: revcompw ot 0.3 maxsites 6. Motiph For the purposes of this study, we implemented a program called Motiph that scans a multiple alignment for occurrences of a given motif, taking into account the phylogenetic tree relating the species in the alignment. Motiph was motivated by the notion of phylogenetic shadowing (Boffelli et al. 2003), and the Motiph program is thus similar to Monkey (Moses et al. 2004). Given an alignment, a motif matrix and a tree, Motiph calculates for each position in the alignment the probability of the given motif, taking into account the phylogenetic tree. This probability is the sum of all possible 2

3 evolutionary histories (i.e., all possible assignments of nucleotides to the internal nodes of the tree), with the given motif at the root of the tree. Motiph reports a log-odds score, in which the numerator is this probability (computed using a functional evolutionary rate of 1), and the denominator is a similar probability computed using a motif of background probabilities and a non-functional evolutionary rate of 1.2. Supplemental Fig.1 Figure S1A, the percentage of identified known periodic genes (total 127) is plotted against the percentage of total genes as we lower the posterior probability or periodicity calculated by PNM (Lu et al. 2004). Both the alpha30 and the alpha38 data sets identify known periodic transcripts slightly better than the first alpha factor data set (Spellman et al. 1998). Figure S1B demonstrates that adding additional microarray data sets improves the rate at which known periodic genes are identified. These results also show that it is beneficial to include data sets generated with different synchronization methods. The combination of all five available data sets (PNM5) performs the best. PNM2 uses alpha30 and alpha38 data. PNM3 adds the third alpha factor set (Spellman et al. 1998). PNM4 adds the cdc28 arrest synchrony (Cho et al. 1998), and PNM5 analyzes those four plus the cdc15 data set (Spellman et al. 1998). Diamond and cross symbols indicate the positions on each plot representing the probability thresholds of and 0.95, respectively. Supplemental Fig. 2 3

4 Figure S2 plots the enrichment of known periodic genes within the ranked list of genes identified by PBM5 and PNM5 as in Fig. S1. This result is consistent with the previous comparison made between PNM and PBM using three data sets (de Lichtenberg 2005). Table S1: Position-specific probability matrix used to search for the presence of Hcm1 binding sites. Rows are positions, and columns correspond to A, C, G and T. A C G T Supplemental Fig. 3 4

5 5

6 Fig. S3. Cell cycle microarray data for 180 genes with conserved Hcm1 binding sites in their promoters and that are ranked as periodic by both PNM5 and PBM5 have been extracted from the alpha30 data set and visualized in PRISM (Wu and Noble 2004). Each row represents one gene, named to the right. The transcript profiles are ordered by their average peak time, which is also indicated (peak(combined)). Peaks are magenta, troughs are cyan. Common names and short descriptions are downloaded from the Saccharomyces Genome Database. Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., and Batzoglou, S LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13: Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., and Davis, R.W A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2: Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., and Johnston, M Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301: de Lichtenberg, U., L.J. Jensen, A. Fausboll, T.S. Jensen, P. Bork and S. Brunak Comparison of computational methods for the identification of cell cycleregulated genes. Bioinformatics 21: Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E.S Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: Lu, X., Zhang, W., Qin, Z.S., Kwast, K.E., and Liu, J.S Statistical resynchronization and Bayesian detection of periodically expressed genes. Nucleic Acids Res 32: Moses, A.M., Chiang, D.Y., Pollard, D.A., Iyer, V.N., and Eisen, M.B MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol 5: R98. Sinha, S., Blanchette, M., and Tompa, M PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5: 170. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9: Wu, W. and Noble, W.S Genomic data visualization on the Web. Bioinformatics 20: Zhao, L.P., Prentice, R., and Breeden, L Statistical modeling of large microarray data sets to identify stimulus-response profiles. Proc. Natl. Acad. Sci. USA 98:

Analyzing Microarray Time course Genome wide Data

OR 779 Functional Data Analysis Course Project Analyzing Microarray Time course Genome wide Data Presented by Xin Zhao April 29, 2002 Cornell University Overview 1. Introduction Biological Background Biological