microrna Dr. Researcherr Prepared by LC Sciences, LLC Jan. 1, 2009

Size: px

Start display at page:

Download "microrna Dr. Researcherr Prepared by LC Sciences, LLC Jan. 1, 2009"

Maryann McBride
5 years ago
Views:

1 Sequencing Dataa Report microrna Discovery Sequencing Service On sample_ For Dr. Researcherr Life Sciences University of USA Prepared by LC Sciences, LLC Jan. 1, 2009 Prepared by LC Sciences, LLC W. Bellfort, Suite 270, Houston, Texas Tel , Fax

2 microrna Discovery Sequencing Data Report sample_ I. PROJECT INFORMATION Project related information is listed in Table 1. Table 1. Sample, service, and project tracking information Project Information Customer Sample Name: Sample Type: Date Sample Received: Service Requested: Data Analysis Requested: LCS Project Number: LCS Sample ID: sample_ Human total RNA 12/15/2009 microrna Discovery Sequencing Service Standard Data Analysis sample1 II. DA ATA REPORT The received RNA sample was processed to generate a cdna library which was then used to deep sequencing. The dataa generated were analyzed and the full data files of 2-3 Gb were saved onto a DVD disc which is included in this report. Experimental procedures and analysis methods were described in Section III of this report. The statistics of the data analysis was given in file Data_summary_sample1.xls and a summary is presented in Table 2. The detailed dataa files which may be in tens of Mbs and the recommended software programs for reviewing the data are given in Table 3. Terminologies Used Sequ Seq: Raw sequencing reads generated in after image extraction and base-calling Unique Seq: Family of sequ seq with same sequence Copy Number: Number of sequ seqs in the same unique seq family Count: Number of sequ seqs in the same unique seq family Mapping: Aligning a sequence to a reference database Mir: pre-mirna registered in mirbase mir: mature mirnas registered in mirbase LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

3 microrna Discovery Sequencing Data Report sample_ Table 2. A summary of standard data analysis results Raw Mappable Mapped to mirbase (including nohit 1) Mapped to Cluster I Mapped to Cluster II Mapped to Cluster III Mapped to Cluster IV Mapped (total) Nohit (including nohit 1 and nohit 2) #SequSeq 6,870,, % 4,764,, % 4,200,, % 4,181,, % % 2, % 13, % 4,196,, % 567, % %SequSeq #UniqueSeq 1,445,905 40,890 8,971 7, ,128 9,201 31,689 %UniqueSeq % 2.83% 0.62% 0.55% 0.00% 0.01% 0.08% 0.64% 2.19% Flow chart of sequencing data analysis of a single sequencing reaction through various filters and the number of mirnas detected. 7.42% mrn NA, RFam,repbas se filter 1.35% ADT filter 0.05% Junk filter 0.07% Sequ uence pattern filt ter 6.07% length<15 or > % copy#< <3 91% mappable 88% pas ss optional filte r 9,031,007 reads (91%) are mappable 7,944,551 re eads (88%) passed opt ional filter 41% unma apped 38% (gro our 4) 292 mirnas detected 39 9% (group 1) 300 mirna detecte d 4,722,478 re eads (59%) are mapped to or are mirna ca andidates 19% (gro oup 3) 149 mirna As detected 4% (g group 2) 27 mi RNAs detected d LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

4 microrna Discovery Sequencing Data Report sample_ Folder Raw Data Filtered Data Table 3. Data files delivered and programs recommended for reviewing Data Files sample1_rawdata.txt sample1_fhg unique.txt sample1_fhg pass.txt sample1_fhg long.txt sample1_fhg short.txt sample1_fhg hc.txt sample1_fhg lc.txt Description Sequencing sequences (sequ seqs) ) as Wordpad obtained from sequencer Sequ seqs listed by family (unique seqs) Wordpad Unique seqs passed digital filters Unique seqs with length >= 15 Unique seqs with length < 15 Reviewing Program Wordpad Wordpad Wordpad Unique seqs with copy number >=3 Wordpad Unique seqs with copy number <3 Wordpad sample1_fhg db.fa Final mappable unique seqs Wordpad sample1_fhg gp1_align.txt sample1_fhg gp1_mirlist.txt sample1_fhg gp1_sum.txt Cluster I: see Table 4 Wordpad Excel Excel sample1_fhg gp2_align.txt sample1_fhg gp2_mirlist.txt sample1_fhg gp2_sum.txt Cluster II: see Table 4 Wordpad Excel Excel Mapped Data sample1_fhg gp3_align.txt sample1_fhg gp3_mirlist.txt sample1_fhg gp3_sum.txt Cluster III: see Table 4 Wordpad Excel Excel sample1_fhg gp4_align.txt sample1_fhg gp4_mirlist.txt sample1_fhg gp4_sum.txt Cluster IV: see Table 4 Wordpad Excel Excel sample1_fhg uni_mirs.txt The list of all unique seqs from Cluster I to IV Wordpad sample1_fhg nohit.txt Unique seqs having no hit with reference libraries or the genome Wordpad Summary sample1_fhg clusterposition.txt sample1_fhg mirdistribution.png Data_summary_sample1.xls Genomic chromosomal positions of the mapped unique seqs Plot of position of mapped unique seqs inside genome Statistics of data analysis at the various steps and the final results Excel Paint Excel LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

5 microrna Discovery Sequencing Data Report sample_ III. M ETHODS AND EXPERIMENTS A. Small RNA Library Constructionn A small RNA library was generated from the customer sample according to Illumina s sample preparation instruction 1. A summary of the procedures performed is briefly described below. 1. Small RNA Isolation by Denaturing PAGE Gel The received total RNA sample was size-fractionated on a 15% tris-borate-edta- quantified following gel elution, and ethanol precipitated. Urea polyacrylamide gel. The RNA fragments of length nts were isolated, 2. 5 and 3 Adapter Ligation The SRA 5 adapter (Illumina) was ligated to the aforementioned RNA fragments with T4 RNA ligase (Promega). The ligated RNAs were size-fractionated on a 15% trisborate-edta-urea polyacrylamide gel and the RNA fragments of size ~41-76 nts were isolated. The SRA 3 adapter (Illumina) ligation was then performed, followed by a second size-fractionation using the same gel condition as described above. The RNA fragments of size ~64-99 nts were isolated through gel elution and ethanol precipitation. 3. Reverse Transcription and PCR Amplification The ligated RNA fragments were reverse transcribed to single-stranded cdnas using M-MLV (Invitrogen) with RT-primers recommended by Illumina. The cdnas were amplified with pfx DNA polymerase (Invitrogen) in 20 cycles of PCR using Illumina s small RNA primers set. 4. Purification of Amplified cdna Library for Sequencing PCR products prepared were purified on a 12% TBE polyacrylamide gel and a slice of gel of ~ bps was excised. This fraction was eluted and the recovered cdnas were precipitated and quantified on Nanodrop (Thermo Scientific) and on TBS-380 mini-fluorometer (Turner Biosystems) using Picogreen dsdna quantization reagent (Invitrogen). The concentration of the sample was adjusted to ~10 nm and a total of 10 L was used in sequencing reaction. B. Deep Sequencing The purified cdna library was used for cluster generationn on Illumina s Cluster Station and then sequenced on Illumina GAIIx following vendor s instructionn for running the instrument. Raw sequencing reads were obtained using Illumina s Pipeline v1.5 software LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

6 microrna Discovery Sequencing Data Report sample_ following sequencing image analysis by Pipeline Firecrest Module and base-calling by Pipeline Bustard Module. The extracted sequencing reads weree stored in file sample1_rawdata.txt and were then used in the standard data analysis, which is described in the next Section. C. Standard Data Analysis A proprietary software package, ACGT101-miR v3.x (LC Sciences), was used for standard data analysis. The key functions performed by this software and the relevant analysis results are described here. 5. Obtaining Mappable Sequences from Raw Sequencing Data After the raw sequence reads, or sequenced sequences (sequ seqs) were extracted from image data, a series of digital filters (LC Sciences) were employed to remove various un-mappable sequencing reads. A Fasta file named sample1_fhg_db.fa was generated and used for mapping. a. Generating Unique Families of Sequ Seqs by Sorting Raw Sequencing Reads In this step, the same sequ seqs in the raw data file were being counted and a unique family of sequences (unique seqs) file, sample1_fhg_unique.txt, was generated. An example of a typical entry of this file is as shown below: 23 TTTGTCGG GTCTTTGGATATGCCGTGTGACAATGGTGG 1,8560 where 23 is the index of this sequence, followed by the sequ seq, and is the count (copy number) of the sequ seq. b. Generating Mappable Sequ Seqs In this step, the impurity sequences due to sample preparation, sequencing chemistry and processes, and the optical digital resolution of the sequencer detector were removed to give sequ seqs which were used to map with the reference database files. Those remaining sequ seqs were grouped by families (unique seqs) and stored in file sample1_fhg_ pass.txt. c. Filtering Unique Seqs by Length In this step, unique seqs weree separated into two groups based on their sequence lengths. Unique seqs with sequence length greater than a cut-off length (default = 15 nts for microrna discovery) were saved in the file named sample1_fhg_long.txt, while those of shorter length were saved in the file named sample1_fhg_short.txt. LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

7 microrna Discovery Sequencing Data Report sample_ d. Filtering Unique Seqs by Copy Number In this step, unique seqs weree further sorted based on their copy numbers. Those with copy numbers greater than a predefined cut-off number (default = 3) were stored in the file named sample1_fhg hc.txt while those with less copies were stored in the file named sample1_fhg lc.txt, whereas hc means high copy and lc means low copy. e. Removing Unique Seqs from Certain Known RNA Reference Databases Standard procedures were employed to remove those unique seqs which were mapped to mrna, RFam and Repbase. The unique seqs which passed the filter at this step were saved in the file called sample1_fhg_db.fa. 6. Mapping Mappable Unique Seqs to Mirs and Genome In this Section, various mappings were performed on unique seqs against pre-mirna (mir) and mature mirna (mir) sequences listed in the latest release of mirbase 2, 3, 4, or genomee based on the public releases of appropriate species. Mappings were also done on mirs of interest against genome sequence. Methods and criteria used for various mappings were documented in the ACGT-101 User s Manual 5. Brief descriptions of the analyses are presented below and the characteristics of various groups of unique seqs are summarized in Table 4. a. Mapping Unique Seqs to Mirs in mirbase The cleaned unique seqs in sample1_fhg db.fa were blasted against mirs in mirbase. The mapped unique seqs weree grouped as unique seqs mapped to mirs in mirbase, while the remaining ones were grouped as unique seqs un-mapped to mirs in mirbase. b. Mapping Mirs Mapped by Unique Seqs to Genome The mirs to which the unique seqs in unique seqs mapped to mirs in mirbase group weree mapped were further blasted against genome. The mirs mapped to genome weree sorted out and the unique seqs associated with these mirs were grouped as unique seqs mapped to mirs that further mapped to genome. This group of unique seqs were categorized as Cluster I and saved in file sample1_fhg_gp1_mirlist.txt. Their alignments were presented in file sample1_fhg_gp1_align.txt. A summary file was also generated and saved as sample1_fhg_gp1_sum.txt. The unique seqs mapped to the mirs that were not mapped to genome were grouped as unique seqs mapped to mirs that un-mapped to genome. LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

8 microrna Discovery Sequencing Data Report sample_ c. Mapping Unique Seqs with Mirs Un-mapped to Genome to Genome The unique seqs in the group unique seqs mapped to mirs that un-mapped to genome were blasted against genome. The unique seqs that mapped to genome heree were grouped as unique seqs mapped to mirs and genome but mirs un-mapped to genome and were categorized as Cluster II and saved in file sample1_fhg_gp2_mirlist.txt. Their alignments were presented in file sample1_fhg_gp2_align.txt. A summary file was also generated and saved as sample1_fhg_gp2_sum.txt. The remaining unique seqs were grouped as unique seqs mapped to mirs but neither unique seqs nor their mirs mapped to genome. d. Mapping Unique Seqs to mirs The unique seqs in the group unique seqs mapped to mirs but neither unique seqs nor their mirs mapped to genome were further categorized based on whether unique seqs were mapped to any mature mirnas (mirs) in the mirs to which the unique seqs were mapped. All unique seqs in this group that were mapped to mirs were grouped as unique seqs mapped to mirs and mirs but neither unique seqs nor their mirs mapped to genome and were categorized as Cluster III and saved in file sample1_fhg_gp3_mirlist.txt. Their alignments were presented in file sample1_fhg_gp3_align.txt. A summary file was also generated and saved as sample1_fhg_gp3_sum.txt. The rest of unique seqs in the group that were un-mapped to mirs were further grouped as unique seqs mapped to mirs but not to mir and neither unique seqs nor their mirs mapped to genome and termed as unique seqs nohit 1. e. Mapping Unique Seqs Un-mapped to Mirs to Genome Unique seqs in the group of unique seqs un-mapped to mirs in mirbase were blasted against genome directly and those mapped to genome weree identified. The extended sequences of the mapped genome sequences were tested for possible formation of stable hairpins. When stable hairpins were predicted, their associated unique seqs were then grouped as unique seqs un-mapped to mirs but mapped to genome with possible hairpin formation. These unique seqs were categorized as Cluster IV and saved in file sample1_fhg_gp4_mirlist.txt. Their alignments were presented in file sample1_fhg_gp4_align.txt. A summary file was also generated and saved as sample1_fhg_gp4_sum.txt All unique seqs in Cluster I to IV were listed in sample1_fhg_uni_mirs.txt as mapped mirs or predicted mirs. LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

9 microrna Discovery Sequencing Data Report sample_ The unique seqs that were mapped neither to mirs in mirbase nor to genome were grouped as unique seqs un-mapped to mirs and genome and weree termed as unique seqs nohit 2. The unique seqs in both groups of unique seqs nohit 1 and unique seqs nohit 2 weree combined as unique seqs nohit and saved in file sample1_fhg_nohit.txt. f. Plot of the Chromosome Genomic Positions of the Mapped Unique Seqs The genomic positions of the Cluster I to IV sequences were mapped to chromosomes and the results were saved in the file named sample1_fhg_clustposition.txt and displayed in the plot file named sample1_fhg_mirdistribution.png. Table 4. Summary of mapping of unique seqs to mirs, mirs, and genome* Clusters Group Description mir Unique seqs Mapped* Genome mir Comments Cluster I Unique seqs mapped to mirs that further mapped to genome Cluster II Cluster IIII Unique seqs mapped to mirs and genome but mirs un-mapped to genome Unique seqs mapped to mirs and mirs but neither unique seqs nor their mirs mapped to genome mirs un-mapped to genome mirs un-mapped to genome Cluster IV Unique seqs un-mapped to mirs but mapped to genome with possible hairpin formation Unique seqs nohit Unique seqs mapped to mirs but not to mir and neither unique seqs nor their mirs mapped to genome (unique seqs nohit 1) Unique seqs un-mapped to mirs and genome (unique seqs nohit 2) mirs un-mapped to genome * Note: indicates that a unique seq was mapped to the mir, mir, or genome. indicates that a unique seq was not mapped to the mir, mir, or genome. LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

10 microrna Discovery Sequencing Data Report sample_ Length Distribution of Mappable Dataa 4,000,000 3,500,000 3,533,567 3,000,000 2,500,000 2,,203,539 2,000,000 1,500,000 1,000, , , , ,314 81,131 77, , , ,,713 42,930 20, Length (nt) Total # of Reads 105,314 81,131 77, , , ,824 2,203,539 3,533, , ,713 42,930 20,549 7,944,551 % of Total Reads # of Unique Seqs , 471 Reads # / Unique seqs LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

11 microrna Discovery Sequencing Data Report sample_ Chromosomal location of pre-mirnas. The relative locations of individual pre-mirnas (mir) are shown across the 19 chromosomes. MID (Maximum Inter-Distance) is the maximum distance between any two pre-mirnas on a same chromosome considered to be in the same cluster. Fiftynine clusters (black dots) are obtained under the MID is limited to 50 kb. LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

12 microrna Discovery Sequencing Data Report sample_ IV. RE FERENCE Preparing Samples for Analysis of Small RNA, Illumina Inc., Part # Rev. A, 2008; Griffiths-Jones, S., Saini, H.K., van Dongen, S., Enright, A.J., mirbase: tools for microrna genomics, 2008, Nucleic Acids Research, 36, D154-D158; Griffiths-Jones, S., Grocock, R.J., van Dongen, S., Bateman, A., Enright, A.J., mirbase: microrna sequences, targets and gene nomenclature, 2006, Nucleic Acids Research, 34, D140-D144; Griffiths-Jones, S., The microrna Registry, 2004, Nucleic Acids Research 32, D109-D111; LC Sciences ACGT 101 manual; LC Sciences, LLC support@lcsciences.com 2575 W. Bellfort, Suite 270, Houston, Texas Tel , Fax

13 Example printouts of the included sequence data files are attached. The data files included with each sample report are listed in Table 3. These printouts represent truncated sample data files.

14 sample1_rawdata >ILLUMINA-57021F:7:1:3:1127#0/1 ACATTGGGTTCTCATTCAAATATACTTTTGAAGTATGTGC >ILLUMINA-57021F:7:1:3:352#0/1 ACGAAGAGGGAGCGCAATNNNTCAGTATATATTGAAGGAC >ILLUMINA-57021F:7:1:3:804#0/1 TCTGAGCAGTGACTAGNACCCGTAATANGAGGTGAGCAGC >ILLUMINA-57021F:7:1:3:2003#0/1 CAACAAGGGTAAGTTAATGCAATCGCCCCTCCNNAAAGGG >ILLUMINA-57021F:7:1:3:399#0/1 TTAAGGTAATTAGCGTGGGCGGTAGCGCTCTGTATAAGCT >ILLUMINA-57021F:7:1:3:921#0/1 TAAGCAGGCCATGCCGCTACGNNGGGGATAAATCTGGCTG >ILLUMINA-57021F:7:1:3:1598#0/1 TATTGGGAGGGAATAGATCGCTGNCCAGCCTGATNTAAGG >ILLUMINA-57021F:7:1:3:1981#0/1 GCCGCCTGCCTAGGTCTTCTTATCTTGAGAATGAGTCAAG >ILLUMINA-57021F:7:1:3:118#0/1 TGAGCACACATTACATAATGCGGCTACTGTTTGACAAAGT >ILLUMINA-57021F:7:1:3:1347#0/1 CTCGATGTAAGAGATCACTATTTCGCCACATGGTATTCCG >ILLUMINA-57021F:7:1:3:185#0/1 TAAGGGACCTATTGTCAGCGGATCACAATGTCTTGAAGGA >ILLUMINA-57021F:7:1:3:824#0/1 TTTGCAAACCGATGTCAGTGGCACTGCAAATGTCCACTGT >ILLUMINA-57021F:7:1:3:1603#0/1 CAAAACAAACGTAATACGGCGGTATCCCACTAGAGTTGTC >ILLUMINA-57021F:7:1:3:613#0/1 TCCCATGAATCAGGCCNGACTAGGGAAANTTCAATCAGAC >ILLUMINA-57021F:7:1:3:1459#0/1 GCCCGCCGTTATGGACAATAAGTAAATTGCTACAATTGAC >ILLUMINA-57021F:7:1:3:494#0/1 GCTGTTCTAGAAAATGGTTTATCTATTCCTGCGTCAATCT >ILLUMINA-57021F:7:1:3:1747#0/1 TGGAAAACTCTATTAGAGTCTAACTTATCCAATGCGCACG >ILLUMINA-57021F:7:1:3:1574#0/1 AGTAAATNATANAAATAAAATTAAAAAAAAAANAAAAAAA >ILLUMINA-57021F:7:1:3:1448#0/1 CTATGTACAGCCACTCTCTTGATGGCGGGAAATATTTATT >ILLUMINA-57021F:7:1:3:149#0/1 GGTGCTGGATTCCCGTTTTGCGTATTTTGGGAGAGGTCCA >ILLUMINA-57021F:7:1:3:460#0/1 TAGGCTGTTTGCTACATTTTGAGACAAACTGTATAGAGTG >ILLUMINA-57021F:7:1:3:1436#0/1 AATGGCGGAGCGATTTATAGGGAGAGGGGCGATTGGCTCG >ILLUMINA-57021F:7:1:3:1847#0/1 GACTATCTGCCTGTAGCGGATAAGGCAGCATCCAACCTAA >ILLUMINA-57021F:7:1:3:909#0/1 sample1_rawdata

15 sample1_fhg_unique # of input sequences: # of families: ; / =21.0% Index seq count 1 ATTATGGTACTTGTATTTAACAGGCTCACT CCTCTTATGTAGACCGTTGTCCAGTGGTGA ATTTTTATAGTACCAAGAGGCTACGCAGT GTAAAAAGTCTATCGCCGCACTGTCGTCA CTTAACGGTTCTAACTATTCACCGGTAAAG CAAAGAAGCCATAGGCGCCCGGGAACACC AACCGTGAGGCGCTGGAAAGACGCTAGAAG AGTAGTACTGGCGTACACATTCTCCACGG CATCCTATTCTAGCAATCAGGAGAACATTC AGGCCCATATCAAGAAGTAGAACTATCGA ACATGAATGGCGAATGCTTCCCGTGATA GTTAGCTAGTGCCCGGTTTTATCAAGCCC CGCGAATGTCTTCGTATGCTCAGGTAGC CATGAGAGGTCGAGGGACTTGATTCCTAC CTTCTGGCACCGTGGGCCAGCGGAAGGACA TGCCCCAACGACGCGGAAAATCAAGCGAGC CTTCTGGAAGCCAACGCTCTGGCGGGATCC AAACAAATTAACTCGACGACCTCTCCTCT AAACTATGCATTATTTCCCCCTAAGATCT CATATCTGTCTCCTACCAGATTATCACCCC TGTTTGCTGTGGCATTTTTCCCATGGATTA ATTATTAACGTGGTGTGGTAAATAGAGGGT GGTTCATGCCTAAATTGCATCTATAATA GAGTCGTAACGCTACCCTATACGAAGCG CCTTGAATGTGACCCTGAGGCTTCTATTAG AAGAAGCGTCAACCCCCACGTCAAGACGT TCCGTTGCTAGCCGGAGGACCTCCTGGT CCGCAGAGCCAGGCACTATGTCAGGGGCTA CATATGGCAAAAGACCGGACTGGACGCGA ACATCAAGAATCCTCAATCCTACGTGGACG AAGAGCCGGAGAACACATGATGGAGGCGAC TGCGATAAAAACGGTGATGACCAAAGAACA TGATCACGAAAAGTTGCTTGACAAGGTT CTGGTTAAGCACCCCCTGGTGGTGCTGCCT TGGGGCTACCGGGGCCTACACGCACCCAT GTTTATACTATTAATATGCAATGGTGACT GCCCCAAGCGTAGGTTGGGGGTCCGTTCG GCCCGGAGTCAGATGACTCGCTTACGTG CACGCTAGAAGGTGCTAGGGCTAGCTCTTT GTCATGGGAGCATTCATGCCGCGACGCAC CGCCTTTACTCCTGGAAGATATGACATGA GTAAACCCCGGGCGGTGCAACCACAGGCGG TGGTCTGGGCATTGTGCTTGAGCACACTTA 67972

16 sample1_fhg_pass # of input seq: # of input family: # of seq after filter: ; / =95.0% # of family after filter: ; /987972=83.6% # of seq with repeat fragment: ; / =5.0% # of family with repeat fragment: ; /987972=16.4% # of seq of >=7A,>=8C,>=6G,>=7T: ; / =5.0% # of family of >=7A,>=8C,>=6G,>=7T: ; /987972=16.4% # of seq of >10 dimer: 5; 5/ =0.0% # of family of >10 dimer: 5; 5/987972=0.0% # of seq of >6 trimer: 18; 18/ =0.0% # of family of >6 trimer: 18; 18/987972=0.0% # of seq of >5 tetramer: 8; 8/ =0.0% # of family of >5 tetramer: 8; 8/987972=0.0% Index seq count 1 TAGCACTCAAGTGTTTTGCACTGG AGAACGGTTTGCTATTTCTG TAACGGGGGTCACCTTCGGCAG CTCGAATTTTTCCAATCAC GTATGCATAATTGCAAGCACAT TTGCCTCACTCGTACAAAAGGCC GGATCAGGACCCGACTCCACATTAG AGAGGCCACCAAGATCTTAGGCC AAACAGCGTCTCAGTGTAATTG TCTGTCTATCTTCTTAAT CTGTCCCCCTTGTCTGATACA ACAAATGGCGGACGCGAATC GGCAAGTCTTCCTGCGA TGATGGAGGAACGTAAACGT GTCCACATCATCCCGGGGTCG TCGCGTATTGCTATTTAGGA ATCAACAACCAATCTCGATAT TATCAGTAACGTACATGCCCCCGAT ACTCCTGGAGGTCTCGCTCGTCTA AATCGAATCTTCGATACGTCGT CGTGGAGGGAGGCCGTCAGTTT GTCCATCTAGCCAATAGGC AGTAACCAACCGTGAGAGTGTTGGC GCTGGTTTTTGGAGCATG AGTAGGTCTGTAAGGGGT CAGTAAACGAAAGGACCGAGACT TTACACAAACATATCAGCGAT CTCGTTATTAGTAATACTC TAAGTGGTAAATTAACCGTTACACC TCGCGAGCTGACCAGTATCACG 23074

17 sample1_fhg_gp1_align input file: sample1/miralign/sample1_fhg_align.txt for mirs input file: sample1/db/sample1_fhg_db.fa for sequ seq input file: sample1/output/sample1_fhg_gp1.txt for cluster & genomeseq Conventions: 1. the. in alignments means that the base is same as that in reference. 2. the * in alignments means that the base is same as that in reference, but the * is the mature part of precursor. The capital bases in * region also belong to mature. 3. the * in #error means that this sequseq has deletion compared with reference. 4. the + in #error means that this sequseq is without 3ADT cut, and the previous part of the sequseq is mapped to the reference and the other part is removed. 5. full length precursor and sequseq (except for the sequseq without 3ADT cut, which is indicated by + in #error) are listed below. clusterno=1 chr=1 gi=nt_ strand=1 #mirs=7 #copy(all)=3832 #family(all)=37 #copy(0error)=1167 #family(0error)=12 #copy(1error)=2665 #family(1error)=25 genome GTATGCCTTAACAGCAAGCGCAGTAGCGTAGCGACTGGGCATGAACGCGACGTTGATGAACTCGTAAGTTCTTCCACAAGTCTGACCGTCGTATAAG #error hsa-mir-xxe 1...**********************...********************** hsa-mir-xxe,hsa-mir-xxe* ptr-mir-xxe 1...********************** ptr-mir-xxe mml-mir-xxe 1...a...********************** mml-mir-xxe mmu-mir-xxe 1...**********************...c...********************** mmu-mir-xxe,mmu-mir-xxe* bta-mir-xxe 1...************************...c...cga bta-mir-xxe-5p oan-mir-xxe 1.t...***********************..g... c.ga...**********************...a.. 94 oan-mir-xxe,oan-mir-xxe* cfa-mir-xxe 1...a...g.********************** 64 cfa-mir-xxe 275_count= _count= _count= _count= _count= _count= _count= _count= c _count= c _count= c _count= t _count= c _count= a _count= c _count=8 1...c _count=7 1...c _count=5 1...a _count=4 1...t _count=3 1...t _count=3 1...c _count=3 1...t _count=3 1...c _count= _count= _count= _count= _count= _count= t _count= g _count= g _count= g _count=9 1...a 23 1

18 sample1_fhg_gp1_align 19877_count=6 1...g _count=3 1...t 21 1 clusterno=2 chr=1 gi=nt_ strand=1 #mirs=5 #copy(all)=156 #family(all)=7 #copy(0error)=137 #family(0error)=4 #copy(1error)=19 #family(1error)=3 genome CTCTTGCGAAAAATAAATAAACGCTCAATTAGATGGCGGCGGATTGGGTCCCCCCTAGAAGCGACAGGGTTGCTGCTGAACTCGGTGGTTCTGTGAG #error bta-mir- XXc 4 a..g...c...***********************...g...g. 104 bta-mir-xxc hsa-mir-xxc ***********************...********************** hsa-mir-xxc ptr-mir-xxc *********************** ptr-mir-xxc mmu-mir-xxc t...***********************...********************** mmu-mir-xxc cfa-mir-xxc-1 1 ************************ cfa-mir-xxc 1630_count= _count= _count= _count= a _count=6 1...a _count=3 1...t _count= clusterno=3 chr=1 gi=nt_ strand=1 #mirs=5 #copy(all)=22 #family(all)=2 #copy(0error)=19 #family(0error)=1 #copy(1error)=3 #family(1error)=1 genome TCATAAAAATGTCGAGGAATGGCGGCTCGCGTAGACCCGCACCCCACCCCTTCGAAGCTCATTGCGTCAGTTCCACGATTC #error mmu-mir-xxx 1...********************** mmu-mir-xxx bta-mir-xxx 1...**********************.g...a.. 80 bta-mir-xxx hsa-mir-xxx 1...********************** hsa-mir-xxx mne-mir-xxx 1...*******************G** mne-mir-xxx cfa-mir-xxx 1...********************** 61 cfa-mir-xxx 6971_count= _count=3 1...a 24 1 clusterno=4 chr=1 gi=nt_ strand=1 #mirs=2 #copy(all)=33402 #family(all)=49 #copy(0error)=481 #family(0error)=12 #copy(1error)=32921 #family(1error)=37 genome GCTCCCCTATAAGAAGCGCGGAAGCCGGCTTATATGTTTCCCCATTATCATCGAACTTTCGATTGGGCCCCGTAACTCT #error ptr-mir-xxx ********************** ptr-mir-xxx hsa-mir-xxx ********************** hsa-mir-xxx 1326_count= _count= _count= _count= _count= _count= _count= _count= _count= g _count= g _count= g _count= g _count= t _count= t _count= g _count= a _count= t _count= g _count= g _count= t 21 1

19 sample1_fhg_gp1_mirlist #sequ_seq_id seq length clusterno=1: mir 40e 3p _count=837 AAATTGTCGTCCGAACGACCCA 24 CTAAATTGTCGTCCGAACGACCCA _count=62 AAATTGTCGTCCGAACGACCCA 23 CTAAATTGTCGTCCGAACGACCCA _count=32 AAATTGTCGTCCGAACGACCCA 23 CTAAATTGTCGTCCGAACGACCCA _count=12 CTAAATTGTCGTCCGAACGACC 22 CTAAATTGTCGTCCGAACGACCCA _count=7 CTAAATTGTCGTCCGAACGACCCA 25 CTAAATTGTCGTCCGAACGACCCA _count=7 CTAAATTGTCGTCCGAACGACC 22 CTAAATTGTCGTCCGAACGACCCA _count=5 TAAATTGTCGTCCGAACGACCCA 24 CTAAATTGTCGTCCGAACGACCCA _count=4 AAATTGTCGTCCGAACGACCCA 24 CTAAATTGTCGTCCGAACGACCCA _count=3 TAAATTGTCGTCCGAACGACCCA 24 CTAAATTGTCGTCCGAACGACCCA _count=3 AAATTGTCGTCCGAACGAC 19 CTAAATTGTCGTCCGAACGACCCA _count=3 AAATTGTCGTCCGAACGACCCA 24 CTAAATTGTCGTCCGAACGACCCA _count=3 CTAAATTGTCGTCCGAACGA 20 CTAAATTGTCGTCCGAACGACCCA 1 3 clusterno=1: mir 40e 5p _count=133 AAGAGTGCGTTGATTGTGGGTA 22 TCAAGAGTGCGTTGATTGTGGGTA _count=61 CAAGAGTGCGTTGATTGTGGG 21 TCAAGAGTGCGTTGATTGTGGGTA _count=4 AAGAGTGCGTTGATTGTGG 19 TCAAGAGTGCGTTGATTGTGGGTA _count=4 AAGAGTGCGTTGATTGTGGGT 21 TCAAGAGTGCGTTGATTGTGGGTA _count=4 AAGAGTGCGTTGATTGTGGG 20 TCAAGAGTGCGTTGATTGTGGGTA _count=168 CAAGAGTGCGTTGATTGTGGGT 22 TCAAGAGTGCGTTGATTGTGGGTA _count=80 AAGAGTGCGTTGATTGTGGGTA 22 TCAAGAGTGCGTTGATTGTGGGTA _count=19 AAGAGTGCGTTGATTGTGGGT 21 TCAAGAGTGCGTTGATTGTGGGTA _count=12 TCAAGAGTGCGTTGATTGTGG 21 TCAAGAGTGCGTTGATTGTGGGTA _count=9 TCAAGAGTGCGTTGATTGTGGGT 23 TCAAGAGTGCGTTGATTGTGGGTA _count=6 AAGAGTGCGTTGATTGTGGG 20 TCAAGAGTGCGTTGATTGTGGGTA _count=3 CAAGAGTGCGTTGATTGTGGG 21 TCAAGAGTGCGTTGATTGTGGGTA _count=3 AAGAGTGCGTTGATTGTGGGTA 22 TCAAGAGTGCGTTGATTGTGGGTA _count=3 AAGAGTGCGTTGATTGTGGGTA 23 TCAAGAGTGCGTTGATTGTGGGTA _count=3 AAGAGTGCGTTGATTGTGGGTA 23 TCAAGAGTGCGTTGATTGTGGGTA 3 3

20 sample1_fhg_gp1_mirlist clusterno=2: mir 40c 5p _count=106 AGTGGAGAGTGCCGCGTGTCTCG 24 GAGTGGAGAGTGCCGCGTGTCTCG _count=19 GAGTGGAGAGTGCCGCGTGTCTC 23 GAGTGGAGAGTGCCGCGTGTCTCG _count=7 AGTGGAGAGTGCCGCGTGTCTC 22 GAGTGGAGAGTGCCGCGTGTCTCG _count=10 AGTGGAGAGTGCCGCGTGTCTCG 24 GAGTGGAGAGTGCCGCGTGTCTCG _count=6 AGTGGAGAGTGCCGCGTGTCTCG 25 GAGTGGAGAGTGCCGCGTGTCTCG _count=3 GAGTGGAGAGTGCCGCGTGTCTCG 25 GAGTGGAGAGTGCCGCGTGTCTCG 1 2 clusterno=2: mir 40c 3p _count=5 ATCGCAGAATGCGCCTTGAT 22 CATCGCAGAATGCGCCTTGAT 2 1 clusterno=3: mir p _count=9 AACTAGCGGTCTCTTTCGCGT 21 AACTAGCGGTCTCTTTCGCGTGGA _count=13 ACTAGCGGTCTCTTTCGCGTGG 22 AACTAGCGGTCTCTTTCGCGTGGA 2

21 sample1_fhg_gp1_sum input file: sample1/miralign/sample1_fhg_align.txt for mirs input file: sample1/db/sample1_fhg_db.fa for sequ seq input file: sample1/output/sample1_fhg_gp1.txt for cluster & genomeseq Title: Position clusters of mirs mapping to genome input file: sample1/lists/sample1_fhg_align_chry.txt for sequence start position input file: sample1/miralign/sample1_fhg_matchedmirs.fa for mapped mir IDs input file: sample1/lists/sample1_fhg_align_chr1.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr2.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr3.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr4.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr5.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr6.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr7.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr8.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr9.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr10.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr11.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr12.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr13.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr14.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr15.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr16.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr17.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr18.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr19.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr20.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr21.txt for alignment data input file: sample1/lists/sample1_fhg_align_chr22.txt for alignment data input file: sample1/lists/sample1_fhg_align_chrx.txt for alignment data input file: sample1/lists/sample1_fhg_align_chry.txt for alignment data unique mammalian mirs mapped by sequ seq: 6456 # of unique mammalian mirs mapped by sequ seq & genome: 5256; 5256/6456=85.6% # of position clusters of the mapped mirs: 574 length of the genome: position cluster: the distance of two near positions in a clustaer is < 50 For the mirs mapped by sequ seq and genome after re alignment: # of position clusters: 457 # of mammalian mirs mapped to sequ seq and genome: 4839 # of unique sequ seq: # of unique sequ family: 5330 Note: represent mir at 5p or 3p is the sequenced sequence of lowest error# in each cluster and highest copy# the mir_name is composed of 4 parts, mir, extension number, 5p or 3p, index of sequ seq. The extension number is the extension number in mirids of highest occurency # of unique mirs detected: 563; # of unique mirs is counted based on mir_name.

22 sample1_fhg_gp1_sum Index copy#(all isoforms in 5p family#(all isoforms in 5p or 3p) chr# chr_seqid strand mir_start mir_end mir_start mir_end #mirs mirids clustern o mir_name mir_seq mir_len copy# of the isoform or 3p) 1 1 mir 30e 5p 275 TGTAAACATCCTTGACTGGAAGCT NT_ hsa mir XXX 2 20 mir 28 3p 2238 CACTAGATTGTGAGCTCCTGGA NT_ hsa mir XXX 3 57 mir 27b 3p 127 TTCACAGTGGCTAAGTTCTGC NT_ hsa mir XXX 4 87 mir 625 3p GACTATAGAACTTTCCCCCTCA NT_ hsa mir XXX 5 98 mir p AGGGTAGATAGAACAGGTCTTG NT_ hsa mir XXX mir 21 3p 413 CAACACCAGTCGATGGGCTGTC NT_ hsa mir XXX mir 7e 3p CTATACGGCCTCCTAGCTTTCC NT_ hsa mir XXX mir 101 3p 121 GTACAGTACTGTGATAACTGAA NT_ hsa mir XXX mir 135b 3p ATGTAGGGCTAAAAGCCATGGG NT_ hsa mir XXX mir 425 3p 5257 ATCGGGAATGTCGTGTCCGCC NT_ hsa mir XXX mir 548p 3p CCAAAACTGCAGTTACTTTTGC NT_ hsa mir XXX mir 31 5p 412 AGGCAAGATGCTGGCATAGCTG NT_ hsa mir XXX mir 548l 5p AAAAGTATTTGCGGGTTTTGTC NT_ hsa mir XXX mir 125b 5p 90 TCCCTGAGACCCTAACTTGTGA NT_ hsa mir XXX mir p 5124 TCTGGGTGGTCTGGAGATTTGTG NT_ hsa mir XXX mir 16 3p 9924 CCAGTATTAACTGTGCTGCTGA NT_ hsa mir XXX mir p GAGGCAGAAGCAGGATGACAA NT_ hsa mir XXX mir 33b 5p 3282 GTGCATTGCTGTTGCATTGCA NT_ hsa mir XXX mir 454 3p 8695 TAGTGCAATATTGCTTATAGGGTTT NT_ hsa mir XXX mir 320 3p 1624 AAAAGCTGGGTTGAGAGGG NT_ hsa mir XXX mir 548j 5p AAAAGTAATTGCGGTCTTTGGT NT_ hsa mir XXX mir 659 5p AGGACCTTCCCTGAACCAAGGA NT_ hsa mir XXX mir p ACGCCCTTCCCCCCCTTCTTCA NT_ hsa mir XXX mir 221 5p 816 ACCTGGCATACAATGTAGATTTCT X NT_ hsa mir XXX mir 221 3p 5 AGCTACATTGTCTGCTGGGTTTC X NT_ hsa mir XXX mir 222 5p 4253 CTCAGTAGCCAGTGTAGATCC X NT_ hsa mir XXX mir 222 3p 23 AGCTACATCTGGCTACTGGGTCTC X NT_ hsa mir XXX mir 548i 5p AAAAGTACTTGCGGATTTTGC X NT_ hsa mir XXX mir 361 5p 1854 TTATCAGAATCTCCAGGGGTAC X NT_ hsa mir XXX mir 361 3p TCCCCCAGGTGTGATTCTGATTT X NT_ hsa mir XXX mir 421 3p 3255 ATCAACAGACATTAATTGGGCGC X NT_ hsa mir XXX

23 sample1_fhg_clusterposition input from sample1/0finalreport/3_sample1_fhg_gp1_sum.txt input from sample1/0finalreport/4_sample1_fhg_gp2_sum.txt input from sample1/0finalreport/6_sample1_fhg_gp4_sum.txt The position refers to the start position of mir in human genome. ESTs are added one by one after chromosome X. The PositionInSeq refers to the position of mir in its own contig sequence or EST Refer to above three input files to get definitions of mir_name & mir_seq The clusterdistance is the difference of the positions of the current and previous clusters. Minimum c 1 Maximum c #Copy (all #family (all cluster isoforms in 5p isoforms in 5p StartPosition EndPosition Index Position Distance or 3p) or 3p) Chr# Strand InSeq InSeq Type unique_mirs mir_name mir_seq predict (gp4) PC 5p XXXXX GCGGCACTGAGGCTTATAGCGGAA predict (gp4) PC 3p XXXXX TGTACGGCCATCCAGCTCTAGGCC predict (gp4) PC 5p XXXXX GGAATAGCACATCAAGTAGGT predict (gp4) PC 5p XXXXX CACGGCCATTAGACGACGCCGGG predict (gp4) PC 3p XXXXX TGTACGTAAAGTGACTCCACTAA predict (gp4) PC 5p XXXXX TGATAGGCCCTACTGTCCATGTT known (gp1) hsa mir XXX mir XXX 5p XXX TATGCCAGGCAGTTATACCAT known (gp1) hsa mir XXX mir XXX 5p XXX CAAGCGTTTGTCAACAAAGTGTTGA predict (gp4) PC 5p XXXXX TTGTCCGATTATGTGCTCG known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 5p XXX CCGCGACGTTTTCGGGACCGA known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 3p XXX CCAAACTCGCGAACTAG known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 5p XXX GAACTCTACGAATCATCCTAGTATG known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 3p XXX GCTGCCTCCGTACGATGCTA predict (gp4) PC 5p XXXXX ATCTAATGTGGGTGACACTGGT predict (gp4) PC 5p XXXXX GGGGTTTAGGGTACCCGCTTCTG predict (gp4) PC 5p XXXXX TGTTGAGCGATTGCATGCAACTTA predict (gp4) PC 5p XXXXX TGGGTGCGTGTGGTCACGTC predict (gp4) PC 5p XXXXX TGCGCGTCTTTATTATC predict (gp4) PC 5p XXXXX CCGTGATTGGACCGTCGCGTTCGT predict (gp4) PC 5p XXXXX CACTGCCGAACGATCTGTGATTCC predict (gp4) PC 3p XXXXX TTCACGCTGGGTTATATCTCTCGC predict (gp4) PC 5p XXXXX CCTCTCCTGGTTAGTCCA predict (gp4) PC 3p XXXXX CCAAGCAGTCTGGCATCTTATGC known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 3p XXX GTATCGCTATCGCCCAGAGCGTCG predict (gp4) PC 5p XXXXX CACCCAAGGACCCCGCC known (gp1) ssc mir XXX mir XXX 3p XXX GCGCCTTCCGCCGATTTTGT predict (gp4) PC 3p XXXXX ACGATACTGTACTCGGG predict (gp4) PC 5p XXXXX CCAAGAGGTGTGTTGAGCA predict (gp4) PC 5p XXXXX AGTTTTCGCACGGCGTGTCAT predict (gp4) PC 5p XXXXX TGCTTATGCAGCTTTGTAGCCT known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 5p XXX AAGGCGGGTCTACTAAGGGGAGC predict (gp4) PC 5p XXXXX GAAACCAGCTAAGCAATGC known (gp1) mmu mir XXX;mml mir XXX;b mir XXX 5p XXX GGGCATAACTGTGGGCTGAC predict (gp4) PC 5p XXXXX TTGAGGTCCGTTCCTCAGTCGACCT predict (gp4) PC 3p XXXXX TAGAGGTAGCCACAAGGATAGCG

24 Table 1 - Data summary Raw Data #SequSeq #UniqueSeq # Raw Sequ seq 9,922,513 1,592,666 Data Processing #SequSeq %SequSeq #UniqueSeq %UniqueSeq 1. impurity sequences filtered 948, % 623, % 2. Copy#<3 filtered 913, % 845, % 3. Length < 15 filtered 204, % 56, % 4. mrna,rfam,repbase filtered 362, % 11, % 5. Final Mappable 7,493, % 55, % Total 9,922, % 1,592, % Table 2 - Length distribution of mappable data Length #SequSeq %FinalMappable SequSeq #UniqueSeq %FinalMappable UniqueSeq #SequSeq/ #UniqueSeq 15 52, % % , % 1, % , % % , % 1, % , % % , % 1, % , % 3, % ,438, % 16, % ,810, % 8, % , % 2, % , % % , % % , % % , % % , % % , % % , % 15, % 7.4 Final Mappable 7,493, % 55, % 135.1

25 Table 3 - Unique seq mapped to unique mammalian mirs #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable # Known mammalian unique mir in mirbase v14.0 3,924 # Known mammalian unique mir in mirbase v14.0 2,656 # Unique mir in mirbase mapped 1,599 # Unique mir in mirbase mapped 2,273 Mapped to mirbase 6,013, % 80.25% 12, % 22.46% # Known hsa mir in mirbase v # Known hsa mir in mirbase v # Unique hsa mir in mirmirbase mapped 286 # Unique hsa mir in mirbase mapped 399 Mapped to hsa of mirbase 5,937, % 79.24% 8, % 15.13% Table 4 - Cluster I: Sequ seq mapped to mammalian mirs that further mapped to genome Mapping to mammalian: #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Cluster I 5,734, % 76.53% 8, % 15.03% # Alignment-cluster in Cluster I 389 # Unique mir in Cluster I 1,456 # Unique mir in Cluster I 309 FileName FileName FileName Sequ Seq Sequ Seq Mapped_Data/sample1_FHG_gp1_Align.txt Mapped_Data/sample1_FHG_gp1_Sum.txt Mapped_Data/sample1_FHG_gp1_miRlist.txt Unique Seq Unique Seq Mapping to species hsa: Sequ Seq Unique Seq #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Cluster I 5,734, % 76.53% 8, % 15.01% #Unique hsa mir in Cluster I 295 #Unqiue hsa mir in Cluster I 399

26 Table 5 - Cluster II: Sequ seq mapped to both mammalian mirs and genome, but the mirs unmapped to genome: Mapping to mammalian: Sequ Seq Unique Seq #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Cluster II % 0.00% % 0.02% # Alignment-cluster in Cluster II 7 # Unique mir in Cluster II 3 # Unique mir in Cluster II 4 FileName FileName FileName Mapped_Data/sample1_FHG_gp2_Align.txt Mapped_Data/sample1_FHG_gp2_Sum.txt Mapped_Data/sample1_FHG_gp2_miRlist.txt Table 6 - Cluster III: Sequ seq mapped to mammalian mirs, but the mirs unmapped to genome (sequence cluster): Mapping to mammalian: Sequ Seq Unique Seq #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Cluster III 2, % 0.04% % 0.26% # Alignment-cluster in Cluster III 23 # Unique mir in Cluster III 37 # Unique mir in Cluster III 18 FileName FileName FileName Mapped_Data/sample1_FHG_gp3_Align.txt Mapped_Data/sample1_FHG_gp3_Sum.txt Mapped_Data/sample1_FHG_gp3_miRlist.txt Mapping to species hsa: Sequ Seq Unique Seq #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Cluster III % 0.00% % 0.00% #Unique hsa mir in Cluster III 0 #Unqiue hsa mir in Cluster III 0

27 Table 7 - Cluster IV: Sequ seq mapped to genome, but unmapped to mammalian mirs (predict new hairpin by mfold): #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Cluster IV 18, % 0.25% 1, % 2.53% # Alignment-cluster in Cluster IV 1,095 # Unique mir in Cluster IV 703 FileName FileName FileName Table 8 - Unmapped Sequ Seq Mapped_Data/sample1_FHG_gp4_Align.txt Mapped_Data/sample1_FHG_gp4_Sum.txt Mapped_Data/sample1_FHG_gp4_miRlist.txt #SequSeq %SequSeq %FinalMappable #UniqueSeq %UniqueSeq %FinalMappable Nohit 847, % 11.31% 40, % 72.15% FileName Table 9 - Mapping summary Mapped_Data/sample1_FHG_nohit.txt #SequSeq %SequSeq #UniqueSeq %UniqueSeq Raw 9,922, % 1,592, % Mappable 7,493, % 55, % Mapped to mirbase (including nohit 1) 6,013, % 12, % Mapped to Cluster I 5,734, % 8, % Mapped to Cluster II % % Mapped to Cluster III 2, % % Mapped to Cluster IV 18, % 1, % Mapped (total) 5,756, % 9, % Nohit (including nohit 1 and nohit 2) 847, % 40, % Note: Mapped (total) + Nohit should equal to mappable Table 10 - Detected mir summary # of unique mirs detected in Cluster I 309 # of unique mirs detected in Cluster II 4 # of unique mirs detected in Cluster III 18 # of unique mirs detected in Cluster IV 703 Total 1,034 Sequ Seq Sequ Seq Unique Seq Unique Seq FileName Mapped_Data/sample1_FHG_uni_miRs.txt

28 Note: Definition of mir_name: in cluster group1: the mir_name is composed of 4 parts, mir, extension number, 5p or 3p, index of sequ seq. The extension number is the extension number in mirids of highest occurency. in cluster group2: the mir_name is composed of 4 parts, PC, extension number, 5p or 3p, index of sequ seq. PC means 'Predicted Candidate'. The extension number is the extension number in mirids of highest occurency. in cluster group3: the mir_name is composed of 4 parts, PN, extension number, 5p or 3p, index of sequ seq. PN means 'Predicted Novel mir'. The extension number is the extension number in mirids of highest occurency. in cluster group4: the mir_name is composed of 3 parts, PC, 5p or 3p, index of sequ seq. PC means 'Predicted Candidate'. The mir sequence is from the isoform of lowest error# and highest copy# in cluster groups 1, 2 & 4. The mir sequence is from the isoform of highest copy# regardless its error# in cluster groups 3. # of unique mirs detected in these 4 groups is counted from different mir_seq.

E.Z.N.A. MicroElute Clean-up Kits Table of Contents

E.Z.N.A. MicroElute Clean-up Kits Table of Contents Introduction... 2 Kit Contents... 3 Preparing Reagents/Storage and Stability... 4 Guideline for Vacuum Manifold... 5 MicroElute Cycle-Pure - Spin Protocol...