SUPPLEMENTARY INFORMATION

Size: px
Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Transcription

1 SUPPLEMENTARY INFORMATION Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals Young Seok Ju, Jong-Il Kim, Sheehyun Kim, Dongwan Hong, Hansoo Park, Jong-Yeon Shin, Seungbok Lee, Won-Chul Lee, Sujung Kim, Saet-Byeol Yu, Sung-Soo Park, Seung-Hyun Seo, Ji- Young Yun, Hyun-Jin Kim, Dong-Sung Lee, Maryam Yavartanoo, Hyunseok Peter Kang, Omer Gokcumen, Diddahally. R. Govindaraju, Jung Hee Jung, Hyonyong Chong, Kap-Seok Yang, Hyungtae Kim, Charles Lee and Jeong-Sun Seo 1

2 Table of Contents Supplementary Note 1. Genome Analysis Sensitivity of SNP calling 5 Super nssnp genes 6 Estimating the number of novel variants as the number of personal genomes increases 7 Linkage disequilibrium of novel variants 7 Detection of large deletions 8 Identification of large deletion breakpoints 9 Inferring molecular mechanisms for large deletion formation 9 Microhomology-sequence motifs for NHEJ deletions 10 De novo assembly of short reads Transcriptome analysis Sequence alignment for transcriptome 12 Expression mapping 12 Unknown transcripts 13 Genes escape X-inactivation Comprehensive analysis of genome and transcriptome Transcriptional base modification (TBM) 15 Allelic-specific expression (ASE) 15 References 17 2

3 Supplementary Figures Suppl. Figure 1: Experimental overview of the study 18 Suppl. Figure 2: Sensitivity of SNP detection in population-level 19 Suppl. Figure 3: Description of PRIM2 gene - Example of super nssnp gene 20 Suppl. Figure 4: Super nssnp genes 21 Suppl. Figure 5: Overview of the strategy for detecting large deletion breakpoints 22 Suppl. Figure 6: Comparison of RNA sequence alignments on cdna and genome sequences 23 Suppl. Figure 7: Validation of 4 unknown transcripts 24 Suppl. Figure 8: Distance between unknown transcripts and nearest gene 25 Suppl. Figure 9: Validation of 15 TBMs 26 Supplementary Tables Suppl. Table 1: Whole Genome Sequencing Statistics (10 individuals) 28 Suppl. Table 2: WGS SNP Accuracy from validation using Illumina 610K genotyping array 29 Suppl. Table 3: Primers for validations 30 Suppl. Table 4: Indel list of 10 individuals extracted by whole genome sequencing 31 Suppl. Table 5: Exome sequencing statistics 32 Suppl. Table 6: Non-synonymous SNP list detected from 18 individuals (10 whole genome sequencing, 8 whole exome sequencing) 33 Suppl. Table 7: Funtional assessment of nssnp of 18 individuals 34 Suppl. Table 8: Super nssnp gene list 35 Suppl. Table 9: List of Korean common novel nssnp LD 36 Suppl. Table 10: Total 5,496 large deletion list of 8 individuals including breakpoints information 37 Suppl. Table 11: Validation of large deletions using 24M CGH array data 38 3

4 Suppl. Table 12: Breakpoints list of NA Suppl. Table 13: Motifs on flanking regions of NHEJ large deletions 40 Suppl. Table 14: RNA Sequencing Statistics 41 Suppl. Table 15: Comparison of transcriptome alignment methods using pseudogene expression 42 Suppl. Table 16: Expression map represented in RPKM value on all RefSeq genes 43 Suppl. Table 17: List of Korean common novel transcripts 44 Suppl. Table 18: 23 Genes Escape X-inactivation 45 Suppl. Table 19: 1,809 TBM sites 46 Suppl. Table 20: 580 Allele Specific Expression sites 47 Suppl. Table 21: Contig list generated by de novo assembly 48 Suppl. Table 22: Alignment result of de novo assemble contigs 49 4

5 Supplementary Note 1. Genome Analysis 1.1. Sensitivity of SNP calling The number of SNPs detected in an individual genome can be changed by altering SNP filter conditions. Even in the same individual, number of SNPs detected with different algorithms can be changed extensively 1 (3.84 million SNPs by ELAND, 4.13 million SNPs by MAQ, 3.61 million SNPs in common). In this project, we focused on the identification of rare SNPs. As we may expect, since false positive SNPs tend to appear in an individual specific manner (interpreted as being rare), the number of rare SNPs can be easily overestimated if the SNP filter conditions are modified to allow for more false positives to be included in the SNP list. Therefore, we have attempted to reduce the false positives in our SNP list, attempting to maximize our PPV rather than detection sensitivity. We have achieved a high PPV (>99.94%) for SNP detection, based on the comparison of wholegenome sequencing and microarray data. SNPs we called are accurate. Our experimental validation by PCR and Sanger sequencing also supported our high accuracy of SNPs we called (100%). Given the PPV of SNP detection (~ 99.94%), we expect the number of false positives in each individual genome to be approximately 2,000 (3.5 million SNPs x ). Excluding data from AK1 and AK2 (sequenced by earlier platforms), the sensitivities of SNP detection for the other Korean individuals are ~ 97%. Most of the variants that we could not detect (false negatives) appear to be covered by only a few reads, with which we cannot accurately call SNPs. For example, AK4, 62.4% (n=4,930) of the mismatch-sites between sequencing and 610K microarray were covered less than 10 times. Lower coverage (23.1x) than 30x may account for the part of the insufficient read-depth for SNP calling. Interestingly, 74.3% (n=3,665) of these regions showed high GC (> 55%) or low GC (<25%) contents. The GA technology does not provide robust sequence data in genomic regions of high or low GC contents 1. Because we sequenced 10 individual genomes, the population-level sensitivity of SNP detection is much higher than that of each individual. Compared with microarray-data, we identified ~95% of 5

6 singletons. However, we could identify > 99% of SNPs when more than 2 individuals have the variants in the genome (Supplementary Figure 2) Super nssnp genes The density of nssnps on coding sequences was 0.254/kb on average. During the course of summarizing nssnps, we identified a subset of genes with more nssnps than expected. These super nssnp genes were defined as those in which 1) the average number of nssnps was 2 among all the individuals whole-genome sequenced, and 2) the nssnp density was 4/kb of coding sequence. We found that, in some cases, hidden duplication of genes that are frequently duplicated among the population but are not located in the human reference genome may generate super nssnp genes as an artifact. For example, PRIM2 is a super nssnp gene. However, the read-depth for PRIM2 genes is highly elevated for all Korean individuals whole-genome sequenced, suggesting that copy number gain in PRIM2 gene may be frequent in this population (Supplementary Figure 3). Interestingly, the human reference genome includes only a single copy of PRIM2 on chromosome 6, whereas the C. Venter genome 2 includes PRIM2 homologous DNA segments on chromosome 5. Therefore, during alignments, short reads generated from homologous segments should be mapped to chromosome 6; thus, all mismatches between the segments of chromosome 5 and 6 would appear as SNPs (mostly heterozygous) even though there are actually few variants of PRIM2 in the human genome. Because the reference genome is not a perfect reference, the interpretation of human resequencing should be done carefully. Of the 86 super nssnp genes we identified, 15 showed more than a 30% increase in read-depth (Supplementary Figure 4). In addition, 33 were located on the segmental duplications of human genomes, and 75 overlapped with known CNV regions archived in the Database of Genome Variants (DGV). (Only 9 super nssnp genes were not related to increased read-depth, segmental duplications, or CNV regions in the DGV.) These observations suggest that structural variants could partially account for super nssnp genes Estimating the number of novel variants as the number of personal genomes increases 6

7 We simulated the number of novel variants that would be discovered as the number of personal genomes increases. To accomplish this, we first randomly permuted the order of 10 personal genomes 1000 times. At each step, we obtained the average number of new variants; that is, those that were not archived in genome databases and were not discovered in the previous step. This method was applied to SNPs, short Indels, and nssnps. For nssnps, we performed identical tests using nssnps from 18 individuals (10 whole-genomes and 8 whole-exomes) to further confirm the 10 individual estimates. Then, using the numbers of each novel variant at each incremental step, we obtained trend curves by the least-mean-square method. Surprisingly, the extrapolation of the trend showed that a number of SNPs would be identified as novel, even though many personal genomes are deep sequenced. For example, ~54,000 novel SNPs would be identified after sequencing 100 haploid genomes (50 individuals; < 1% allele frequency). Similarly, ~28,000, ~6000 and ~700 SNPs would be discovered as novel when 100 (< 0.5%), 500 (< 0.1%) and 5,000 (< 0.01%) individuals are sequenced in high depth, respectively. However, this number should be interpreted cautiously, in particular when extensive extrapolation was used, since approximately 2,000 SNPs are expected to be false positives in an individual genome, given the PPV in SNP detection (99.94%). The number of novel nssnp decreases relatively more slowly than SNPs. From these observations, we may conclude that nssnps are relatively more diverse than SNPs Linkage disequilibrium of novel variants We examined the linkage disequilibrium (correlation, r 2 ) between novel but common (allele frequency 2/20 among Koreans) non-synonymous and surrounding SNPs (within 20 kb), both upstream and downstream, among 10 individual whole genomes sequenced. We were particularly interested in the linkage relationship between novel non-synonymous and known variants, or tagging SNPs, for common genotyping arrays, since the novel nssnps are likely to be functional. However, the linkage with known variants should be tight if they are to be detected as candidates for complex diseases in genome-wide association studies (GWAS). 7

8 1.5. Detection of large deletions Because of incompleteness of the human reference genome (e.g., PRIM2), and genomic characteristics of that makes some genomic regions inaccessible by sequencing technologies (e.g., extremely high GC ratio, repetitive sequence, gaps), whole-genome resequencing of single individuals against the human reference genome is not a feasible and accurate approach for identifying structural variations that show real polymorphisms among human populations. In this manuscript, by comparing multiple genomes in parallel, we could identify reliable polymorphisms existing in human populations. To find SV, we used read-depth (RD), paired-end (PE) read, and split-reads information of each personal genomes and performed pairwise comparison between genomes. First, we calculated the normalized (to 25.0x) personal whole-genome read-depth of coverage in a 30 bp window size as follows: normalized RD 30bp window,person = RD 30bp window,person 25.0x average whole genome RD person Then we compared the 30 bp window coverage between two individuals (RD deviation = normalized RD 30bp window,person A normalized RD 30bp window,person B ) using all available combinations (N= C 2 = 28 ) 8. To be defined as a deletion candidate, the read-depth deviation should be increased (4/3 = 1.33x) or decreased (3/4 = 0.75x) for more than 33 windows in a row (>1 kb long). Likewise, to identify regions of homozygous deletion for all individuals considered, we also counted regions with in which the read-depth for all individuals was less than 5x for more than 33 windows in a row (>1 kb long). Because the frequency of CN loss predominates over CN gain in the human genome, we proceeded to next step using the assumption that all candidate regions identified above correspond to CN loss. For the next step, we investigated stretched paired-end reads, aligning each end onto each flanking region of large deletion candidates (defined as <1 kb from the estimated junction). We regarded deletion candidate regions with 2 stretched paired-end reads as suggestive. However, existence of stretched paired-end reads is not an essential prerequisite for large deletion; thus, we did not remove candidates with one or no paired-end reads in this step. 8

9 Thereafter, we checked read-depth changes and paired-end reads near deletion candidate regions for all individuals in parallel. Regions with nearby unstable read depths, which make it difficult to determine deletion, were removed. If an individual region exhibited a remarkable read-depth decline compared to flanking regions or if read depths were variable among individuals, it was considered a large deletion. If any individuals showed fitting of stretched reads to the large deletion regions, all individuals who showed a clear decline in read depth for the regions were regarded as carrying the deletion in their genome. If all individuals showed a read depth of approximately zero (RD < 3), we regarded the regions as unimorphic CN losses Identification of large deletion breakpoints To identify nucleotide-resolution breakpoints of large deletions, we used orphan reads, which are short-reads that only one of paired-end reads is successfully aligned to the reference genome. We collected all the orphan reads which were mapped within 1 kb of an estimated large deletion boundary. Then unmapped ends of the orphan-reads were re-aligned to reference genome using the BLAT 3 program, to check if it could be split and separately aligned ( split reads ) to both side of large deletion region. The alignment information of the split reads provides nucleotide-resolution breakpoints of large deletions. To collect the most reliable breakpoints for each large deletions, we picked a split-reads group that had the greatest number of split reads with the same gap coordinates. When there was more than one split-reads group with different gap coordinates and the same number of split reads, we chose split reads with the closest gap size compared to the estimated deletion size. In this way, we summarized BLAT results of orphan reads to one best split-reads group for each detected large deletion Inferring molecular mechanisms for large deletion formation To infer mechanisms of large deletion formation, we classified 5,496 large deletions we detected into 4 particular mechanisms, such as VNTR, NAHR, TEI and NHEJ, using DNA sequences of large deletions and breakpoint junction through an algorithm slightly modified from BreakSeq 4. If > 80% of a large deletion is covered by simple repeat sequences predicted by Tandem Repeat Finder 5, it is 9

10 categorized as "VNTR". Then, we compared flanking sequences of both ends of a deletion using BLAST. If both the sequences share exact homology accross the breakpoints, we classified it as NAHR. Large deletions not yet categorized as "VNTR" or "NAHR", are annotated by RepeatMasker, and if a deletion is completely covered by any of transposable elements, e.g. Alu or LINE, the deletion is classified as TEI. Finally, remaining large deletion lacking the former patterns are classified as NHEJ 1.8. Microhomology-sequence motifs for NHEJ deletions We assessed the microhomology-sequence motifs 10 bp within 400 bp upstream and downstream from the breakpoints of NHEJ large deletions (or estimated boundaries if breakpoints were not detected). In our large deletion set, there are 3,664 large deletions by NHEJ mechanisms. Large deletions with identical positions were compressed, and finally we obtained a set of 1,022 nonredundant NHEJ deletions, and 2,044 flanking DNA sequences of 400-bp long. We used MEME (Motif-based sequence analysis tools, 6 for detection of the microhomology-sequence. We have shown the most significant 3 motifs (Supplementary Table 13) De novo assembly of short reads not aligned onto human genome reference In order to find novel contigs that might emerge in Korean individuals, we conducted a de novo sequence assembly using read data not aligned to the reference human genome, gathered from eight Korean sequencings. Before merging vertices into contigs, we discarded all reads that contained any ambiguities ( N s) and those with the lowest base quality scores ( B s) and then we obtained about 15G reads of sequence data. De novo sequence assembly of the filtered read data was carried out using ABySS 7 version short-read assembler and the MPI (Message Passing Interface) protocol. To assess assembly performance of overlapping sub-string values (k-mer), we compared assemblies of million paired-end reads for k-values ranging from 25 to 34 bp, and found the optimal size (32 bp) of k-mers with parameters of four coverage depths and two erode bases. We then aligned the assembled contigs greater than 1000 bp in length with the reference human genome (hg19) and the Huref genome sequence using NCBI BLAST version (parameters, e = 1 x 10-20, a = 23, 10

11 F = false, -X = 1000), and chose those contigs that mapped onto these genomes by less than 99% using our own scripts. In addition, we aligned these contigs to the common chimpanzee wholegenome shotgun draft assembly (Pan troglodytes 2), ultimately retaining contigs with a mapping ratio greater than 99%. To analyze the context of the remaining contigs, we aligned DNA and RNA sequencing data from eight Koreans to those contigs. Moreover, we aligned the sequence read data of YH 8, NA , NA , NA , NA , ABT 11, KB1 11, and Eskimo 12 onto each contig using the GSNAP 13 alignment tool, selecting the options exact matches and unique alignment, and then compared the number of aligned contigs in each genome. 11

12 2. Transcriptome Analysis 2.1. Sequence alignment for transcriptome Using the GSNAP alignment tool, we aligned short reads from transcriptome sequencing to a set of constructed mrna sequences instead of the reference human genome to avoid mapping errors resulting from mrna splicing. As introduced in the main manuscript (Figure 5a), if short reads are mapped to genomic sequences, short reads containing splice junctions usually cannot be aligned in situ. The reads appear to be non-mapped, or mapped to ectopic sites, such as pseudogenes, which have sequences that are highly homologous to the in situ sequences but lack introns. As a result, (1) read depths for loci near splice junctions usually decrease; (2) pseudogenes appear to be transcribed, especially near the sequences of splice junctions in real genes; and (3) false-positive variants are detected due to ectopic alignments (Supplementary Figure 6 and Supplementary Table 15). We generated the mrna sequences set using information about exons from RefSeq, UCSC, and Ensembl gene databases. All information was downloaded from the UCSC genome browser. Exons for a total of 161,250 genes were available (33,907 from RefSeq, 65,271 from UCSC and 62,072 from Ensembl). The mrna sequences were generated from human reference genome NCBI Build 36.3 based on their exonic positions. After mapping the short reads from transcriptome sequencing onto the set of 161,250 mrna sequences, the mapping information for each base (i.e., read depth, type and number of mismatches) was transformed into genomic location from mrna-scale. Results of transcriptome sequencing, such as expression level and variants information, were obtained from this mapping information Expression mapping We examined the expression level of human genes (RefSeq gene), normalized using reads per kilobase of exon per million mapped reads (RPKM) values 14, calculated by applying the following equation: RPKM = 10 9 C NL 12

13 where C is the number of reads mapped to a total gene, N is the total number of mapped reads in the experiment, and L is the length in base pairs of a gene. Using threshold > 1 RPKM 14, we identified 11,101 genes in active transcription in lymphoblastoid cell lines Unknown transcripts Unknown transcripts were detected by filtering short reads from transcriptomes aligned outside of currently known genomic regions. First, reads that failed to align to the 161,250 mrna sequences from the transcriptome pipeline were re-aligned with human reference genome NCBI build Then, short reads overlapping with any known gene regions were removed based on information obtained from four different databases: (1) known genes from RefSeq gene (downloaded on 12 Sep. 2010), (2) UCSC (downloaded on 10 May 2009), (3) Ensembl (downloaded 9 Aug. 2009), (4) and known mrna from GenBank (Downloaded 12 Sep. 2010). All database information was downloaded from repositories at the UCSC genome browser ( To be conservative in identifying unknown transcripts, we also removed short reads that overlapped with any human expressed sequence tags (ESTs) (downloaded from the UCSC genome browser). Short reads that overlapped known genic regions by 1 bp were filtered out. As a result, we obtained short reads that did not overlap any known genic regions. These reads were collapsed to construct unknown transcript regions for each individual. Unknown transcript regions with average read-depths < 4 were considered insignificant and were removed. Finally, our interpretations were made more conservative by removing unknown transcript regions found in only single individuals Genes escape X-inactivation To find X-chromosome genes that are expressed at higher levels in females than in males, we compared gene expression levels between genders. First, we removed 585 genes with < 1 RPKM in gene expression from among the 948 genes on the X-chromosome. Then, we removed 186 genes for which the average expression in females was lower than that in males. Using the expression level of the remaining 177 genes, we performed a Wilcoxon rank-sum test to analyze differences between the two groups. This non-parametric method was used because the sample size was not very large for either group (9 males, 6 females). From these tests, we determined that the minimum significance 13

14 level was (e.g. XIST) when the expression level in all six females was greater than that in males. We selected 23 genes as candidate escape from X-inactivation genes at a significance level of To estimate the false discovery rates (FDR) for establishing a cut-off value, we calculated q-values using the p-values of 162 genes tested using QVALUE software 15. The FDR values for the most significant (XIST, p-value = ) and minimally suggestive (ALG13, p-value = ) genes were and 0.151, respectively. 14

15 3. Comprehensive analysis of genome and transcriptome 3.1. Transcriptional base modification (TBM) We compared the sequences of each genome and transcriptome sets of 15 individuals (seven sets of whole-genomes and transcriptomes, and eight sets of whole-exomes and transcriptomes), sequenced using an Illumina Genome Analyzer. To identify modifications of RNA from genomic sequences, we identified loci where variations were present in the transcriptome but not in the genome sequence. SNPs in the transcriptome are defined as loci containing more than three identical mismatches in high-quality reads (Q score > 15) with a mismatch allele frequency 20%. Genomic regions with no variants were defined as loci where (1) at least five high quality reads existed; (2) one or fewer mismatches were found; and (3) the mismatch allele frequency was < 10%, which allows one mismatch for more than 10 reads and supports the existence of a wild-type allele. For sequencing of the eight exomes, we included an additional criterion since many UTR regions were not covered by hybridization capture system. We considered all the known SNPs exist in dbsnp130 are potential genomic variants in the individual sequenced exomes. Conversely, loci not reported as variants in the dbsnp130 were considered to be wild-type. Because most of the errors in transcriptome sequencing would be detected as false positives, we further filtered the results using conservative filter criteria. First, we filtered out singletons; that is, those found in only one of the 15 individuals. Candidate loci found in only exome-sequenced individuals were also removed. Then, we aligned 61-bp long sequences, comprising 30 bp upstream, the variant allele, and 30 bp downstream of each candidate site, with the Human Reference Genome Build 36.3 using BLAT. If the sequence matched perfectly to any region, it was removed, since the modified RNA sequence might be transcribed from the perfectly matched region rather than from the candidate TBM site Allele-specific expression (ASE) We explored the expression level of each allele on the heterozygous nssnps. Among 28,042 nssnps found in 15 individuals, we identified 4,867 loci where two or more individuals had heterozygous SNPs and the corresponding gene is active in transcription. We calculated the read counts for wild- 15

16 type and variant alleles in both genome and transcriptome sequencing results for each individual containing the heterozygous SNPs. If the read depth for either genome or transcriptome was < 5, the individual was regarded as non-informative for the given variant, and the corresponding locus was removed. Using read-counts for each category, we constructed a 2 x 2 table for each individual in each heterozygotic locus, as shown below. Wild type Variant Total Genome A B A+B Transcriptome C D C+D Total A+C B+D A+B+C+D We carried out a Fisher s exact test for testing unbalanced expression. If the resulting p-value was <0.10, the individual was considered suggestive for allele specific expression at the locus. Loci with fewer than two suggestive individuals were considered to be non-informative, and were removed. Preferential expression (PE) was quantified using the following formula: n PE = 1 ( VariantFrequency n 1 individual,transcriptome VariantFrequency individual,genome ), where n is the number of suggestive individuals for the locus. If the PE for a locus was between -0.2 and 0.2, the magnitude of allele-specific expression was regarded as insufficient and the locus was removed. Finally, we constructed a new 2 x 2 table using read counts from all suggestive individuals for all suggestive regions, and then repeated Fisher s exact tests to obtain a significance level. Applying the Bonferroni s correction method for multiple testing (n = 4,867), we removed loci with p-values > x Finally, we determined 580 loci that satisfied the criteria for allele-specific expression.. 16

17 Reference 1 Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59, (2008). 2 Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254, (2007). 3 Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res 12, , (2002). 4 Lam, H. Y. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotechnol 28, 47-55, (2010). 5 Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, , (1999). 6 Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28-36, (1994). 7 Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res 19, , (2009). 8 Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65, (2008). 9 Ju, Y. S. et al. Reference-unbiased copy number variant analysis using CGH microarrays. Nucleic Acids Res, (2010). 10 Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, , (2010). 11 Schuster, S. C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, , (2010). 12 Rasmussen, M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463, , (2010). 13 Wu, T. D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, , (2010). 14 Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, , (2008). 15 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100, , (2003). 17

18 Supplementary Figures Supplementary Figure 1. Experimental overview of the study. To identify functional genomic variants, we performed whole-genome sequencing, targeted-exome capture sequencing, and transcriptome sequencing. 18

19 Supplementary Figure 2. Sensitivity of SNP detection in population-level 19

20 Supplementary Figure 3. Description of PRIM2 gene - Example of super nssnp gene 20

21 Supplementary Figure 4. Super nssnp genes compared to Segmental Duplication (SD), Increase of Read Depth (RD) and Database of Genome Variants (DGV) 21

22 Supplementary Figure 5. Overview of the strategy for detecting nucleotide-resolution large deletion breakpoints 22

23 Supplementary Figure 6. Aligning RNA short reads on cdna sequence provide better results (higher expression level) than aligning on reference genome sequence. Gene expression level (genes in chromosome 1) assessed by cdna alignment and genome alignment are compared. Genome alignment loses many short-reads, especially short-reads from splice junctions. 23

24 Supplementary Figure 7. Validation result of 4 unknown transcripts using PCR and gel electrophoresis 24

25 Supplementary Figure 8. Distance between each of unknown transcripts and its nearest gene 25

26 Supplementary Figure 9. Validation results of 15 TBM sites by Sanger Sequencing of DNA (above) and RNA (below) 26

27 27

28 Supplementary Tables Supplementary Table 1. Whole Genome Sequencing Statistics (10 individuals) Total short read data Aligned short read data ID Gender Total Bases Aligned Bases Read length Read Bases Read Bases 1 x ,486,218 18,701,503, ,705,892 13,129,412,112 AK1 Male 107,635,576,460 79,278,726,228 2 x 36 1,646,543,336 59,275,560,096 1,360,491,421 48,977,691,156 2 x ,322,768 10,852,403,584 99,363,440 8,743,982,720 2 x ,416,122 18,806,108,932 79,506,040 8,427,640,240 AK2* Female 328,846,011,200 78,911,480,450** 2 x 25 6,371,995, ,299,894,500 1,451,098,162 36,277,454,050 2 x 50 3,390,922, ,546,116, ,680,528 42,634,026,400 AK3 Male 89,154,943,968 73,664,437,785 2 x 76 1,173,091,368 89,154,943, ,340,034 73,664,437,785 AK4 AK5 Female Male 77,201,068,724 89,182,771,792 66,125,950,244 74,703,930,798 2 x 76 2 x ,312, ,272,572 33,767,754,712 10,701,812, ,774, ,105,201 30,684,050,534 9,723,384,303 2 x x ,032,812 1,032,644,200 43,433,314,012 78,480,959, ,934, ,072,031 35,441,899,710 64,980,546,495 2 x 36 55,752,362 2,007,085,032 52,073,655 1,874,563,411 AK6 Female 73,502,467,582 63,796,933,848 2 x ,079,624 41,046,051, ,485,356 36,969,646,182 2 x ,478,526 30,449,331, ,079,721 24,952,724,255 AK7 Male 103,902,771,632 87,557,181,002 2 x 76 1,367,141, ,902,771,632 1,152,156,028 87,557,181,002 AK9 Male 92,883,089,498 77,220,870,831 2 x ,119,798 92,883,089, ,452,344 77,220,870,831 AK14 Female 84,084,604,278 74,670,029,486 2 x ,229,228 21,829,421, ,486,993 19,947,194,343 2 x ,387,950 62,255,182, ,867,353 54,722,835,143 2 x ,810,256 11,549,169, ,486,715 10,889,034,343 AK20 Female 74,985,049,228 64,940,569,874 2 x ,223,192 16,432,962, ,812,391 14,804,360,307 2 x ,375,420 47,002,917, ,626,600 39,247,175,224 * Statistics of AK2 performed by SOLiD platform is based on current SRA data (SRX018824, SRX018829, SRX018830, SRX018821) ** AK2 statics excludes redundant pair for calculation 28

29 Supplementary Table 2. WGS SNP Accuracy from validation using Illumina 610K genotyping array individual Whole Genome Sequencing Illumina 610k Genotype sensitivity PPV Wildtype Hetero SNP Homo SNP accuracy total homo hetero AK1 Wildtype Hetero SNP Homo SNP AK2 Wildtype Hetero SNP Homo SNP AK4 Wildtype Hetero SNP Homo SNP AK5 Wildtype Hetero SNP Homo SNP AK6 Wildtype Hetero SNP Homo SNP AK7 Wildtype Hetero SNP Homo SNP AK14 Wildtype Hetero SNP Homo SNP AK20 Wildtype Hetero SNP Homo SNP Definition Wildtype a b c PPV (e+f+h+i)/(d+e+f+g+h+i) Hetero SNP d e f Genotype accuracy (e+i)/(e+f+h+i) Homo SNP g h i Sensitivity_total Sensitivity_homo Sensitivity_hetero (e+f+h+i)/(b+c+e+f+h+i) (f+i)/(c+f+i) (e+h)/(b+e+h) 29

30 Supplementary Table 3. Primer List used for validation of SNPs, Indels, Novel Transcripts and RNA ModificationGS SNP Accuracy from validation using Illumina 610K genotyping array SuppTable3_Validation_Primer_List.xls Depicted below is a preview of the full version. individual position F primer R primer SNP01 AK3 chr1: ACGTAGATCCTGATTTCGTGGT TGACCATAATGCTTGCTGTTTC SNP02 AK3 chr10: TTGTACATGTTTTAGAGAAAGCAAA TTGAAGGTGCTCCAATTCTACA SNP03 AK3 chr15: AGGAATCTCGGTTGGATATGAA TCAGGACTAACCTGCAAGATCA SNP04 AK3 chr21: ATCCAGTCAAGTCAACGGTTCT CAACTTTAAGGTGGGAAAGGTG SNP05 AK3 chr3: CAGATTCTGGCAATGAAATGTCT TTTTGGAACCAAGATAGCAGGT SNP06 AK3 chr5: ACAGTGTTGGGAGAATGGAGTT GCAGGACCTTGTAAAGAAATGC SNP07 AK5 chr1: GGAGTGAACAAAAAGTCGAACC CGCTGAGGAACAACTGGTATAA SNP08 AK5 chr19: TGAAATGAGAAACCTCGTGATG CAGTAGCTTTTGCAGTTTGCAC SNP09 AK5 chr3: TTTCCTGATGTGCGTTTATGTC TCCAGTCCTCCTCTTTCTTCTG SNP10 AK5 chr5: CTCTCGTTAGACGGGAAAGCTA ATTTCCTTACCCAGGGATGACT SNP11 AK7 chr11: AGTGAGGGTGCAGAGAGAAAAG TAGGTGAGTTGTGTTCCCACAG SNP12 AK7 chr15: CCCTCAGGAAGGACTGTTCATA AGGGAGATGATCGACTTGATGT SNP13 AK7 chr22: CATCTTCATTGTCTCCCATCCT CTTACCTGCACCAGGAGGTTCT SNP14 AK9 chr1: GCCTTCTGCAGATCACTTTTGT AGGACTGGGAAAACATCTGAAA SNP15 AK9 chr7: CTGTCTGGAAAGTGTGATCAGG TTGTAAATGAAAGCTCGCACAT SNP16 AK9 chr10: CCAGTTGGACACCAATCTACAA TCTGTGTCAGACATCACCACTG SNP17 AK9 chr22: AAGGTTACCCTGACCATCTTTG CCCATCACAGACACTTAAGCAG SNP18 AK20 chr12: TGCTATTCCTCTTCGACCTCAT TCTGAGCATCACATTCTCTGCT SNP19 AK20 chr17: TGCACAAGGATAGGAACCAGTA GTTACAGACAGGACTCCCTTGG SNP20 AK20 chr6: CCCTGATCTTCAAGTTGGATTC GGATGTTTTTCTTCCTCCTCCT SNP21 AK20 chr7: GAGAAAGAATTAAAGCCGAGCA CCTTAGCCTTCTCTTCCTCCTC SNP22 AK4 chr11: TGAGCTTCCTGCATGACTACAT TTGTGGGAATTACACCTCCTCT SNP23 AK4 chr17: TGGTGACGATAGAAATTCATGC TCCTTGCATTTACAGAACATGG SNP24 AK4 chr3: TGCTTTCCACTTGATATTGTGC TTCATGTGCAACCCAGATACAT SNP25 AK4 chr6: TGGACTTCTTTTATGTGGCAGA GGGCTGTAAAACAAGTGTCTCC SNP26 AK6 chr14: TGCTTTGTTGGGTATTGTTTTT ACATCAAGCCATCTATCCACAA SNP27 AK6 chr18: GTGGCAAAACTTTCAAAAGGAG GCACAAAGCAAGCTAGACTCAA SNP28 AK6 chr2: AGAGAAGGGAATTCTGGTAGCC ATGGAGGACAAGGAGTGAATGT SNP29 AK6 chr6: TGATTATCCTCTACGGCACAAA GCATTATACCTTTGTGTTTCTGCTT SNP30 AK6 chrx: GCTATGTGGACTTGTCCTTTCC TCTAAGCTCCTTCCAAACAAGC SNP31 AK14 chr1: GTATGGTTTGATGACCCAGGTT AGAGAAGCCATTTGGATGTGAT SNP32 AK14 chr5: AATCTTGGTGATCCAGGCTCTA CTTGATGCATTTGGACCATCTA SNP33 AK14 chr5: CTGGTCCAAGCAGAGTTCTAGG TCTGACCTGTGGTTGAAAAATG Indel01 AK3 chr7: AAAAGGGTGATAGGCAATTGAA TGGCCTAATATGTAACTTCTCTTTG Indel02 AK5 chr19: ACTTGCCTGTGTCCCCAAAG CTCCACCTCTTCACCCCAAT Indel03 AK5 chr20:74155 AAGTGTCTAAACGACGTTGGAA TCGAAGCAGTAGTCATCATCAAA Indel04 AK5 chr3: CAGGGCGTGAGAAAAAGTAAAA TACCTCCAGAGAGTCATCAGCA Indel05 AK7 chr1: TGTTGTCATTGTCCTCGTCTTC TCCCTGGAATTGAGTGAGAAAT Indel06 AK7 chr11: CAGAGTCCCCACAATATTCACA TCCAGGACCGTCTACCTAAAAA Indel07 AK7 chr21: AATCAGGCTACACCAGCTCCT AGACGGACTTAGAGCAGACAGG Indel08 AK7 chr5: TTCCGACTACTCCAGGTATGCT TCATCGGTTACATTCAGACCAC Indel09 AK7 chr6: AAGCCAATTTTGAGTCTTGGAG GATCCCATTGGAGCTCATTATC Indel10 AK9 chr1: GAGGGATCTAAGCACGTTTACAA AATGCTGAAGGTAACAGGAGAAA 30

31 Supplementary Table 4. Indel list of 10 individuals extracted by whole genome sequencing SuppTable4_Indel_Table_1_20.txt Depicted below is a preview of the full version. Chro type position position_altesize allele AK1 AK3 AK5 AK7 AK9 AK2 AK4 AK6 AK14 AK20 total sampmax_t chr1 del chr1 del chr1 ins A chr1 del chr1 del chr1 del chr1 del chr1 del chr1 ins A chr1 del chr1 del chr1 del chr1 del chr1 del chr1 del chr1 del chr1 del chr1 ins CACAC chr1 del chr1 del chr1 del chr1 del chr1 ins T chr1 del chr1 ins CT chr1 ins TG chr1 del chr1 del chr1 del chr1 ins GT chr1 del chr1 del chr1 del chr1 del chr1 del chr1 ins GTGTG chr1 ins GCT chr1 ins GC chr1 del chr1 del chr1 del chr1 ins GGAAT chr1 del chr1 del chr1 del chr1 del chr1 ins T chr1 del chr1 ins AA

32 Supplementary Table 5. Exome Sequencing Statistics Individuals Gender Read length reads bases Aligned reads Aligned bases aligned coverage AK_N1 Male 2 x 78 65,740,784 5,127,781,152 62,465,112 4,872,092, AK_N2 Male 2 x 78 67,124,814 5,235,735,492 63,544,668 4,956,295, AK_N5 Male 2 x 78 68,444,468 5,338,668,504 64,553,546 5,034,976, AK_N6 Male 2 x 78 66,995,750 5,225,668,500 62,752,082 4,894,465, AK_N7 Male 2 x 78 66,884,224 5,216,969,472 62,973,127 4,911,704, AK_N9 Female 2 x 78 70,142,332 5,471,101,896 63,828,875 4,978,499, AK_N14 Female 2 x 78 71,856,586 5,604,813,708 64,188,537 5,006,552, AK_N15 Male 2 x 78 71,876,294 5,606,350,932 64,543,996 5,034,263,

33 Supplementary Table 6. Non-synonymous SNP list detected from 18 individuals (10 whole genome sequencing, 8 whole exome sequencing) SuppTable6_nsSNP_from_18individuals.xls Depicted below is a preview of the full version. chr pos ref allele AK1 AK3 AK5 AK7 AK9 AK2 AK4 AK6 AK14AK20AK_NAK_NAK_NAK_NAK_NAK_NAK_NAK_Nannotations ref_aa snp_aa blosum nssnp chr A G CDS:OR4F5 T A 0 nssnp chr C T CDS:SAMD11 H Y 2 nssnp chr T C CDS:SAMD11 W R -3 nssnp chr T C CDS:NOC2L I V 3 nssnp chr G A CDS:NOC2L A V 0 nssnp chr C T CDS:PLEKHN1 A V 0 nssnp chr G C CDS:PLEKHN1 A P -1 nssnp chr G C CDS:PLEKHN1 R P -2 nssnp chr T C CDS:PLEKHN1 S P -1 nssnp chr C A CDS:HES4::IntroR S -1 nssnp chr G A CDS:ISG15 S N 1 nssnp chr C T CDS:AGRN T I -1 nssnp chr A G CDS:AGRN Q R 1 nssnp chr G C CDS:AGRN S T 1 nssnp chr G A CDS:TTLL10 S N 1 nssnp chr G A CDS:TTLL10 G D -1 nssnp chr G C CDS:SCNN1D R P -2 nssnp chr G C CDS:SCNN1D E Q 2 nssnp chr G A CDS:GLTPD1 R K 2 nssnp 33

34 Supplementary Table 7. Trait-O-Matic results on nssnp of 18 individuals SuppTable7_TraitOMatic_18individuals.xls Depicted below is a preview of the full version. Coordinates Gene, amino acid change chr1: H6PD, R453Q chr1: RHCE, P226A chr1: FAAH, P129T chr1: LEPR, K109R chr1: LEPR, Q223R chr1: SELE, H468Y chr1: CFH, Y402H chr1: CFH, Y402H chr1: PIGR, A580V chr1: EPHX1, Y113H chr1: EPHX1, Y113H chr1: EPHX1, Y113H chr1: EPHX1, Y113H chr2: EDAR, V370A chr2: SP110, L425S chr3: CCR2, V64I chr4: ADD1, G460W chr4: ADH1B, R48H chr4: ADH1B, R48H chr4: ADH1C, I350V chr4: ADH1C, R272Q chr4: BANK1, R61H chr5: MTRR, I22M chr5: MTRR, I22M Genotype Trait-associated allele G/A A C/G C C/A A G/G G G/G G G/A A C/T C C/T C A/A A T/C C T/C C T/C C T/C C G/G G G/G G G/A A T/T T T/C T T/C T T/C C C/T T G/A A A/G G A/G G Associated trait CORTISONE REDUCTASE DEFICIENCY RH E/e POLYMORPHISM DRUG ADDICTION, SUSCEPTIBILITY TO LEPTIN RECEPTOR POLYMORPHISM LEPTIN RECEPTOR POLYMORPHISM IgA NEPHROPATHY, SUSCEPTIBILITY TO BASAL LAMINAR DRUSEN, INCLUDED MACULAR DEGENERATION, AGE-RELATED, 4, SUSCEPTIBILITY TO IgA NEPHROPATHY, SUSCEPTIBILITY TO EMPHYSEMA, SUSCEPTIBILITY TO, INCLUDED LYMPHOPROLIFERATIVE DISORDERS, SUSCEPTIBILITY TO PREECLAMPSIA, SUSCEPTIBILITY TO, INCLUDED PULMONARY DISEASE, CHRONIC OBSTRUCTIVE, SUSCEPTIBILITY TO, INCLUDED HAIR MORPHOLOGY 1, HAIR THICKNESS MYCOBACTERIUM TUBERCULOSIS, SUSCEPTIBILITY TO HUMAN IMMUNODEFICIENCY VIRUS TYPE 1, RESISTANCE TO HYPERTENSION, SALT-SENSITIVE ESSENTIAL, SUSCEPTIBILITY TO AERODIGESTIVE TRACT CANCER, SQUAMOUS CELL, ALCOHOL-RELATED, PROTECTION AGAINST; INCLUDED ALCOHOL DEPENDENCE, PROTECTION AGAINST ALCOHOL DEPENDENCE, PROTECTION AGAINST ALCOHOL DEPENDENCE, PROTECTION AGAINST SYSTEMIC LUPUS ERYTHMATOSUS, ASSOCIATION WITH DOWN SYNDROME, SUSCEPTIBILITY TO, INCLUDED NEURAL TUBE DEFECTS, FOLATE-SENSITIVE, SUSCEPTIBILITY TO 34

35 Supplementary Table 8. Super nssnp gene list SuppTable8_Super_nsSNP_Gene_List.xls Depicted below is a preview of the full version. Gene Code Chr Start of first exon (bp) Stop of last exon (bp) Number of exons Total exon length (bp) Average AK1 AK3 AK5 AK7 # nssnp density (/Kb) # nssnp density (/Kb) # nssnp density (/Kb) # nssnp density (/Kb) # nssnp density (/Kb) ZNF717 NM_ OR4C3 NM_ CDC27 NM_ FRG2C NM_ OR4C45 NM_ OR9G9 NM_ FAM104B NM_ X PRIM2 NM_ HLA-DRB1 NM_ HLA-DPA1 NM_ CTBP2 NM_ SEC22B NM_ HLA-DQB1 NM_ OR13C5 NM_ KCNJ12 NM_ MUC4 NM_ TAS2R31 NM_ HLA-DQA1 NM_ OR51Q1 NM_ HLA-A NM_

36 Supplementary Table 9. List of Korean common novel nssnp LD SuppTable9_KoreanCommonNovel_nsSNP_LD.xls Depicted below is a preview of the full version. chr pos ref_allele var_allele frequency annotation ref_aa snp_aa blosum rssnp_maxld r2 chr A T 2//20 CDS:CLSTN1 F Y 3 rs chr G A 2//20 CDS:C1orf187 G R -2 rs chr A G 2//20 CDS:PLOD1 Y C -2 rs chr G A 2//20 CDS:VPS13D D N 1 rs chr C G 3//20 CDS:PRAMEF12 N K 0 rs chr T A 8//20 CDS:PRAMEF1 L stop -4 rs chr T C 2//20 CDS:PRAMEF1 C R -3 rs chr C G 4//20 CDS:PRAMEF1 R G -2 rs chr C T 2//20 CDS:PRAMEF1 A V 0 rs chr T G 2//20 CDS:PRAMEF1 I M 1 rs chr T G 2//20 CDS:PRAMEF11 Q H 0 rs chr C T 4//20 CDS:PRAMEF11 D N 1 rs chr C T 5//20 CDS:PRAMEF11 A T 0 rs chr T C 7//20 CDS:LOC649330,HNR T A 0 rs chr G C 7//20 CDS:LOC649330,HNR S R -1 rs chr T C 7//20 CDS:LOC649330,HNR E G -2 rs chr C G 2//20 CDS:LOC649330,HNR D H -1 rs chr T C 2//20 CDS:LOC649330,HNR K R 2 rs chr T C 2//20 CDS:LOC649330,HNR I V 3 rs chr A C 5//20 CDS:LOC649330,HNR F L 0 rs chr C T 6//20 CDS:LOC649330,HNR G D -1 rs chr C T 6//20 CDS:LOC649330,HNR G S 0 rs chr G A 2//20 CDS:PRAMEF2 A T 0 rs chr A C 2//20 CDS:PRAMEF2 S R -1 rs chr C A 2//20 CDS:PRAMEF2 S R -1 rs chr T G 2//20 CDS:PRAMEF2 I M 1 rs chr A C 2//20 CDS:PRAMEF4 F C -2 rs chr T C 2//20 CDS:LOC E G -2 rs chr G A 5//20 CDS:LOC P L -3 rs chr A C 6//20 CDS:LOC F L 0 rs chr C T 5//20 CDS:LOC G D -1 rs chr C T 5//20 CDS:LOC G S 0 rs chr A G 2//20 CDS:NBPF1 S P -1 rs chr T C 4//20 CDS:NBPF1 S G 0 rs chr G A 2//20 CDS:PADI1 R H 0 rs chr C T 2//20 CDS:C1orf223 R W -3 rs chr C T 2//20 CDS:CYP4A22 R C -3 rs chr T A 2//20 CDS:NRD1 E V -2 rs chr T G 2//20 CDS:ROR1 S A 1 rs chr C T 2//20 CDS:RBMXL1::Intron:C A T 0 rs chr C G 3//20 CDS:RBMXL1::Intron:C G A 0 rs chr A G 2//20 CDS:GBP1 L P -3 rs chr G A 2//20 CDS:ADORA3 R C -3 rs

37 Supplementary Table 10. Total 5,496 deletion list of 8 individuals including breakpoints information SuppTable10_Large_Deletion_List.xls Depicted below is a preview of the full version. index individual chr size start stop breakpoint start breakpoint stop Gene Annotation inferred #split_reads whole-gene CDS UTR intron promoter(<1kb) mechanism LargeDeletion_1 AK3 chr N/S N/S N/S VNTR LargeDeletion_1 AK5 chr N/S N/S N/S VNTR LargeDeletion_1 AK7 chr N/S N/S N/S VNTR LargeDeletion_1 AK9 chr N/S N/S N/S VNTR LargeDeletion_1 AK4 chr N/S N/S N/S VNTR LargeDeletion_1 AK6 chr N/S N/S N/S VNTR LargeDeletion_1 AK14 chr N/S N/S N/S VNTR LargeDeletion_1 AK20 chr N/S N/S N/S VNTR LargeDeletion_2 AK3 chr N/S N/S N/S SAMD11 - VNTR LargeDeletion_2 AK5 chr SAMD11 - VNTR LargeDeletion_2 AK7 chr SAMD11 - VNTR LargeDeletion_2 AK9 chr N/S N/S N/S SAMD11 - VNTR LargeDeletion_2 AK4 chr N/S N/S N/S SAMD11 - VNTR LargeDeletion_2 AK6 chr SAMD11 - VNTR LargeDeletion_2 AK14 chr SAMD11 - VNTR LargeDeletion_2 AK20 chr N/S N/S N/S SAMD11 - VNTR LargeDeletion_3 AK3 chr N/S N/S N/S AGRN - VNTR LargeDeletion_3 AK5 chr N/S N/S N/S AGRN - VNTR LargeDeletion_3 AK7 chr AGRN - VNTR LargeDeletion_3 AK9 chr AGRN - VNTR LargeDeletion_3 AK4 chr N/S N/S N/S AGRN - VNTR LargeDeletion_3 AK6 chr AGRN - VNTR LargeDeletion_3 AK14 chr N/S N/S N/S AGRN - VNTR LargeDeletion_3 AK20 chr AGRN - VNTR LargeDeletion_4 AK3 chr VNTR LargeDeletion_4 AK5 chr VNTR LargeDeletion_4 AK7 chr N/S N/S N/S VNTR 37

38 Supplementary Table 11. Validation of large deletions using 24M CGH array data Individual Total Subject regions Validated regions Validated regions Accuracy Accuracy deletions ( 5 array probes) (p-value*<0.05) (p-value*<0.01) (p-value<0.05) (p-value<0.01) AK % 77.27% AK % 81.87% AK % 77.04% AK % 77.04% Total 2,750 1,297 1,087 1, % 78.33% * Wilcoxon Rank Sum Test using R statistics 38

39 Supplementary Table 12. Breakpoint list of NA10851 deletions (CNV Loss) SuppTable12_NA10851_Deletion(CNV)_Breakpoints.xls Depicted below is a preview of the full version. cnv_id chr cnv_typecnv_size cnv_start cnv_end gap_size gap_start gap_end NA10851_BP_1 chr1 LOSS NA10851_BP_2 chr1 LOSS NA10851_BP_3 chr1 LOSS NA10851_BP_4 chr1 LOSS NA10851_BP_5 chr1 LOSS NA10851_BP_6 chr1 LOSS NA10851_BP_7 chr1 LOSS NA10851_BP_8 chr1 LOSS NA10851_BP_9 chr1 LOSS NA10851_BP_10 chr1 LOSS NA10851_BP_11 chr1 LOSS NA10851_BP_12 chr1 LOSS NA10851_BP_13 chr1 LOSS NA10851_BP_14 chr1 LOSS NA10851_BP_15 chr1 LOSS NA10851_BP_16 chr1 LOSS NA10851_BP_17 chr1 LOSS NA10851_BP_18 chr1 LOSS NA10851_BP_19 chr1 LOSS NA10851_BP_20 chr1 LOSS NA10851_BP_21 chr1 LOSS NA10851_BP_22 chr1 LOSS NA10851_BP_23 chr1 LOSS NA10851_BP_24 chr1 LOSS NA10851_BP_25 chr1 LOSS NA10851_BP_26 chr1 LOSS NA10851_BP_27 chr1 LOSS NA10851_BP_28 chr1 LOSS NA10851_BP_29 chr1 LOSS NA10851_BP_30 chr1 LOSS NA10851_BP_31 chr1 LOSS NA10851_BP_32 chr1 LOSS NA10851_BP_33 chr1 LOSS NA10851_BP_34 chr1 LOSS NA10851_BP_35 chr1 LOSS

40 Supplementary Table 13. Motif on flanking regions of NHEJ large deletions 40

41 Supplementary Table 14. RNA Sequencing Statistics <8AKs Transcriptome Seq. Run1> Read length reads bases Aligned reads Aligned bases Coverage AK3 2 x ,601,132 5,211,714,332 28,345,352 2,862,410, AK4 2 x ,143,140 5,165,457,140 27,636,273 2,790,800, AK5 2 x ,630,854 5,315,716,254 28,317,447 2,859,606, AK6 2 x ,324,294 5,183,753,761 26,733,252 2,699,610, AK7 2 x ,032,760 5,154,308,760 27,011,183 2,727,650, AK14 2 x ,862,496 5,339,112,096 26,717,457 2,698,007, AK20 2 x ,586,996 5,210,286,596 27,373,384 2,764,293, <8AK_Ns Transcriptome Seq. Run1> Read length reads bases Aligned reads Aligned bases Coverage AK_N1 2 x 78 77,298,628 6,029,292,984 59,347,587 4,606,250, AK_N2 2 x 78 82,591,894 6,442,167,732 63,596,426 4,940,288, AK_N5 2 x 78 90,704,150 7,074,923,700 64,991,714 5,059,103, AK_N6 2 x 78 86,655,342 6,759,116,676 61,578,609 4,797,048, AK_N7 2 x 78 87,381,024 6,815,719,872 61,237,205 4,775,180, AK_N9 2 x 78 81,470,204 6,354,675,912 61,697,228 4,790,953, AK_N14 2 x 78 88,447,086 6,898,872,708 67,369,082 5,228,014, AK_N15 2 x 78 78,300,992 6,107,477,376 61,677,972 4,787,852,

42 Supplementary Table 15. Alignments of RNA short-reads on cdna sequence decreases misalignment of short-reads on pseudogenes. Chr. Gene RPKM_GenomeBased RPKM_cDNABased 1 SUMO1P HSP90B3P TOP1P AURKAPS PA2G4P LYPLA2P CLK2P RPL23P RNF5P NACAP PTTG3P PTENP ANXA2P PIPSL CSNK2A1P NME2P ATP5EP UBE2MP CSDAP

43 Supplementary Table 16. Expression map represented in RPKM value for all Refseq genes SuppTable16_Gene_Expression_Map_in_RPKM.xls Depicted below is a preview of the full version. chr gene_id gene start stop size accession ave_rpkm Gene Expression (RPKM) AK3 AK5 AK7 AK_N1 AK_N2 AK_N5 AK_N6 AK_N7 AK_N15 AK4 AK6 AK14 AK20 AK_N9 AK_N LOC NR_ WASH5P NR_ FAM138F NR_ FAM138A NR_ FAM138C NR_ OR4F NM_ LOC NR_ LOC NR_ LOC NR_ OR4F NM_ OR4F NM_ OR4F NM_ MIR NR_ OR4F NM_ OR4F NM_ OR4F NM_ LOC NR_ NCRNA NR_ LOC NR_ FAM41C NR_ FLJ NR_ SAMD NM_ NOC2L NM_ KLHL NM_ PLEKHN NM_ PLEKHN NM_ C1orf NR_ HES NM_ HES NM_ ISG NM_ AGRN NM_ C1orf NM_

44 Supplementary Table 17. List of Unknown transcripts SuppTable17_Unknown_Transcripts.xls Depicted below is a preview of the full version. index individuals chr size(bp) start stop # of nearest_gene distance(bp) individuals NovelTranscript_1 AK3,AK_N5,AK7,Achr MRPL NovelTranscript_2 AK_N14,AK_N5,A chr SSU NovelTranscript_3 AK6,AK_N5 chr LOC NovelTranscript_4 AK_N1,AK_N7 chr TNFRSF NovelTranscript_5 AK_N14,AK_N9 chr C1orf NovelTranscript_6 AK_N5,AK_N2 chr C1orf NovelTranscript_7 AK_N15,AK_N6 chr C1orf NovelTranscript_8 AK3,AK_N7,AK4,Achr KIAA NovelTranscript_9 AK4,AK14 chr LOC NovelTranscript_10 AK_N2,AK_N5,AKchr TNFRSF NovelTranscript_11 AK_N2,AK6 chr TNFRSF NovelTranscript_12 AK_N2,AK14,AK6 chr TNFRSF NovelTranscript_13 AK_N5,AK3 chr PARK7 521 NovelTranscript_14 AK4,AK6,AK_N5,Achr SLC2A5 613 NovelTranscript_15 AK_N14,AK_N2 chr SLC2A5 16 NovelTranscript_16 AK_N6,AK_N14,A chr NMNAT NovelTranscript_17 AK_N7,AK_N15 chr NMNAT NovelTranscript_18 AK_N14,AK_N1,A chr APITD1 409 NovelTranscript_19 AK_N15,AK_N14 chr APITD NovelTranscript_20 AK_N7,AK_N9,AKchr SRM 1462 NovelTranscript_21 AK_N5,AK_N7 chr EXOSC NovelTranscript_22 AK_N5,AK14 chr EXOSC NovelTranscript_23 AK_N2,AK20,AK7 chr UBIAD NovelTranscript_24 AK_N7,AK_N14,A chr UBIAD NovelTranscript_25 AK_N2,AK_N9,AKchr TNFRSF NovelTranscript_26 AK_N5,AK_N6,AKchr TNFRSF NovelTranscript_27 AK_N1,AK_N9,AKchr TNFRSF8 2 NovelTranscript_28 AK4,AK_N2,AK_Nchr TNFRSF8 2 NovelTranscript_29 AK_N2,AK_N5,AKchr TNFRSF8 490 NovelTranscript_30 AK_N6,AK_N5,AKchr TNFRSF NovelTranscript_31 AK_N5,AK_N2 chr TNFRSF NovelTranscript_32 AK_N1,AK_N2 chr TNFRSF NovelTranscript_33 AK_N2,AK_N6 chr TNFRSF NovelTranscript_34 AK_N6,AK_N5,AKchr PRDM NovelTranscript_35 AK_N14,AK_N9,A chr PRDM

45 Supplementary Table Genes escape X-inactivation chr gene start stop ave_rpkm male_rpkm female_rpkm pvalue qvalue previous report* X PRKX /9 X HDHD1A /9 X PNPLA /9 X MSL /9 X TMSB4X not analyzed X TRAPPC /9 X GEMIN /9 X CA5BP /9 X ZRSR not analyzed X SYAP /9 X CXorf /9 X EIF1AX /9 X EIF2S /9 X GPR /9 X CDK /7 X KDM5C /9 X SMC1A /9 X FOXO /9 X RPS4X /9 X TSIX not analyzed X XIST /9 X NGFRAP /9 X ALG /9 * Carrel, L & Willard, H.F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature (2005) The value reflects fraction of hybrids expressing genes in inactivated X chromosomes in the previous paper (out of 9 (or less) tested). 45

46 Supplementary Table 19. 1,809 TBM sites SuppTable19_TBM_List.xls Depicted below is a preview of the full version. Chr position wt snp rna_wt rna_snp rna_var% dna_wt dna_rd rna_a rna_c rna_g rna_t is_dbsnp chr T C rs chr T C rs chr T C rs chr C T rs chr C T rs chr C T rs chr T C rs chr T C rs chr T C rs chr T C rs chr T C rs utr_gene utr_strand allele_change individual is_validated_by_2ndrnaseq uc009viv.1,uc009viw.1,uc009vix.1 -,-,- AG AK5 yes uc009viv.1,uc009viw.1,uc009vix.1 -,-,- AG AK7 yes uc009viv.1,uc009viw.1,uc009vix.1 -,-,- AG AK14 yes LOC ,WASH7P,uc009vit.1,uc009viu.1,uc001aae.2,uc001aab.2,uc009viq.1,uc009vir.1,u-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- GA AK20 yes LOC ,WASH7P,uc009vit.1,uc009viu.1,uc001aae.2,uc001aab.2,uc009viq.1,uc009vir.1,u-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- GA AK4 yes LOC ,WASH7P,uc009vit.1,uc009viu.1,uc001aae.2,uc001aab.2,uc009viq.1,uc009vir.1,u-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- GA AK6 yes uc009viu.1,uc001aab.2,uc001aac.2,uc009viz.1,uc001aai.1,uc009vja.1,uc009vjb.1 -,-,-,-,-,-,- AG AK3 yes uc009viu.1,uc001aab.2,uc001aac.2,uc009viz.1,uc001aai.1,uc009vja.1,uc009vjb.1 -,-,-,-,-,-,- AG AK5 yes uc009viu.1,uc001aab.2,uc001aac.2,uc009viz.1,uc001aai.1,uc009vja.1,uc009vjb.1 -,-,-,-,-,-,- AG AK7 yes uc009viu.1,uc001aab.2,uc001aac.2,uc009viz.1,uc001aai.1,uc009vja.1,uc009vjb.1 -,-,-,-,-,-,- AG AK20 yes uc009viu.1,uc001aab.2,uc001aac.2,uc009viz.1,uc001aai.1,uc009vja.1,uc009vjb.1 -,-,-,-,-,-,- AG AK6 yes 46

47 Supplementary Table Allele Specific Expression sites SuppTable20_Allele_Specific_Expression_Sites.xls Depicted below is a preview of the full version. chr pos wt snp dbsnp annotation wtaa snpaa blosum is_nssnp Genome Read-Counts AK3 AK5 AK7 AK4 AK6 AK14 AK20 AK_N1 AK_N2 AK_N5 AK_N6 AK_N7 AK_N9 AK_N14 AK_N G A rs ATAD3B R Q 1 nssnp 1:0 1:5 3:0 5:5 1:1 4:1 0:3 51:18 45:22 32:12 51:15 37:34 49:0 46:0 18: C T rs atad3b P S -1 nssnp 3:3 3:3 3:4 0:4 3:1 6:6 2:6 10:15 5:2 8:7 7:2 9:4 8:4 9:4 11: T C rs cdk11b,cdkh R 0 nssnp 20:40 26:22 34:29 21:28 30:13 33:18 18:22 1:0 0:0 0:0 0:0 0:1 2:1 3:0 0: A G rs CDK11B,CDKC R -3 nssnp 18:38 27:22 29:23 23:24 29:10 37:17 19:23 1:0 0:0 0:0 0:0 0:1 2:1 2:0 0: C T rs UTS2 S N 1 nssnp 31:0 6:13 31:0 21:8 15:0 23:0 18:0 65:0 64:0 0:62 69:0 30:32 42:24 86:0 77: G A rs UTS2 T M -1 nssnp 9:17 0:33 22:16 16:11 33:0 13:15 16:11 56:62 55:72 138:0 60:61 110:0 52:35 56:49 113: A T novel CLSTN1 F Y 3 nssnp 22:0 10:14 27:0 17:0 11:0 12:0 7:0 31:0 36:0 19:32 20:16 31:0 11:14 39:0 37: C T rs MTHFR R Q 1 nssnp 13:0 18:0 19:0 12:0 9:4 6:9 13:0 118:0 105:0 64:62 101:0 102:0 96:0 52:47 84: T A rs12584 UBR4 M L 2 nssnp 0:37 18:21 0:32 0:14 0:19 6:6 0:14 34:18 0:53 33:24 28:33 0:63 27:17 0:46 0:45 Transcriptome Read-counts AK3 AK5 AK7 AK4 AK6 AK14 AK20 AK_N1 AK_N2 AK_N5 AK_N6 AK_N7 AK_N9 AK_N14 AK_N15 # of hetero# of inform# of suggeas_score FisherPvalu 0::61 0::49 48::0 0::29 0::40 21::15 0::57 10::11 14::18 5::5 22::21 0::29 23::25 26::0 7:: E-10 52::20 64::11 78::11 29::8 46::7 47::6 37::13 27::11 16::5 6::5 28::11 28::8 38::14 33::9 15:: E-06 24::96 31::72 26::86 13::49 19::64 15::61 14::61 19::31 7::34 13::24 28::35 19::34 7::52 15::48 9:: E-20 48::100 48::78 47::93 36::64 52::68 37::71 46::59 37::38 11::43 27::31 49::51 34::40 20::56 51::48 32:: E-10 0::0 0::0 0::0 0::0 28::0 21::0 14::0 0::0 31::0 0::57 43::0 0::12 7::24 0::0 32:: E-06 0::0 0::0 0::0 0::0 20::0 3::0 17::0 0::0 15::0 40::0 16::8 8::1 8::4 1::0 20:: E-08 51::0 32::24 27::0 57::0 38::0 48::0 38::0 73::0 36::1 19::16 22::12 55::0 45::0 61::0 42:: E-06 13::0 17::0 11::0 18::0 0::8 7::7 11::0 19::0 23::0 6::10 20::1 13::0 29::1 5::23 16:: E-06 0::113 53::65 0::102 0::131 0::99 64::57 0::114 66::74 1::144 58::74 52::45 2::98 0::129 0::98 0:: E-08 47

48 Supplementary Table 21. Contig list generated by de novo assembly SuppTable21_Denovo_Contigs_List.xls Depicted below is a preview of the full version. 48

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction

More information

RNA- seq read mapping

RNA- seq read mapping RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Supplementary Information for Discovery and characterization of indel and point mutations

Supplementary Information for Discovery and characterization of indel and point mutations Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Nature Genetics: doi:0.1038/ng.2768

Nature Genetics: doi:0.1038/ng.2768 Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014 Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

Fei Lu. Post doctoral Associate Cornell University

Fei Lu. Post doctoral Associate Cornell University Fei Lu Post doctoral Associate Cornell University http://www.maizegenetics.net Genotyping by sequencing (GBS) is simple and cost effective 1. Digest DNA 2. Ligate adapters with barcodes 3. Pool DNAs 4.

More information

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI 1 GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI Justin Dailey and Xiaoyu Zhang Department of Computer Science, California State University San Marcos San Marcos, CA 92096 Email: daile005@csusm.edu,

More information

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Inferences for Isoform Expression in RNA-Seq Statistical Inferences for Isoform Expression in RNA-Seq Hui Jiang and Wing Hung Wong February 25, 2009 Abstract The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

Heterozygous BMN lines

Heterozygous BMN lines Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30

More information

1 Introduction. Abstract

1 Introduction. Abstract CBS 530 Assignment No 2 SHUBHRA GUPTA shubhg@asu.edu 993755974 Review of the papers: Construction and Analysis of a Human-Chimpanzee Comparative Clone Map and Intra- and Interspecific Variation in Primate

More information

Lecture Notes: BIOL2007 Molecular Evolution

Lecture Notes: BIOL2007 Molecular Evolution Lecture Notes: BIOL2007 Molecular Evolution Kanchon Dasmahapatra (k.dasmahapatra@ucl.ac.uk) Introduction By now we all are familiar and understand, or think we understand, how evolution works on traits

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Supplemental Information

Supplemental Information Molecular Cell, Volume 52 Supplemental Information The Translational Landscape of the Mammalian Cell Cycle Craig R. Stumpf, Melissa V. Moreno, Adam B. Olshen, Barry S. Taylor, and Davide Ruggero Supplemental

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb Overview Fosmid XAAA112 consists of 34,783 nucleotides. Blat results indicate that this fosmid has significant identity to the 2R chromosome of D.melanogaster. Evidence suggests that fosmid XAAA112 contains

More information

Protocol S1. Replicate Evolution Experiment

Protocol S1. Replicate Evolution Experiment Protocol S Replicate Evolution Experiment 30 lines were initiated from the same ancestral stock (BMN, BMN, BM4N) and were evolved for 58 asexual generations using the same batch culture evolution methodology

More information

Annotation of Drosophila grimashawi Contig12

Annotation of Drosophila grimashawi Contig12 Annotation of Drosophila grimashawi Contig12 Marshall Strother April 27, 2009 Contents 1 Overview 3 2 Genes 3 2.1 Genscan Feature 12.4............................................. 3 2.1.1 Genome Browser:

More information

High-throughput sequencing: Alignment and related topic

High-throughput sequencing: Alignment and related topic High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs

More information

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre PhD defense Chromosomal rearrangements in mammalian genomes : characterising the breakpoints Claire Lemaitre Laboratoire de Biométrie et Biologie Évolutive Université Claude Bernard Lyon 1 6 novembre 2008

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Unfixed endogenous retroviral insertions in the human population Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Supplementary Methods Common sources of 'false positives' in mining

More information

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Maheshi Dassanayake 1,9, Dong-Ha Oh 1,9, Jeffrey S. Haas 1,2, Alvaro Hernandez 3, Hyewon Hong 1,4, Shahjahan

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

RGP finder: prediction of Genomic Islands

RGP finder: prediction of Genomic Islands Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication

More information

Supplementary Information

Supplementary Information Supplementary Information LINE-1-like retrotransposons contribute to RNA-based gene duplication in dicots Zhenglin Zhu 1, Shengjun Tan 2, Yaqiong Zhang 2, Yong E. Zhang 2,3 1. School of Life Sciences,

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 8 29, pages 126 132 doi:1.193/bioinformatics/btp113 Gene expression Statistical inferences for isoform expression in RNA-Seq Hui Jiang 1 and Wing Hung Wong 2,

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11 The Eukaryotic Genome and Its Expression Lecture Series 11 The Eukaryotic Genome and Its Expression A. The Eukaryotic Genome B. Repetitive Sequences (rem: teleomeres) C. The Structures of Protein-Coding

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Isoform discovery and quantification from RNA-Seq data

Isoform discovery and quantification from RNA-Seq data Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Classical Selection, Balancing Selection, and Neutral Mutations

Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Designer Genes C Test

Designer Genes C Test Northern Regional: January 19 th, 2019 Designer Genes C Test Name(s): Team Name: School Name: Team Number: Rank: Score: Directions: You will have 50 minutes to complete the test. You may not write on the

More information

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Yuan Ji 1,, Yanxun Xu 2, Qiong Zhang 3, Kam-Wah Tsui 3, Yuan Yuan 4, Clift Norris 1, Shoudan

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Ensembl Exercise Answers Adapted from Ensembl tutorials presented by Dr. Bert Overduin, EBI

Ensembl Exercise Answers Adapted from Ensembl tutorials presented by Dr. Bert Overduin, EBI Ensembl Exercise Answers Adapted from Ensembl tutorials presented by Dr. Bert Overduin, EBI Exercise 1 Exploring the human MYH9 gene (a) Go to the Ensembl homepage (http://www.ensembl.org). Select Search:

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

Variant visualisation and quality control

Variant visualisation and quality control Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14

More information

Supporting Information

Supporting Information Supporting Information Weghorn and Lässig 10.1073/pnas.1210887110 SI Text Null Distributions of Nucleosome Affinity and of Regulatory Site Content. Our inference of selection is based on a comparison of

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Pan-genomics: theory & practice

Pan-genomics: theory & practice Pan-genomics: theory & practice Michael Schatz Sept 20, 2014 GRC Assembly Workshop #gi2014 / @mike_schatz Part 1: Theory Advances in Assembly! Perfect Human Assembly First PacBio RS @ CSHL Perfect Microbes

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY Command line Training Set First Motif Summary of Motifs Termination Explanation MEME - Motif discovery tool MEME version 3.0 (Release date: 2002/04/02 00:11:59) For further information on how to interpret

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Learning gene regulatory networks Statistical methods for haplotype inference Part I Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung

More information

Potato Genome Analysis

Potato Genome Analysis Potato Genome Analysis Xin Liu Deputy director BGI research 2016.1.21 WCRTC 2016 @ Nanning Reference genome construction???????????????????????????????????????? Sequencing HELL RIEND WELCOME BGI ZHEN LLOFRI

More information

Towards More Effective Formulations of the Genome Assembly Problem

Towards More Effective Formulations of the Genome Assembly Problem Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25 2 / 25 CENTRAL DOGMA OF BIOLOGY

More information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

More information

Is KIT locus polymorphism rs related to white belt phenotype in Krškopolje pig?

Is KIT locus polymorphism rs related to white belt phenotype in Krškopolje pig? Is KIT locus polymorphism rs328592739 related to white belt phenotype in Krškopolje pig? Jernej Ogorevc, Minja Zorc, Martin Škrlep, Riccardo Bozzi, Matthias Petig, Luca Fontanesi, Marjeta Čandek-Potokar,

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Genome 371, Autumn 2018 Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Goals: To illustrate how molecular tools can be used to track inheritance. In this particular example, we will

More information

Bias in RNA sequencing and what to do about it

Bias in RNA sequencing and what to do about it Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Adaptation in the Human Genome. HapMap. The HapMap is a Resource for Population Genetic Studies. Single Nucleotide Polymorphism (SNP)

Adaptation in the Human Genome. HapMap. The HapMap is a Resource for Population Genetic Studies. Single Nucleotide Polymorphism (SNP) Adaptation in the Human Genome A genome-wide scan for signatures of adaptive evolution using SNP data Single Nucleotide Polymorphism (SNP) A nucleotide difference at a given location in the genome Joanna

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 147 Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data Sunduz

More information

objective functions...

objective functions... objective functions... COFFEE (Notredame et al. 1998) measures column by column similarity between pairwise and multiple sequence alignments assumes that the pairwise alignments are optimal assumes a set

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/3/11/eaao4709/dc1 Supplementary Materials for Pushing the limits of photoreception in twilight conditions: The rod-like cone retina of the deep-sea pearlsides Fanny

More information

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER SUPPLEMENTARY METHODS FOR THE PAPER TRANSCRIPT ASSEMBLY AND QUANTIFICATION BY RNA-SEQ REVEALS UNANNOTATED TRANSCRIPTS AND ISOFORM SWITCHING DURING CELL DIFFERENTIATION COLE TRAPNELL, BRIAN A WILLIAMS,

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation

More information

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS.

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS. !! www.clutchprep.com CONCEPT: HISTORY OF GENETICS The earliest use of genetics was through of plants and animals (8000-1000 B.C.) Selective breeding (artificial selection) is the process of breeding organisms

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Genotyping By Sequencing (GBS) Method Overview

Genotyping By Sequencing (GBS) Method Overview enotyping By Sequencing (BS) Method Overview RJ Elshire, JC laubitz, Q Sun, JV Harriman ES Buckler, and SE Mitchell http://wwwmaizegeneticsnet/ Topics Presented Background/oals BS lab protocol Illumina

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

Objective 3.01 (DNA, RNA and Protein Synthesis)

Objective 3.01 (DNA, RNA and Protein Synthesis) Objective 3.01 (DNA, RNA and Protein Synthesis) DNA Structure o Discovered by Watson and Crick o Double-stranded o Shape is a double helix (twisted ladder) o Made of chains of nucleotides: o Has four types

More information

Chapter 18 Active Reading Guide Genomes and Their Evolution

Chapter 18 Active Reading Guide Genomes and Their Evolution Name: AP Biology Mr. Croft Chapter 18 Active Reading Guide Genomes and Their Evolution Most AP Biology teachers think this chapter involves an advanced topic. The questions posed here will help you understand

More information

2. Der Dissertation zugrunde liegende Publikationen und Manuskripte. 2.1 Fine scale mapping in the sex locus region of the honey bee (Apis mellifera)

2. Der Dissertation zugrunde liegende Publikationen und Manuskripte. 2.1 Fine scale mapping in the sex locus region of the honey bee (Apis mellifera) 2. Der Dissertation zugrunde liegende Publikationen und Manuskripte 2.1 Fine scale mapping in the sex locus region of the honey bee (Apis mellifera) M. Hasselmann 1, M. K. Fondrk², R. E. Page Jr.² und

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Figure 1 Principal components analysis (PCA) of all samples analyzed in the discovery phase. Colors represent the phenotype of study populations. a) The first sample

More information

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics. Bioinformatics Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

Levels of genetic variation for a single gene, multiple genes or an entire genome

Levels of genetic variation for a single gene, multiple genes or an entire genome From previous lectures: binomial and multinomial probabilities Hardy-Weinberg equilibrium and testing HW proportions (statistical tests) estimation of genotype & allele frequencies within population maximum

More information

Introduction to Advanced Population Genetics

Introduction to Advanced Population Genetics Introduction to Advanced Population Genetics Learning Objectives Describe the basic model of human evolutionary history Describe the key evolutionary forces How demography can influence the site frequency

More information

Darwinian Selection. Chapter 7 Selection I 12/5/14. v evolution vs. natural selection? v evolution. v natural selection

Darwinian Selection. Chapter 7 Selection I 12/5/14. v evolution vs. natural selection? v evolution. v natural selection Chapter 7 Selection I Selection in Haploids Selection in Diploids Mutation-Selection Balance Darwinian Selection v evolution vs. natural selection? v evolution ² descent with modification ² change in allele

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information