Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Size: px
Start display at page:

Download "Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)"

Transcription

1 Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

2 inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction of gene structure within genome 0 _ 5 _ luscu2-75bp 0 _ _ JGI 3.1 Genes from GFF as Partly Cleaned by SJC -- do not trust AA translations Solexa Transcriptome Log_10 Unique Coverage All All All with -1 for 0 (M./G./N.) ll_u -1 _ Transcripti onal Start Site UTR Coding Exon Intron Transcripti onal End Site

3 How are Genomes Annotated? Traditional Approaches use: Using information from expressed sequence tags (ESTs) Conservation across organisms Prior knowledge of sequence motifs (e.g. splice junctions) Do not take advantage of data generated from nextgeneration sequencers Challenge: Develop data-driven annotation using RNA-seq data

4 Whole-genome Transcriptome Analysis (WTA) 10!g Total RNA 10ng/!l mrna AAAAAAA Random fragmentation AAAA Random hexamer primed 1st strand cdna synthesis + 2nd strand cdna synthesis End-repairing and adaptor ligation Size selection Christian Haudenschild - Illumina

5 Plant Genomes Chlamydomonas is a model algae with a sequenced genome and still incomplete annotation Currently being used for biodiesel studies Arabiodpsis is a model plant with a very high quality genome and nearly complete annotation Genetically tractable organism

6 Limita&ons of Current Chlamydomonas Annota&on from Augustus Models Even at high coverage more than 20% of predicted genes have no RNA-seq evidence

7 Two Approaches for Annotating Genomes using RNA-seq 1. First Approach Align reads to genome Concatenate reads that map to overlapping bases on the genome 2. Second approach Assemble reads directly before mapping to genome Use Assembly tools such as ABySS

8 Method I - Alignment of Reads to Genome First perform ungapped alignments using a fast aligner (e.g. Novoalign or Bow@e) The reads that do not map are mapped using a gapped alignment protocol (e.g. BLAT or TopHat) The gaps iden@fy splice junc@ons We compute the number of reads that align to each base in the genome

9 Read Counts Across Chlamydomonas Genome 35 base reads JGI-MinusCu1-35bp JGI-MinusCu2-75bp sition old_1: 5 _ 0 _ 5 _ C. reinhardtii May 2006 scaffold_1:99, ,790 (3,038 bp) RNA-Seq Coverage JGI MinusCu1 35bp First alignment round - Log10 RNA-Seq Coverage JGI MinusCu2 75bp First alignment round - Log10 75 base reads JGI-PlusCu1-75bp JGI-PlusCu2-75bp 0 _ 5 _ 0 _ 5 _ RNA-Seq Coverage JGI PlusCu1 75bp First alignment round - Log10 RNA-Seq Coverage JGI PlusCu2 75bp First alignment round - Log10 0 _ _ JGI 3.1 Genes from GFF as Partly Cleaned by SJC -- do not trust AA translations Solexa Transcriptome Log_10 Unique Coverage All All All with -1 for 0 (M./G./N.) AllAllAll_U Long Reads are not aligned across short exons in ungapped alignments -1 _

10 Gapped Alignments in Arabidopsis Genome Window Position chr1: ---> A. thaliana Jan chr1:64,187-65,986 (1,800 bp) Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=100 Gapped alignments allow us to cover short exons and define splice junctions Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=99 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=98 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=97 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=96 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=95 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=94 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=93 AT1G AT1G AT1G Short Match Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=92 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=91 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=90 TAIR7 Annotations (green=protein-coding gene, red=pseudogene/transposon, others=various RNA types) Simple Tandem Repeats by TRF Tandem Repeats Dispersed Repeats Inverted Repeats Perfect Matches to Short Sequence (CG) Your Sequence from Blat Search

11 Assembly of Mapped Reads Ungapped Alignments Gapped Alignment RNA reads RNA Counts per base Splice Assembly Transcript models Gene models

12 Assembly e 1: 2 _ 2 kb DGE tags Forward strand DGE-Tag-F Our Models DGE-Tag-R (blue) JGI (green) Augustus (red) SMMC37CuP.MPU _ s _ -2 _ 4" 1" 2 _ -1 _ 7 chr1model66 NlaIII restriction sites - Thick restriction site - Thin 17-mer tag forward/reverse strand DGE tags Reverse strand Paired-end 50bp reads from Solexa GAII. Chre4 RNA-Seq models for chr=1 bestgenes v4 annotation Augustus Genes v5.0 from GFF Sabeeha Merchant, Madeli Castruita, 2137, Copper Plus Minimal, Paired end, Unique hits Reads that map to overlapping bases on the genome are concatenated into contigs (blue)

13 Splicing The same locus can generate transcripts due to splicing, TSS and TTS sites Our Assembly generates models that represent different of splice sites 13

14 _ _ DGE-Tag-R Segmenta&on -2 _ SMMC37CuP.MPU au5.g1528_t1 au5.g1529_t1 au5.g1530_t1 2 _ Augustus Genes v5.0 from GFF "Sabeeha Merchant, Madeli Castruita, 2137, Copper Plus Minimal, Paired end, Unique hits" -1 _ Regions where two genes overlap show continuous RNA-seq data counts in Discontinuities in read counts may be used to define the boundaries We use Dynamic Programming approaches to efficiently segment count profiles

15 Refining Splice Junc&ons Splicing Donor Acceptor Splice Junction motifs are computed and used to refine ambiguous gapped alignments

16 Preliminary Results 77% of the bases in our models overlap with Augustus 76% of Augustus models overlap our models Our models are ouen limited by poor RNA-seq coverageof genes which results in the of gene fragments rather than complete transcripts 16

17 De Novo Transcriptiome Assembly with ABySS l Transcriptome assembly --May be used when a genome seqeunce is not available --Not biased by errors in genome sequence l De novo assembly ABySS Assembler --Assembly By Short Sequence --Assembly basis: de Bruijn graph

18 Differences Between Genome and Transcript Assemblies Transcript have a large dynamic range of abundances

19 De Novo Transcriptiome Assembly with ABySS K-mers K=3 Remove bubbles and branches Output: contigs The k-mers are connected if the overlap is k-1=2 Blue arrows indicate the order of the k-mers and their overlaps

20 De Novo Transcriptiome Assembly with ABySS Generated RNA-seq library from Arabidopsis flowers Sequenced 20 million reads, 100 bases long Reads were assembled using ABySS

21 ABySS: Parameter Search for k value Assembly Birol et al Arabipodisis txscriptome (14 mill 100mer reads) k-mer value #contigs 812,300 1,700,453 40,603 42,365 45,491 47,319 51,115 54,541 59,672 #contigs >= ,080 37,352 29,074 29,468 30,153 30,347 30,891 31,548 32,467 #contigs >= N50 N/A 8,621 5,828 5,818 5,805 5,771 5,775 5,780 5,803 median (bp) N/A mean (bp) N/A N50 (bp) ,106 1,116 1,131 1,143 1,148 1,154 1,155 max (bp) 7,386 3,495 8,539 11,911 11,911 11,911 11,911 11,911 8,373 sum (Mbp) Stats for con@gs >= 100bp (except #con@gs) N50: con@gs of size >= N50 make up 50% of assembly s bases Opted for k = 56 because highest N50, max con@g, & total Mbp

22 ABySS Assembly effienciency improved when adding more reads Sum of bases in contigs Contig length

23 How Much Coverage do We Need to Generate Full Length Transcripts? 15 counts per base are sufficient to assemble full length transcripts for most genes

24 ABySS: Assembly Coverage To determine ABySS assembly quality Aligned to TAIR mrna ref seq w/ BLAST Perl script to calculate coverage: only alignments w/ 98% (2% MM & 0 gaps) BLAST of Contigs against Refseqs #Contigs Qual Hits Low Qual Hits No Hits 31,548 27, ,053 Coverage Total Covered %Covered Queries 31,548 27,530 87% Bases 21,124,429 19,575,778 93% BLAST of Refseqs against Contigs #Refseqs Qual Hits Low Qual Hits No Hits 31,770 19,189 5,146 7,435 Coverage Total Covered %Covered Queries 31,770 19,189 60% Bases 48,103,124 23,494,369 49%

25 Example ABySS Con&gs In Black ABySS contigs often underestimate TSS and TTS ABySS contigs only capture a single transcript

26 Able to Predict new genes that are not in Annotated Were able to identify 414 novel transcripts with no matches to existing annotation 90 of these had hits to protein database

27 Acknowledgements Pellegrini Lab David Casero Diaz-Cano Stephen Douglass Darren Kessner Sabeeha Merchant Lab Steven Karpowicz Madeli Castruita Janeie Kropat Sequencing done at JGI 27

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Isoform discovery and quantification from RNA-Seq data

Isoform discovery and quantification from RNA-Seq data Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification

More information

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Maheshi Dassanayake 1,9, Dong-Ha Oh 1,9, Jeffrey S. Haas 1,2, Alvaro Hernandez 3, Hyewon Hong 1,4, Shahjahan

More information

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014 Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more

More information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER SUPPLEMENTARY METHODS FOR THE PAPER TRANSCRIPT ASSEMBLY AND QUANTIFICATION BY RNA-SEQ REVEALS UNANNOTATED TRANSCRIPTS AND ISOFORM SWITCHING DURING CELL DIFFERENTIATION COLE TRAPNELL, BRIAN A WILLIAMS,

More information

Genomes Comparision via de Bruijn graphs

Genomes Comparision via de Bruijn graphs Genomes Comparision via de Bruijn graphs Student: Ilya Minkin Advisor: Son Pham St. Petersburg Academic University June 4, 2012 1 / 19 Synteny Blocks: Algorithmic challenge Suppose that we are given two

More information

objective functions...

objective functions... objective functions... COFFEE (Notredame et al. 1998) measures column by column similarity between pairwise and multiple sequence alignments assumes that the pairwise alignments are optimal assumes a set

More information

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова Предсказание и анализ промотерных последовательностей Татьяна Татаринова Eukaryotic Transcription 2 Initiation Promoter: the DNA sequence that initially binds the RNA polymerase The structure of promoter-polymerase

More information

Supplementary Information

Supplementary Information Supplementary Information LINE-1-like retrotransposons contribute to RNA-based gene duplication in dicots Zhenglin Zhu 1, Shengjun Tan 2, Yaqiong Zhang 2, Yong E. Zhang 2,3 1. School of Life Sciences,

More information

High-throughput sequencing: Alignment and related topic

High-throughput sequencing: Alignment and related topic High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs

More information

Bias in RNA sequencing and what to do about it

Bias in RNA sequencing and what to do about it Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

More information

Towards More Effective Formulations of the Genome Assembly Problem

Towards More Effective Formulations of the Genome Assembly Problem Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25 2 / 25 CENTRAL DOGMA OF BIOLOGY

More information

RNA- seq read mapping

RNA- seq read mapping RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION reverse 3175 3175 F L C 318 318 3185 3185 319 319 3195 3195 315 8 1 315 3155 315 317 Supplementary Figure 3. Stability of expression of the GFP sensor constructs return to warm conditions. Semi-quantitative

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering Proteomics 2 nd semester, 2013 1 Text book Principles of Proteomics by R. M. Twyman, BIOS Scientific Publications Other Reference books 1) Proteomics by C. David O Connor and B. David Hames, Scion Publishing

More information

Potato Genome Analysis

Potato Genome Analysis Potato Genome Analysis Xin Liu Deputy director BGI research 2016.1.21 WCRTC 2016 @ Nanning Reference genome construction???????????????????????????????????????? Sequencing HELL RIEND WELCOME BGI ZHEN LLOFRI

More information

RNA Processing: Eukaryotic mrnas

RNA Processing: Eukaryotic mrnas RNA Processing: Eukaryotic mrnas Eukaryotic mrnas have three main parts (Figure 13.8): 5! untranslated region (5! UTR), varies in length. The coding sequence specifies the amino acid sequence of the protein

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI 1 GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI Justin Dailey and Xiaoyu Zhang Department of Computer Science, California State University San Marcos San Marcos, CA 92096 Email: daile005@csusm.edu,

More information

Annotation of Drosophila grimashawi Contig12

Annotation of Drosophila grimashawi Contig12 Annotation of Drosophila grimashawi Contig12 Marshall Strother April 27, 2009 Contents 1 Overview 3 2 Genes 3 2.1 Genscan Feature 12.4............................................. 3 2.1.1 Genome Browser:

More information

Supplemental Data. Hou et al. (2016). Plant Cell /tpc

Supplemental Data. Hou et al. (2016). Plant Cell /tpc Supplemental Data. Hou et al. (216). Plant Cell 1.115/tpc.16.295 A Distance to 1 st nt of start codon Distance to 1 st nt of stop codon B Normalized PARE abundance 8 14 nt 17 nt Frame1 Arabidopsis inflorescence

More information

Genome Assembly. Sequencing Output. High Throughput Sequencing

Genome Assembly. Sequencing Output. High Throughput Sequencing Genome High Throughput Sequencing Sequencing Output Example applications: Sequencing a genome (DNA) Sequencing a transcriptome and gene expression studies (RNA) ChIP (chromatin immunoprecipitation) Example

More information

Paired-End Read Length Lower Bounds for Genome Re-sequencing

Paired-End Read Length Lower Bounds for Genome Re-sequencing 1/11 Paired-End Read Length Lower Bounds for Genome Re-sequencing Rayan Chikhi ENS Cachan Brittany PhD student in the Symbiose team, Irisa, France 2/11 NEXT-GENERATION SEQUENCING Next-gen vs. traditional

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

Pan-genomics: theory & practice

Pan-genomics: theory & practice Pan-genomics: theory & practice Michael Schatz Sept 20, 2014 GRC Assembly Workshop #gi2014 / @mike_schatz Part 1: Theory Advances in Assembly! Perfect Human Assembly First PacBio RS @ CSHL Perfect Microbes

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Eppendorf twin.tec PCR Plates 96 LoBind Increase Yield of Transcript Species and Number of Reads of NGS Libraries

Eppendorf twin.tec PCR Plates 96 LoBind Increase Yield of Transcript Species and Number of Reads of NGS Libraries APPLICATION NOTE No. 375 I December 2016 Eppendorf twin.tec PCR Plates 96 LoBind Increase Yield of Transcript Species and Number of Reads of NGS Libraries Hanae A. Henke¹, Björn Rotter² ¹Eppendorf AG,

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven) BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Comparative Network Analysis

Comparative Network Analysis Comparative Network Analysis BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb Overview Fosmid XAAA112 consists of 34,783 nucleotides. Blat results indicate that this fosmid has significant identity to the 2R chromosome of D.melanogaster. Evidence suggests that fosmid XAAA112 contains

More information

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database SoyBase, the USDA-ARS Soybean Genetics and Genomics Database David Grant Victoria Carollo Blake Steven B. Cannon Kevin Feeley Rex T. Nelson Nathan Weeks SoyBase Site Map and Navigation Video Tutorials:

More information

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors Arli Parikesit, Peter F. Stadler, Sonja J. Prohaska Bioinformatics Group Institute of Computer Science University

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge Araport, a community portal for Arabidopsis. Data integration, sharing and reuse sergio contrino University of Cambridge Acknowledgements J Craig Venter Institute Chris Town Agnes Chan Vivek Krishnakumar

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

Supplemental Figure 1. Comparison of Tiller Bud Formation between the Wild Type and d27. (A) and (B) Longitudinal sections of shoot apex in wild-type

Supplemental Figure 1. Comparison of Tiller Bud Formation between the Wild Type and d27. (A) and (B) Longitudinal sections of shoot apex in wild-type A B 2 3 3 2 1 1 Supplemental Figure 1. Comparison of Tiller Bud Formation between the Wild Type and d27. (A) and (B) Longitudinal sections of shoot apex in wild-type (A) and d27 (B) seedlings at the four

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

RGP finder: prediction of Genomic Islands

RGP finder: prediction of Genomic Islands Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication

More information

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males Supplementary Figure 1 The number of differentially expressed genes for males (green), females (yellow), males (red), and females (blue) in caring vs. control comparisons in the caring gene set and the

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given

More information

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector. The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector. Omar S. Akbari*, Igor Antoshechkin*, Henry Amrhein, Brian Williams, Race Diloreto, Jeremy

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY http://genomebiology.com/2002/3/12/preprint/0011.1 This information has not been peer-reviewed. Responsibility for the findings rests solely with the author(s). Deposited research article MRD: a microsatellite

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

Supporting Information

Supporting Information Supporting Information Das et al. 10.1073/pnas.1302500110 < SP >< LRRNT > < LRR1 > < LRRV1 > < LRRV2 Pm-VLRC M G F V V A L L V L G A W C G S C S A Q - R Q R A C V E A G K S D V C I C S S A T D S S P E

More information

De novo assembly and genotyping of variants using colored de Bruijn graphs

De novo assembly and genotyping of variants using colored de Bruijn graphs De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013 Challenges Detecting genetic variants that are highly divergent from a reference Detecting

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Supplemental Information

Supplemental Information Molecular Cell, Volume 52 Supplemental Information The Translational Landscape of the Mammalian Cell Cycle Craig R. Stumpf, Melissa V. Moreno, Adam B. Olshen, Barry S. Taylor, and Davide Ruggero Supplemental

More information

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads 1st International Conference on Algorithms for Computational Biology AlCoB 2014 Tarragona, Spain, July 1-3, 2014 Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads Claire

More information

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. Alessandra Vigilante, Mara Sangiovanni, Chiara Colantuono, Luigi Frusciante and Maria Luisa Chiusano Dept. of Soil, Plant,

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Introduc)on to RNA- Seq Data Analysis Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Material: hep://)ny.cc/rnaseq Slides: hep://)ny.cc/slidesrnaseq

More information

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2013 Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Han Wu

More information

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Inferences for Isoform Expression in RNA-Seq Statistical Inferences for Isoform Expression in RNA-Seq Hui Jiang and Wing Hung Wong February 25, 2009 Abstract The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription

More information

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE

SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE Editor Antonia Gutiérrez-Mora Compilers Benjamín Rodríguez-Garay Silvia Maribel Contreras-Ramos Manuel Reinhart Kirchmayr Marisela González-Ávila Index 1.

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq express: Streaming read deconvolution and abundance estimation applied to RNA-Seq Adam Roberts 1 and Lior Pachter 1,2 1 Department of Computer Science, 2 Departments of Mathematics and Molecular & Cell

More information

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Lecture 18 June 2 nd, Gene Expression Regulation Mutations Lecture 18 June 2 nd, 2016 Gene Expression Regulation Mutations From Gene to Protein Central Dogma Replication DNA RNA PROTEIN Transcription Translation RNA Viruses: genome is RNA Reverse Transcriptase

More information

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016 New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016 Wikipedia The traditional workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The traditional workflow

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

Systematic comparison of lncrnas with protein coding mrnas in population expression and their response to environmental change

Systematic comparison of lncrnas with protein coding mrnas in population expression and their response to environmental change Xu et al. BMC Plant Biology (2017) 17:42 DOI 10.1186/s12870-017-0984-8 RESEARCH ARTICLE Open Access Systematic comparison of lncrnas with protein coding mrnas in population expression and their response

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Eukaryotic vs. Prokaryotic genes

Eukaryotic vs. Prokaryotic genes BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Alignment Strategies for Large Scale Genome Alignments

Alignment Strategies for Large Scale Genome Alignments Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty

More information

Algorithmics and Bioinformatics

Algorithmics and Bioinformatics Algorithmics and Bioinformatics Gregory Kucherov and Philippe Gambette LIGM/CNRS Université Paris-Est Marne-la-Vallée, France Schedule Course webpage: https://wikimpri.dptinfo.ens-cachan.fr/doku.php?id=cours:c-1-32

More information

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS The Saguaro Genome Toward the Ecological Genomics of a Sonoran Desert Icon Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS Why study a genome? - the genome contains the genetic information of an

More information

PG Diploma in Genome Informatics onwards CCII Page 1 of 6

PG Diploma in Genome Informatics onwards CCII Page 1 of 6 PG Diploma in Genome Informatics 2014-15 onwards CCII Page 1 of 6 BHARATHIAR UNIVERSITY, COIMBATORE 641046 CENTRE FOR COLLABORATION OF INDUSTRY AND INSTITUTION(CCII) PG DIPLOMA IN GENOME INFORMATICS (For

More information

High-throughput sequence alignment. November 9, 2017

High-throughput sequence alignment. November 9, 2017 High-throughput sequence alignment November 9, 2017 a little history human genome project #1 (many U.S. government agencies and large institute) started October 1, 1990. Goal: 10x coverage of human genome,

More information

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Promoters and Enhancers Systematic discovery of transcriptional regulatory motifs

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Introduction. SAMStat. QualiMap. Conclusions

Introduction. SAMStat. QualiMap. Conclusions Introduction SAMStat QualiMap Conclusions Introduction SAMStat QualiMap Conclusions Where are we? Why QC on mapped sequences Acknowledgment: Fernando García Alcalde The reads may look OK in QC analyses

More information

Variation in the genetic response to high temperature in Montastraea faveolata from the Florida Keys & Mexico

Variation in the genetic response to high temperature in Montastraea faveolata from the Florida Keys & Mexico Variation in the genetic response to high temperature in Montastraea faveolata from the Florida Keys & Mexico Nicholas R. Polato 1, Christian R. Voolstra 2, Julia Schnetzer 3, Michael K. DeSalvo 4, Carly

More information

Multi-Assembly Problems for RNA Transcripts

Multi-Assembly Problems for RNA Transcripts Multi-Assembly Problems for RNA Transcripts Alexandru Tomescu Department of Computer Science University of Helsinki Joint work with Veli Mäkinen, Anna Kuosmanen, Romeo Rizzi, Travis Gagie, Alex Popa CiE

More information

Regulatory Change in YABBY-like Transcription Factor Led to Evolution of Extreme Fruit Size during Tomato Domestication

Regulatory Change in YABBY-like Transcription Factor Led to Evolution of Extreme Fruit Size during Tomato Domestication SUPPORTING ONLINE MATERIALS Regulatory Change in YABBY-like Transcription Factor Led to Evolution of Extreme Fruit Size during Tomato Domestication Bin Cong, Luz Barrero, & Steven Tanksley 1 SUPPORTING

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq Mark F. Rogers, Christina Boucher, and Asa Ben-Hur Department of Computer Science 1873 Campus Delivery Fort Collins, CO 80523 rogersma@cs.colostate.edu,

More information

Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism

Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism Ling Yin 1,3,, Yunhe An 1,2,, Junjie Qu 3,, Xinlong Li 1, Yali Zhang 1, Ian Dry 5, Huijun Wu 2*, Jiang Lu 1,4** 1 College

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 8 29, pages 126 132 doi:1.193/bioinformatics/btp113 Gene expression Statistical inferences for isoform expression in RNA-Seq Hui Jiang 1 and Wing Hung Wong 2,

More information