Paired-End Read Length Lower Bounds for Genome Re-sequencing

Size: px
Start display at page:

Download "Paired-End Read Length Lower Bounds for Genome Re-sequencing"

Transcription

1 1/11 Paired-End Read Length Lower Bounds for Genome Re-sequencing Rayan Chikhi ENS Cachan Brittany PhD student in the Symbiose team, Irisa, France

2 2/11 NEXT-GENERATION SEQUENCING Next-gen vs. traditional (Sanger) DNA sequencers : Throughput Shorter reads (starting at 25 bp) Cost Paired reads (both ends of a fragment High quality reads of size k nt) ( 0.4% error rate per base on Illumina)

3 2/11 NEXT-GENERATION SEQUENCING Next-gen vs. traditional (Sanger) DNA sequencers : Throughput Cost High quality reads ( 0.4% error rate per base on Illumina) Some bioinformatics applications : Genome re-sequencing De novo (w/out reference genome) and comparative assembly Shorter reads (starting at 25 bp) Paired reads (both ends of a fragment of size k nt) Topic of this talk : Determine when NGS paired read length is too short for re-sequencing.

4 3/11 CONTEXT : GENOME RE-SEQUENCING Re-sequencing : align reads to a reference sequence to improve it and detect SNPs/indels Resequencing ambiguity : map: AACGTATGCA to: -AACGTTTGCA AACGTTTGCA-- Micro-repeated (30-50 nt) regions are : -not re-sequencable with short reads -not fully predicted by simple models (such as BLAST s E-value) nor RepeatMasker Solution : analyze actual genomes [Whiteford et al, 2005] : perfect uniqueness of single reads. Genomes Viral Bacterial Small eukaryote Human Read length for 12 nt 18 nt nt nt max uniqueness (100%) (97%) (90-95%) (85-95%)

5 SEQUENCERS CHART 6 5 Throughput (Gb/run) ABI SOLiD 2 (p) Illumina GA II (p) 1 Illumina GA ABI SOLiD Ti (p) 454 GS FLX Read length (nt) 4/11

6 SEQUENCERS CHART 6 Throughput (Gb/run) E. coli, 97% of reads map uniquely ABI SOLiD 2 (p) Human genome, 95% of reads map uniquely Illumina GA II (p) Human genome chr 1, 80% of genome covered by contigs > 10k nt 1 ABI SOLiD 1 Illumina GA 454 GS Read length (nt) 5/11

7 6/11 METHODS We study the perfect uniqueness of mate-paired reads : (σ, δ)-pair : r 1 r 2 σ ± δ Example values for Illumina [Lee 09] : σ 150, δ 15 a (σ, δ)-pair (r 1, r 2 ) is unique there is no other (r 1, r 2 ) pair distant of σ ± δ in the genome ---AACGT---TTGCA AACGT-----TTGCA-- Here, the (3, 2)-pair AACGT---TTGCA is not unique. U = number of unique (σ, δ)-pairs number of (σ, δ)-pairs

8 7/11 METHODS, ALGORITHM Build a suffix array and lcp of the genome sequence We developed a novel pairs-counting algorithm based on a suffix array. Here, δ = 0 case is shown. Complexity : O(n + nδ) time and memory For each read length l, do: Find duplicate reads For each r such that lcp[r]=l, dupe[r]=1 Determine if each (σ,0)-pair is unique For each r s.t. dupe[r]=1, pair[r,r+l+σ]++ For each r s.t. dupe[r]=1, If (pair[r,r+l+σ]>=2) found_pair(r) } Variant of RepAnalyse

9 7/11 METHODS, ALGORITHM Build a suffix array and lcp of the genome sequence Notes : One-time computation, but high memory usage. RepAnalyse on the human genome : 2 days Our algorithm on Medaka genome, δ = 0 : 12 hrs. For each read length l, do: Find duplicate reads For each r such that lcp[r]=l, dupe[r]=1 Determine if each (σ,0)-pair is unique For each r s.t. dupe[r]=1, pair[r,r+l+σ]++ For each r s.t. dupe[r]=1, If (pair[r,r+l+σ]>=2) found_pair(r) } Variant of RepAnalyse

10 8/11 RESULTS : Comparison between paired and single uniqueness 100 Lambda-phage (50 kbp) E. coli (4.7 Mbp) H. sapiens chr X (154 Mbp) Unique reads (%) Read length (nt) paired unpaired Fugu (408 Mbp) Medaka (886 Mbp) both strands are considered, and (σ, δ)=(300,0)

11 9/11 RESULTS : Evaluation of mate-pair separation vs. uniqueness in the E. coli genome 10% σ = 9000, δ = 5% 0% 10% σ = 5000, δ = 5% 0% 10% σ = 1000, δ = 5% 0% 10% σ = 850, δ = 4% 0% 10% σ = 600, δ = 5% 0% 10% σ = 350, δ = 4% 0% 10% σ = 100, δ = 5% 0% Uniqueness of reads in the E. coli genome Read length (nt) % 99.75% % 99.67% 99.73% % 98.90% 98.99% 99.13% 98.77% 98.85% 98.96% 98.53% 98.57% 98.66% 98.25% 98.26% 98.32% 97.85% 97.85% 97.88%

12 10/11 CONCLUSION Conclusion : Best recipe for paired-end sequencing : 1. increase mate-pair separation 2. keep variability of separation as low as possible 3. use longer read lengths Given perfect separation precision, short (estimate : nt) mate-paired reads should map uniquely to 95% of the human genome. Perspectives : Use statistical models to obtain bounds for approximate uniqueness of reads. Find theoretical lower bounds for de novo assembly. Study the time complexity of paired-end de novo assembly (prelim results : NP-hardness of several models).

13 11/11 ACKNOWLEDGEMENTS Dominique Lavenier, PhD advisor Aurélien Biogenouest genomics center Symbiose team and ENS Cachan Britanny CS dept Any Question?

14 12/11 SUPPL. MATERIAL : Is the right mate always in the same strand? [H. Li et al, 2008] Correct paired-end reads : SOLiD : Right mate always on the same strand Illumina GA : Right mate always on the other strand 100 E. coli (4.7 Mbp) If one wishes to perform structural variation detection, then the uniqueness of both correct and discordant reads matters. Unique reads (%) right mate in forward strang right mate in reverse strand right mate in any strand Read length (nt)

15 SUPPL. MATERIAL : Comparison between paired, 2x paired and single uniqueness 100 Lambda-phage (50 kbp) E. coli (4.7 Mbp) H. sapiens chr X (154 Mbp) Unique reads (%) Read length (nt) paired paired (with read length x2) unpaired Fugu (408 Mbp) Medaka (886 Mbp) both strands are considered, and (σ, δ)=(300,0) 13/11

16 14/11 SUPPL. MATERIAL : Near-perfect micro-repeats Counting only perfect micro-repeats gives a upper bound on unicity, hence a lower bound on read length.

Cycle «Analyse de données de séquençage à haut-débit»

Cycle «Analyse de données de séquençage à haut-débit» Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Chadi Saad CRIStAL - Équipe BONSAI - Univ Lille, CNRS, INRIA (chadi.saad@univ-lille.fr) Présentation de Sophie Gallina (source:

More information

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads 1st International Conference on Algorithms for Computational Biology AlCoB 2014 Tarragona, Spain, July 1-3, 2014 Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads Claire

More information

Genomes Comparision via de Bruijn graphs

Genomes Comparision via de Bruijn graphs Genomes Comparision via de Bruijn graphs Student: Ilya Minkin Advisor: Son Pham St. Petersburg Academic University June 4, 2012 1 / 19 Synteny Blocks: Algorithmic challenge Suppose that we are given two

More information

High-throughput sequence alignment. November 9, 2017

High-throughput sequence alignment. November 9, 2017 High-throughput sequence alignment November 9, 2017 a little history human genome project #1 (many U.S. government agencies and large institute) started October 1, 1990. Goal: 10x coverage of human genome,

More information

Genome Assembly. Sequencing Output. High Throughput Sequencing

Genome Assembly. Sequencing Output. High Throughput Sequencing Genome High Throughput Sequencing Sequencing Output Example applications: Sequencing a genome (DNA) Sequencing a transcriptome and gene expression studies (RNA) ChIP (chromatin immunoprecipitation) Example

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly CMPS 6630: Introduction to Computational Biology and Bioinformatics Sequence Assembly Why Genome Sequencing? Sanger (1982) introduced chaintermination sequencing. Main idea: Obtain fragments of all possible

More information

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction

More information

Pan-genomics: theory & practice

Pan-genomics: theory & practice Pan-genomics: theory & practice Michael Schatz Sept 20, 2014 GRC Assembly Workshop #gi2014 / @mike_schatz Part 1: Theory Advances in Assembly! Perfect Human Assembly First PacBio RS @ CSHL Perfect Microbes

More information

Designing and Testing a New DNA Fragment Assembler VEDA-2

Designing and Testing a New DNA Fragment Assembler VEDA-2 Designing and Testing a New DNA Fragment Assembler VEDA-2 Mark K. Goldberg Darren T. Lim Rensselaer Polytechnic Institute Computer Science Department {goldberg, limd}@cs.rpi.edu Abstract We present VEDA-2,

More information

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The

More information

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment. CoCoGen meeting Accuracy of the anchor-based strategy for genome alignment Raluca Uricaru LIRMM, CNRS Université de Montpellier 2 3 octobre 2008 1 / 31 Summary 1 General context 2 Global alignment : anchor-based

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

SJÄLVSTÄNDIGA ARBETEN I MATEMATIK

SJÄLVSTÄNDIGA ARBETEN I MATEMATIK SJÄLVSTÄNDIGA ARBETEN I MATEMATIK MATEMATISKA INSTITUTIONEN, STOCKHOLMS UNIVERSITET Analysing k-mer distributions in a genome sequencing project av Josene Röhss 2014 - No 22 MATEMATISKA INSTITUTIONEN,

More information

Single-cell genomics applied to the picobiliphytes using next-generation sequencing

Single-cell genomics applied to the picobiliphytes using next-generation sequencing Department of Ecology, Evolution and Natural Resources and Institute of Marine and Coastal Sciences Rutgers University, NJ 08901 Single-cell genomics applied to the picobiliphytes using next-generation

More information

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Advanced Topics in Computational Genomics Slides from Marcel H. Schulz, Tobias Rausch (EMBL), and Kai Ye (Leiden University)

More information

De novo assembly and genotyping of variants using colored de Bruijn graphs

De novo assembly and genotyping of variants using colored de Bruijn graphs De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013 Challenges Detecting genetic variants that are highly divergent from a reference Detecting

More information

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham Assembly improvement: based on Ragout approach student: Anna Lioznova scientific advisor: Son Pham Plan Ragout overview Datasets Assembly improvements Quality overlap graph paired-end reads Coverage Plan

More information

Algorithmics and Bioinformatics

Algorithmics and Bioinformatics Algorithmics and Bioinformatics Gregory Kucherov and Philippe Gambette LIGM/CNRS Université Paris-Est Marne-la-Vallée, France Schedule Course webpage: https://wikimpri.dptinfo.ens-cachan.fr/doku.php?id=cours:c-1-32

More information

Transcriptional Circuitry and the Regulatory Conformation of the Genome

Transcriptional Circuitry and the Regulatory Conformation of the Genome Transcriptional Circuitry and the Regulatory Conformation of the Genome I-CORE for Gene Regulation in Complex Human Disease. Feb. 1 st, 2015 The Hebrew University Introduction to Next-generation sequencing

More information

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Ibrahim Ali H Nafisah Department of Statistics University of Leeds Submitted in accordance with the requirments for the

More information

Bio 119 Bacterial Genomics 6/26/10

Bio 119 Bacterial Genomics 6/26/10 BACTERIAL GENOMICS Reading in BOM-12: Sec. 11.1 Genetic Map of the E. coli Chromosome p. 279 Sec. 13.2 Prokaryotic Genomes: Sizes and ORF Contents p. 344 Sec. 13.3 Prokaryotic Genomes: Bioinformatic Analysis

More information

Whole genome sequencing (WGS) as a tool for monitoring purposes. Henrik Hasman DTU - Food

Whole genome sequencing (WGS) as a tool for monitoring purposes. Henrik Hasman DTU - Food Whole genome sequencing (WGS) as a tool for monitoring purposes Henrik Hasman DTU - Food The Challenge Is to: Continue to increase the power of surveillance and diagnostic using molecular tools Develop

More information

Whole genome sequencing (WGS) - there s a new tool in town. Henrik Hasman DTU - Food

Whole genome sequencing (WGS) - there s a new tool in town. Henrik Hasman DTU - Food Whole genome sequencing (WGS) - there s a new tool in town Henrik Hasman DTU - Food Welcome to the NGS world TODAY Welcome Introduction to Next Generation Sequencing DNA purification (Hands-on) Lunch (Sandwishes

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

New Contig Creation Algorithm for the de novo DNA Assembly Problem

New Contig Creation Algorithm for the de novo DNA Assembly Problem New Contig Creation Algorithm for the de novo DNA Assembly Problem Mohammad Goodarzi Computer Science A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in

More information

CHAPTER : Prokaryotic Genetics

CHAPTER : Prokaryotic Genetics CHAPTER 13.3 13.5: Prokaryotic Genetics 1. Most bacteria are not pathogenic. Identify several important roles they play in the ecosystem and human culture. 2. How do variations arise in bacteria considering

More information

High-throughput sequencing: Alignment and related topic

High-throughput sequencing: Alignment and related topic High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs

More information

Discrete Probability Distributions: Poisson Applications to Genome Assembly

Discrete Probability Distributions: Poisson Applications to Genome Assembly Discrete Probability Distributions: Poisson Applications to Genome Assembly Binomial Distribution Binomially distributed random variable X equals sum of n independent Bernoulli trials The probability mass

More information

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Unfixed endogenous retroviral insertions in the human population Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Supplementary Methods Common sources of 'false positives' in mining

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2009 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology Copyright 2008 Oliver Jovanovic, All Rights Reserved. Genome

More information

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq Institute for Computational Biomedicine ChIP seq peak calling Statistical integration between ChIP seq and RNA seq Olivier Elemento, PhD ChIP-seq to map where transcription factors bind DNA Transcription

More information

Towards More Effective Formulations of the Genome Assembly Problem

Towards More Effective Formulations of the Genome Assembly Problem Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25 2 / 25 CENTRAL DOGMA OF BIOLOGY

More information

Probabilistic Arithmetic Automata

Probabilistic Arithmetic Automata Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense Overview 1 Probabilistic Arithmetic Automata 2 Application

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Multi-Assembly Problems for RNA Transcripts

Multi-Assembly Problems for RNA Transcripts Multi-Assembly Problems for RNA Transcripts Alexandru Tomescu Department of Computer Science University of Helsinki Joint work with Veli Mäkinen, Anna Kuosmanen, Romeo Rizzi, Travis Gagie, Alex Popa CiE

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology 2012 Univ. 1301 Aguilera Lecture Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the

More information

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109 CONTENTS ix Preface xv Acknowledgments xxi Editors and contributors xxiv A computational micro primer xxvi P A R T I Genomes 1 1 Identifying the genetic basis of disease 3 Vineet Bafna 2 Pattern identification

More information

Predictive Genome Analysis Using Partial DNA Sequencing Data

Predictive Genome Analysis Using Partial DNA Sequencing Data Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,

More information

23/01/2018. PiRATE: a Pipeline to Retrieve and Annotate TEs of non-model organisms. Transposable elements (TEs) Impact of TEs on genomes

23/01/2018. PiRATE: a Pipeline to Retrieve and Annotate TEs of non-model organisms. Transposable elements (TEs) Impact of TEs on genomes Transposable elements () PiRATE: a Pipeline to Retrieve and Annotate of non-model organisms are DNAsequences able to move (= transposition) into the host genome of eucaryotic and procaryotic organisms

More information

Bioinformatics Practical for Biochemists

Bioinformatics Practical for Biochemists Bioinformatics Practical for Biochemists Andrei Lupas, Birte Höcker, Steffen Schmidt WS 2013/2014 01. DNA & Genomics!! 1 Description Lectures about general topics in Bioinformatics & History Tutorials

More information

Bioinformatics Practical for Biochemists

Bioinformatics Practical for Biochemists Bioinformatics Practical for Biochemists Andrei Lupas, Birte Höcker, Steffen Schmidt WS 2012/2013 01. DNA & Genomics 1 Description Lectures about general topics in Bioinformatics & History Tutorials will

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS The Saguaro Genome Toward the Ecological Genomics of a Sonoran Desert Icon Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS Why study a genome? - the genome contains the genetic information of an

More information

Heterozygous BMN lines

Heterozygous BMN lines Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Application to

More information

Natural Variation, Genomics and Biotechnology

Natural Variation, Genomics and Biotechnology Natural Variation, Genomics and Biotechnology My royal ancestry Stefan Jansson (1959-) - Professor Britt-Greta Jansson (1936-) Office clerk Karl Ljungberg (1911-1998) - Shopkeeper Olof Ljungberg (1884-1944)

More information

Genome Assembly Results, Protocol & Demo

Genome Assembly Results, Protocol & Demo Genome Assembly Results, Protocol & Demo Monday, March 7, 2016 BIOL 7210: Genome Assembly Group Aroon Chande, Cheng Chen, Alicia Francis, Alli Gombolay, Namrata Kalsi, Ellie Kim, Tyrone Lee, Wilson Martin,

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto Hapsembler version 2.1 ( + Encore & Scarpa) Manual Nilgun Donmez Department of Computer Science University of Toronto January 13, 2013 Contents 1 Introduction.................................. 2 2 Installation..................................

More information

L3: Blast: Keyword match basics

L3: Blast: Keyword match basics L3: Blast: Keyword match basics Fa05 CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with exactly the same number of hairs! Assignment 1 is online Due 10/6

More information

A DNA Sequence 2017/12/6 1

A DNA Sequence 2017/12/6 1 A DNA Sequence ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgg gtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagc ggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttc gcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgcta

More information

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 1 PILER: identification and classification of genomic repeats Robert C. Edgar 1* and Eugene W. Myers 2 1 195 Roque Moraes Drive, Mill Valley, CA, U.S.A., bob@drive5.com.

More information

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Daisuke Ikeda Department of Informatics, Kyushu University 744 Moto-oka, Fukuoka 819-0395, Japan. daisuke@inf.kyushu-u.ac.jp

More information

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation.

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation. Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation. Inlets are labelled in blue, outlets are labelled in red, and static channels

More information

Topic 3: Genetics (Student) Essential Idea: Chromosomes carry genes in a linear sequence that is shared by members of a species.

Topic 3: Genetics (Student) Essential Idea: Chromosomes carry genes in a linear sequence that is shared by members of a species. Topic 3: Genetics (Student) 3.2 Essential Idea: Chromosomes carry genes in a linear sequence that is shared by members of a species. 3.2 Chromosomes 3.2.U1 Prokaryotes have one chromosome consisting of

More information

Overview of Research at Bioinformatics Lab

Overview of Research at Bioinformatics Lab Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating domain knowledge > Effective,

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Hidden Markov

More information

Linear-Space Alignment

Linear-Space Alignment Linear-Space Alignment Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x

More information

RGP finder: prediction of Genomic Islands

RGP finder: prediction of Genomic Islands Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication

More information

More Codon Usage Bias

More Codon Usage Bias .. CSC448 Bioinformatics Algorithms Alexander Dehtyar.. DA Sequence Evaluation Part II More Codon Usage Bias Scaled χ 2 χ 2 measure. In statistics, the χ 2 statstic computes how different the distribution

More information

On the Sound Covering Cycle Problem in Paired de Bruijn Graphs

On the Sound Covering Cycle Problem in Paired de Bruijn Graphs On the Sound Covering Cycle Problem in Paired de Bruijn Graphs Christian Komusiewicz 1 and Andreea Radulescu 2 1 Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Germany christian.komusiewicz@tu-berlin.de

More information

Fitness constraints on horizontal gene transfer

Fitness constraints on horizontal gene transfer Fitness constraints on horizontal gene transfer Dan I Andersson University of Uppsala, Department of Medical Biochemistry and Microbiology, Uppsala, Sweden GMM 3, 30 Aug--2 Sep, Oslo, Norway Acknowledgements:

More information

FINISHING MY FOSMID XAAA113 Andrew Nett

FINISHING MY FOSMID XAAA113 Andrew Nett FINISHING MY FOSMID XAAA113 Andrew Nett The assembly of fosmid XAAA113 initially consisted of three contigs with two gaps (Fig 1). This assembly was based on the sequencing reactions of subclones set up

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

Y chromosome dynamics in Drosophila. Amanda Larracuente Department of Biology

Y chromosome dynamics in Drosophila. Amanda Larracuente Department of Biology Y chromosome dynamics in Drosophila Amanda Larracuente Department of Biology Sex chromosomes J. Graves X X X Y Sex chromosome evolution Autosomes Proto-sex chromosomes Sex determining Suppressed recombination

More information

RNA- seq read mapping

RNA- seq read mapping RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference

More information

Genome Sequencing & Assembly Michael Schatz. May 2, 2013 Human Microbiome Consortium

Genome Sequencing & Assembly Michael Schatz. May 2, 2013 Human Microbiome Consortium Genome Sequencing & Assembly Michael Schatz May 2, 2013 Human Microbiome Consortium Outline 1. Assembly theory 1. Assembly by analogy 2. De Bruijn and Overlap graph 3. Coverage, read length, errors, and

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Centrifuge: rapid and sensitive classification of metagenomic sequences

Centrifuge: rapid and sensitive classification of metagenomic sequences Centrifuge: rapid and sensitive classification of metagenomic sequences Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg Supplementary Material Supplementary Table 1 Supplementary Note

More information

Computer Algebra Systems Activity: Developing the Quadratic Formula

Computer Algebra Systems Activity: Developing the Quadratic Formula Computer Algebra Systems Activity: Developing the Quadratic Formula Topic: Developing the Quadratic Formula R. Meisel, Feb 19, 2005 rollym@vaxxine.com Ontario Expectations: (To be added when finalized

More information

Read Quality Assessment & Improvement. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Read Quality Assessment & Improvement. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014 Read Quality ssessment & Improvement J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014 Error modes Each technology has unique error modes, depending on the physico-chemical processes involved

More information

Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism

Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism Ling Yin 1,3,, Yunhe An 1,2,, Junjie Qu 3,, Xinlong Li 1, Yali Zhang 1, Ian Dry 5, Huijun Wu 2*, Jiang Lu 1,4** 1 College

More information

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12 Lecture 5 Continuous Random Variables BMIR Lecture Series in Probability and Statistics Ching-Han Hsu, BMES, National Tsing Hua University c 215 by Ching-Han Hsu, Ph.D., BMIR Lab 5.1 1 Uniform Distribution

More information

Applications of genome alignment

Applications of genome alignment Applications of genome alignment Comparing different genome assemblies Locating genome duplications and conserved segments Gene finding through comparative genomics Analyzing pathogenic bacteria against

More information

Encoding and Decoding 3D Genome Organization

Encoding and Decoding 3D Genome Organization Encoding and Decoding 3D Genome Organization Noam Kaplan Dekker Lab Program in Systems Biology University of Massachusetts Medical School Chromatin organization 10000nm Human genome: 2x3000 Mbp Human chromosome:

More information

Introduction to microbiota data analysis

Introduction to microbiota data analysis Introduction to microbiota data analysis Natalie Knox, PhD Head Bacterial Genomics, Bioinformatics Core National Microbiology Laboratory, Public Health Agency of Canada 2 National Microbiology Laboratory

More information

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. Ron Shamir, CG 08 Hidden Markov Models 1 Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Maheshi Dassanayake 1,9, Dong-Ha Oh 1,9, Jeffrey S. Haas 1,2, Alvaro Hernandez 3, Hyewon Hong 1,4, Shahjahan

More information

Microbiome: 16S rrna Sequencing 3/30/2018

Microbiome: 16S rrna Sequencing 3/30/2018 Microbiome: 16S rrna Sequencing 3/30/2018 Skills from Previous Lectures Central Dogma of Biology Lecture 3: Genetics and Genomics Lecture 4: Microarrays Lecture 12: ChIP-Seq Phylogenetics Lecture 13: Phylogenetics

More information

Is inversion symmetry of chromosomes a law of nature?

Is inversion symmetry of chromosomes a law of nature? Is inversion symmetry of chromosomes a law of nature? David Horn TAU Safra bioinformatics retreat, 28/6/2018 Lecture based on Inversion symmetry of DNA k-mer counts: validity and deviations. Shporer S,

More information

Nature Genetics: doi:0.1038/ng.2768

Nature Genetics: doi:0.1038/ng.2768 Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic

More information

Evaluation of the Number of Different Genomes on Medium and Identification of Known Genomes Using Composition Spectra Approach.

Evaluation of the Number of Different Genomes on Medium and Identification of Known Genomes Using Composition Spectra Approach. Evaluation of the Number of Different Genomes on Medium and Identification of Known Genomes Using Composition Spectra Approach Valery Kirzhner *1 & Zeev Volkovich 2 1 Institute of Evolution, University

More information

Characterization of Structural Variants with Single Molecule and Hybrid Sequencing Approaches

Characterization of Structural Variants with Single Molecule and Hybrid Sequencing Approaches Bioinformatics Advance Access published October 28, 2014 Characterization of Structural Variants with Single Molecule and Hybrid Sequencing Approaches Anna Ritz 1,,, Ali Bashir 2,3, Suzanne Sindi 4, David

More information

natural development from this collection of knowledge: it is more reliable to predict the property

natural development from this collection of knowledge: it is more reliable to predict the property 1 Chapter 1 Introduction As the basis of all life phenomena, the interaction of biomolecules has been under the scrutiny of scientists and cataloged meticulously [2]. The recent advent of systems biology

More information

PLNT2530 (2018) Unit 5 Genomes: Organization and Comparisons

PLNT2530 (2018) Unit 5 Genomes: Organization and Comparisons PLNT2530 (2018) Unit 5 Genomes: Organization and Comparisons Unless otherwise cited or referenced, all content of this presenataion is licensed under the Creative Commons License Attribution Share-Alike

More information

Bioinformatics course

Bioinformatics course Bioinformatics course Phylogeny and Comparative genomics 10/23/13 1 Contents-phylogeny Introduction-biology, life classificationtaxonomy Phylogenetic-tree of life, tree representation Why study phylogeny?

More information

Title: Field-based species identification of closely-related plants using real-time nanopore sequencing.

Title: Field-based species identification of closely-related plants using real-time nanopore sequencing. Title: Field-based species identification of closely-related plants using real-time nanopore sequencing. Authors: Joe Parker 1 *, Dion Devey 1, Andrew J. Helmstetter 1, Tim Wilkinson 1 & Alexander S.T.

More information

Essentiality in B. subtilis

Essentiality in B. subtilis Essentiality in B. subtilis 100% 75% Essential genes Non-essential genes Lagging 50% 25% Leading 0% non-highly expressed highly expressed non-highly expressed highly expressed 1 http://www.pasteur.fr/recherche/unites/reg/

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p. READING: 14.2 Bacterial Genomes p. 481 14.3 Gene Transfer Mechanisms in Bacteria p. 486 Suggested Problems: 1, 7, 13, 14, 15, 20, 22 BACTERIAL GENETICS AND GENOMICS We still consider the E. coli genome

More information

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem Paola Bonizzoni, Riccardo Dondi, Gunnar W. Klau, Yuri Pirola, Nadia Pisanti and Simone Zaccaria DISCo, computer

More information

Genome Sequencing & DNA Sequence Analysis

Genome Sequencing & DNA Sequence Analysis 7.91 / 7.36 / BE.490 Lecture #1 Feb. 24, 2004 Genome Sequencing & DNA Sequence Analysis Chris Burge What is a Genome? A genome is NOT a bag of proteins What s in the Human Genome? Outline of Unit II: DNA/RNA

More information

Fragment Assembly of DNA

Fragment Assembly of DNA Wright State University CORE Scholar Computer Science and Engineering Faculty Publications Computer Science and Engineering 2003 Fragment Assembly of DNA Dan E. Krane Wright State University - Main Campus,

More information