Paired-End Read Length Lower Bounds for Genome Re-sequencing

Size: px

Start display at page:

Download "Paired-End Read Length Lower Bounds for Genome Re-sequencing"

Madeline Blair
5 years ago
Views:

1 1/11 Paired-End Read Length Lower Bounds for Genome Re-sequencing Rayan Chikhi ENS Cachan Brittany PhD student in the Symbiose team, Irisa, France

2 2/11 NEXT-GENERATION SEQUENCING Next-gen vs. traditional (Sanger) DNA sequencers : Throughput Shorter reads (starting at 25 bp) Cost Paired reads (both ends of a fragment High quality reads of size k nt) ( 0.4% error rate per base on Illumina)

3 2/11 NEXT-GENERATION SEQUENCING Next-gen vs. traditional (Sanger) DNA sequencers : Throughput Cost High quality reads ( 0.4% error rate per base on Illumina) Some bioinformatics applications : Genome re-sequencing De novo (w/out reference genome) and comparative assembly Shorter reads (starting at 25 bp) Paired reads (both ends of a fragment of size k nt) Topic of this talk : Determine when NGS paired read length is too short for re-sequencing.

3/11 CONTEXT : GENOME RE-SEQUENCING Re-sequencing : align reads to a reference sequence to improve it and detect SNPs/indels Resequencing ambiguity : map: AACGTATGCA to:

4 3/11 CONTEXT : GENOME RE-SEQUENCING Re-sequencing : align reads to a reference sequence to improve it and detect SNPs/indels Resequencing ambiguity : map: AACGTATGCA to: -AACGTTTGCA AACGTTTGCA-- Micro-repeated (30-50 nt) regions are : -not re-sequencable with short reads -not fully predicted by simple models (such as BLAST s E-value) nor RepeatMasker Solution : analyze actual genomes [Whiteford et al, 2005] : perfect uniqueness of single reads. Genomes Viral Bacterial Small eukaryote Human Read length for 12 nt 18 nt nt nt max uniqueness (100%) (97%) (90-95%) (85-95%)

5 SEQUENCERS CHART 6 5 Throughput (Gb/run) ABI SOLiD 2 (p) Illumina GA II (p) 1 Illumina GA ABI SOLiD Ti (p) 454 GS FLX Read length (nt) 4/11

6 SEQUENCERS CHART 6 Throughput (Gb/run) E. coli, 97% of reads map uniquely ABI SOLiD 2 (p) Human genome, 95% of reads map uniquely Illumina GA II (p) Human genome chr 1, 80% of genome covered by contigs > 10k nt 1 ABI SOLiD 1 Illumina GA 454 GS Read length (nt) 5/11

7 6/11 METHODS We study the perfect uniqueness of mate-paired reads : (σ, δ)-pair : r 1 r 2 σ ± δ Example values for Illumina [Lee 09] : σ 150, δ 15 a (σ, δ)-pair (r 1, r 2 ) is unique there is no other (r 1, r 2 ) pair distant of σ ± δ in the genome ---AACGT---TTGCA AACGT-----TTGCA-- Here, the (3, 2)-pair AACGT---TTGCA is not unique. U = number of unique (σ, δ)-pairs number of (σ, δ)-pairs

8 7/11 METHODS, ALGORITHM Build a suffix array and lcp of the genome sequence We developed a novel pairs-counting algorithm based on a suffix array. Here, δ = 0 case is shown. Complexity : O(n + nδ) time and memory For each read length l, do: Find duplicate reads For each r such that lcp[r]=l, dupe[r]=1 Determine if each (σ,0)-pair is unique For each r s.t. dupe[r]=1, pair[r,r+l+σ]++ For each r s.t. dupe[r]=1, If (pair[r,r+l+σ]>=2) found_pair(r) } Variant of RepAnalyse

9 7/11 METHODS, ALGORITHM Build a suffix array and lcp of the genome sequence Notes : One-time computation, but high memory usage. RepAnalyse on the human genome : 2 days Our algorithm on Medaka genome, δ = 0 : 12 hrs. For each read length l, do: Find duplicate reads For each r such that lcp[r]=l, dupe[r]=1 Determine if each (σ,0)-pair is unique For each r s.t. dupe[r]=1, pair[r,r+l+σ]++ For each r s.t. dupe[r]=1, If (pair[r,r+l+σ]>=2) found_pair(r) } Variant of RepAnalyse

10 8/11 RESULTS : Comparison between paired and single uniqueness 100 Lambda-phage (50 kbp) E. coli (4.7 Mbp) H. sapiens chr X (154 Mbp) Unique reads (%) Read length (nt) paired unpaired Fugu (408 Mbp) Medaka (886 Mbp) both strands are considered, and (σ, δ)=(300,0)

11 9/11 RESULTS : Evaluation of mate-pair separation vs. uniqueness in the E. coli genome 10% σ = 9000, δ = 5% 0% 10% σ = 5000, δ = 5% 0% 10% σ = 1000, δ = 5% 0% 10% σ = 850, δ = 4% 0% 10% σ = 600, δ = 5% 0% 10% σ = 350, δ = 4% 0% 10% σ = 100, δ = 5% 0% Uniqueness of reads in the E. coli genome Read length (nt) % 99.75% % 99.67% 99.73% % 98.90% 98.99% 99.13% 98.77% 98.85% 98.96% 98.53% 98.57% 98.66% 98.25% 98.26% 98.32% 97.85% 97.85% 97.88%

12 10/11 CONCLUSION Conclusion : Best recipe for paired-end sequencing : 1. increase mate-pair separation 2. keep variability of separation as low as possible 3. use longer read lengths Given perfect separation precision, short (estimate : nt) mate-paired reads should map uniquely to 95% of the human genome. Perspectives : Use statistical models to obtain bounds for approximate uniqueness of reads. Find theoretical lower bounds for de novo assembly. Study the time complexity of paired-end de novo assembly (prelim results : NP-hardness of several models).

13 11/11 ACKNOWLEDGEMENTS Dominique Lavenier, PhD advisor Aurélien Biogenouest genomics center Symbiose team and ENS Cachan Britanny CS dept Any Question?

14 12/11 SUPPL. MATERIAL : Is the right mate always in the same strand? [H. Li et al, 2008] Correct paired-end reads : SOLiD : Right mate always on the same strand Illumina GA : Right mate always on the other strand 100 E. coli (4.7 Mbp) If one wishes to perform structural variation detection, then the uniqueness of both correct and discordant reads matters. Unique reads (%) right mate in forward strang right mate in reverse strand right mate in any strand Read length (nt)

15 SUPPL. MATERIAL : Comparison between paired, 2x paired and single uniqueness 100 Lambda-phage (50 kbp) E. coli (4.7 Mbp) H. sapiens chr X (154 Mbp) Unique reads (%) Read length (nt) paired paired (with read length x2) unpaired Fugu (408 Mbp) Medaka (886 Mbp) both strands are considered, and (σ, δ)=(300,0) 13/11

16 14/11 SUPPL. MATERIAL : Near-perfect micro-repeats Counting only perfect micro-repeats gives a upper bound on unicity, hence a lower bound on read length.

Cycle «Analyse de données de séquençage à haut-débit»

Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Chadi Saad CRIStAL - Équipe BONSAI - Univ Lille, CNRS, INRIA (chadi.saad@univ-lille.fr) Présentation de Sophie Gallina (source: