RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216
IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference genome 3. Index reference genome 4. Map reads to genome (output in SAM or BAM format 5. Convert results to a sorted, indexed BAM file 6. Quality checks on mapped reads 7. Visualize read mappings on the genome Followed by further analyses
Input: sequence reads (FASTQ format @HWI-ST118:7:111:1691:46835#/1 CTTCATTTCCCTCCAGTCCCTGGAGGGGCTTCTAGTATTACTGGGACAATGACCACGCTGCCTGTTTGTCTGTGAGTTACGGGCAACCAGCCTC + bbbeeeeefgggghiiiiiiiiiiiiiiiiihihihhiiiihiiiiiiiihiiiiiiiiiggggdeeeebddddcbbbcccccccccccacccc @HWI-ST118:7:111:2937:53143#/1 CGACCAGCTGATCGTGTCTCCAAGGGCAGAAGCACAAGCGGGGAGGCTGGGGTGGCTGCAGCGAGGTCCTCCCTAAGTAGGGCAGGGGAGCCCC + bbbeeeeeggfggihihiiiiiiiiiihiiiihiiiiihihigadcccdccczaa^^_acccc_ac_bcccccbb^byabbcbc]a]aet]aca @HWI-ST118:7:111:14544:66521#/1 GGTGGCTGCAGCGAGGTCCTCCCTAAGTAGGGCAGGGGAGCCCCCAGGTGGGGAGGGCTCATGGGGGCCAGGGAGTAAGGCTGGCTCCCCTGGT + bbaeeeeegggggiifghiiiiiihfhfhihiifhigihhiiihigggdcecc^acccccccccaccccccccac^b_bcbccccbbaacba`y @HWI-ST118:7:111:1545:122666#/1 CCCACCTGCAACTTTCCTCCAAGTGTGGCTCGGAGAAGAAACATCAACAAGGACCCTGGGCTTCGATTCAAAAACTCCTCTGAAGCCATCCATG + bbbeeeeegggggiiiiiiihiigieghiii_eu_^cbceghffdhhiiicg`\xaz`ggcdecebcdbb`bcaw_]bbbb]bbbbcbc^`bbb @HWI-ST118:7:111:14326:133684#/1 CGCCTGCCCAGCAGTGTTTATCCTGGGATCCTCCTATTGGGGTTGAGGGAGGGGAAGACAGCAGGAAGGTTGAGGGAGCAGCAACTTGGCCAGA + ^\\cccc^y[ybee^bfcegagx_^aeehhheebzpbf_rzeo^_ea]`ye`[wyy^q_xab]zz^z\_ay[gy^anrow^pqxqx`a`xy`p^...
Goal: reads mapped to genome (SAM format HWI-ST118:7:126:3667:137198# 97 chr1 1581284 255 47M2769N47M7S chr2 HWI-ST118:7:235:11836:132357# 177 chr12 137344 255 11S9M chr2 HWI-ST118:7:125:1818:8988# 97 chr12 5163719 255 96M5S chr2 73325 HWI-ST118:7:113:2457:7159# 129 chr19 4554799 255 11M chr2 733155 HWI-ST118:7:117:1423:14655# 99 chr2 73351 255 11M = HWI-ST118:7:116:168:6339# 163 chr2 733524 255 11M = 7336 HWI-ST118:7:236:199:6213# 99 chr2 733547 255 11M = 7337 HWI-ST118:7:235:8697:195892# 163 chr2 733561 255 4S97M = 7336 HWI-ST118:7:128:124:5258# 99 chr2 733563 255 98M3S = 7336 HWI-ST118:7:117:1423:14655# 147 chr2 733572 255 11M = HWI-ST118:7:128:1123:715# 99 chr2 733593 255 11M = 7336 HWI-ST118:7:217:11555:46214# 163 chr2 733593 255 11M = 7336 HWI-ST118:7:112:1213:8767# 73 chr2 733594 255 11M = 7335 HWI-ST118:7:112:1213:8767# 133 chr2 733594 * = 7335 HWI-ST118:7:126:3667:137198# 145 chr2 73362 255 11M chr1 15812 HWI-ST118:7:128:16138:8853# 99 chr2 73363 255 11M = 7337 HWI-ST118:7:226:7742:86872# 163 chr2 733621 255 11M = 7336 HWI-ST118:7:138:1466:19516# 99 chr2 733623 255 1S1M = 7338 HWI-ST118:7:231:14871:8111# 99 chr2 733623 255 11M = 7337 HWI-ST118:7:221:13683:6477# 145 chr2 733623 255 11S9M = 7336...
VisualizaDon of read alignments Scale chr2: 2 kb hg19 136,872, 136,873, 136,874, Gm12878 RNA-seq reads 136,875, 136,876, 4 _ Gm12878 RNA-seq coverage, minus strand Gm12878_minus _ 4.88 _ 1 Vert. Cons - -4.5 _ Common SNPs(144 RepeatMasker CXCR4 CXCR4 RefSeq Genes 1 vertebrates Basewise Conservation by PhyloP Simple Nucleotide Polymorphisms (dbsnp 144 Found in >= 1% of Samples Repeating Elements by RepeatMasker
Spliced alignment k Garber et al. Nature Methods 211
Introns can be very large! Human introns (Ensembl 1..8 Cumulative proportion.6.4.2. 1 1 1, 1, 1, 1M 1M Intron size (bp
Limited sequence signals at splice sites 5 ss BP 3 ss H. sapiens 1 2 3 4 5 D. rerio D. melanogaster P. chrysosporium S. cerevisiae 1 2 1 2 1 2 3 4 5 3 4 5 3 4 5 P. tricornutum 1 2 3 4 5 A. thaliana 1 2 3 4 5 2 1-9 -8-7 -6-5 -4 1 2 3 4 5 2 1-9 -8-7 -6-5 -4 GT AG 98.6% GC AG 1.3% 2 1-9 -8-7 -6-5 -4 2 1-9 -8-7 -6-5 -4 2 1-9 -8-7 -6-5 -4 2 1-9 -8-7 -6-5 -4 2 1-9 -8-7 -6-5 -4 AT AC.1% Iwata and Gotoh BMC Genomics 211
MulD- mapping reads and pseudogenes FuncDonal gene Processed pseudogene Correct read alignment IdenDcal, spliced Incorrect read alignment Mismatches, not spliced Note: An aligner may report both alignments or either Some search strategies and scoring schemes give preference to unspliced alignments
How important is mapping accuracy? Depends what you want to do: IdenDfy novel genedc variants or RNA edidng Importance Allele- specific expression Genome annotadon Gene and transcript discovery DifferenDal expression
Current RNA- seq aligners TopHat2 Kim et al. Genome Biology 213 HISAT2 Kim et al. Nature Methods 215 STAR Dobin et al. Bioinforma8cs 213 GSNAP Wu and Nacu Bioinforma8cs 21 OLego Wu et al. Nucleic Acids Research 213 HPG aligner Medina et al. DNA Research 216 MapSplice2 hjp://www.netlab.uky.edu/p/bioinfo/mapsplice2
Compute requirements Program Run time (min Memory usage (GB HISATx1 22.7 4.3 HISATx2 47.7 4.3 HISAT 26.7 4.3 STAR 25 28 STARx2 5.5 28 GSNAP 291.9 2.2 OLego 989.5 3.7 TopHat2 1,17 4.3 Run times and memory usage for HISAT and other spliced aligners to align 19 million 11-bp RNA-seq reads from a lung fibroblast data set. We used three CPU cores to run the programs on a Mac Pro with a 3.7 GHz Quad-Core Intel Xeon E5 processor and 64 GB of RAM. Kim et al. Nature Methods 215
The predecessor: BLAT In the process of assembling and annotadng the human genome, I was faced with two very large- scale alignment problems: aligning three million ESTs and aligning 13 million mouse whole- genome random reads against the human genome. These alignments needed to be done in less than two weeks Dme on a moderate- sized (9 CPU Linux cluster in order to have Dme to process an updated genome every month or two. To achieve this I developed a very- high- speed mrna/dna and translated protein alignment algorithm. (Kent Genome Research 22
InnovaDons in RNA- seq alignment sooware Read pair alignment Consider base call quality scores SophisDcated indexing to decrease CPU and memory usage Map to genedc variants Resolve muld- mappers using regional read coverage Consider juncdon annotadon Two- step approach (juncdon discovery & final alignment
Two- step RNA- seq read mapping 1 st runofhisattodiscoversplicesites mapped e1# e2# e3# unmapped 2 nd runofhisattoalignreadsbymakinguseofthelistofsplicesitescollectedabove e1# e2# e3# Read Exon Intron GlobalSearch LocalSearch Extension Junc:onextension Kim et al. Nature Methods 215
Mapping accuracy Correctly and uniquely mapped Correctly mapped (multimapped Incorrectly mapped Unmapped 1 92.1 (93.5 91.6 (91.6 92.3 (93.8 88.9 (9.5 96.7 (99 97.6 (99.2 97.6 (99.3 94.4 (97.4 ( % % + % Percentage of reads 75 5 25 HISAT 1 OLego GSNAP STAR STAR 2 HISAT HISAT 2 TopHat2 Accuracy for 2 million simulated human 1 bp reads with.5% mismatch rate Kim et al. Nature Methods 215
Mapping accuracy for reads with small anchors Percentage of reads, 2M_8_15 Percentage of reads, 2M_1_7 Correctly and uniquely mapped 1 75 5 25 1 75 5 25 88.2 (89.4 7.6 (7.7 HISAT 1 88.5 (88.5 ( OLego Correctly mapped (multimapped 9.4 (91.7 9.2 (9.4 GSNAP 51.1 (52.2 ( STAR 94.4 (96.4 79.8 (92.6 STAR 2 Incorrectly mapped 97.2 (98.9 94.4 (97.3 HISAT 97.2 (99 95.4 (98.3 HISAT 2 Unmapped 93.5 (95.5 77.8 (96 TopHat2 ( ( % % + % % % + % b 62.4% (M 25.1% (2M_gt_15 3.1% (gt_2m 4.2% (2M_1_7 5.1% (2M_8_15 Kim et al. Nature Methods 215
Novel juncdons are typically supported by few alignments b K562 whole cell K562 whole cell Annotated junctions 5, 1, 15, Novel junctions 1, 2, 3, Annotated junctions 1 2 3 4 5 6 7 8 9 1 Minimum number of supporting mappings 1 2 3 4 5 6 7 8 9 1 Minimum number of supporting mappings Each curve represents one RNA- seq read mapping protocol (program + sesngs. Engström et al. Nature Methods 213
1 c 1, 11, 12, True junctions d 68, 1, 72, 11, 76, 12, 8, 2, 1 2 3 4 5 6 7 8 9 1 Several methods show over- confidence in annotadon Minimum number of supporting mappings 5, 1, 15, 2, 25, Annotated junctions All junctions False junctions Simulation 2 68, 68, 72, 76, 8, 68, 72, 3, 76, 35, 8, 4, 5, 1, 15, BAGET ann PASS Simulation 2 Simulation Annotated GEM ann 2 junctio PASS cons Simulation GEM cons ReadsMap Simulation 1 GEM cons ann SMALT Simulation 1 GSNAP STAR 1p GSNAP ann STAR 1p ann GSTRUCT STAR 2p GSTRUCT ann STAR 2p ann MapSplice TopHat1 MapSplice ann TopHat1 ann PALMapper TopHat2 PALM. ann TopHat2 ann PALM. cons PALM. cons ann 1, 5, 2, 1, 3, 15, 2, 4, 5, 1, 15, 5, 1, 15, 2, 25, 5, 1, 15, Engström et al. Nature Methods 213 d 8, 1 2 3 4 Minimum numb False junctions Simulation 2
RecommendaDons Use STAR, HISAT2 or GSNAP STAR and HISAT2 are the fastest HISAT2 uses the least memory If you want to run Cufflinks, use TopHat2 (but don t Consider 2- pass read mapping (default in HISAT2 and TopHat2 No need to supply annotadon to mapper Check that juncdon discovery criteria are conservadve HISAT2 and GSNAP can use SNP data, which may give higher sensidvity For long (PacBio reads, STAR, BLAT or GMAP can be used Don t trust novel introns supported by single reads Always check the results!
Unsolved problems in RNA- seq read mapping Determine correct locadon of muldmapping reads Accurate alignment of indels Use gene annotadon in an unbiased fashion Cross- species mapping
Browsing your results Two main browsers: Integra:ve Genomics Viewer (IGV + Fast response (runs locally + Easy to load your data (including custom genomes - Limited funcdonality - User interface issues UCSC Genome Brower - - Sluggish (remote web site Need to place data on web server (e.g. UPPMAX webexport + Much public data for comparison + Good for sharing your data tracks (e.g. using track hubs
Important SAM fields Command: samtools view X file.bam Perfectly and uniquely aligned read pair: HWI-ST118:3:135:219:45397# ppr1 chr1 4426 255 11M = 4435 11 GT C@ NH:i:1 HI:i:1 AS:i:2 nm:i: HWI-ST118:3:135:219:45397# ppr2 chr1 4435 255 11M NH:i:1 HI:i:1 AS:i:2 nm:i: = 4426 1 CG 5< Problema:c read pair: HWI-ST118:3:219:617:66353# ppr2s chr1 558 3 65M36S = 558 95 CA B@ NH:i:2 HI:i:2 AS:i:135 nm:i:9 HWI-ST118:3:219:617:66353# ppr1s chr1 558 3 7S73M1D21M = 558-95 CC ## NH:i:2 HI:i:2 AS:i:135 nm:i:9
Thanks for listening!