Introduction. SAMStat. QualiMap. Conclusions

Similar documents
High-throughput sequencing: Alignment and related topic

RNA- seq read mapping

Variant visualisation and quality control

Isoform discovery and quantification from RNA-Seq data

Comparing whole genomes

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Supplemental Information

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

Bioinformatics Chapter 1. Introduction

Our typical RNA quantification pipeline

Bias in RNA sequencing and what to do about it

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017

Introduction to Bioinformatics Online Course: IBT

Bioinformatics and BLAST

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

objective functions...

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

GEP Annotation Report

1 What s Way Out There? The Hubble Ultra Deep Field

Genome Assembly. Sequencing Output. High Throughput Sequencing

8.4 Scientific Notation

Introduction to Bioinformatics

Whole Genome Alignments and Synteny Maps

1. HyperLogLog algorithm

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Synteny Portal Documentation

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

Read Quality Assessment & Improvement. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

BLAST. Varieties of BLAST

Fundamentals of Mathematics I

Single Cell Sequencing

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Exploring variability in reads from next generation rna-sequencing data. Presented By. Andrew Butler

ncounter PlexSet Data Analysis Guidelines

Practical considerations of working with sequencing data

Quantifying sequence similarity

Large-Scale Genomic Surveys

Package chimeraviz. April 13, 2019

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Package chimeraviz. November 29, 2017

O 3 O 4 O 5. q 3. q 4. Transition

Tree Building Activity

High-throughput sequence alignment. November 9, 2017

How to Run the Analysis: To run a principal components factor analysis, from the menus choose: Analyze Dimension Reduction Factor...

An Introduction to Sequence Similarity ( Homology ) Searching

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

It s about time... The only timeline tool you ll ever need!

Mathematics Level 5 to 6 for Year 8

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Galaxy Zoo. Materials Computer Internet connection

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets

Data Analysis and Statistical Methods Statistics 651

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Predictive Genome Analysis Using Partial DNA Sequencing Data

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

New Zealand Population Distribution

Using Bioinformatics to Study Evolutionary Relationships Instructions

Introduction to second-generation sequencing

Riemann Integration Theory

Sequencing alignment Ameer Effat M. Elfarash

Bioinformatics Exercises

Practical Bioinformatics

Radiological Control Technician Training Fundamental Academic Training Study Guide Phase I

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Expanded View Figures

A100 Exploring the Universe: Introduction. Martin D. Weinberg UMass Astronomy

Taxonomical Classification using:

GAP CLOSING. Algebraic Expressions. Intermediate / Senior Facilitator s Guide

Collected Works of Charles Dickens

Small & Large Card Game

Written Exam Linear and Integer Programming (DM554)

Supplemental Materials

new interface and features

SJÄLVSTÄNDIGA ARBETEN I MATEMATIK

CRISPRseek Workshop Design of target-specific guide RNAs in CRISPR-Cas9 genome-editing systems

Sequence Comparison. mouse human

Written Exam Linear and Integer Programming (DM545)

Cerno Application Note Extending the Limits of Mass Spectrometry

Integration by Parts Logarithms and More Riemann Sums!

Sequencing alignment Ameer Effat M. Elfarash

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Supplementary Materials for R3P-Loc Web-server

Introduction to Sequence Alignment. Manpreet S. Katari

Destination Math. Scope & Sequence. Grades K 12 solutions

Ligand Scout Tutorials

SUPPLEMENTARY INFORMATION

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto

Transcription:

Introduction SAMStat QualiMap Conclusions

Introduction SAMStat QualiMap Conclusions

Where are we?

Why QC on mapped sequences Acknowledgment: Fernando García Alcalde The reads may look OK in QC analyses of raw reads but some issues only show up once the reads are aligned: low coverage, homopolymer biases, experimental artifacts, etc. These unwanted biases can be introduced by the selected: Sample extraction process Sequencing technology Sample preparation protocol Mapping algorithm

Why QC on mapped sequences SAM/BAM files usually contain information from tens to hundreds of millions of reads The systematic detection of such biases is a non-trivial task that is crucial to to drive appropriate downstream analyses. Look for big biases that really affect the analysis Difficult to provide guidelines: general trends

Introduction SAMStat QualiMap Conclusions

SAMStat Features Facilitates the indentification of sequencing error biases that may disturb the mapping process Provides a concise html page with statistics that highlight problems in the data processing: Reads with an excesive proportion of mapping errors Reads containing contaminants Reads representing novel splice junctions/genomic regions... Easy-to-use command-line tool freely downloadable at: http://samstat.sourceforge.net Timo Lassmann, Yoshihide Hayashizaki, and Carsten O. Daub. SAMStat: monitoring biases in next generation sequencing data Bioinformatics (2011) 27(1): 130-131

Running SAMStat Input: a BAM/SAM file (other sequence files are also accepted such as fasta or fastq) Output: an html report Run SAMStat with a.bam example samstat /home/biouser/mda13/mqc-igv/test1.bam The html report will be saved at /home/biouser/mda13/mqc-igv/test1.bam.html. Use a web browser (e.g. Firefox) to open it

SAMStat report Concepts Mapping quality: an integer in [0,254] representing 10 log 10 P(mapping error) Calculated as a function of the quality of the read, and a score that indicates how well the read is aligned Algorithm-specific The higher it is, the better the alignment. (MAPQ = 30 = 0.001 error rate) 255 indicates that the mapping quality is not available.

SAMStat report Number of aligned reads and mapping quality Proportion of reads mapped in each mapping quality range. The red part should fill most of the pie chart area

SAMStat report Number of aligned reads and mapping quality Proportion of reads mapped in each mapping quality range. The red part should fill most of the pie chart area WARNING: The appearance of a 0% of unmapped reads does not necessarily mean that there all the raw reads were aligned.

SAMStat report Number of aligned reads and mapping quality Proportion of reads mapped in each mapping quality range. The red part should fill most of the pie chart area WARNING: The appearance of a 0% of unmapped reads does not necessarily mean that there all the raw reads were aligned. Why?

SAMStat report Mean base quality Mean quality per read base in each mapping quality range Higher the base quality = higher mapping quality expected.

SAMStat report Mean base quality Mean quality per read base in each mapping quality range Higher the base quality = higher mapping quality expected. Error profiles Number of mismatches at each read position, segregated by the nucleotide causing the mismatch Should be more or less stable across read positions More errors are expected at the end of the reads since base qualities tend to be lower at that positions Nucleotide peaks at different positions may indicate experimental artifacts that disturb read mapping

SAMStat report Over-represented di-nucleotides Over-representation scores for each possible di-nucleotide at each read position. Significant socores (p-value <= 1e-100) appear in bold Over-represented di-nucleotides may indicate experimental artifacts that disturb read mapping

SAMStat report Over-represented di-nucleotides Over-representation scores for each possible di-nucleotide at each read position. Significant socores (p-value <= 1e-100) appear in bold Over-represented di-nucleotides may indicate experimental artifacts that disturb read mapping Error distribution Distribution of the number of errors (mismatches and indels) per read, segregated by mapping quality ranges No more than 2 mismatches should be allowed for short ( 75b) reads

SAMStat report Nucleotide composition Number of As,Cs,Gs and Ts appearing at each read position and segregated by mapping quality The counts and proportions should be almost invariant accross read positions

SAMStat report Nucleotide composition Number of As,Cs,Gs and Ts appearing at each read position and segregated by mapping quality The counts and proportions should be almost invariant accross read positions Length distribution Distribution of the number of bases per read

SAMStat report Top 5 over-represented 2-mers Summary of the Over-represented di-nucleotides, including the top-5 2-mers in each position

SAMStat report Top 5 over-represented 2-mers Summary of the Over-represented di-nucleotides, including the top-5 2-mers in each position Top 20 over-represented 10-mers The 20 most significant 10-mers per quality level

More on SAMStat Hands-on Run SAMstat on /home/biouser/mda13/mqc-igv/test2.bam and /home/biouser/mda13/mqc-igv/test3.bam Interpret the results

Introduction SAMStat QualiMap Conclusions

Qualimap Aim Provide an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data Run QualiMap qualimap BAM file needs to be sorted: samtools sort <filename> <fileout> File New analysis BAM/SAM file /home/biouser/mda13/mqc-igv/hg00096.chrom20.bam García-Alcalde, et al. Qualimap: evaluating next generation sequencing alignment data. Bioinformatics(2012) 28 (20): 2678-2679

Features Fast analysis accross the reference of genome coverage and nucleotide distribution Easy to interpret summary of the main properties of the alignment data Analysis of the reads mapped inside/outside of the regions provided in GFF format Insert size mean and median value calculation and plotting statistical distribution Analysis of the adequasy of the sequencing depth in RNA-seq experiments Clustering of epigenomic profiles

Hands on Open the online help Go through the examples Run qualimap over /home/biouser/mda13/mqc-igv/igv1.bam (with and without reference annotation http://reports.bioinfomgp.org/external-downloads/chr21.gtf) Run qualimap in the previous data (with and without reference annotation http://reports.bioinfomgp.org/externaldownloads/refseqgenes.gtf) Have a look at http://reports.bioinfomgp.org/externaldownloads/fullbam/qualimapreport.html Drive conclusions from what you get BONUS: Run qualimap via de command line

Introduction SAMStat QualiMap Conclusions

Conclusions One should always perform QC on the mapped data

Conclusions One should always perform QC on the mapped data The correct interpretation of the QC output may save a lot of time (and money) on downstream analyses

Conclusions One should always perform QC on the mapped data The correct interpretation of the QC output may save a lot of time (and money) on downstream analyses The expected results are experiment-specific = Learn from experience