Practical considerations of working with sequencing data

Similar documents
Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Moreover, the circular logic

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Pairwise & Multiple sequence alignments

Algorithms in Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignment

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Scoring Matrices. Shifra Ben-Dor Irit Orr

Single alignment: Substitution Matrix. 16 march 2017

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Quantifying sequence similarity

Motivating the need for optimal sequence alignments...

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence analysis and comparison

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Computational Biology

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Pairwise sequence alignments

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Lecture 5,6 Local sequence alignment

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Bioinformatics and BLAST

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Sequence Database Search Techniques I: Blast and PatternHunter tools

Introduction to Bioinformatics

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Pairwise Sequence Alignment

Substitution matrices

Sequence Alignment (chapter 6)

Large-Scale Genomic Surveys

Sequence Alignment Techniques and Their Uses

Introduction to Bioinformatics

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Similarity or Identity? When are molecules similar?

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Local Alignment Statistics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Scoring Matrices. Shifra Ben Dor Irit Orr

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Basic Local Alignment Search Tool

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Local Alignment: Smith-Waterman algorithm

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Sequence comparison: Score matrices

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Bioinformatics for Biologists

Advanced topics in bioinformatics

BLAST. Varieties of BLAST

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Exploring Evolution & Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

Week 10: Homology Modelling (II) - HHpred

O 3 O 4 O 5. q 3. q 4. Transition

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Bio nformatics. Lecture 3. Saad Mneimneh

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

1.5 Sequence alignment

Alignment & BLAST. By: Hadi Mozafari KUMS

Introduction to protein alignments

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Lecture 2: Pairwise Alignment. CG Ron Shamir

EECS730: Introduction to Bioinformatics

BINF 730. DNA Sequence Alignment Why?

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Copyright 2000 N. AYDIN. All rights reserved. 1

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Example of Function Prediction

CSE 549: Computational Biology. Substitution Matrices

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Analysis and Design of Algorithms Dynamic Programming

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:


Sequence Comparison. mouse human

Hidden Markov Models

Practical Bioinformatics

Bioinformatics. Can Keşmir Theoretical Biology/Bioinformatics, UU

Transcription:

Practical considerations of working with sequencing data

File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more! Bedgraph read density along the genome Bed file Read density reported in large continuous intervals Genes/transcript and transcript structure Transcription factor binding regions If someone does a sequencing experiment usually one of these is available and deposited in a public database

SAM/BAM

Viewing genome coordinate files with IGV Integrated Genome Browser Cross-platform application Knows about common genomes Genome version is important!

Different assemblies Genome coordinates different between genome assemblies Differences accumulate over chromosome length You have to know which assembly was used Sequencing files are nonrandomly distributed relative to genes RNAseq should align with exons TF binding sites biased towards promoter regions

Converting coordinates UCSC liftover -- converts genome coordinates Convert from one assembly to another Cross organism conversion Mammals/vertebrates

Sequence Alignment

To do: Global alignment Local alignment Scoring Gaps Scoring matrices Database Search Statistical Significance Multiple Sequence alignment

Why compare sequences Given a new sequence, infer its function based on similarity to another sequence Find important molecular regions conserved across species Determine 3d structure with homology modeling Homologs-sequences that descended from a common ancestral sequence Orthologs- separated by speciation Paralogs separated by duplication in a single genome Basic unit of protein homology is a sufficient functional unit typically much smaller than a whole gene

DNA vs Protein alignments Protein coding Typically compared in amino acid space Amino acid change slower than nucleotides Some nucleotides can change without any change to a.a. sequence Different levels of amino acid similarity can be accounted for Not all a.a. changes are equally disruptive Can detect very remote homology Non-coding regions Smaller alphabet requires more matches to achieve significance No notion of similarity match or nor match Diverge more rapidly though some are very conserved at short evolutionary distances

What is a good sequence alignment Theory: If two sequences are homologous we want to match up the residues such that each residue is descendant from a common ancestral residue Practice: approximate string matching introduce gaps and padding to find best matching between two strings

Efficient alignment What is the best alignment? we need a scoring metric Basic scoring metric (1 for matching, 0 for mismatching, 0 for a gap) Number of possible alignments is exponential in string length Scoring is local we apply dynamic programming dynamic programming solve a large problem in terms of smaller subproblems Requirements There is only a polynomial number of subproblems Align x 1 x i to y 1 y j Original problem is one of the subproblems Align x 1 x M to y 1 y N Each subproblem is easily solved from smaller subproblems

Matrix representation of an alignment

Dynamical programming approach Score the optimal alignment up to every (i,j) F(i,j) Scoring is local so F(i,j) depends only 3 other values

Global alignment S T i-1 i j-1 j M ij F i-1, j-1 + Score(S i,t j ) F i,j = MAX Needleman & Wunsch, 1970 F i,j-1 + gp F i-1,j + gp Gap penalty 15

Example Keep track of the argmax!

Matrix filled out

Finding the optimal alignment

Complete Algorithm Initialization. F(0,0) =0 F(0, j) = - j go F(i, 0) = - i go Main Iteration. Filling-in partial alignments For each i=1...m For each j = 1...N F(i, j) = max(f(i-1,j-1)+s(xi, yj) F(i-1, j) gp, F(i, j-1) gp) Ptr(i,j) =DIAG LEFT UP

Local alignment Given two sequences, S and T, find two subsequences, s and t, whose alignment has the highest score amongst all subsequence pairs. Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence Genes can have local similarity because of variable domain composition EGR4_HUMAN EGR4_RAT EGR3_HUMAN EGR3_RAT EGR1_HUMAN EGR1_MOUSE EGR1_RAT EGR1_BRARE EGR2_RAT EGR2_XENLA EGR2_MOUSE EGR2_HUMAN EGR2_BRARE MIG1_KLULA MIG1_KLUMA MIG1_YEAST MIG2_YEAST KA [FACPVESCVRSFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTSHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH] KA [FACPVESCVRTFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTTHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH] RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH] RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH] RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH] RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH] RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH] RP [YACPVETCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACEI--CGRKFARSDERKRHTKIH] RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH] RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH] RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH] RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH] RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDF--CGRKFARSDERKRHTKIH] -- [-------------------------] ---RP [YVCPICQRGFHRLEHQTRHIRTH] TGERP [HACDFPGCSKRFSRSDELTRHRRIH] -- [-------------------------] ---RP [YMCPICHRGFHRLEHQTRHIRTH] TGERP [HACDFPGCAKRFSRSDELTRHRRIH] -- [-------------------------] ---RP [HACPICHRAFHRLEHQTRHMRIH] TGEKP [HACDFPGCVKRFSRSDELTRHRRIH] -- [-------------------------] ---RP [FRCDTCHRGFHRLEHKKRHLRTH] TGEKP [HHCAFPGCGKSFSRSDELKRHMRTH] [ ] :* [. * * * * * :*. *:* *] ***:* [. * * : *:****.** : *] 20

Local alignment S T i-1 i j-1 j M ij F i,j = MAX 0 F i-1, j-1 + Score(S i,t j ) F i,j-1 + go F i-1,j + go No pointer, we start over Gap penalty Smith & Waterman, 1981 Similarity Scoring Expected value: negative for random alignments positive for highly similar sequences 21

Local alignment Initialization Iteration F(0,0) = F(0,j) = F(i,0) = 0 for i=1,,m for j=1,,n - calculate optimal F(i,j) - store Ptr(i,j) if score is positive Termination Find the end of the best alignment with FOPT = max{i,j} F(i,j) and trace back OR Find all alignments with F(i,j) > threshold and trace back

Local vs. global alignment 23

Local vs. global alignment (cntd) 24

More accurate gap model In nature, gaps often come as a single event rather than a series of single gaps Linear gap penalty is too stringent Convex gap penalty is expensive Have to keep track of the length of gaps This is more likely. γ(n) Normal scoring would give the same score for both alignments This is less likely. γ(n) Compromise Affine gap penalty γ(n) = -d - e * (n-1) d: gap initiation penalty e: gap extension penalty γ(n) d e

Affine gap algorithm Dynamical programming in 3 layers The three recurrences for the scoring algorithm creates a 3- layered graph. The top level creates/extends gaps in the sequence w. The bottom level creates/extends gaps in sequence v. The middle level extends matches and mismatches. Keep track of 3 matrices

Affine Gap Update rule F i,j = max F i,j = max F i-1,j - e F i-1,j -(d+e) F i,j-1 - e F i,j-1 -(d+e) Continue Gap in s (deletion) Start Gap in s (deletion): from middle Continue Gap in t (insertion) Start Gap in t (insertion):from middle F i,j = F i-1,j-1 + s(v i, w j ) max F i,j F i,j Match or Mismatch End deletion: from top End insertion: from bottom

How to decide on the correct scoring metric Scoring metrics should reflect the evolutionary process What are the odds that an alignment is biologically meaningful the proteins are homologous Random model: product of chance events Non-random model: two sequences derived from a common ancestor Things to consider What is the frequency of different mutations Over what time scale?

Log-odds scoring 29

Log-odds scoring 30

Accepted Point Mutation (PAM) model Where do we get q XY Compare closely related proteins Find substitutions that are accepted to natural selection Very likely mutations E to D Very unlikely: involve C and W

Conservative substitutions

PAM1 probability matrix PAM1 probability matrix Dayhoff et al (1978) estimated probability of onestep transitions Used a family of very closely related proteins Corresponds to 1 change per 100 a.a.

PAM1 through PAM250 We can multiply PAM1 by itself to get a probability matrix for longer time scales PAM is measured in number of changes not time Number of changes that occurred is not the same as number of observed changes

PAM250 Only 20% identity 20% identity is close to what you might get aligning random sequences

Choice of scale reflects the result Human and chimp beta globin close orthologs Human beta and alpha globin paralogs further apart

PAM model Assumptions Replacement at any site depends only on the a.a. on that site, given the mutability of the a.a. Sequences in the training set (and those compared) have average a.a. composition. Sources of error Many proteins depart from the average a.a. composition. The a.a. composition can vary even within a protein (e.g. transmembrane proteins). A.a. positions are not mutated equally probably; especially in long evolutionary distances. Rare replacements are observed too infrequently and errors in PAM1 are magnified in PAM250.

Blocks Substitution Matrices (BLOSUM): Log-likelihood matrix (Henikoff & Henikoff, 1992) BLOCKS database of aligned sequences used as primary source set. Different BOLSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45 BLOCKS database contains large number of ungapped multiple local alignments of conserved regions of proteins Alignments include distantly related sequences in which multiple base substitutions at the same position could be observed

PAM vs BLOSUM PAM is based on closely related sequences, thus is biased for short evolutionary distances where number of mutations are scalable PAM is based on globally aligned sequences, thus includes conserved and nonconserved positions; BLOSUM is based on conserved positions only Lower PAM/higher BLOSUM matrices identify shorter local alignments of highly similar sequences Higher PAM/lower BLOSUM matrices identify longer local alignments of more distant sequences Matrices of choice: BLOSUM62: the all-weather matrix PAM250: for distant relatives