Global alignments - review

Similar documents
Pairwise sequence alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Scoring Matrices. Shifra Ben-Dor Irit Orr

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence comparison: Score matrices

Pairwise & Multiple sequence alignments

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence Alignment (chapter 6)

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Tools and Algorithms in Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Computational Biology

Pairwise sequence alignments

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Practical considerations of working with sequencing data

Quantifying sequence similarity

Similarity or Identity? When are molecules similar?

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Algorithms in Bioinformatics

Large-Scale Genomic Surveys

Sequence analysis and Genomics

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Moreover, the circular logic

Local Alignment: Smith-Waterman algorithm

Introduction to Bioinformatics

Scoring Matrices. Shifra Ben Dor Irit Orr

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Tools and Algorithms in Bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Single alignment: Substitution Matrix. 16 march 2017

Bioinformatics and BLAST

Lecture 4: September 19

Bioinformatics for Biologists

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

CSE 549: Computational Biology. Substitution Matrices

An Introduction to Sequence Similarity ( Homology ) Searching

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Ch. 9 Multiple Sequence Alignment (MSA)

Alignment & BLAST. By: Hadi Mozafari KUMS

Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Sequence Comparison. mouse human

Lecture 5,6 Local sequence alignment

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Class 3, 1/19/98. Mark Gerstein. Yale University

1.5 Sequence alignment

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Sequence analysis and comparison

Motivating the need for optimal sequence alignments...

Sequence Database Search Techniques I: Blast and PatternHunter tools

Substitution matrices

Collected Works of Charles Dickens

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

In-Depth Assessment of Local Sequence Alignment

Advanced topics in bioinformatics

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Local Alignment Statistics

Lecture 2: Pairwise Alignment. CG Ron Shamir

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Week 10: Homology Modelling (II) - HHpred

Chapter 2: Sequence Alignment

Overview Multiple Sequence Alignment

Introduction to Bioinformatics

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

BLAST: Target frequencies and information content Dannie Durand

Introduction to protein alignments

Similarity searching summary (2)

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Alignment Techniques and Their Uses

String Matching Problem

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Genomics and bioinformatics summary. Finding genes -- computer searches

Bio nformatics. Lecture 3. Saad Mneimneh

Introduction to Computation & Pairwise Alignment


11/18/2010. Pairwise sequence. Copyright notice. November 22, Announcements. Outline: pairwise alignment. Learning objectives

Transcription:

Global alignments - review Take two sequences: X[j] and Y[j] M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 The best alignment for X[1 i] and Y[1 j] is called M[i, j] X[j] Initiation: M[,]= pply the equation Find the alignment with backtracing Y[i] 1 2 3 4 5 6 7 - G G G C G - 1 G 2 3 G 4 T 5 G 6 2

lgorithm time/space complexity - Big-O Notation a simple description of complexity: constant O(1), linear O(n), quadratic O(n 2 ), cubic O(n 3 )... asymptotic upper bound read: order of f n=o gn iff. simple,e.g.n 2 x, c f x cgx x x 2 lnx f x lnx x =17

Big-O Notation Example Time complexity of global alignment: M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 nm 1 init 1nm 1 =Onm calcm print

Global alignment linear space M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 We need O(nm) time, but only O(m) space how? problem with backtracking

Global alignment linear space, recursion LSpacei,i', j, j': i j k 1 j' LSpacei, n 2 1, j, k 1 n/2 LSpacei, n 2 1, j,k 2 i' k 2 space complexity: O(m)

Global alignment linear space, algorithm LSpacei,i ', j, j': i j k 1 j' return if areai,i', j, j'empty h:= i ' i 2 calc.musing Ommemory plusfind pathl h crossing therow h LSpacei,h 1,k 1, j' print L h LSpacei,h1,k 2, j' h=n/2 i' L h k 2 log 2 n time complexity: nm 2nm i= 2 i

lignments: Local alignment

Local alignment: Smith-Waterman algorithm Seq X: What s local? Seq Y: llow only parts of the sequence to match Locally maximal: can not make it better by trimming/extending the alignment

Local alignment Seq X: Seq Y: Why local? Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence seq X: seq Y: Proteins have modular construction sharing domains between sequences seq X: seq Y:

Domains - example Immunoglobulin domain

Global local alignment a) global align Take the global equation Look at the result of the global alignment - s e q - s b) retrieve the result e CGCCTTGGTTCTCG- C-C-----GTTCGT-G c) sum score along the result q u e sum align. pos.

Local alignment breaking the alignment recipe Just don t let the score go below Start the new alignment when it happens Where is the result in the matrix? Before: fter: - s e q u e - s e q u e - - s s e e q q sum sum sum align. pos. align. pos. align. pos.

Local alignment the equation M[i-1, j-1] + score(x[i],y[j]) M[i, j] = max M[i, j-1] g M[i-1, j] g Init the boundaries with s Great contribution to science! Run the algorithm Read the maximal value from anywhere in the matrix - s e q u e Find the result with backtracking - s e q

Finding second best alignment We can find the best local alignment in the sequence But where is the second best one? Scoring: 1 for match -2 for a gap G Best alignment G 18 16 14 16 19 17 14 17 2 15 18 16 G 15 18 clump 13 16

Clump of an alignment lignments sharing at least one aligned pair G 18 16 G 16 19 2 G clump

Clumps gene X gene Y

Finding second best alignment Don t let any matched pair to contribute to the next alignment G Clear the best alignment G 1 1 1 Recalculate the clump G 2 3 1 1

Extraction of local alignments Waterman- Eggert algorithm 1. Repeat a. Calc M without taking cells into account b. Retrieve the highest scoring alignment c. Set it s trace to

Clumps gene X gene Y

Clumps gene X gene Y

Low complexity regions Local composition bias Replication slippage: e.g. triplet repeats Results in spurious hits Breaks down statistical models Different proteins reported as hits due to similar composition Up to ¼ of the sequence can be biased Huntington s disease Huntingtin gene of unknown (!) function Repeats #: 6-35: normal; 36-12: disease

Pitfalls of alignments lignment is not a reconstruction of the evolution (common ancestor is extinct by the time of alignment)

Pitfalls of alignments Matches to the same fragment rbitrary poor alignment regions seq X: seq Y:

What s the number of steps in these algorithms? How much memory is used? Summary 1. Global a.k.a. Needleman- Wunsch algorithm 2. Global-local 3. Local a.k.a. Smith- Waterman algorithm 4. Many local alignments a.k.a. Waterman- Eggert algorithm seq X: seq Y: seq X: seq Y: seq X: seq Y: seq X: seq Y:

mino acid substitution matrices

Percent ccepted Mutations distance and matrices ccepted by natural selection not lethal not silent Def.: S 1 and S 2 are PM 1 distant if on avg. there was one mutation per 1 aa Q.: If the seqs are PM 8 distant, how many residues may be diffent?

PM matrix Created from easy alignments pairwise gapless 85% id f proline f proline, valine i.e. frequency of occurenceof proline frequency of substitution proline with valine,for PM1 counta,b aligns f a,b= aligns c,d!=a,b M symmetricmatr ix, f a,a f a i.e. M=[ f b,a f b,b] countc,d

PM matrix How to calculte M for PM 2 distance? Take more distant seqs or extrapolate... M 2 =[ f a,a f b,a =[ f a f a,af a,af a,bf b,a f a,af a,bf a,bf b,b f b,b]2 ]

PM log odds matrix Making of the PM N matrix observed PMN[a, b] = log 2 f am N [a,b] f af b By chance alone Why log? Mutations and chance: More freq: PM N[a,b] > Less freq: PM N[a,b] < = log 2 M N [a,b] f b odds log odds

PM 25 matrix

BLOSUM matrix BLOcks SUbstitution Matrix Based on gapless alignments More often used than PM

BLOSUM N matrix Cluster together sequences N% identity f(a,b) frequency of occurrence a and b in the same column f a, b f a = f a,a a b 2 e(a,b) chance alone ea,b = 1 a b ea,a = f a 2 ea, b = 2 f af b for a b BLOSUM N [a,b] = log 2 f a,b ea,b log odds

BLOSUM 62 matrix # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5./blocks.dat # Cluster Percentage: >= 62 # Entropy =.6979, Expected = -.529 R N D C Q E G H I L K M F P S T W Y V 4-1 -2-2 -1-1 -2-1 -1-1 -1-2 -1 1-3 -2 R -1 5-2 -3 1-2 -3-2 2-1 -3-2 -1-1 -3-2 -3 N -2 6 1-3 1-3 -3-2 -3-2 1-4 -2-3 D -2-2 1 6-3 2-1 -1-3 -4-1 -3-3 -1-1 -4-3 -3 C -3-3 -3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1-3 5 2-2 -3-2 1-3 -1-1 -2-1 -2 E -1 2-4 2 5-2 -3-3 1-2 -3-1 -1-3 -2-2 G -2-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 -2-2 -3-3 H -2 1-1 -3-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2-3 -2-1 -2-1 1 K -1 2-1 -3 1 1-2 -1-3 -2 5-1 -3-1 -1-3 -2-2 M -1-1 -2-3 -1-2 -3-2 1 2-1 5-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1-3 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 S 1-1 1-1 -1-2 -2-1 -2-1 4 1-3 -2-2 T -1-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 V -3-3 -3-1 -2-2 -3-3 3 1-2 1-1 -2-2 -3-1 4

PM vs BLOSUM http://www.ncbi.nih.gov/education/blstinfo/scoring2.html PM is extrapolation from closely related seqs We are interested more distant relationships