Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Similar documents
Sequence Alignment (chapter 6)

Introduction to Bioinformatics

Pairwise & Multiple sequence alignments

Algorithms in Bioinformatics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence analysis and Genomics

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

An Introduction to Sequence Similarity ( Homology ) Searching

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

B I O I N F O R M A T I C S

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Practical considerations of working with sequencing data

EECS730: Introduction to Bioinformatics

Motivating the need for optimal sequence alignments...

Network Alignment 858L

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Sequence Alignment. Johannes Starlinger

Lecture 2: Pairwise Alignment. CG Ron Shamir

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Lecture 4: September 19

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

proteins are the basic building blocks and active players in the cell, and

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Bioinformatics Exercises

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Pairwise sequence alignment

Quantifying sequence similarity

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Single alignment: Substitution Matrix. 16 march 2017

In-Depth Assessment of Local Sequence Alignment

Local Alignment: Smith-Waterman algorithm

Computational Biology

Pairwise sequence alignments

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

1.5 Sequence alignment

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Global alignments - review

Sequence analysis and comparison

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Bioinformatics and BLAST

Dr. Amira A. AL-Hosary

Example of Function Prediction

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Phylogenetic inference

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Analysis and Design of Algorithms Dynamic Programming

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Exploring Evolution & Bioinformatics

Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Multiple Sequence Alignment

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Tools and Algorithms in Bioinformatics

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Lecture 5,6 Local sequence alignment

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Copyright 2000 N. AYDIN. All rights reserved. 1

Comparative genomics: Overview & Tools + MUMmer algorithm

Effects of Gap Open and Gap Extension Penalties

O 3 O 4 O 5. q 3. q 4. Transition

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

BLAST. Varieties of BLAST

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Network alignment and querying

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Collected Works of Charles Dickens

Tools and Algorithms in Bioinformatics

Processes of Evolution

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.


BIOINFORMATICS: An Introduction

An ant colony algorithm for multiple sequence alignment in bioinformatics

Bioinformatics for Biologists

Graph Alignment and Biological Networks

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Sequence Comparison. mouse human

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Similarity or Identity? When are molecules similar?

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Transcription:

Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms? enome sequencing allows comparison of organisms at DN and protein levels omparisons can be used to Find evolutionary relationships between organisms Identify functionally conserved sequences Identify corresponding genes in human and model organisms: develop models for human diseases Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn omologs Sequence similarity wo genes or characters g B and g evolved from the same ancestor g are called homologs omologs usually exhibit conserved functions lose evolutionary relationship => expect a high number of homologs g = agtgtccgttaagtgcgttc g B = agtgccgttaaagttgtacgtc g = ctgactgtttgtggttc Intuitively, similarity of two sequences refers to the degree of match between corresponding positions in sequence agtgccgttaaagttgtacgtc ctgactgtttgtggttc hat about sequences that differ in length? Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Similarity vs homology Sequence similarity is not sequence homology If the two sequences g B and g have accumulated enough mutations, the similarity between them is likely to be low #mutations agtgtccgttaagtgcgttc agtgtccgttatagtgcgttc agtgtccgcttatagtgcgttc agtgtccgcttaagggcgttc agtgtccgcttcaaggggcgt gggccgttcatgggggt gcagggcgtcactgagggct #mutations acagtccgttcgggctattg cagagcactaccgc cacgagtaagatatagct taatcgtgata acccttatctacttcctggagtt agcgacctgcccaa 9 caaac Similarity vs homology () Sequence similarity can occur by chance Similarity does not imply homology Similarity is an expected consequence of homology omology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn

Orthologs and paralogs e distinguish between two types of homology Orthologs: homologs from two different species Paralogs: homologs within a species Organism g Orthologs and paralogs () Orthologs typically retain the original function In paralogs, one copy is free to mutate and acquire new function (no selective pressure) g Organism g ene is copied g ene is copied g g within organism g g within organism g B g g B g g B g g B g Organism B Organism Organism B Organism Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn 9 Sequence alignment lignment specifies which positions in two sequences match acgtctag actctag matches mismatches not aligned acgtctag actctag matches mismatches not aligned acgtctag actctag matches mismatches not aligned Mutations: Insertions, deletions and substitutions Indel: insertion or deletion of a base with respect to the ancestor sequence acgtctag actctag Insertions and/or deletions are called indels Mismatch: substitution (point mutation) of a single base e can t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Problems hat sorts of alignments should be considered? ow to score alignments? ow to find optimal or good scoring alignments? ow to evaluate the statistical significance of scores? Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment In this course, we discuss the first three problems. ourse Biological sequence analysis tackles all four indepth. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn

lobal alignment Representing alignments and scores Problem: find optimal scoring alignment between two sequences (Needleman & unsch 9) e give score for each position in alignment Identity (match) + Substitution (mismatch) µ Indel Y Y X X X S(/Y) = + µ Y X Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Representing alignments and scores Dynamic programming ow to find the optimal alignment? Y e use previous solutions for optimal alignments of smaller subsequences his general approach is known as dynamic programming lobal alignment scores, = µ Y µ Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Filling the alignment matrix Filling the alignment matrix () Y ase ase ase onsider the alignment process at shaded square. ase. lign against (match or substitution). ase. lign in Y against (indel) in. ase. lign in against (indel) in Y. Y ase ase ase Scoring the alternatives. ase. S, = S, + s(, ) ase. S, = S, ase. S, = S, s(i, j) = for matching positions, s(i, j) = µ for substitutions. hoose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn 9

lobal alignment: formal development Scoring partial alignments = a a a a n, B = b b b b m b b b b lignment of = a a a a n with B = b b b b m can end in three ways ase : (a a a i ) a i b b b b a a a ny alignment can be written as a unique path through the matrix Score for aligning and B up to positions i and j: S i,j = S(a a a a i, b b b b j ) a a a (b b b j ) b j ase : (a a a i ) a i (b b b j ) ase : (a a a i ) (b b b j ) b j Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Scoring alignments Scoring alignments () Scores for each case: ase : (a a a i ) a i (b b b j ) b j ase : (a a a i ) a i (b b b j ) ase : (a a a i ) s(a i, b j ) = + ifa i = b j { µ otherwise s(a i, ) = s(, b j ) = First row and first column correspond to initial alignment against indels: S(i, ) = i S(, j) = j Optimal global alignment score S(, B) = S n,m a b b b b (b b b j ) b j a a Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn lgorithm for global alignment lobal alignment: example Input sequences, B, n =, m = B Set S i, := i for all i µ = Set S,j := j for all j = for i := to n for j := to m S i,j := max{s i,j, S i,j + s(a i,b j ), S i,j } end? end lgorithm takes O(nm) time and space. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn

lobal alignment: example () Sequence lignment (chapter ) µ = = 9 he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Local alignment: rationale Otherwise dissimilar proteins may have local regions of similarity > Proteins may share a function uman bone morphogenic protein receptor type II precursor (left) has a aa region that resembles 9 aa region in F receptor (right). he shared function here is protein kinase. Introduction to bioinformatics, utumn B Local alignment: rationale Regions of similarity lobal alignment would be inadequate Problem: find the highest scoring local alignment between two sequences Previous algorithm with minor modifications solves this problem (Smith & aterman 9) Introduction to bioinformatics, utumn 9 From global to local alignment Modifications to the global alignment algorithm Look for the highestscoring path in the alignment matrix (not necessarily through the matrix) llow preceding and trailing indels without penalty Scoring local alignments = a a a a n, B = b b b b m Let I and J be intervals (substrings) of and B, respectively:, Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn

Introduction to bioinformatics, utumn llowing preceding and trailing indels First row and column initialised to zero: M i, = M,j = a a a b b b b b b b a Introduction to bioinformatics, utumn Recursion for local alignment M i,j = max{ M i,j + s(a i, b i ), M i,j, M i,j, } Introduction to bioinformatics, utumn Finding best local alignment Optimal score is the highest value in the matrix = max i,j M i,j Best local alignment can be found by backtracking from the highest value in M Introduction to bioinformatics, utumn Local alignment: example 9 Introduction to bioinformatics, utumn Local alignment: example 9 Scoring Match: + Mismatch: Indel: Introduction to bioinformatics, utumn Nonuniform mismatch penalties e used uniform penalty for mismatches: s(, ) = s(, ) = = s(, ) = µ ransition mutations (>, >, >, >) are approximately twice as frequent than transversions ( >, >, >, >) use nonuniform mismatch penalties....

aps in alignment ap is a succession of indels in alignment Previous model scored a length k gap as w(k) = k Replication processes may produce longer stretches of insertions or deletions In coding regions, insertions or deletions of codons may preserve functionality ap open and extension penalties () e can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = (k ) ap open cost, ap extension cost Our previous algorithm can be extended to use w(k) (not discussed on this course) Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn 9 Sequence lignment (chapter ) Multiple alignment he biological problem lobal alignment Local alignment Multiple alignment onsider a set of n sequences on the right Orthologous sequences from different organisms Paralogs from multiple duplications ow can we study relationships between these sequences? aggcgagctgcgagtgcta cgttagattgacgctgac ttccggctgcgac gacacggcgaacgga agtgtgcccgacgagcgaggac gcgggctgtgagcgcta aagcggcctgtgtgcccta atgctgctgccagtgta agtcgagccccgagtgc agtccgagtcc actcggtgc Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Optimal alignment of three sequences lignment of = a a a i and B = b b b j can end either in (, b j ), (a i, b j ) or (a i, ) = alternatives lignment of, B and = c c c k can end in ways: (a i,, ), (, b j, ), (,, c k ), (, b j, c k ), (a i,, c k ), (a i, b j, ) or (a i, b j, c k ) Solve the recursion using threedimensional dynamic programming matrix: O(n ) time and space eneralizes to n sequences but impractical with moderate number of sequences Multiple alignment in practice In practice, realworld multiple alignment problems are usually solved with heuristics Progressive multiple alignment hoose two sequences and align them hoose third sequence w.r.t. two previous sequences and align the third against them Repeat until all sequences have been aligned Different options how to choose sequences and score alignments Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn

Multiple alignment in practice Profilebased progressive multiple alignment: LUSL onstruct a distance matrix of all pairs of sequences using dynamic programming Progressively align pairs in order of decreasing similarity LUSL uses various heuristics to contribute to accuracy dditional material R. Durbin, S. Eddy,. Krogh,. Mitchison: Biological sequence analysis ourse Biological sequence analysis in Spring Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn