Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms? enome sequencing allows comparison of organisms at DN and protein levels omparisons can be used to Find evolutionary relationships between organisms Identify functionally conserved sequences Identify corresponding genes in human and model organisms: develop models for human diseases Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn omologs Sequence similarity wo genes or characters g B and g evolved from the same ancestor g are called homologs omologs usually exhibit conserved functions lose evolutionary relationship => expect a high number of homologs g = agtgtccgttaagtgcgttc g B = agtgccgttaaagttgtacgtc g = ctgactgtttgtggttc Intuitively, similarity of two sequences refers to the degree of match between corresponding positions in sequence agtgccgttaaagttgtacgtc ctgactgtttgtggttc hat about sequences that differ in length? Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Similarity vs homology Sequence similarity is not sequence homology If the two sequences g B and g have accumulated enough mutations, the similarity between them is likely to be low #mutations agtgtccgttaagtgcgttc agtgtccgttatagtgcgttc agtgtccgcttatagtgcgttc agtgtccgcttaagggcgttc agtgtccgcttcaaggggcgt gggccgttcatgggggt gcagggcgtcactgagggct #mutations acagtccgttcgggctattg cagagcactaccgc cacgagtaagatatagct taatcgtgata acccttatctacttcctggagtt agcgacctgcccaa 9 caaac Similarity vs homology () Sequence similarity can occur by chance Similarity does not imply homology Similarity is an expected consequence of homology omology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn
Orthologs and paralogs e distinguish between two types of homology Orthologs: homologs from two different species Paralogs: homologs within a species Organism g Orthologs and paralogs () Orthologs typically retain the original function In paralogs, one copy is free to mutate and acquire new function (no selective pressure) g Organism g ene is copied g ene is copied g g within organism g g within organism g B g g B g g B g g B g Organism B Organism Organism B Organism Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn 9 Sequence alignment lignment specifies which positions in two sequences match acgtctag actctag matches mismatches not aligned acgtctag actctag matches mismatches not aligned acgtctag actctag matches mismatches not aligned Mutations: Insertions, deletions and substitutions Indel: insertion or deletion of a base with respect to the ancestor sequence acgtctag actctag Insertions and/or deletions are called indels Mismatch: substitution (point mutation) of a single base e can t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Problems hat sorts of alignments should be considered? ow to score alignments? ow to find optimal or good scoring alignments? ow to evaluate the statistical significance of scores? Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment In this course, we discuss the first three problems. ourse Biological sequence analysis tackles all four indepth. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn
lobal alignment Representing alignments and scores Problem: find optimal scoring alignment between two sequences (Needleman & unsch 9) e give score for each position in alignment Identity (match) + Substitution (mismatch) µ Indel Y Y X X X S(/Y) = + µ Y X Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Representing alignments and scores Dynamic programming ow to find the optimal alignment? Y e use previous solutions for optimal alignments of smaller subsequences his general approach is known as dynamic programming lobal alignment scores, = µ Y µ Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Filling the alignment matrix Filling the alignment matrix () Y ase ase ase onsider the alignment process at shaded square. ase. lign against (match or substitution). ase. lign in Y against (indel) in. ase. lign in against (indel) in Y. Y ase ase ase Scoring the alternatives. ase. S, = S, + s(, ) ase. S, = S, ase. S, = S, s(i, j) = for matching positions, s(i, j) = µ for substitutions. hoose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn 9
lobal alignment: formal development Scoring partial alignments = a a a a n, B = b b b b m b b b b lignment of = a a a a n with B = b b b b m can end in three ways ase : (a a a i ) a i b b b b a a a ny alignment can be written as a unique path through the matrix Score for aligning and B up to positions i and j: S i,j = S(a a a a i, b b b b j ) a a a (b b b j ) b j ase : (a a a i ) a i (b b b j ) ase : (a a a i ) (b b b j ) b j Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Scoring alignments Scoring alignments () Scores for each case: ase : (a a a i ) a i (b b b j ) b j ase : (a a a i ) a i (b b b j ) ase : (a a a i ) s(a i, b j ) = + ifa i = b j { µ otherwise s(a i, ) = s(, b j ) = First row and first column correspond to initial alignment against indels: S(i, ) = i S(, j) = j Optimal global alignment score S(, B) = S n,m a b b b b (b b b j ) b j a a Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn lgorithm for global alignment lobal alignment: example Input sequences, B, n =, m = B Set S i, := i for all i µ = Set S,j := j for all j = for i := to n for j := to m S i,j := max{s i,j, S i,j + s(a i,b j ), S i,j } end? end lgorithm takes O(nm) time and space. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn
lobal alignment: example () Sequence lignment (chapter ) µ = = 9 he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Local alignment: rationale Otherwise dissimilar proteins may have local regions of similarity > Proteins may share a function uman bone morphogenic protein receptor type II precursor (left) has a aa region that resembles 9 aa region in F receptor (right). he shared function here is protein kinase. Introduction to bioinformatics, utumn B Local alignment: rationale Regions of similarity lobal alignment would be inadequate Problem: find the highest scoring local alignment between two sequences Previous algorithm with minor modifications solves this problem (Smith & aterman 9) Introduction to bioinformatics, utumn 9 From global to local alignment Modifications to the global alignment algorithm Look for the highestscoring path in the alignment matrix (not necessarily through the matrix) llow preceding and trailing indels without penalty Scoring local alignments = a a a a n, B = b b b b m Let I and J be intervals (substrings) of and B, respectively:, Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn
Introduction to bioinformatics, utumn llowing preceding and trailing indels First row and column initialised to zero: M i, = M,j = a a a b b b b b b b a Introduction to bioinformatics, utumn Recursion for local alignment M i,j = max{ M i,j + s(a i, b i ), M i,j, M i,j, } Introduction to bioinformatics, utumn Finding best local alignment Optimal score is the highest value in the matrix = max i,j M i,j Best local alignment can be found by backtracking from the highest value in M Introduction to bioinformatics, utumn Local alignment: example 9 Introduction to bioinformatics, utumn Local alignment: example 9 Scoring Match: + Mismatch: Indel: Introduction to bioinformatics, utumn Nonuniform mismatch penalties e used uniform penalty for mismatches: s(, ) = s(, ) = = s(, ) = µ ransition mutations (>, >, >, >) are approximately twice as frequent than transversions ( >, >, >, >) use nonuniform mismatch penalties....
aps in alignment ap is a succession of indels in alignment Previous model scored a length k gap as w(k) = k Replication processes may produce longer stretches of insertions or deletions In coding regions, insertions or deletions of codons may preserve functionality ap open and extension penalties () e can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = (k ) ap open cost, ap extension cost Our previous algorithm can be extended to use w(k) (not discussed on this course) Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn 9 Sequence lignment (chapter ) Multiple alignment he biological problem lobal alignment Local alignment Multiple alignment onsider a set of n sequences on the right Orthologous sequences from different organisms Paralogs from multiple duplications ow can we study relationships between these sequences? aggcgagctgcgagtgcta cgttagattgacgctgac ttccggctgcgac gacacggcgaacgga agtgtgcccgacgagcgaggac gcgggctgtgagcgcta aagcggcctgtgtgcccta atgctgctgccagtgta agtcgagccccgagtgc agtccgagtcc actcggtgc Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn Optimal alignment of three sequences lignment of = a a a i and B = b b b j can end either in (, b j ), (a i, b j ) or (a i, ) = alternatives lignment of, B and = c c c k can end in ways: (a i,, ), (, b j, ), (,, c k ), (, b j, c k ), (a i,, c k ), (a i, b j, ) or (a i, b j, c k ) Solve the recursion using threedimensional dynamic programming matrix: O(n ) time and space eneralizes to n sequences but impractical with moderate number of sequences Multiple alignment in practice In practice, realworld multiple alignment problems are usually solved with heuristics Progressive multiple alignment hoose two sequences and align them hoose third sequence w.r.t. two previous sequences and align the third against them Repeat until all sequences have been aligned Different options how to choose sequences and score alignments Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn
Multiple alignment in practice Profilebased progressive multiple alignment: LUSL onstruct a distance matrix of all pairs of sequences using dynamic programming Progressively align pairs in order of decreasing similarity LUSL uses various heuristics to contribute to accuracy dditional material R. Durbin, S. Eddy,. Krogh,. Mitchison: Biological sequence analysis ourse Biological sequence analysis in Spring Introduction to bioinformatics, utumn Introduction to bioinformatics, utumn