SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Similar documents
Practical Bioinformatics

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Proteins: Characteristics and Properties of Amino Acids

Advanced topics in bioinformatics

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

Supplementary Information for

Crick s early Hypothesis Revisited

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Codon Distribution in Error-Detecting Circular Codes

The Trigram and other Fundamental Philosophies

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

SUPPLEMENTARY DATA - 1 -

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Number-controlled spatial arrangement of gold nanoparticles with

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Using an Artificial Regulatory Network to Investigate Neural Computation

Electronic supplementary material

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

Supporting Information

SUPPLEMENTARY INFORMATION

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

Lecture 15: Realities of Genome Assembly Protein Sequencing

Supplemental Figure 1.

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Introduction to Molecular Phylogeny

part 3: analysis of natural selection pressure

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Translation. A ribosome, mrna, and trna.

Supplementary Information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

SUPPLEMENTARY INFORMATION

Genetic code on the dyadic plane

Aoife McLysaght Dept. of Genetics Trinity College Dublin

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

Evolutionary Analysis of Viral Genomes

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

The role of the FliD C-terminal domain in pentamer formation and

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

373 The Evidence of how DNA and the Scriptures have Identical Numeric Signatures

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Viewing and Analyzing Proteins, Ligands and their Complexes 2

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

A modular Fibonacci sequence in proteins

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Re- engineering cellular physiology by rewiring high- level global regulatory genes

HADAMARD MATRICES AND QUINT MATRICES IN MATRIX PRESENTATIONS OF MOLECULAR GENETIC SYSTEMS

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Genetic Code, Attributive Mappings and Stochastic Matrices

Properties of amino acids in proteins

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Amino Acids and Peptides

Chemistry Chapter 22

A p-adic Model of DNA Sequence and Genetic Code 1

Near-instant surface-selective fluorogenic protein quantification using sulfonated

Sequence Divergence & The Molecular Clock. Sequence Divergence

Diversity of Chlamydia trachomatis Major Outer Membrane

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Similarity or Identity? When are molecules similar?

Timing molecular motion and production with a synthetic transcriptional clock

Sequence comparison: Score matrices

Supplementary Information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Capacity of DNA Data Embedding Under. Substitution Mutations

In this article, we investigate the possible existence of errordetection/correction

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Symmetry Studies. Marlos A. G. Viana

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other.

Slide 1 / 54. Gene Expression in Eukaryotic cells

Studies Leading to the Development of a Highly Selective. Colorimetric and Fluorescent Chemosensor for Lysine

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Evolutionary Change in Nucleotide Sequences. Lecture 3

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

From DNA to protein, i.e. the central dogma

An Analytical Model of Gene Evolution with 9 Mutation Parameters: An Application to the Amino Acids Coded by the Common Circular Code

The Mathematics of Phylogenomics

The degeneracy of the genetic code and Hadamard matrices. Sergey V. Petoukhov

The Journal of Animal & Plant Sciences, 28(5): 2018, Page: Sadia et al., ISSN:

Using algebraic geometry for phylogenetic reconstruction

Insects act as vectors for a number of important diseases of

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar

Solutions In each case, the chirality center has the R configuration

Transcription:

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4

Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid. 5 6 The standard genetic code TTT Phe TCT Ser TAT Tyr TGT Cys TTC Phe TCC Ser TAC Tyr TGC Cys TTA Leu TCA Ser TAA STOP TGA STOP TTG Leu TCG Ser TAG STOP TGG Trp CTT Leu CCT Pro CAT His CGT Arg CTC Leu CCC pro CAC His CGC Arg CTA Leu CCA Pro CAA Gln CGA Arg CTG Leu CCG Pro CAG Gln CGG Arg ATT Ile ACT Thr AAT Asn AGT Ser ATC Ile ACC Thr AAC Asn AGC Ser ATA Ile ACA Thr AAA Lys AGA Arg ATG Met ACG Thr AAG Lys AGG Arg GTT Val GCT Ala GAT Asp GGT Gly GTC Val GCC Ala GAC Asp GGC Gly GTA Val GCA Ala GAA Glu GGA Gly GTG Val GCG Ala GAG Glu GGG Gly Nucleotides and amino acids The four nucleotides in DNA (RNA) A adenine G guanine C cytosine T thymine (U uracil) The twenty amino-acids in proteins A alanine C cysteine D aspartic acid E glutamic acid F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N asparagine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine 7 8

Sequences Goals of sequence alignment DNA sequence: GCTGAACGATTCGTTACT Amino-acid sequence: MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYS Given two (nucleotide or amino acid) sequences, we want to: measure their similarity determine the correspondences between elements of the sequences observe patterns of sequence conservation and variability of sequences over time 9 10 Definition of sequence alignment Changes in alphabets Given an alphabet A, a string is a finite sequence of letters from A Example: GCTGAACG (DNA alphabet) Sequence alignment is the assignment of letter-letter correspondences between two or more strings from a given alphabet exchange of a single letter for another (point mutation) insertion of a single letter deletion of a single letter Pairwise sequence alignment is the process of transforming one sequence into another by repated application of these three operations on single letters 11 12

Example alignment Different alignment notations Without gaps: With gaps: With gaps: G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C 13 14 Optimal alignment Dotplot To decide which alignment is the best of all possibilities, we need: 1. A way to systematically examine all possible alignments 2. A score for each possible alignment, which reflects the similarity of the two sequences The optimal alignment will then be the one(s) with the highest score. Note that there may be more than one optimal alignment. D O R O T H Y C R O W F O O T H O D G K I D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N N 15 16

Dotplot Dotplot A B R A C A D A B R A C A D A B R A B B B B R R R R A A A A A A A A C C C D D D B B B B R R R R C C C D D D B B B B R R R R 17 Dotplot of the amino acid sequence of SLIT protein of Drosophila melanogaster (fruit fly). Web tool Dotlet: http://myhits.isb-sib.ch/cgi-bin/dotlet 18 Dotplot: filtering Dotplot with filtering To avoid very short stretches or many small gaps along stretches of matches one may use the filtering parameters window and threshold A dot will appear in a cell of the dotplot if that cell is in the center of a stretch of characters of length window such that the number of matches is threshold Another option is to give the cell a color (or grey value), such that the higher the number of matches in the window, the more intense the color becomes w = 1, t = 1 w = 11, t = 5 Dotplot with window w and threshold t of the amino acid sequence of the protein pancreatic ribonuclease of the horse. 19 20

Measures of sequence similarity Hamming distance Functions that associates a numeric value with a pair of sequences: 1. similarity measure Higher value greater similarity 2. distance function Larger distance smaller similarity (a distance function is a dissimilarity measure) For two strings of equal length their Hamming distance is the number of character positions in which they differ s : A G T C t : C G T A s : A G C A C A C A t : A C A C A C T A Hamming distance = 2 Hamming distance = 6 Disadvantage: shift of just one position leads to large Hamming distance. 21 22 Edit distance Alignments with gaps Distance can be based on the number of edit operations required to change one string to the other Here an edit operation is a deletion, insertion or alteration of a single character in either sequence (a, a) match (no change from s to t) (a, ) deletion of character a (in s); indicated by in t (a, b) replacement of a (in s) by b (in t), where a b (, b) insertion of character b (in s); indicated by in s For the DNA alphabet: a {A, C, T, G} b {A, C, T, G} Input: Alignment 1: Alignment 2: s : A G C A C A C A t : A C A C A C T A s : A G C A C A C A t : A C A C A C T A s : A G C A C A C A t : A C A C A C T A 23 24

Protocol of edit operations Unit cost model Alignment 1: s : A G C A C A C A t : A C A C A C T A Match (A, A) Delete (G, ) Match (C, C) Match (A, A) Match (C, C) Match (A, A) Match (C, C) Insert (, T ) Match (A, A) Assign a cost or weight w to each operation. For example: match: w(a, a) = 0 replacement: w(a, b) = 1 for a b deletion/insertion: w(a, ) = w(, b) = 1 This scheme is known as the Levenshtein Distance, also called unit cost model 25 26 Edit distance Edit distance: examples Given a cost function w for single operations: 1. The cost of an alignment of two sequences s and t is the sum of the costs of all the edit operations needed to transform s to t 2. An optimal alignment of s and t is an alignment which has minimal cost among all possible alignments Alignment 1: cost = 2 Alignment 2: cost = 4 s : A G C A C A C A t : A C A C A C T A s : A G C A C A C A t : A C A C A C T A 3. The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w. We denote it by d w (s; t) Alignment 1 is optimal under the unit cost model edit distance d w (s; t) = 2 27 28

Scoring functions Some changes in nucleotide or amino acid sequences are more likely than others So assign variable weights to different edit operations This leads to the concept of scoring functions or substitution matrices A substitution matrix: square array of values which indicate the scores associated to possible transitions (replacements, insertions, deletions) One uses either similarity scores or dissimilarity scores (such as edit distance) Similarity scoring schemes for DNA sequences Percent Identity substitution matrix: 99% identity 50% identity A T G C A +1-3 -3-3 T -3 +1-3 -3 G -3-3 +1-3 C -3-3 -3 +1 A T G C A +3-2 -2-2 T -2 +3-2 -2 G -2-2 +3-2 C -2-2 -2 +3 Substitutions that are more likely get a higher similarity score or, equivalently, a smaller dissimilarity score 29 30 Dotplots and sequence alignment D O R O T H Y C R O W F O O T H O D G K I D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I N Any path through this dotplot from upper left to lower right, moving at each cell only East, South or Southeast, corresponds to an alignment. D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N N I N 31 Optimal substructure property S Dynamic programming M If M is a point on an optimal path π [S T ] (solid line) then π [S M] and π [M T ] are also optimal paths. The cost of the dotted path from S to M cannot be smaller than the cost of the solid path from S to M. T 32

Edit distance: Recursive computation Edit distance: Recursive computation Create a matrix by D, with elements 1 D(i, j), i = 1, 2,..., n and j = 1, 2,..., m such that D(i, j) is the minimal edit distance between the sequences that consist of the first i characters of s and the first j characters of t Then D(n, m) will be the minimal edit distance between the full sequences s and t For initialization, we need to add an extra row D(0, j), j = 0, 1, 2,..., m, and column D(i, 0), i = 0, 1, 2,..., n to the matrix. D 00 D 01 D 0m D = D 10 D 11 D 1m.... D n0 D n1 D nm D(n, m) equals the minimal edit distance between the full sequences s and t 1 i is the row index, running from top to bottom; j is the column index, running from left to right. 33 34 Steps in the dotplot matrix Recursion Each step in the matrix which arrives in cell (i, j) can be of three types: East (previous cell was (i, j 1)) South (previous cell was (i 1, j)) SouthEast (previous cell was (i 1, j 1)) edit operation step in matrix cost substitution of a i b j (i 1, j 1) (i, j) w(a i, b j ) deletion of a i from s (i 1, j) (i, j) w(a i, ) deletion of b j from t (i, j 1) (i, j) w(, b j ) Three paths arrive at cell (i, j): The optimal paths from (0, 0) to (i 1, j 1), (i 1, j), or (i, j 1), followed by: step cost (i 1, j 1) (i, j) D(i 1, j 1) + w(a i, b j ) (i 1, j) (i, j) D(i 1, j) + w(a i, ) (i, j 1) (i, j) D(i, j 1) + w(, b j ) The minimum of these is the cost D(i, j) of the optimal path from (0, 0) to (i, j). Recursion: D(i, j) = min{d(i 1, j 1) + w(a i, b j ), D(i 1, j) + w(a i, ), D(i, j 1) + w(, b j )} 35 36

Initialization Retrieving the optimal path(s) On the top row and left column of the matrix we have no North or West neighbours, respectively. So here we have to initialize values: D(i, 0) = i w(a k, ), D(0, j) = k=0 j w(, b k ) k=0 which impose the gap penalty on unmatched characters at the beginning of either sequence store a pointer (an arrow) to one of the three cells (i 1, j 1), (i 1, j) or (i, j 1) that provided the minimal value. This cell is called the predecessor of (i, j) If there are more cells that provided the minimal value (remember that optimal paths need not be unique) we store a pointer to each of these cells In practice one often uses a constant gap penalty: w(a k, ) = w(, b k ) = g 37 38 Needleman-Wunsch algorithm 1: INPUT: two sequences s = a 1 a 2... a n and t = b 1 b 2... b m ; cost function w with gap penalty g 2: OUTPUT: matrix D containing the minimal edit distance between the sequences s and t 3: for i = 0 to n do 4: D(i, 0) g i 5: end for 6: for j = 0 to m do 7: D(0, j) g j 8: end for 9: for i = 1 to n do 10: for j = 1 to m do 11: Match D(i 1, j 1) + w(a i, b j ) 12: Delete D(i 1, j) + g 13: Insert D(i, j 1) + g 14: D(i, j) min(match, Insert, Delete) 15: end for 16: end for Example Alignment of sequences s=ggaatgg and t=atg with scoring scheme: w(a, a) = 0 (match) w(a, b) = 4 for a b (mismatch) w(a, ) = w(, b) = 5 (gap insertion) 39 40

Example (continued) Example (continued) Matrix D(i, j) after initialization and the first diagonal step: s t A T G 0 5 10 15 G 5 4 G 10 A 15 A 20 T 25 G 30 G 35 NB: to be consistent with the definition of the matrix D(i, j), sequence s is plotted vertically, sequence t horizontally 41 Matrix after termination, including pointers: s t A T G 0 5 10 15 G 5 4 9 10 G 10 9 8 9 A 15 10 13 12 A 20 15 14 17 T 25 20 15 18 G 30 25 20 15 G 35 30 25 20 Red arrows indicate trace-back paths of optimal alignment. 42 Example (continued) Sequence logos Two cells where the trace-back path branches four optimal alignments with equal score: Graphical display of multiple alignment, with colored stacks of letters representing nucleotides or amino acids at successive positions. Height of a letter at a certain position increases with increasing frequency of an amino acid at that position. G G A A T G G A T G G G A A T G G A T G G G A A T G G A T G G G A A T G G A T G 43 Sequence logo of human exon-intron splice boundaries. c http://weblogo.berkeley.edu 44