CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Quantifying sequence similarity

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Pairwise & Multiple sequence alignments

Tools and Algorithms in Bioinformatics

Algorithms in Bioinformatics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Alignment Techniques and Their Uses

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence analysis and comparison

Sequence analysis and Genomics

Practical considerations of working with sequencing data

Collected Works of Charles Dickens

Single alignment: Substitution Matrix. 16 march 2017

Similarity or Identity? When are molecules similar?

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Pairwise sequence alignments

Scoring Matrices. Shifra Ben-Dor Irit Orr

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Multiple sequence alignment

Bioinformatics and BLAST

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Computational Biology

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

CSE 549: Computational Biology. Substitution Matrices

Copyright 2000 N. AYDIN. All rights reserved. 1

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

EECS730: Introduction to Bioinformatics

Pairwise Sequence Alignment

Introduction to Bioinformatics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Tools and Algorithms in Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

Substitution matrices

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

In-Depth Assessment of Local Sequence Alignment

Sequence Alignment (chapter 6)

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Overview Multiple Sequence Alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Moreover, the circular logic

Local Alignment Statistics

Ch. 9 Multiple Sequence Alignment (MSA)

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Similarity searching summary (2)

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Practical Bioinformatics

Introduction to Bioinformatics Online Course: IBT

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Alignment & BLAST. By: Hadi Mozafari KUMS

Large-Scale Genomic Surveys

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Computational methods for predicting protein-protein interactions

Basic Local Alignment Search Tool

Bioinformatics for Biologists

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Pairwise sequence alignment

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Week 10: Homology Modelling (II) - HHpred

Effects of Gap Open and Gap Extension Penalties

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Bioinformatics Exercises

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Tutorial 4 Substitution matrices and PSI-BLAST

Introduction to Bioinformatics Introduction to Bioinformatics

Sequence comparison: Score matrices

Motivating the need for optimal sequence alignments...

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

G4120: Introduction to Computational Biology

Lecture 5,6 Local sequence alignment

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Sequence Comparison. mouse human

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

MegAlign Pro Pairwise Alignment Tutorials

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

B I O I N F O R M A T I C S

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Exploring Evolution & Bioinformatics

Transcription:

CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018

SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Wikipedia 2

COMPARING SEQUENCES cornerstone in sequence analysis aims for identification of sequence relatedness ONLY homologous sequences (derived from the same ancestor) can be compared homologous sequences should (but not MUST) have similar function and similar sequences 3

HOMOLOGY IN STRUCTURES reasons why structures have similar shapes are homology and homoplasy homology = shares the same ancestor homoplasy = similar structures but not derived from the same ancestor 4

HOMOLOGY IN SEQUENCES ACTGTACTCGCATCG ACTATACTCTCATTG species A ACTGTTCTCCCATCA species B 5

DEGREES OF HOMOLOGY Homology is qualitative! Paralog: homologous genes have diverged from each other after gene duplication Ortholog: Genes originating from a single ancestral gene Xenolog: Homologous genes acquired via Horizontal Gene Transfer (HGT) ; Koonin (2005) Annu. Rev. Genet 6

SEQUENCE ALIGNMENT ACTATACTCTCATTG ACTGTTCTCCCATCA 7

Sequence 1 DOT PLOT sequence 1 ACCTCGTGCA sequence 2 ACTTAGTCCA A C C T C G T G C A A C Sequence 2 T T A G T C C A sequence 1 ACCT-CGTGC-A sequence 2 AC-TTAGT-CCA 8

seq_2 DOT PLOT too many dots (high background) = no information How can we handle this problem? seq_1 9

GENERAL PARAMETERS FOR DOT PLOT Window size = subsequence length Window sliding = rate of moving window Threshold or mismatch = cut off (normally use similarity score as the cut off) window size TGAATCCCAGTTCAGCTCTTCAGCCTTTCGTGGATAAGAGAAGGCTGAAAGCGGGTCACGTTTTG TAAATGGCAGTACAGCTGTTAGGCCCATCGTGGCTAAGATCAGGCTCCAAATAGGTCCAGTTCCC 70% 70% 80% 10

PRACTICAL HINTS FOR DOT PLOT a window of 10-20 residues is a good place to start comparative very large sequences (>30 to about 100 residues) may be useful. a good practical rule is to makes plots that have 3 5 times as many dots as the length of the sequences (e.g., 3000-5000 dots for a 1000 base sequence) 11

Sequence 2 DOT PLOT Sequence 1 horizontal offsets (indels) 12

sequence 2 INTERPRETATION OF DOT PLOT (1) highly similar single diagonal line needs noise (or background) reduction sequence 1 13

sequence 2 INTERPRETATION OF DOT PLOT (2) domain identification sequence 1 14

EXON AND INTRON http://myhits.isb-sib.ch/util/dotlet 15

sequence 2 INTERPRETATION OF DOT PLOT (3) sequence 1 inversion 16

sequence 2 INTERPRETATION OF DOT PLOT (4) sequence 1 repeat 17

REPEATED PROTEIN DOMAINS http://myhits.isb-sib.ch/util/dotlet 18

sequence 2 INTERPRETATION OF DOT PLOT (5) sequence 1 palindromic sequence 19

TERMINATORS AND OTHER STEM- LOOP STRUCTURES http://myhits.isb-sib.ch/util/dotlet 20

sequence 2 INTERPRETATION OF DOT PLOT (6) sequence 1 low complexity regions AAAAAAAAAAAAAA 21

LOW-COMPLEXITY REGIONS Plasmodium falciparum serinerepeat antigen protein precursor http://myhits.isb-sib.ch/util/dotlet 22

GAPS IN ALIGNMENT gap has never exist in nature gaps make the comparison difficult gap in sequence alignment most likely is indel accuracy of alignment determines accuracy of indel ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT gap ~ indel(insertion/deletion) 23

SCORING PAIRWISE SEQUENCE ALIGNMENT FOR DNA SEQUENCES the easiest method to score is match scoring seq1 seq2 ATTCGTCGTAGCTAGGCTAA ATTGGCCGTACCATGGATAA match = 14 positions similarity score = 14 Normalized score seq1 seq2 ATTCGTCGTAGCTAGGCTAA ATTGGCCGTACCATGGATAA match = 14 positions mismatch = 6 positions total length = 20 positions similarity score = 70% 24

SCORING PAIRWISE SEQUENCE ALIGNMENT FOR PROTEIN SEQUENCES MAATPTVLLFWKLLDEVFMA 80% identity MAVTPLVLFFWKLVDEVFMA idea = amino acids that have the same physicochemical property would not change the structure of protein MAATPTVLLFWKLLDEVFMA + + 90% similarity MAVTPLVLFFWKLVDEVFMA 25

CONFUSING TERMS Identity proportion of pairs of identical characters between 2 sequences strongly depends on how two sequences are aligned Similarity proportion of pairs of similar characters between 2 sequences similarity is determined by substitution matrix strongly depends on how two sequences are aligned and matrix used Homology two sequences are homologs if they have the same ancestor we cannot score homology (so yes or no ONLY) 26

ALIGNMENT EVENT AND MUTATION EVENT Match -> no mutation Mismatch -> substitution Gap -> insertion/deletion (InDel) 27

SUBSTITUTION MUTATION IN DNA original DNA seq. T A C C T G A G C C A A Tyr Leu Ser Gln C T A Leu silent mutation missense mutation T A C C T C A G C C A A Tyr Leu Ser Gln T A C C T G C G C C A A Tyr Leu Arg Gln C T A Leu C T A Leu non-sense mutation T A C C T G A G C T A A Tyr Leu Ser C T A 28

NUCLEOTIDE SUBSTITUTION sequences that share the same common ancestor will gradually diverse very difficult to perform direct observation sequence divergence = proportion (p) of nucleotide sites that two sequences are different ACTGTACTCGCATCG ACTATACTCTCATTG ACTGTTCTCCCATCA 29

EMPIRICAL STUDIES OF AMINO ACID SUBSTITUTION several studies observed of the amino acid substitution results show that amino acid substitution is not random amino acids with similar chemical properties are more often to substitute in the sequence some amino acids (e.g., cysteine, glycine and tryptophan) are rarely changed 30

POINT ACCEPTED MUTATION (PAM) proposed in 1978 by Margaret Oakley Dayhoff the first substitution matrix for amino acid changes one PAM is a unit of evolutionary divergence in which 1% of amino acids have been changed if no selection for fitness (impossible!!), substitution is one of the main factors that drive the protein sequence change under observation of related protein sequences, frequencies of amino acid substitutions are biased prone to maintain the function of protein these are the point mutations that have been accepted during evolution 31

PAM 250 MATRIX the 1 PAM unit was constructed from the observation of amino acid changes in closely related proteins the data of one PAM was then extrapolated to PAM250 only PAM250 was published by Dayhoff et al. (1978) higher PAM matrix is good for highly divergent sequences; lower PAM is good for conserved sequences BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins 32

BLOSUM MATRIX observed amino acid changes by different strategy with PAM matrix construction sequence data are derived from BLOCKS database differ from PAM, BLOSUM used distantly related sequences (PAM used closely related sequences) BLOSUM62 matrix (the first BLOSUM matrix) sequences having at least 62% identity are merged into a single sequence higher BLOSUM matrix (e.g., BLOSUM90) is good for comparing very similar sequences, the lower BLOSUM (e.g., BLOSUM30) is for highly divergent sequences 33

BLOSUM 62 MATRIX the 1 PAM unit was constructed from the observation of amino acid changes in closely related proteins the data of one PAM was then extrapolated to PAM250 only PAM250 was published by Dayhoff et al. (1978) higher PAM matrix is good for highly divergent sequences; lower PAM is good for conserved sequences BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins 34

SUGGESTED USES FOR COMMON SUBSTITUTION MATRICES Menlove, Clement, and Crandall: Similarity Searching Using BLAST 35

GAP PENALTY assumption = indel is rare (not easy to occur) gap opening = penalty when gap is introduced into the alignment gap extension = penalty of the large size of gap, normally count from the second position of gap CCGTATCGTCTATCTACGTGCACTGAT CCCAATCTTCAATCTACG---TCTGAT gap opening gap extension 36

DYNAMIC PROGRAMMING Sean R Eddy Nature Biotechnology 22, 909-910 (2004) 37

BLAST: BASIC LOCAL ALIGNMENT SEARCH TOOL Wishard, Introduction to Bioinformatics A theoretical and Practical Approach 38

PAIRWISE SEQUENCES ALIGNMENT aim for comparison of 2 sequences global alignment try to do the best alignment of two sequences across the entire length local alignment try to fine the highly similar region(s) between two sequences overlapping alignment global alignment of two sequences with different sizes 39

GLOBAL ALIGNMENT end-to-end alignment may end up with a lot of gaps in the alignment if 2 sequences have dissimilar in size Not sensitive to the modular nature of proteins very sensitive to gap penalties (gap opening and gap extension) Needleman-Wunch algorithm (1970) 5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3' 5' ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA 3' 40

LOCAL ALIGNMENT finds local regions with high level of similarity more sensitive to the modular nature of proteins can be used to search databases Smith-Waterman algorithm (1981) ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA Global Alignment ACTACTAGATT ACTACTAGATT ACGGATC ACGGATC GTACTTTAGAGGCTTGCAACCA GTACTTTAGAGGCTAGCAACCA Local Alignment 41

MULTIPLE SEQUENCE ALIGNMENT 42

PROBLEM OF USING PAIRWISE ALIGNMENT good for comparing of only two sequences hard to understand and interpret the alignment results when a number of sequences are >2 less evolutionary meaning ATGCTAGTAAGC ATTCAA-T--GC ATTCAA-TGC -TTCTAGCGC ATGCTAGTAAGC ATTCAA-T--GC -TTCTAGC--GC ATGCTAGTAAGC -TTCTAGC--GC 43

MULTIPLE SEQUENCE ALIGNMENT (MSA) most useful object in sequence analysis mid 1980s, MSA was generated by hand because dynamic programming (at that time) were slow when applied to >3 sequences idea arrangement of the homologous residues (nucleotide or amino acid) in the same column provides more biological information than pairwise sequence alignment 44

MSA METHODS Exact method Progressive methods: Clustal, MUSCLE Iterative methods: MAFFT Consistency based methods: T-Coffee, ProbCons Structure based methods: 3D-Coffee Multiple sequence alignment methods 45

MSA METHODS Sviatopolk-Mirsky Pais et al. (2014) Algorithm for Molecular Biology 46

PROGRESSIVE ALIGNMENT dynamic programming 47

THE CLUSTAL SERIES Clustal was published by Thompson, et al. in 1994 ClustalW, ClustalX Clustal algorithm were obsolete, but their algorithm is good for understanding the MSA algorithm generated a guide tree, then,do a progressive alignment based on that guide tree Latest: Clustal Omega 48

MUSCLE ALIGNMENT PROGRAM MUltiple Sequence Comparison by Log- Expectation (MUSCLE) was published by Edgar RC, et al. in 2004 step I: progressive alignment step II: improve progressive alignment step III: refinement very easy command line improved speed and accuracy (based on SP method) 49

MUSCLE ALIGNMENT PROGRAM 50

CHOOSING THE RIGHT MSA PROGRAM Chagoyen M (2013) Sequence Analysis and Structure Prediction Service. 51

QUESTIONS? 52