Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Pairwise & Multiple sequence alignments

Sequence analysis and Genomics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Single alignment: Substitution Matrix. 16 march 2017

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Algorithms in Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment (chapter 6)

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Practical considerations of working with sequencing data

Motivating the need for optimal sequence alignments...

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

BLAST. Varieties of BLAST

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Tools and Algorithms in Bioinformatics

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Computational Biology

Homology and Information Gathering and Domain Annotation for Proteins

In-Depth Assessment of Local Sequence Alignment

Bioinformatics and BLAST

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment & BLAST. By: Hadi Mozafari KUMS

Introduction to Bioinformatics Online Course: IBT

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Basic Local Alignment Search Tool

Collected Works of Charles Dickens

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

EECS730: Introduction to Bioinformatics

Tools and Algorithms in Bioinformatics

Sequence analysis and comparison

Computational methods for predicting protein-protein interactions

An Introduction to Sequence Similarity ( Homology ) Searching

Homology. and. Information Gathering and Domain Annotation for Proteins

Evolutionary Tree Analysis. Overview

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Similarity or Identity? When are molecules similar?

Copyright 2000 N. AYDIN. All rights reserved. 1

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Large-Scale Genomic Surveys

Pairwise sequence alignment

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Biol478/ August

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Pairwise sequence alignments

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Bioinformatics for Biologists

Bioinformatics Exercises

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Cladistics and Bioinformatics Questions 2013

Comparative genomics: Overview & Tools + MUMmer algorithm

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Exploring Evolution & Bioinformatics

Introduction to Bioinformatics

Quantifying sequence similarity

Phylogenetic analysis. Characters

Introduction to Sequence Alignment. Manpreet S. Katari

1.5 Sequence alignment

Phylogenetic inference

Comparative Bioinformatics Midterm II Fall 2004

Analysis and Design of Algorithms Dynamic Programming

Introduction to protein alignments

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Practical Bioinformatics

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Genomics and bioinformatics summary. Finding genes -- computer searches

Introduction to Bioinformatics

Pairwise Sequence Alignment

Week 10: Homology Modelling (II) - HHpred

B I O I N F O R M A T I C S

Sequence Alignment Techniques and Their Uses

Phylogenetic Tree Reconstruction

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Administration. ndrew Torda April /04/2008 [ 1 ]

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Bio nformatics. Lecture 3. Saad Mneimneh

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

8/23/2014. Phylogeny and the Tree of Life

Moreover, the circular logic

Comparing whole genomes

Effects of Gap Open and Gap Extension Penalties

BIOINFORMATICS: An Introduction

Transcription:

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo

Learning Objectives Understand the applications of sequence similarity searching and alignment Understand the concepts of homology, identity, orthologues, paralogues Sequence evolution: introducing concepts of point mutations, deletions, insertions etc. Introduction to pair-wise sequence alignment Overview of the different approaches to sequence alignment - exhaustive vs. heuristic

Learning Outcomes Understand the concept of sequence alignment Understanding the concepts of sequence evolution- mutations, deletions, insertions etc Sequence diversity: understand the difference between homologues, paralogues and orthologues Analyzing similarity and differences between sequences Understanding why and how to find sequences similar to the one of your interest

What is Sequence Alignment Sequence alignment is a way of arranging two or more sequences (DNA, RNA, or A.a.(proteins)) to identify regions of similar character patterns Sequence similarity could be a result of functional, structural, or evolutionary relationships between the sequences Procedure involves searching for series of identical or similar characters/patterns in the same order between the sequences Non identical characters aligned as mismatches or opposite a gap in the other sequence Alignment made between a known sequence and unknown sequence or between two unknown sequences

Why Sequence Alignment (uses)?: Useful in DNA and Protein sequences for: Discovering functional information Predicting molecular structure Discovering evolutionary relationships Sequences that are very much alike probably have: Same function Similar secondary and 3-D structure (if proteins) Shared ancestral sequence (though not always) Sequence alignment enables the following: Annotation of new sequences Modeling of protein structures Phylogenetic analysis

Sequence Evolution

Evolutionary basis of Sequence Alignment One goal of sequence alignment is to enable inference of homology (origin from common ancestor) through observed shared sequence similarity. Changes that occur during sequence divergence from common ancestor include: Substitutions Deletions Insertions

Sequence Relationships-1 Identity/ Similarity: Sequence Identity: Exactly the same Amino acid or nucleotide in the same position Sequence Similarity: Content includes substitutions (A.a residues) with similar chemical properties Similarity: A quantifiable property- Two sequences are similar if order of sequence characters is recognizably the same and they can be aligned

Sequence Relationships-2 How similar is very similar?: Sequences be at least 100 A.a or 100 nucleotides long, then: 25% Amino acid identity required to call protein homology 70% nucleotide identity required to call gene homology Caution: Homology or non-homology is more than just sequence similarity

Sequence Relationships-3 To ascertain homology, also consider other information reported by the sequence comparison/search: Expectation Value (E-value, see local alignments later): tells how likely observed similarity is due to chance Length of segments similar between the two sequences The patterns of A.a. conservation The number of insertions and deletions

Similarity/Identity: Nucleotides Credit Pandam Salifu, IBT 2016

% Similarity/Identity: Nucleotides-1 Equal Length:

Similarity/Identity: Nucleotides-2 Un equal Length:

Similarity/Identity: Amino Acids-1 % Identity and similarity not synonymous for Amino acid sequences: Credit Pandam Salifu, IBT 2016

Similarity/Identity: Amino Acids-2 % Identity and similarity not synonymous for Amino acid sequences: Credit Pandam Salifu, IBT 2016

Sequence Relationships-2a Homology: Homous sequences (related by descent): Two or more sequences, readily aligned,i.e. very similar such that they have a shared ancestry Homous positions TATGATC TATGATC TATGATC TATcATC TTcATC T TcATC

Sequence Relationships-2b Similarity Vs Homology Similarity means likeness or % identity between two sequences Homology refers to shared ancestry Similarity means having statistically significant number of Amino acids or nucleotide base matches Similarity does not imply homology Two sequences are homous if derived from a common ancestral sequence Homology usually implies similarity sequences are either homous or not, so no % homology

Sequence Relationships-2c Orthous sequences: quite similar sequences found in different species (i.e. due to a speciation event), and carrying out a similar biological function Paraus sequences: Sequences related through gene duplication events. Can have variable biological function within a species Sequences may be both orthous and paraus Orthology is a form of homology

Sequence Relationships-2d Credit Pandam Salifu, IBT 2016

Sequence Alignment Example- Homology 10 20 30 40 50 60 HUMAN MNPLLILTFVAAALAAPFDDDDKIVGGYNCEENSVPYQVSLNSGYHFCGGSLINEQWVVS :. ::::..:.::.: :..:::::::::.: :.:::::::::::::::::::::.::::: RAT MSALLILALVGAAVAFPLEDDDKIVGGYTCPEHSVPYQVSLNSGYHFCGGSLINDQWVVS 10 20 30 40 50 60 70 80 90 100 110 120 HUMAN AGHCYKSRIQVRLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVIN :.::::::::::::::::.::::.::::::::::.::.:. :::::::::::::..: RAT AAHCYKSRIQVRLGEHNINVLEGDEQFINAAKIIKHPNYSSWTLNNDIMLIKLSSPVKLN 70 80 90 100 110 120 130 140 150 160 170 180 HUMAN ARVSTISLPTAPPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPGKIT :::...::.:.::.::::::::: :.:.. :: :::.:::::::: :::.:::.:: RAT ARVAPVALPSACAPAGTQCLISGWGNTLSNGVNNPDLLQCVDAPVLSQADCEAAYPGEIT 130 140 150 160 170 180 190 200 210 220 230 240 HUMAN SNMFCVGFLEGGKDSCQGDSGGPVVCNGQLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIK :.:.::::::::::::::::::::::::::::.:::: :::..::::::: :.: ::. RAT SSMICVGFLEGGKDSCQGDSGGPVVCNGQLQGIVSWGYGCALPDNPGVYTKVCNFVGWIQ 190 200 210 220 230 240 HUMAN NTIAAN.::::: RAT DTIAAN Human (247 aa) vs Rat (246 aa) Trypsin : show 76.4% identity (91.9% similarity) in 246 aa overlap (1-246:1-246), E(1) < 2e-86 The similarity is statistically significant ( > expected by chance), so sequences can be considered homous

Sequence Alignment:- Structure

Sequence Alignment Problems: global & local-1 Sequences can be aligned: Matching as many characters as possible across their entire length (Global alignment) The tool for global alignment is based on the Needleman- Wunsch algorithm Focusing on just the best matching (highest scoring) regions (Local alignment) The tool for local alignment is based on Smith-Waterman algorithm Both algorithms are derivatives from the basic dynamic programming algorithm (see later, session 2).

Sequence Alignment Problems: global & local-2 Global alignment: L G P S S K Q T G K G S S R I W D N L N I T K S A G K G A I MR L G D A Local alignment: - - - - - - - - G K G - - - - - - - - - - - - - - - - G K G - - - - - - - -

Sequence Alignment Problems: global & local-3 Global alignment: Suitable for- Sequences that are quite similar (more closely related) Sequences of approximately same length Global alignment made possible by including gaps either within the alignment or at the ends of the sequences Local alignment: Suitable for- Sequences similar along some of their lengths but dissimilar in others (i.e. sharing several conserved regions of local similarity/domains) Sequences that differ in length Gaps not tolerated within local alignment

Pair-wise Sequence Alignment Pair-wise sequence alignment maps and compares residues between two sequences Aligning two sequences has many distinct alignment options possible The overall goal is to find the alignment that provides the best (optimal) pairing between the two sequences (i.e. maximum residue/character matches, gaps inclusive) Sequence alignments have to be scored to identify the best one/s among them. Scoring system can be simple match/mismatch scheme (DNA) or for protein comparisons, use of a more sensitive scheme by substitution matrix Often there is more than one solution with the same score

Treatment of gaps: Penalties-1 Constant gap penalty, a fixed ve score -a is given as penalty of every gap, irrespective of length. Aligning GCTGATTCAT Vs GCTTCAT GCTGATTCAT GC - - -TTCAT Score rules: Each match +1; The gap -1 Total score = 7-1 = 6

Treatment of gaps: Penalties-2 Linear gap penalty, a penalty of -a per unit length of a gap. Takes into account the length(l) of each insertion / deletion in the gap Aligning GCTGATTCAT Vs GCTTCAT GCTGATTCAT GC - - -TTCAT Score rules: Each match +1; Each gap -1 Total score = 7-3 = 4

Treatment of gaps: Penalties-3 Constant and linear gap penalties do not consider whether gap is opening or extending Gaps at terminal regions treated with no penalty since many true homous sequences can be of different lengths

Treatment of gaps: Penalties-4 Affine gap penalty considers Introducing (opening) and extension of gaps: Total gap penalty G= O+E(L-1) Where O= opening penalty; E= extension penalty; L= length of gap Pros: Opening gap costs more than extending More evolutionary sound Draw back: penalty points are arbitrary chosen

Pair-wise alignment score: Example Data: A G T A C Vs G T A A C Score rules: +1 for match, -2 for mismatch, -3 for gap 2 matches, 0 gaps (-4) A G T A C G T A A C 4 matches, 1 insertion (+1) A G T - A C. G T A A C 3 matches (2 end gaps) (+1) A G T A C.. G T A A C 4 matches, 1 insertion (+1) A G T A - C. G T A A C Scoring scheme rewards matches and punishes mismatches and gaps

Methods of Pair-wise Sequence Alignment Short and very related sequences By hand- slide sequences on two lines of a word processor General initial exploration of your sequence: to discover repeats, insertions, deletions etc Dot plot/matrix methods- simplest comparison method Intensive comparisons to arrive at the optimal alignment..rigorous mathematical approach Dynamic programming (slow, optimal) Extensive comparisons involving long sequences (e.g. entire genomes) or a large set of sequences( e.g. database entries).heuristic methods (fast, approximate) Word search methods e.g. BLAST, FASTA etc Continued in session 2