THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Single alignment: Substitution Matrix. 16 march 2017

Phylogenetic analyses. Kirsi Kostamo

Quantifying sequence similarity

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Algorithms in Bioinformatics

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

Collected Works of Charles Dickens

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Basic Local Alignment Search Tool

Sequence Alignment Techniques and Their Uses

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Alignment & BLAST. By: Hadi Mozafari KUMS

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Large-Scale Genomic Surveys

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Tools and Algorithms in Bioinformatics

Phylogenetic inference

Constructing Evolutionary/Phylogenetic Trees

Multiple sequence alignment

An Introduction to Sequence Similarity ( Homology ) Searching

Dr. Amira A. AL-Hosary

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Scoring Matrices. Shifra Ben-Dor Irit Orr

BIOINFORMATICS: An Introduction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Multiple Sequence Alignment. Sequences

Pairwise & Multiple sequence alignments

Week 10: Homology Modelling (II) - HHpred

Using Bioinformatics to Study Evolutionary Relationships Instructions

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Copyright 2000 N. AYDIN. All rights reserved. 1

Phylogenetic Tree Reconstruction

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Similarity or Identity? When are molecules similar?

Effects of Gap Open and Gap Extension Penalties

Ch. 9 Multiple Sequence Alignment (MSA)

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Sequence analysis and comparison

Computational Biology

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

Motivating the need for optimal sequence alignments...

Practical considerations of working with sequencing data

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Comparing whole genomes

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics

Introduction to Bioinformatics

Theory of Evolution Charles Darwin

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Constructing Evolutionary/Phylogenetic Trees

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Similarity searching summary (2)

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetic trees 07/10/13

BINF6201/8201. Molecular phylogenetic methods

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignments

Moreover, the circular logic

G4120: Introduction to Computational Biology

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogeny: building the tree of life

Introduction to Bioinformatics Online Course: IBT

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Substitution matrices

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

EVOLUTIONARY DISTANCES

Phylogenetic Tree Generation using Different Scoring Methods

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Introduction to Bioinformatics Introduction to Bioinformatics

Phylogenetics. BIOL 7711 Computational Bioscience

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Sequencing alignment Ameer Effat M. Elfarash

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

How to read and make phylogenetic trees Zuzana Starostová

Transcription:

Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between them. Aligning sequences assigns functions to the unknown proteins, determines the evolutionary relatedness of organisms and helps in making prediction about the 3D structures. Homology is attributed to similarity due to a descent from a common ancestor, i.e. if two sequences from different organism are similar there is a possibility that both sequences are termed as Homologous. Thereby, predicting structure and function for the sequences. Types of Sequence Alignment- Based on sequence Length According to the length of sequence being compared it is of following two types 1) Global sequence Alignment In this alignment sequences are aligned along their entire length to include as many matching characters possible. TAGC-GC-GT TA-CA-CAGT 2) Local sequence Alignment In this alignment sequences are aligned to find a region of higher density or strong similarity. CGATAACGTAT --ATAAAC--- Based on Number of sequence- According to number of sequence being compared it is of following two types 1) Pairwise Sequence Alignment - This involves aligning two sequences and to get the best region of similarity. 1 KTSSGNGAEDS 11 1 KTSSGNGAEDS 11 1

Various methods used for pairwise alignment of nucleotide and protein sequences are: 1) Dot Plot It is graphical method for two sequences to identify the region of maximum similarity and dissimilarity, depicted by presence and absence of DOTS. 2) Dynamic Programming This method breaks a problem into small sub-problems and uses the solution of the sub-problems to compute the solution of the larger one. Some algorithms like Needleman-Wansch and Smith-Waterman are used here. 3) Heuristic Method When a single sequence is to be compared against the whole database heuristic methods like BLAST and FASTA are used. The following are certain parameters used for producing optimum alignment - a) Max target sequences It displays the result with total number of aligned sequences on a page. b) Expected Threshold It is a statistical indicator which calculates the probability that the resulting alignment are caused by random chance. The lower the E value, the more significant is the score The default value is kept 10 as 10 matches are expected to be found random by chance (Stochastic model of Karlin & Altschul, 1990). c) Query match - It gives the maximum match in a query range. This is useful for comparing many stronger matches of the query results from weaker matches of the results. d) Word size This algorithm works by using word matches between the query and the database sequences. It searches for exact word match, initiates the extension leading to the full alignment. Word size 3 is required for standard protein align. Word size 2 is required for short and nearly exact matches. e) Scoring schemes Different scoring schemes algorithms are devised to obtain an optimum alignment. Use of any substitution matrix helps in aligning possible pair of residues and also generates scores. To check the quality of pairwise sequence alignment, different PAM and BLOSUM matrices are used. BLOSUM (Block amino acid substitution matrix) This has been developed using conserved regions called BLOCKS. Of distantly related protein sequences available from the block database. Out of all BLOSUM 62 matrix is best used for detecting most protein similarities. BLOSUM 45 may be used for longer and weaker alignments. 2

PAM (Point Accepted Mutation) This is developed by calculating the substitution of amino acid during evolution which are naturally accepted. PAM 30 is used for sequences less than 35 in length whereas PAM 70 is used for sequences ranging from 35 to 50. f) Gap costs A gap is a space which is introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. Too many gaps should be avoided in the alignment and hence a gap penalty or gap score is assign. The introduction of gap causes the deduction of gap score from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Increase in gap costs parameter results in the decreased number of gaps in the alignment. The penalty for the creation of a gap should be large enough so that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time. Some used values here are existence 10, 11 and extension 1. 2) Multiple sequence Alignment - This involves the alignment of more than two (protein, DNA) sequences and assess the sequence conservation of proteins domains and protein structures. It is an extrapolation of pairwise sequence alignment which reflects alignment of similar sequences and provides a better alignment score. 3

Various analysis like Homology modeling for prediction of protein structure, Phylogenetic analysis, motif detection etc are based on the results of multiple sequence alignment. There are many softwares like Clustal, t-coffee, Phylip, MSA, MUSCLE used for obtaining multiple sequence alignment. Example Seq 3 - Seq 4 - Seq 5 - PQGGGGWGQ Following parameters should be considered to align multiple sequences. Protein weight matrix - Matrix is used to increase the alignment score. Eg: PAM and BLOSUM. Gap Open The penalty to open a gap. The presence of a gap is frequently given more significance than the length of the gap. By default, the gap opening penalty is 10. Gap extension The penalty to extend a gap. Extension of the gap also involves additional amino acid penalized in the scoring of an alignment. By default, gap extension penalty is 0.20. Application of MSA results - Phylogenetic Analysis It is one of the major areas where multiple sequence analysis results are used to find the evolutionary relatedness between sequences. The results are displayed in form of Phylogenetic tree which has set of nodes and branches to link the nodes. Methods used for Phylogenetic analysis are: 1) UPGMA (Unweighted Pair Group Method with Arithmetic Mean) it is a simple, hierarchical clustering, tree making method which uses distance matrix to find the relatedness between sequences. 2) NJ (Neighbor Joining) - It is a method that is related to the clustering method. The method is especially suited for datasets comprising lineages with largely varying rates of evolution. In this method, a special case of the star decomposition is seen File format view: 4

PHYLIP - PHYLogeny Inference Package. It s a format for Joe Felsenstein s phylogenetic applications, having 8 letter maximum lengths for the sequence ID. Claudogram - In a cladgram, the external taxa line up neatly in a row Their branch lengths are not proportional to the number of evolutionary changes and thus no Phylogenetic time analysis can be done only the relative ordering of the taxa can be analyzed. Phylogram In a phylogram, the branch lengths represent the amount of evolutionary divergence. Such trees are said to be scaled. Pearson/ FASTA Text based format to represent amino acids in single letter code. It also has sequence names followed by comments. Jalview Java alignment editor. It is a visualization tool for alignment algorithms and other database search results. 5