Multiple Sequence Alignments

Similar documents
Ch. 9 Multiple Sequence Alignment (MSA)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Week 10: Homology Modelling (II) - HHpred

Introduction to protein alignments

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Master Biomedizin ) UCSC & UniProt 2) Homology 3) MSA 4) Phylogeny. Pablo Mier

Quantifying sequence similarity

Overview Multiple Sequence Alignment

Procedure to Create NCBI KOGS

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Comparing Genomes! Homologies and Families! Sequence Alignments!

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Genomics and bioinformatics summary. Finding genes -- computer searches

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Moreover, the circular logic

Large-Scale Genomic Surveys

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Supplemental Figure 1.

Phylogenetic inference

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Molecular Coevolution of the Vertebrate Cytochrome c 1 and Rieske Iron Sulfur Protein in the Cytochrome bc 1 Complex

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Bioinformatics Exercises

Substitution matrices

Pairwise & Multiple sequence alignments

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Similarity searching summary (2)

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

GATA family of transcription factors of vertebrates: phylogenetics and chromosomal synteny

Proteins: Structure & Function. Ulf Leser

Comparative Protein Modeling of Superoxide Dismutase Isoforms in Maize.

In-Depth Assessment of Local Sequence Alignment

Pairwise sequence alignments

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Graph Alignment and Biological Networks

Multiple Sequence Alignment. Sequences

Algorithms in Bioinformatics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Small RNA in rice genome

G4120: Introduction to Computational Biology

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

CAP 5510 Lecture 3 Protein Structures

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence analysis and comparison

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Computational Structural Bioinformatics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Protein Structure Prediction Using Neural Networks

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

BIOINFORMATICS LAB AP BIOLOGY

Application of new distance matrix to phylogenetic tree construction

Bio nformatics. Lecture 23. Saad Mneimneh

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

CSE P 590 A Spring : MLE, EM

Cubic Spline Interpolation Reveals Different Evolutionary Trends of Various Species

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Introduction to Bioinformatics Online Course: IBT

EECS730: Introduction to Bioinformatics

Sequence Database Search Techniques I: Blast and PatternHunter tools

Computational Genomics and Molecular Biology, Fall

Practical considerations of working with sequencing data

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Sequence Alignment Techniques and Their Uses

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Copyright 2000 N. AYDIN. All rights reserved. 1

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl


Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Exploring Evolution & Bioinformatics

Multiple Sequence Alignment

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

CRISPRseek Workshop Design of target-specific guide RNAs in CRISPR-Cas9 genome-editing systems

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

CSE 549: Computational Biology. Substitution Matrices

7.36/7.91 recitation CB Lecture #4

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Sequence analysis and Genomics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics. Can Keşmir Theoretical Biology/Bioinformatics, UU

Similarity or Identity? When are molecules similar?

Sequence comparison: Score matrices

Introduction to Bioinformatics

Genome Sequencing & DNA Sequence Analysis

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

Letter to the Editor. Temperature Hypotheses. David P. Mindell, Alec Knight,? Christine Baer,$ and Christopher J. Huddlestons

Sequence Based Bioinformatics

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Transcription:

Multiple Sequence Alignments...... Elements of Bioinformatics Spring, 2003 Tom Carter http://astarte.csustan.edu/ tom/ March, 2003 1

Sequence Alignments Often, we would like to make direct comparisons between two or more residue sequences. This allows us to see which subsequences are conserved, and what differences there are between the two sequences. In many cases, we would like to make comparisons among more than two sequences. These comparisons can help us understand evolutionary changes in molecules, and to identify functional or structurally important regions of the molecules. 2

In general, it can be computationally infeasible to look for globally optimal alignments of sequences, particularly when we allow gaps in the sequences. There may also be ambiguities about what should be considered optimal alignments in particular cases. There are a variety of ways in which two residue sequences (or subsequences of a sequence) might be similar. They might be evolutionarily homologous, sharing a (relatively) recent common ancestor. They might be structurally similar, contributing in similar ways to the secondary (or tertiary) structure of the molecule (e.g., alpha helices or beta sheets). They might have functional similarity, such as binding sites. In general, we would expect two evolutionarily homologous sequences to match each other fairly well at the residue level, and for 3

similarity scoring via such models as PAM or BLOSUM to work fairly well. On the other hand, these scoring matrices may not do very well in recognizing structural or functional similarities. For these purposes, we may need more sophisticated methods for building alignments. For example, an algorithm might use a simple PAM or BLOSUM scoring matrix approach in its early phases, and in later phases use a matching model more closely tailored to structural or functional features. For example, current implementations of ClustalW use a variety of approaches, including: Variation of amino acid substitution matrices at different alignment stages according to the divergence of the sequences to be aligned. Residue specific gap penalties and locally reduced gap penalties in hydrophilic 4

regions encourage new gaps in potential loop regions rather than regular secondary structure. Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. (These are from the ClustalW 1.7 descriptive material.) This is a fairly active area of research. New approaches, such as HMM (hidden Markov models), are providing new tools and methods. 5

Basic approach to multiple sequence alignments The general approach used by many algorithms such as ClustalW for building multiple sequence alignments is as follows: Using a scoring matrix approach, do pairwise comparisons between the sequences. If there are n sequences, we will be doing n(n 1)/2 pairwise comparisons. Using the pairwise comparison scores, build a relatedness tree for the sequences. Starting with the most closely related pair of sequences, build successively larger clusters of sequence alignments until all sequences are aligned. 6

Here is a brief example, using some sequences from the Copper-Zinc superoxide dismutase (SODC) family: SOD2-LYCES [Lycopersicon esculentum (Tomato)] SODC-SPIOL [Spinacia oleracea (Spinach)] SODC-YEAST [Saccharomyces cerevisiae (Baker s yeast)] SODC-XENLA [Xenopus laevis (African clawed frog)] SODC-RAT [Rattus norvegicus (Rat)] SODC-MOUSE [Mus musculus (Mouse)] SODC-HUMAN [Homo sapiens (Human)] SODC-DROVI [Drosophila virilis (Fruit fly)] SODC-CHICK [Gallus gallus (Chicken)] 7

The first step is to do pairwise similarity scoring of the sequences. In this case, we use a Gonnet variation of the PAM250 scoring matrix: 1 2 3 4 5 6 7 8 9 SOD2_LYCES 1 84 55 57 54 56 54 59 55 SODC_SPIOL 2 53 55 52 53 53 59 55 SODC_YEAST 3 56 56 54 54 53 55 SODC_XENLA 4 66 66 66 58 67 SODC_RAT 5 96 83 59 71 SODC_MOUSE 6 83 61 71 SODC_HUMAN 7 59 72 SODC_DROVI 8 63 SODC_CHICK 9 Then we build a relatedness tree: ((LYCES, SPIOL 84), (YEAST, (XENLA, (((RAT, MOUSE 96), HUMAN 83), CHICK 71) 66), DROVI 58)) 8

Here is a pictorial version of the unrooted relatedness tree: SODC unrooted relatedness tree 9

Here is a multiple alignment: SODC multiple alignment (ClustalW 1.81) 10

Here is another version of the multiple alignment: SODC multiple alignment (boxshade) 11

Notations in ClustalW output: The conservation line output in the clustal format alignment file uses three characters: * indicates positions which have a single, fully conserved residue : indicates that one of the following strong groups is fully conserved:- STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW 12

. indicates that one of the following weaker groups is fully conserved:- CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY These are all the positively scoring groups that occur in the Gonnet Pam250 matrix. The strong and weak groups are defined as strong score > 0.5 and weak score <= 0.5 respectively. 13