Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Similar documents
InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Tools and Algorithms in Bioinformatics

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Copyright 2000 N. AYDIN. All rights reserved. 1

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Overview Multiple Sequence Alignment

Effects of Gap Open and Gap Extension Penalties

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Week 10: Homology Modelling (II) - HHpred

EECS730: Introduction to Bioinformatics

Pairwise sequence alignment

The PRALINE online server: optimising progressive multiple alignment on the web

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Quantifying sequence similarity

Multiple Sequence Alignment

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to protein alignments

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Multiple sequence alignment

Large-Scale Genomic Surveys

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Moreover, the circular logic

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Multiple Sequence Alignments

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

BLAST: Basic Local Alignment Search Tool

Multiple Sequence Alignment using Profile HMM

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Sequence analysis and comparison

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Multiple Sequence Alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Evolutionary Tree Analysis. Overview

Tools and Algorithms in Bioinformatics

Motivating the need for optimal sequence alignments...

Large Grain Size Stochastic Optimization Alignment

Lecture 5,6 Local sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Today s Lecture: HMMs

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

A New Similarity Measure among Protein Sequences

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Protein Structure Prediction, Engineering & Design CHEM 430

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Pairwise sequence alignments

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Algorithms in Bioinformatics

Ch. 9 Multiple Sequence Alignment (MSA)

CSE 549: Computational Biology. Substitution Matrices

Figure A1. Phylogenetic trees based on concatenated sequences of eight MLST loci. Phylogenetic trees were constructed based on concatenated sequences

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Computational Biology

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

An Introduction to Sequence Similarity ( Homology ) Searching

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Basic Local Alignment Search Tool

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Network alignment and querying

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Sequence comparison: Score matrices

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Class 3, 1/19/98. Mark Gerstein. Yale University

Lecture Notes: Markov chains

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Substitution matrices

Dr. Amira A. AL-Hosary

Multiple Sequence Alignment

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetic inference

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Similarity searching summary (2)

Introduction to Bioinformatics Online Course: IBT

Introduction to Evolutionary Concepts

Transcription:

Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12

Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out loud. (Lesk, 2008) More reliable than pair wise alignments Expose patterns of amino acid conservation Help in better prediction of secondary structure

Multiple Sequence Alignment For pairwise alignment we have; How to align 3 or more sequences? How to calculate P abc or P abcd in likelihood ratio? Large amount of data is needed to estimate Is it even possible to obtain optimal multiple sequence alignment?

Scoring Schemes Two important features of multiple alignment; Position specific scoring The evolutionary relationship between sequences Enough data is not available to parameterize the evolutionary model Assumption of independence between the columns gives score:

Scoring Scheme Minimum Entropy Try to minimize the entropy of the column. The probability of observing residue a in m i can be estimated by; The probability of the column would then be;

Scoring Scheme Minimum Entropy

Scoring Scheme Minimum Entropy The minimum entropy score for column m i would be; The more variation in the column, the higher the entropy Highly conserved columns give more information Completely conserved column would score 0 Good alignment would minimize the total entropy

Scoring Scheme SP scores (Sum of pairs) Independence between columns Score of the column is the sum of all pair wise scores For the column the score becomes; Where s comes from PAM or BLOSSUM

Scoring Scheme SP scores (Sum of pairs) The final alignment score would then be; Unrealistic assumption of same evolutionary distance Not enough data to estimate the probabilities of all evolutionary events

Scoring Scheme SP scores (Sum of pairs)

Multi-dimensional dynamic programming Dynamic programming (D.P.) for multiple sequences Algorithms require a lot of memory High computational cost As many dimensions as the number of sequences Examples of D.P. algorithms include, MSA (Mutliple Sequence Alignment Algorithm) Progressive Alignment (Feng-Doolittle Algorithm)

MSA (Multiple Sequence Alignment) Reduces the volume multi-dimensional D.P. matrix Can optimally align up to 7 sequences of 200-300 residues Makes use of SP scoring scheme The score of multiple alignment is given by;

MSA (Multiple Sequence Alignment) Where a kl is the pair wise alignment between sequences k and l. MSA uses lower threshold score β kl Only scores higher than β kl are considered Instead of passing through all the points in DP matrix only those points are added to the search space where the best alignment score > β kl

MSA (Multiple Sequence Alignment) Finally multi-dimensional DP algorithm is performed on this subset of hyper-cube.

Progressive Alignment The most commonly used approach Uses DP algorithm to align sequences Idea is to start with most related sequences and build on it by adding more sequences and groups Optimal alignment is not guaranteed Fast and efficient, results in reasonable alignments

The Feng-Doolittle Algorithm The algorithm works as follows; Perform pair wise alignment of all sequences Convert alignment score to evolutionary distances Construct a guide tree Align the most related sequences in the guide tree Align: The most closely related sequence to the existing alignment OR The next most related pair to each other OR Two sub-alignments (groups)

The Feng-Doolittle Algorithm PAM scores and affine gap penalties are used Once a gap, always a gap (Feng & Doolittle, 1987) The highest scoring alignments represent the alignment of the group. The distance D is calculated as follows;

Suffix Trees Data structure that represents suffixes of any given string S Defined by a rooted tree with: Every node containing two children except the root No two edges out of a node begin with same character Every edge of the tree defines a non-empty substring of S Facilitates fast retrieval and operations of sub-strings.

Example Multiple Sequence Alignment Case study 5.2 (Lesk, 2008)

Structural inferences from MSA The most highly conserved regions probably correspond to the active site Regions rich in EDIT operations probably correspond to surface loops A conserved Gly/Pro column probably represents a turn

Structural inferences from MSA A conserved pattern of hydrophobicity with spacing 2 with intervening residues more variable and including hydrophilic residues suggests a Betastrand on the surface. (Residues 50-60) A conserved pattern of hydrophobicity with spacing 4 suggests a helix. (Residues 40-49)

References Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351-60. Lesk AM. Introduction to Bioinformatics. Oxford University Press Inc., New york. 2008; 3rd Edition; ISBN: 987-0-19-920804-3.