Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Similar documents
Week 10: Homology Modelling (II) - HHpred

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Similarity searching summary (2)

Exercise 5. Sequence Profiles & BLAST

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Multiple sequence alignment

Similarity or Identity? When are molecules similar?

EECS730: Introduction to Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Template-Based 3D Structure Prediction

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Single alignment: Substitution Matrix. 16 march 2017

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Large-Scale Genomic Surveys

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence analysis and comparison

Tools and Algorithms in Bioinformatics

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Introduction to Bioinformatics

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Alignment & BLAST. By: Hadi Mozafari KUMS

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Tutorial 4 Substitution matrices and PSI-BLAST

An Introduction to Sequence Similarity ( Homology ) Searching

Multiple Sequence Alignment

Protein Structure Prediction and Display

Computational Genomics and Molecular Biology, Fall

Copyright 2000 N. AYDIN. All rights reserved. 1

Pairwise sequence alignments

Computational Genomics and Molecular Biology, Fall

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Sequence Alignment Techniques and Their Uses

Hidden Markov Models (HMMs) and Profiles

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Hidden Markov Models

Administration. ndrew Torda April /04/2008 [ 1 ]

HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

A profile-based protein sequence alignment algorithm for a domain clustering database

Tools and Algorithms in Bioinformatics

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Practical considerations of working with sequencing data

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

HMMs and biological sequence analysis

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Protein Structure Prediction using String Kernels. Technical Report

Ch. 9 Multiple Sequence Alignment (MSA)

Algorithms in Bioinformatics

Sequence-specific sequence comparison using pairwise statistical significance

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Global alignments - review

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Collected Works of Charles Dickens

Bioinformatics: Secondary Structure Prediction

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Intro Protein structure Motifs Motif databases End. Last time. Probability based methods How find a good root? Reliability Reconciliation analysis

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Overview Multiple Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

CSE 549: Computational Biology. Substitution Matrices

SEQUENCE alignment is an underlying application in the

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Scoring Matrices. Shifra Ben-Dor Irit Orr

In-Depth Assessment of Local Sequence Alignment

Moreover, the circular logic

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

Sequence analysis and Genomics

The Pennsylvania State University. The Graduate School. College of Engineering A COMPUTATIONAL FRAMEWORK FOR INFERRING STRUCTURE, FUNCTION,

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence comparison: Score matrices

Computational Biology

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

proteins Refinement by shifting secondary structure elements improves sequence alignments

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Basic Local Alignment Search Tool

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Introduction to Sequence Alignment. Manpreet S. Katari

Efficient Remote Homology Detection with Secondary Structure

Transcription:

1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es)

2 Sequence Comparisons: How? - Pairwise Alignment of 2 Sequences: - Aligning a couple of sequences - Searching for homologues (BLAST) - Multiple Sequence Alignments (n>2): - Advanced Sequence Alignments: - Patterns, Profiles, HMMs - Distant Homologues with PSI-BLAST

3 Multiple Sequence Alignments Rationale: - They try to align more than 2 homologous sequences. - Conserved regions must be important due to selective pressure. - Better Alignments as we can focus on general rules. Algorithmic Complexity: - 2 Sequences: NxM (As seen for Pairwise Alignments) - 3 Sequences: NxMxL - 4 Sequences: NxMxLxJ - Not feasible in general for a set of homologous proteins. Examples of Heuristics to skip this algorithmic problem: - T_coffee : [http://www.ch.embnet.org/software/tcoffee.html] - Clustalw : [http://www.ebi.ac.uk/clustalw/]

4 Clustalw Algorithm The Algorithm in bare-words Sort the sequences by similarity Align the two most similar ones. Label the sequences as aligned. Repeat until there are no unlabeled seqs Computational Complexity: Now it is similar to performing N pairwise alignments, thus it is feasible for the computer to calculate them for a family of homologues.

5 View of a MSA using Belvu

6 PairWise vs MSA - If a couple of homologues have diverged more than 20% the signal between them is so low that BLAST is not able to catch it. In other words, we can NOT use BLAST to find remote homologues. - A Multiple Sequence Alignment (MSA) is able to spot important regions. Since important regions have higher selective pressure they change (evolve) less than other regions Conservation is related to Importance. -If the matches between a set of sequences occur in the conserved regions, the chances of these 2 sequences being homologues increase. So How can we use all this information to improve our knowledge? Can a MSA spot remote homologues?

7 Sequence Comparisons: How? - Pairwise Alignment of 2 Sequences: - Aligning a couple of sequences - Searching for homologues (BLAST) - Multiple Sequence Alignments (n>2): - Advanced Sequence Alignments: - Patterns, Profiles, HMMs - Distant Homologues with PSI-BLAST

8 Advanced Searches: Consensus Algorithm: For each position in the sequence, the consensus position will represent the most repeated monomer. Pros: - The most basic way of summarizing a MSA into a single line. - Easy to implement, easy to understand Cons: - Not taking into account the frequencies, but the most represented ACTGACTACGTACA ATGCGTACCATACA ATCAGTATCGTAGA ATCAGTATCGAACA ----------------------------- ATCAGTATCGTACA Consensus Sequence

9 Advanced Searches: Patterns -Patterns are also called Regular Expressions by bioinformaticians - Useful when dealing with motifs - It is a complex (but powerful) language, not always easy at first glance - Ambiguity can be depicted: [A,B] : Can be both A or B {A, B} : Anything but A or B X: Any monomer -Repetitions are easily represented: A{2,4}: Can be either AA, AAA or AAAA A+: Any number of A s (but at least one) A*: Any number of A s (or even none) - MSA s are reduced to a single line - Not taking into account the frequencies AGLV AGLV AG[IL]V AGIV [AC]-x-G-x{4}-{L,I} [Ala or Cys]-any-Gly-any-any-any-any-[any but Leu or Ile]

10 Advanced Searches: Profiles - Profiles are Position Specific Scoring Matrices (PSSM). - Same concept as the Scoring Matrices (i.e. BLOSUM), but these ones are calculated on scratch using each position in the MSA instead of being pre-calculated. -Thus, PSSM s are 20xN matrices, being N the length of the sequence - PSSM s take into account information specific to the family of proteins, so is Inferred from the alignment Few Assumptions - We can align a sequence and a profile using Smith & Waterman Algorithm Search for homologues MSA PSSM PSSM borrowed from F.Abascal

11 Advanced Searches: HMM-profiles - HMM stands for Hidden Markov Model. - Originally implemented for Speech Recognition - They are much robust than the simple profiles, specially when dealing with gaps. However, they are harder to implement. -HMMer is a package similar to BLAST but using HMMs x hidden states y observable outputs a transition probabilities b output probabilities http://hmmer.janelia.org/

12 Sequence Comparisons: How? - Pairwise Alignment of 2 Sequences: - Aligning a couple of sequences - Searching for homologues (BLAST) - Multiple Sequence Alignments (n>2): - Advanced Sequence Alignments: - Patterns, Profiles, HMMs - Distant Homologues with PSI-BLAST

13 Remote Homologues: PSI-Blast - PSI-BLAST Position Specific Iterated Blast - PSI-Blast is useful to retrieve remote homologues (id<20%) - Algorithm: 1) Run BLAST [Iteration #0] 2) Generate PSSM with the results better than a given threshold (e-value) 3) Run BLAST again using the PSSM as Input, [Iterations #1 to #N] 4) Update PSSM with new results 5) Repeat from 3 until convergence* *Convergence: When we can not find new results

14 Remote Homologues: PSI-Blast Target Sequence BLAS T DataBase PSI-BLAST PSSM Closely Related Homologues Remotely Related Homologues

15 Acknowledgements Federico Abascal (Original Text) Juan Carlos Sanchez Alfonso Valencia

16 XXX