Tools and Algorithms in Bioinformatics

Similar documents
Tools and Algorithms in Bioinformatics

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics for Biologists

Basic Local Alignment Search Tool

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics and BLAST

Algorithms in Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Database Search Techniques I: Blast and PatternHunter tools

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Sequence analysis and Genomics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Practical considerations of working with sequencing data

Alignment & BLAST. By: Hadi Mozafari KUMS

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Single alignment: Substitution Matrix. 16 march 2017

Sequence Alignment Techniques and Their Uses

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

BLAST. Varieties of BLAST

Large-Scale Genomic Surveys

Collected Works of Charles Dickens

In-Depth Assessment of Local Sequence Alignment

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Computational Biology

Pairwise & Multiple sequence alignments

Heuristic Alignment and Searching

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Multiple sequence alignment

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Introduction to Bioinformatics

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Effects of Gap Open and Gap Extension Penalties

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Protein function prediction based on sequence analysis

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Quantifying sequence similarity

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Phylogenetic inference

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

G4120: Introduction to Computational Biology

Introduction to protein alignments

Copyright 2000 N. AYDIN. All rights reserved. 1

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Moreover, the circular logic

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

SUPPLEMENTARY INFORMATION

Similarity searching summary (2)

EECS730: Introduction to Bioinformatics

Week 10: Homology Modelling (II) - HHpred

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Similarity or Identity? When are molecules similar?

Overview Multiple Sequence Alignment

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Fundamentals of database searching

Phylogenetic Tree Reconstruction

EECS730: Introduction to Bioinformatics

Pairwise sequence alignment

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Lecture 5,6 Local sequence alignment

Sequence analysis and comparison

Sequence Alignment (chapter 6)

Practical Bioinformatics

Local Alignment Statistics

Comparative genomics: Overview & Tools + MUMmer algorithm

BLAST: Target frequencies and information content Dannie Durand

Dr. Amira A. AL-Hosary

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Tutorial 4 Substitution matrices and PSI-BLAST

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Sequence Comparison. mouse human

Introduction to Bioinformatics Online Course: IBT

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

A profile-based protein sequence alignment algorithm for a domain clustering database

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Homology and Information Gathering and Domain Annotation for Proteins

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Motivating the need for optimal sequence alignments...


Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Transcription:

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology Core University of Nebraska Medical Center Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) 1

Choice of the Searching Algorithm An ideal algorithm should have Good specificity and sensitivity Should be fast running Should not use too much memory Greedy algorithms are very sensitive, but very slow. Heuristic algorithms are relatively fast, but loose some sensitivity. It s always a challenge for a programmer to develop algorithms that fulfill both of these requirements Needleman-Wunsch algorithm (JMB 48:443-53, 1970) Very greedy algorithm, so very sensitive Implements Dynamic programming Provides global alignment between the two sequences Smith-Waterman algorithm (JMB 147:195-97, 1981) A set of heuristics were applied to the above algorithm to make it less greedy, so it is less sensitive but runs faster Implements Dynamic programming Provide local alignment between two sequences Both BLAST and FASTA use this algorithm with varying heuristics applied in each case 2

FASTA (FAST Algorithm) The fist step is application of heuristics and the second step is using dynamic programming First, the query sequence and the database sequence are cut into defined length words and a word matching is performed in all-to-all combinations Word size is 2 for proteins and 6 for nucleic acids If the initial score is above a threshold, the second score is computed by joining fragments and using gaps of less than some maximum length If this second score is above some threshold, Smith-Waterman alignment is performed within the regions of high identities (known as high-scoring pairs) Protein and nucleotide substitution matrices 3

BLAST (Basic Local Alignment Search Tool) The fist step is application of heuristics and the second step is using dynamic programming First, the query sequence and the database sequence are cut into defined length words and a word matching is performed in all combinations Words that score above a threshold are used to extend the word list 7 words Expanded list- 47 words Word Expanded List ql ql, qm, hl ln ln nf nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf fs fs, fa, fn, fd, fg, fp, ft, ys gw gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw, tw, vw BLAST continued... BLAST is a local alignment algorithm Several High Scoring Segments are found, with the maximum scoring segment used to define a band in the path graph Smith-Waterman algorithm is performed on several possible segments to obtain optimal alignment The word size for Protein is 3 and for Nucleic acid is 11 4

Finding High Scoring Pairs (HSPs) Comparison of BLAST and FASTA BLAST uses an expanded list to compensate for the loss of sensitivity from increased word size BLAST is more sensitive than FASTA for protein searches while FASTA is more sensitive than BLAST for nucleic acid searches Both BLAST and FASTA run faster than the original Needleman-Waunch algorithm at the cost of loss of sensitivity Both algorithms fail to find optimal alignments that fall outside of the defined band width 5

Essential Elements of an Alignment Algorithm Defining the problem (Global, semi-global, local alignment) Scoring scheme (Gap penalties) Distance Matrix (PAM, BLOSUM series) Scoring/Target function (How scores are calculated) Good programming language to test the algorithm Types of Alignments Global - When two sequences are of approximately equal length. Here, the goal is to obtain maximum score by completely aligning them Semi-global - When one sequence matches towards one end of the other. Ex. Searches for 5 or 3 regulatory sequences Local - When one sequence is a sub-string of the other or the goal is to get maximum local score Protein motif searches in a database 6

Local vs Global alignments Pair-wise sequence comparison 7

Different Alignment Paths Sequence T S T S T S T S Whole genome sequence comparison 8

Scoring System for Alignments Scoring Weights Matches +10 -These are arbitrary values, but Mismatches +4 the real values come from distance matrices (PAM, BLOSUM etc.) Gap Penalties Gap Initiation - 10 - Arbitrarily chosen but, optimized Gap Extension - 2 for a particular Distance matrix Rules of Thumb for Affine Gap Penalties Gap Initiation Penalty should be 2 to 3 times the largest negative score in a distance matrix table Gap Extension Penalty should be 0.3 to 0.l times the Initiation Penalty Similarity Score Similarity score is the sum of pair-wise scores for matches/ mismatches and gap penalties Scoring Weights Matches +10 Mismatches +4 Gap Initiation -10 Gap Extension -2 Similarity Score = (Match score + Mismatch score) + Gap penalty Global Alignment EPSGFPAWVSTVHGQEQI E----PAWVST-----QI Score: (9 x 10) + (0 x 4) + (2 x -10 + 7 x -2) Score = 90 + (- 34) = 56 9

Scoring/Target function The scoring function calculates the similarity score. The goal of the algorithm is to maximize similarity score Global Alignment EPSGFPAWVSTVHGQEQI E----PAWVST-----QI Score: (9 x 10) + (2 x -10 + 7 x -2) Score = 90 + (-34) = 56 Scoring Weights Matches +10 Mismatches +4 Gap Initiation -10 Gap Extension -2 Local Alignment EPSGFPAWVSTVHGQEQI E----PAWVST-----QI Score: (6 x 10) + (2 x 0 + 10 x 0) Score = 60 + 0 = 60 Scoring Weights Matches +10 Mismatches +4 Gap Initiation -10 Gap Extension -2 No terminal gap penalty Different types of BLAST programs Program Query Database Comparison blastn nucleotide nucleotide nucleotide blastp protein protein protein blastx nucleotide protein protein tblastn protein nucleotide protein tblastx nucleotide nucleotide protein megablast nucleotide nucleotide nucleotide PHI-Blast protein protein protein PSI-Blast Protein-Profile protein protein RPS-Blast protein Profiles protein 10

Multiple Sequence Alignment 21 Terminology Homologous : Similar Paralogous : Present in the same species, diverged after gene duplication. Orthologous: Present in different species, diverged after speciation. Xenologous: Genes acquired by horizontal gene transfer Analogous: Similarity observed by convergent evolution, but not by common evolutionary origin. 11

Evolution of Genomes and Genetic Variation Variation between species Recombination, Cis-regulatory elements Point mutations/ insertions/deletions Evolutionary selection à speciation Variation within species: Phenotype = Genotype + Environment SNPs, haplotypes, aneuploidy, etc. Variation at cellular level: Spatial state (Tissue/Subcellular location) Temporal state (Stages in life cycle) Physiological state (normal/disease) External stimuli Phylogenetic Tree of Life 12

Multiple Sequence Alignments Aligning homologous residues among a set of sequences, together in columns Multiple alignments exhibit structural and evolutionary information among closely related species i.e., families A column of aligned residues in a conserved region is likely to occupy similar 3-D positions in protein structure Patterns or motifs common to a set of sequences may only be apparent from multiple alignments Necessary for building family profiles, phylogenetic trees and extracting evolutionary relationships Useful for homology modeling if at least one member of the family has a known 3-D structure Phylogenetic Complexity 13

Multiple alignments from different VGC protein families Sodium VGC Calcium VGC Potassium VGC Voltage-gated ion channel proteins Topology of six TMS in VGC proteins 14

Scoring a Multiple Alignment Some positions are more conserved than others Sequences are not independent, but are related by a phylogeny Pair-wise Multiple Most of the multiple alignment algorithms assume that the individual columns of an alignment are statistically independent Total Alignment Score S = Σ i S(m i ) + G where, m i is column i of the multiple alignment m, S(m i ) is the score for column i, G is gap penalty 15

Basic progressive alignment procedure Calculate alignment score for all pair-wise combinations using Dynamic Programming Determine distances from scores for all pairs of sequences and build a distance matrix Use a distance-based method to construct a guide-tree Add sequences to the growing alignment using the order given by the tree Most of the multiple alignment methods differ in the way the guide tree is constructed 16

Feng-Doolittle Method Get a distance matrix Calculate pair-wise scores with DP method, convert raw scores into pair-wise distances using the following formula D = -ln S eff S eff = [S obs - S rand ] / [S max - S rand ] where, -S eff is the effective score -S obs is the similarity score(dp score) between a pair -S max is the max score, average of aligning either sequence to itself -S rand is the background noise, obtained by aligning two random sequences of equal length and composition Feng-Doolittle Method.. From pair-wise distances, build a matrix for all sequence combinations as follows Human Chimp Gorilla Orang Human 0 88 103 160 Chimp 0 106 170 Gorilla 0 166 Orang 0 The number of pair-wise distances are N(N-1)/2 17

Feng-Doolittle Method.. Alignment of sequences The alignment order is determined from the order sequences were added to the guide tree First 2 sequences from the node are added first. In this case, C-H are aligned according to the standard DP algorithm O G C H Next, G is aligned to CH as the best of G(CH) and (CH)G alignments Assuming that of the above two, G(CH) has the best score, O is aligned to G(CH) as the best of O(G(CH)) and G(O(CH)) Again, higher similarity score determines which one is the best. This process is repeated iteratively until all sequences are aligned. As you notice, the order of sequences in the output are not the same as you see in the guide tree. Multiple sequence alignment (MSA) MSA: tools Clustal, Mafft, Muscle, TCoffee., Mview Accessible from Ebi website: http://www.ebi.ac.uk/tools/msa/ Obtains a distance matrix by using scores from pair-wise DP method Guide tree is built using neighbor joining method Sequences are weighed to compensate for biased representation in larger families Substitution matrices change on the fly as the alignment progresses; closely related sequences are aligned with hard matrices (e. g. BLOSUM80) and distant sequences are aligned with soft matrices (e.g. BLOSUM40) Position specific gap penalties are used similar to profiles Guide tree may be adjusted on the fly to defer the alignment of low-scoring sequences 18