# Tools and Algorithms in Bioinformatics

Size: px
Start display at page:

Transcription

1 Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology Core University of Nebraska Medical Center Database Searching Boolean operators AND, OR, NOT 1

2 PubMed keyword search using Boolean operators Search String # of Hits Kinase 585,556 Drosophila 79,923 Kinase OR Drosophila 654,830 Kinase AND Drosophila 10,649 Kinase NOT Drosophila 574,907 Drosophila NOT Kinase 69,274 Drosophila NOT kinase NOT melanogaster 34,189 Searching Biological Databases Searching is done to find the relatedness between the query and the entries in the database For nucleic acids and proteins, the relatedness is defined by homology. A Query sequence is used to search against each entry, a subject in a database Two sequences are said to be homologous when they possess sequence identity above a certain threshold Thresholds can be defined by length, percentage identity, E-value, Bit-score, r.m.s.d. (for structures), etc., or a combination of one or more of these, depending on the objective of the search 2

3 BLAST output Pair-wise Alignments Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) 3

4 Sensitivity vs. Specificity Confusion Matrix Sensitivity: Attempts to report ALL true positives Sensitivity = True Positives / (True Positives + False Negatives) = TP/(TP+FN) (1-sensitivity) gives false negative rate Specificity: Attempts to report ALL true negatives Specificity = True Negatives / (True Negatives + False Positives) = TN/(TN+FP) (1-specificity) gives false positive rate Known positives are used to test sensitivity and known negatives for specificity In database searching, specificity and sensitivity always compete. Depending on the objective of the search, a tradeoff point should be chosen. Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) 4

5 Scoring Scheme Match Match between identical letters or letters of the same group Mismatch Match between letters of different groups Gap Match between a letter and a gap Alignment score is the sum of match, mismatch and gap penalty scores Say, you are aligning two sequences A and B Sequence A : PQVNNTVNRT / Sequence B : PVNRT PQVNNTVNRT PQVNNTVNRT PQVNNTVNRT P-VNRT---- P-V-N---RT P-----VNRT Scoring Scheme Match 5, Mismatch 2 Gaps Match 5, Mismatch 2 Gaps Match 5, Mismatch 2 GapI -5, GapE Proportional gap penalties Penalty P = na, EPAPRAEWVSTVHGSTIEQNEQI EPA----WVST EQI Proportional penalty= -65 Affine penalty = (-10)+(-11) = -21 Gap Penalties where, n is number of gaps and a is gap penalty Affine gap penalties Penalty P = a + mb, where, a is gap initiation penalty b is gap extension penalty, and m represents the number of extended gaps EPAPRAEWVSTVHGSTIEQNEQI E--P-A-WV-----ST-E---QI Proportional penalty= -65 Affine penalty = (-30)+(-7) = -37 Gap a = - 5 Gap b = - 1 In the proportional scheme, both these alignments score the same, while in the affine penalty scheme, the first one scores much higher 5

6 Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) Terminology in Nucleotide Substitutions AGCT: A/G are purines and C/T are pyrimidines Transitions vs Transversions Transition: Aà G or Cà T or vice versa Transversions: Aà T or Aà C or G à T or Gà C or vice versa Synonymous vs Non-synonymous substitutions Synonymous: CCC à CCG, both code for Proline Non-synonymous: UGC à UGG Cysteine à Tryptophan 6

7 Factors Affecting Amino Acid Replacements Each amino acid is coded by a triplet codon Each codon can undergo 9 possible single-base substitutions So in theory, point mutations in the 61 sense codons can lead to 549 (61 x 9) single-base substitutions Of these, 392 are non-synonymous (missense), 134 are synonymous and 23 are nonsense mutations Protein substitution matrices are built based on non-synonymous replacements only Protein substitution matrices reflect evolutionarily tested and accepted substitutions 7

8 Bias in the frequency of non-synonymous substitutions Structure of the genetic code itself Physical and chemical properties of the amino acid (C, W, M) Variation in the gene copy number or pseudogenes Gene function (Histone has the lowest mutation rate) Species life span (Rodents vs Humans) Bias in the frequency of synonymous substitutions Abundance of trna species (Rare vs Abundant trna species) Distance/Substitution Matrices Unitary matrices/minimum distance matrices PAM (Percent Accepted Mutations) BLOSUM (BLOcks SUbstitution Matrix) Terminology Global alignment Alignments with the highest score are found at the expense of local similarity Local alignment Alignments in which the highest scoring subsequences are found anywhere in the alignment 8

9 PAM (Percent Accepted Mutations) Developed by Dayhoff and co-workers PAM 30, 60, 100, 200, 250 Built from globally aligned, closely related sequences (85% similarity) A database of 1572 changes in 71 groups of closely related proteins PAM 1 matrix incorporates amino acid replacements that would be expected if one mutation had occurred per 100 amino acids of sequence i.e., corresponds to roughly one percent divergence in a protein Evolutionary tree is built from aligned sequences and relative frequency (Q ij ) of replacement for each pair of amino acids (i, j) is calculated. PAM (Percent Accepted Mutations) A similarity ratio R ij for each pair is calculated as R ij = Q ij / P i P j where P i and P j are observed (natural) frequencies of amino acids i and j in the set of proteins where replacements are counted Substitution score S ij = log 2 (R ij ) S ij >0 means, replacements are favored during evolution Real value Log value S ij <0 means, evolutionary selection is against the replacement Average relative frequencies of amino acids in animal proteins A C D E F G H I K L M N P Q R S T V W Y

10 BLOSUM (BLOcks SUbstitution Matrix) Developed by Henikoff and Henikoff (1992) Blosum 30, 62, 80 Built from BLOCKS database From the most conserved regions of aligned sequences 2000 blocks from 500 families Blosum 62 is the most popular. Here, 62 means that the sequences used in creating the matrix are at least 62% identical Higher Blosum number - Built from closely related sequences Lower Blosum number - Built from distant sequences 10

11 Major Differences between PAM and BLOSUM PAM Built from global alignments Built from small amout of Data Counting is based on minimum replacement or maximum parsimony Perform better for finding global alignments and remote homologs Higher PAM series means more divergence BLOSUM Built from local alignments Built from vast amout of Data Counting based on groups of related sequences counted as one Better for finding local alignments Lower BLOSUM series means more divergence 11

12 Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) Meaning of E-value in Database Searches E- Value (Expectation value) The number of equal or higher scores expected at random for a given High Scoring Pair (HSP) E-value of 10 for a match means, in a database of current size, one might expect to see 10 matches with a similar or better score, simply by chance E-value is the most commonly used threshold in database searches. Only those matches with E-values smaller than the set threshold will be reported in the output E-value ranges between 0 to higher, lower the E-value, better the reliability of a match. 12

13 Meaning of Bit Score Bit Score Raw scores have no meaning without the knowledge of the scoring scheme used. Raw score are normalized to get Bit scores by incorporating information about the scoring scheme used and the search space used (size of database) Bit score is normalized score and hence it is independent of the size of the database, while E- values are very sensitive to the database size. Generally bit scores of 40 are higher are considered reliable. Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) 13

14 Filtering low complexity sequences Filters out short repeats and low complexity regions from the query sequences before searching the database Filtering helps to obtain statistically significant results and reduce the background noise resulting from matches with repeats and low complexity regions The output shows which regions of the query sequence were masked Filtering low-complex regions Low complex regions in protein sequences are those that have repeating residues of the same or a few different residues in a stretch. In database searches low complex regions on the query sequence are masked (with X) to avoid random hits and to emphasize on significant hits Tools used SEG NSEG (DNA sequences) PSEG (Protein sequences) XNU (For masking repeats <10 in length) DUST ( DNA sequences) CAST (Proteins) 14

15 Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch) Choice of the Searching Algorithm An ideal algorithm should have Good specificity and sensitivity Should be fast running Should not use too much memory Greedy algorithms are very sensitive, but very slow. Heuristic algorithms are relatively fast, but loose some sensitivity. It s always a challenge for a programmer to develop algorithms that fulfill both of these requirements 15

16 Needleman-Wunsch algorithm (JMB 48:443-53, 1970) Very greedy algorithm, so very sensitive Implements Dynamic programming Provides global alignment between the two sequences Smith-Waterman algorithm (JMB 147:195-97, 1981) A set of heuristics were applied to the above algorithm to make it less greedy, so it is less sensitive but runs faster Implements Dynamic programming Provide local alignment between two sequences Both BLAST and FASTA use this algorithm with varying heuristics applied in each case FASTA (FAST Algorithm) The fist step is application of heuristics and the second step is using dynamic programming First, the query sequence and the database sequence are cut into defined length words and a word matching is performed in all-to-all combinations Word size is 2 for proteins and 6 for nucleic acids If the initial score is above a threshold, the second score is computed by joining fragments and using gaps of less than some maximum length If this second score is above some threshold, Smith-Waterman alignment is performed within the regions of high identities (known as high-scoring pairs) 16

17 Protein and nucleotide substitution matrices BLAST (Basic Local Alignment Search Tool) The fist step is application of heuristics and the second step is using dynamic programming First, the query sequence and the database sequence are cut into defined length words and a word matching is performed in all combinations Words that score above a threshold are used to extend the word list 7 words Expanded list- 47 words Word Expanded List ql ql, qm, hl ln ln nf nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf fs fs, fa, fn, fd, fg, fp, ft, ys gw gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw, tw, vw 17

18 BLAST continued... BLAST is a local alignment algorithm Several High Scoring Segments are found, with the maximum scoring segment used to define a band in the path graph Smith-Waterman algorithm is performed on several possible segments to obtain optimal alignment The word size for Protein is 3 and for Nucleic acid is 11 Finding High Scoring Pairs (HSPs) 18

19 Comparison of BLAST and FASTA BLAST uses an expanded list to compensate for the loss of sensitivity from increased word size BLAST is more sensitive than FASTA for protein searches while FASTA is more sensitive than BLAST for nucleic acid searches Both BLAST and FASTA run faster than the original Needleman-Waunch algorithm at the cost of loss of sensitivity Both algorithms fail to find optimal alignments that fall outside of the defined band width Essential Elements of an Alignment Algorithm Defining the problem (Global, semi-global, local alignment) Scoring scheme (Gap penalties) Distance Matrix (PAM, BLOSUM series) Scoring/Target function (How scores are calculated) Good programming language to test the algorithm 19

20 Types of Alignments Global - When two sequences are of approximately equal length. Here, the goal is to obtain maximum score by completely aligning them Semi-global - When one sequence matches towards one end of the other. Ex. Searches for 5 or 3 regulatory sequences Local - When one sequence is a sub-string of the other or the goal is to get maximum local score Protein motif searches in a database Local vs Global alignments 20

21 Pair-wise sequence comparison Different Alignment Paths Sequence T S T S T S T S 21

22 Whole genome sequence comparison Scoring System for Alignments Scoring Weights Matches +10 -These are arbitrary values, but Mismatches +4 the real values come from distance matrices (PAM, BLOSUM etc) Gap Penalties Gap Initiation Arbitrarily chosen but, optimized Gap Extension - 2 for a particular Distance matrix Rules of Thumb for Affine Gap Penalties Gap Initiation Penalty should be 2 to 3 times the largest negative score in a distance matrix table Gap Extension Penalty should be 0.3 to 0.l times the Initiation Penalty 22

23 Similarity Score Similarity score is the sum of pair-wise scores for matches/ mismatches and gap penalties Scoring Weights Matches +10 Mismatches +4 Gap Initiation -10 Gap Extension -2 Global Alignment EPSGFPAWVSTVHGQEQI E----PAWVST-----QI Score: (9 x 10) + (0 x 4) - (2 x x -2) Score = = 56 Scoring/Target function The scoring function calculates the similarity score. The goal of the algorithm is to maximize similarity score Global Alignment EPSGFPAWVSTVHGQEQI E----PAWVST-----QI Score: (9 x 10) + (2 x x -2) Score = 90 + (-34) = 56 Scoring Weights Matches +10 Mismatches +4 Gap Initiation -10 Gap Extension -2 Local Alignment EPSGFPAWVSTVHGQEQI E----PAWVST-----QI Score: (6 x 10) + (2 x x 0) Score = = 60 Scoring Weights Matches +10 Mismatches +4 Gap Initiation -10 Gap Extension -2 No terminal gap penalty 23

24 Different types of BLAST programs Program Query Database Comparison blastn nucleotide nucleotide nucleotide blastp protein protein protein blastx nucleotide protein protein tblastn protein nucleotide protein tblastx nucleotide nucleotide protein megablast nucleotide nucleotide nucleotide PHI-Blast protein protein protein PSI-Blast Protein-Profile protein protein RPS-Blast protein Profiles protein 24

### Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

### Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

### Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

### CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

### Computational Biology

Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

### In-Depth Assessment of Local Sequence Alignment

2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

### Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

### Bioinformatics for Biologists

Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

### Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

### Algorithms in Bioinformatics

Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

### 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

### Bioinformatics and BLAST

Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

### Quantifying sequence similarity

Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

### Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

### CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

### Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

### Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

### Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

### Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

### Large-Scale Genomic Surveys

Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

### Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

### Sequence analysis and Genomics

Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

### THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

### Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

### Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

### Introduction to Bioinformatics

Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

### Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

### Sequence analysis and comparison

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

### BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

### Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

### Practical considerations of working with sequencing data

Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

### Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

### Local Alignment Statistics

Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

### Collected Works of Charles Dickens

Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

### Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

### Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

### CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

### Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

### Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

### Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

### Fundamentals of database searching

Fundamentals of database searching Aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins. The principles

### EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

### Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

### Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

### Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

### Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

### Heuristic Alignment and Searching

3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two

### Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

### Bioinformatics Exercises

Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

### Substitution matrices

Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM

### BLAST. Varieties of BLAST

BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

### Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

### BLAST: Target frequencies and information content Dannie Durand

Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

### Pairwise sequence alignment

Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

### Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

### Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

### Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

### Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

### Pairwise sequence alignments

Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

### Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information

### Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

### Administration. ndrew Torda April /04/2008 [ 1 ]

ndrew Torda April 2008 Administration 22/04/2008 [ 1 ] Sprache? zu verhandeln (Englisch, Hochdeutsch, Bayerisch) Selection of topics Proteins / DNA / RNA Two halves to course week 1-7 Prof Torda (larger

### 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

### Sequence Comparison. mouse human

Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

### Sequence comparison: Score matrices

Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

### Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

### Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

### Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

### Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

### Global alignments - review

Global alignments - review Take two sequences: X[j] and Y[j] M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 The best alignment for X[1 i] and Y[1 j] is called M[i, j] X[j] Initiation: M[,]= pply

### BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

### Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction

### Sequence Alignment (chapter 6)

Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

### Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

### Motivating the need for optimal sequence alignments...

1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

### Protein function prediction based on sequence analysis

Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005

### An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

### Scoring Matrices. Shifra Ben Dor Irit Orr

Scoring Matrices Shifra Ben Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

### 7.36/7.91 recitation CB Lecture #4

7.36/7.91 recitation 2-19-2014 CB Lecture #4 1 Announcements / Reminders Homework: - PS#1 due Feb. 20th at noon. - Late policy: ½ credit if received within 24 hrs of due date, otherwise no credit - Answer

Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

### 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

### A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

### C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

### RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that

### Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

### 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

### Introduction to protein alignments

Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare

### InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

### First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

### Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

### 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

### Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

### Introduction to Computation & Pairwise Alignment

Introduction to Computation & Pairwise Alignment Eunok Paek eunokpaek@hanyang.ac.kr Algorithm what you already know about programming Pan-Fried Fish with Spicy Dipping Sauce This spicy fish dish is quick

### BLAST: Basic Local Alignment Search Tool

.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

### CSE 549: Computational Biology. Substitution Matrices

CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

### 1.5 Sequence alignment

1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

### Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

### Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.