# Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Size: px
Start display at page:

Download "Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013"

Transcription

1 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213

2 Sequence alignment Compare two (or more) sequences to: Find regions of conservation (or variability) Common motifs Find similar sequences in a database Assess likelihood of common evolutionary history Once we have constructed the best alignment between two sequences, we can assess how similar they are to each other. January 31, 213

3 Methods of sequence alignment Dot-matrix plots Dynamic programming algorithms (DPA) Used to determine evolutionary homology Needleman-Wunsch (global) Smith-Waterman (local) Heuristic methods Used to determine similarity to sequences in a database BLAST Combination of DPA and heuristic (FASTA) January 31, 213

4 Dot-matrix plot January 31, 213

5 Dynamic programming Bellman, 1955 Tabular design technique Partition the problem into sub-problems Solve sub-problems recursively Combine solutions to solve the original problem Applicable when the sub-problems are not independent January 31, 213

6 Dynamic programming Sub-problems are computed once and saved in a table or matrix. Table is built from the bottom up The basic ideas behind dynamic programming are: The organization of work in order to avoid repetition of work already done. Breaking a large problem into smaller incremental steps, which are each solved optimally. DPAs are typically used when a problem has many possible solutions and an optimal one has to be found. Hallmark of DPA: An optimal solution to a problem contains optimal solutions to sub-problems January 31, 213

7 Example alignment ELVIS LIVES? January 31, 213

8 Scoring alignments Associate a score with every possible alignment A higher score indicates a better alignment Need a scoring scheme that will give us a. Values for residue substitution b. Penalties for introducing gaps The score for each alignment is then the sum of all the residue substitution scores minus any penalties for gaps introduced January 31, 213

9 BLOSUM62 substitution matrix January 31, 213

10 N-W mathematical formulation Global alignment: used when the sequences are of a comparable length, and show similarity along their whole lengths F(i,j) = MAX { F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d, F(i, j-1) - d } where F(i,j) is the value in cell (i,j); s is the score for that match in the substitution matrix; d is the gap penalty January 31, 213

11 Simple global alignment E L V I S L I V E S January 31, 213

12 Traceback step E - L L V I I V - E S S E L V I S L I V E S January 31, 213

13 Resulting alignment Not correct to just align the highest number of identical residues! EL-VIS -LIVES ELVI-S -LIVES Isoleucine (I) Acidic Score: =12 Hydrophobic Glutamic acid (E) [glutamate] January 31, 213 Valine (V)

14 Local alignment Differences from global alignments Minimum value allowed = Corresponds to starting a new alignment The alignment can end anywhere in the matrix, not just the bottom right corner To find the best alignment, find the best score and trace back from there Expected score for a random match MUST be negative January 31, 213

15 S-W local alignment formula Local alignment: used when the sequences are of very different lengths, or when they show only short patches of similarity F(i,j) = MAX {, F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d, F(i, j-1) - d } January 31, 213

16 T N E M N G I L A H S I L L A M S January 31, 213

17 T N E M N G I L A H S I L L A M S January 31, 213

18 Resulting alignments ALLISH AL-IGN ALLISH A-LIGN Score: =11 Score: =11 January 31, 213

19 Alignment with affine gap scores Affine gap scores Affine means that the penalty for a gap is computed as a linear function of its length Gap opening penalty Less prohibitive gap extension penalty Have to keep track of multiple values for each pair of residue coefficients i and j in place of the single value F(i,j) We keep two matrices, M and I Time taken: increases to O(NM 2 ) January 31, 213

20 Affine gap score formula M(i,j) = MAX { M(i-1, j-1) + s(x i,y j ), I(i-1, j-1) + s(x i,y j ) } I(i,j) = MAX { M(i, j-1) - d, I(i, j-1) - e, M(i-1, j) - d, I(i-1, j) - e } January 31, 213

21 Gotoh Improvements Simplified algorithm to improve affine gapped alignment performance (from O=NM 2 to O=NM) Although space required: 3xMN Used linear relationship for gap weight: wx = g + rx Myers and Miller Divide-and-conquer based on Hirschberg, 1975 January 31, 213

22 Conclusions DPA mathematically proven to yield the optimal alignment between any two given sequences. Final alignment depends on scoring matrix for similar residues and gap penalties. January 31, 213

23 Gap penalty options Depends on task Maximizing quality of a pairwise alignment Lower gap penalties (biased towards allowing complete global alignment) Maximizing sensitivity, searching a database Higher gap penalties to maximize the difference between scores for homologous sequences and unrelated sequences BLOSUM62 11, 1 / 5,3 BLOSUM8 1, 1 / 9,3 BLOSUM45 15,2 PAM3 9,1 PAM 7 1,1 PAM 25 14,2 January 31, 213 Reese and Pearson, Bioinformatics, 22 Vogt et al., JMB, 1995

24 BLAST - heuristic strategy Altschul et al, NAR, 1987 Basic Local Alignment Search Tool Break the query and all database sequences into words Look for matches that have at least a given score, given a particular scoring matrix Seed alignment Extend seed alignment in both directions to find an alignment with a score or an E-value above a threshold Refinements: use DPAs for extension; 2-hit strategy The word size and threshold score determines the speed of the search January 31, 213

25 ELVIS BLAST alignment strategy 1. Break sequences up into words LIVES ELV LVI VIS January 31, 213 LVI = LVV = LII = LVL = LLI = LIV = Find all matching words above a score threshold 3. Locate any perfect matches in the other set of words Use original word! LIV IVE VES 4. Align and extend, with gaps if necessary elvi-s -LIVes

26 PSI-BLAST (Position-Specific Iterated) BLAST Uses a PSSM (position-specific scoring matrix) derived from BLAST results Generated by calculating position-specific scores for each position in the alignment Highly conserved positions receive high scores Weakly conserved positions receive low scores PSSM used as the scoring matrix for the next BLAST search Iterative searching leads to increased sensitivity January 31, 213

27 Scoring matrices Strongly influence the outcome of the sequence comparison Implicitly represent a particular theory of evolution Frommlet et al, Comp Stat and Data Analysis, 26 January 31, 213

28 Examples of scoring matrices Identity BLOSUM PAM Gonnet (also called GCB - Gonnet, Cohen and Benner) JTT (Jones Thornton Taylor) Distance matrix (minimum number of mutations needed to convert one residue to another) e.g., I V = 1 (ATT GTT) Physicochemical characteristics e.g., hydrophobicity scoring (George et al., 199) January 31, 213

29 A T C G A 1 T 1 C 1 G 1 Identity Transition/ transversion DNA matrices A T C G A T C G A T C G A T C G WU-BLAST Purines: A, G Pyrimidines: C, T Transition: A -> G Transversion: C -> G January 31, 213

30 PAM matrices PAM (Percent/Point Accepted Mutation) One PAM unit corresponds to one amino acid change per 1 residues, or.1 changes/residue, or roughly 1% divergence PAM3:.3 changes/residue PAM25: 2.5 substitutions/residue Each matrix gives the changes expected for a given period of evolutionary time Frequency of replacement of one amino acid by another Assumes that amino acid frequencies remain constant over time and that substitutions observed over short periods of evolutionary history can be extrapolated to longer distances Uses a Markov model of evolution January 31, 213

31 Markov model of evolution Assumptions Each amino acid site in a protein can change to any of the other amino acids with the probabilities given in the matrix Changes that occur at each site are independent of one another The probability of an amino acid change is the same regardless of its position in the protein The probability of mutation is historyindependent January 31, 213

32 PAM matrix construction Align sequences Infer the ancestral sequence Count the number of substitution events Estimate the propensity of an amino acid to be replaced Construct a mutation probability matrix Construct a relatedness odds matrix Construct a log odds matrix January 31, 213

33 PAM matrix construction Aligned sequences are 85% similar To reduce the possibility that observed changes are due to two substitution events rather than one To facilitate the reconstruction of phylogenetic trees and the inference of ancestral sequences January 31, 213

34 Ancestral sequences Organism A A W T V A A A V R T S I Organism B A Y T V A A A V R T S I Organism C A W T V A A A V L T S I A B C W to Y (ancestral) L to R One of the possible relationships between organisms A, B and C Assume that each position has undergone a maximum of one substitution (high similarity between sequences) Implies that the number of changes between sequences is proportional to the evolutionary distance between them January 31, 213

35 Frequency of substitution Count the number of changes of each amino acid into every other amino acid Compare each sequence with its ancestor, inferred from the phylogenetic trees In 71 groups of proteins, they counted 1,572 changes (Dayhoff, 1979) Of the 19 possible exchanges among all amino acids, 35 were never observed Most common exchange was Asp Glu January 31, 213

36 Accepted point mutations Atlas of Protein Sequence and Structure. Dayhoff et al, 1978 January 31, 213

37 Relative mutability Calculate the likelihood that an amino acid will be replaced Estimate relative mutability (relative number of changes) Amino acids A B C A B C A B A Observed changes 1 1 Frequency of occurrence Relative mutability.33 1 January 31, 213

38 Scale mutabilities Mutabilities were scaled to the number of replacements per occurrence of the given amino acid per 1 residues in each alignment Normalizes data for variations in amino acid composition, mutation rate, sequence length Count the total number of changes of each amino acid and divide by the total exposure to mutation Exposure to mutation is the frequency of occurrence of each amino acid multiplied by the total number of amino acid changes per 1 sites, summed for all alignments. January 31, 213 *(Ala has arbitrarily been set to 1) Ser 149 Met 122 Asn 111 Ile 11 Glu 12 Ala 1* Gln 98 Asp 9 Thr 9 Gap 84 Val 8 Lys 57 Pro 56 His 5 Gly 48 Phe 45 Arg 44 Leu 38 Tyr 34 Cys 27 Trp 22

39 Mutation probability matrix Mutation probability matrix for one PAM Each element of this matrix gives the probability that the amino acid in column i will be replaced by the amino acid in row j after a given evolutionary interval 1 PAM 1 mutation/1 residues/1 MY 1 PAM.1 changes/aligned residue Non-diagonal Diagonal where: M ij = mutation probability m j = relative mutability A ij = number of times i replaced by j (from matrix of accepted point mutations) January 31, 213 The proportionality constant (λ) is chosen such that there is an average of one mutation every hundred residues.

40 PAM1 mutation probability matrix January 31, 213 Atlas of Protein Sequence and Structure. Dayhoff et al, 1978

41 PAM1, PAM1, PAM25 To get different PAM matrices, the mutation probability matrix is multiplied by itself some integer number of times PAM1 ^ 1 PAM1 PAM1 ^ 25 PAM25 PAM1 applies to proteins that differ by.1 changes per aligned residue 1 amino acid changes per 1 residues PAM25 reflects evolution of sequences with an average of 2.5 substitutions per position 25% change in 25 MY Approximately 2% similarity January 31, 213

42 PAM25 matrix mutation probability matrix January 31, 213 Atlas of Protein Sequence and Structure. Dayhoff et al, 1978

43 Relatedness odds matrix Calculate a relatedness odds matrix: Divide each element of the mutation probability matrix by the frequency of occurrence of the replacing residue Gives the ratio that the substitution represents a real evolutionary event as opposed to random sequence variation Ala.96 Gly.9 Lys.85 Leu.85 Val.78 Thr.62 Ser.57 Asp.53 Glu.53 Phe.45 Asn.42 Pro.41 Ile.35 His.34 Arg.34 Gln.32 Tyr.3 Cys.25 Met.12 Trp.12 January 31, 213 Frequency of occurrence

44 Log odds matrix Calculate the log odds matrix Take the log of each element More convenient when using the matrix to score an alignment score an alignment by summing the individual log odds scores for each aligned residue pair, rather than multiplying the odds scores Assume that each aligned residue is statistically independent of the others Biologically dubious but mathematically convenient January 31, 213

45 PAM 25 log odds matrix January 31, 213

46 Problems with PAM Assumptions: Replacement at a site depends only on the amino acid at that site and the probability given by the matrix Sequences that are being compared have average amino acid composition All amino acids mutate at the same rate January 31, 213

47 Problems with PAM contd. Sources of error: Many sequences do not have average amino acid composition Rare replacements were observed too infrequently to resolve relative probabilities accurately Errors in PAM1 are magnified in the extrapolation to PAM25 Replacement is not equally probable over the whole sequence January 31, 213

48 BLOSUM matrices (BLOCKS substitution matrix) Henikoff and Henikoff, 1992 SWISSPROT PROSITE BLOCKS BLOSUM Matrix values are based on the observed amino acid substitutions in a large set of conserved amino acid blocks ( signatures of protein families) Frequently occurring highly conserved sequences are collapsed into one consensus sequence to avoid bias Patterns that show similarity above a threshold are grouped into a set Find the frequencies of each residue pair in the dataset 62% threshold = BLOSUM62 matrix January 31, 213

49 BLOSUM vs. PAM Not based on an explicit evolutionary model Considers all changes in a family Matrices to compare more distant sequences are calculated separately Based on substitutions of conserved blocks Designed to find conserved domains in proteins Markov-based model of evolution Based on prediction of first changes in each family Matrices to compare more distant sequences are derived Based on substitutions over the whole sequence Tracks evolutionary origins of proteins January 31, 213

50 BLOSUM vs. PAM January 31, 213 Frommlet et al, Comp Stat and Data Analysis, 26

51 Z-score Calculation of statistical significance Shuffle one of the sequences with the same amino acid or nucleotide composition and local sequence features Align the shuffled sequences to the first sequence Generate scores for each of these alignments Generate an extreme value distribution Calculate mean score and standard deviation of distribution Get the Z-score Z = (RealScore - MeanScore) / StandardDeviation January 31, 213

52 Statistical significance: many other methods Many random sequences (with no resemblance to either of the original sequences) are generated and aligned with each other Random sequences of different lengths are generated; scores increase as the log of the length A sequence may be aligned with all the sequences in a database library, the related sequences are discarded and the remainder are used to calculate the statistics Variations: shuffle the sequence or shuffle the library January 31, 213

### CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

### CSE 549: Computational Biology. Substitution Matrices

CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

### Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

### Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

### Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

### Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

### Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

### THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

### Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

### Quantifying sequence similarity

Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

### CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

### Sequence analysis and comparison

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

### Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

### Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

### Algorithms in Bioinformatics

Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

### Substitution matrices

Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM

### In-Depth Assessment of Local Sequence Alignment

2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

### 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

### Pairwise sequence alignment

Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

### Large-Scale Genomic Surveys

Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

### Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

### Practical considerations of working with sequencing data

Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

### M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

A Model of volutionary Change in Proteins M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt n the eight years since we last examined the amino acid exchanges seen in closely related proteins,' the information

### 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

### 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

### Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

### Introduction to Bioinformatics

Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

### Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

### Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

### Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

### Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

### Sequence analysis and Genomics

Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

### Pairwise sequence alignments

Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

### Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

### Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

### Advanced topics in bioinformatics

Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

### Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

### SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

### Proteins: Characteristics and Properties of Amino Acids

SBI4U:Biochemistry Macromolecules Eachaminoacidhasatleastoneamineandoneacidfunctionalgroupasthe nameimplies.thedifferentpropertiesresultfromvariationsinthestructuresof differentrgroups.thergroupisoftenreferredtoastheaminoacidsidechain.

### Pairwise Sequence Alignment

Introduction to Bioinformatics Pairwise Sequence Alignment Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Outline Introduction to sequence alignment pair wise sequence alignment The Dot Matrix Scoring

### An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

### Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

### Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

### First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

### Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

### Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

### Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

### Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

### Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

### Computational Biology

Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

### Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

### Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

### Moreover, the circular logic

Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

### Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

2 Pairwise alignment 2.1 Introduction The most basic sequence analysis task is to ask if two sequences are related. This is usually done by first aligning the sequences (or parts of them) and then deciding

### Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

### Sequence Comparison. mouse human

Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

### Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

### Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the

### Sequence Analysis and Databases 2: Sequences and Multiple Alignments

1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:

### 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

### 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

### Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

### Exercise 5. Sequence Profiles & BLAST

Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

### Sequence Based Bioinformatics

Structural and Functional Analysis of Inosine Monophosphate Dehydrogenase using Sequence-Based Bioinformatics Barry Sexton 1,2 and Troy Wymore 3 1 Bioengineering and Bioinformatics Summer Institute, Department

### Sequence comparison: Score matrices

Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

### Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

### Collected Works of Charles Dickens

Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

### Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

### Introduction to Computation & Pairwise Alignment

Introduction to Computation & Pairwise Alignment Eunok Paek eunokpaek@hanyang.ac.kr Algorithm what you already know about programming Pan-Fried Fish with Spicy Dipping Sauce This spicy fish dish is quick

### Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

### Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

### Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

### C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

### Motivating the need for optimal sequence alignments...

1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

### 1.5 Sequence alignment

1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

### Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information

### Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

### Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Acta Cryst. (2014). D70, doi:10.1107/s1399004714021695 Supporting information Volume 70 (2014) Supporting information for article: Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase

### Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

### Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Cell communication channel Bioinformatics Methods Iosif Vaisman Email: ivaisman@gmu.edu SEQUENCE STRUCTURE DNA Sequence Protein Sequence Protein Structure Protein structure ATGAAATTTGGAAACTTCCTTCTCACTTATCAGCCACCT...

### EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Administrative Term-project In a group of 2 students Think

### BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

### Optimization of a New Score Function for the Detection of Remote Homologs

PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

### Global alignments - review

Global alignments - review Take two sequences: X[j] and Y[j] M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 The best alignment for X[1 i] and Y[1 j] is called M[i, j] X[j] Initiation: M[,]= pply

### Administration. ndrew Torda April /04/2008 [ 1 ]

ndrew Torda April 2008 Administration 22/04/2008 [ 1 ] Sprache? zu verhandeln (Englisch, Hochdeutsch, Bayerisch) Selection of topics Proteins / DNA / RNA Two halves to course week 1-7 Prof Torda (larger

### Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell Mathematics and Biochemistry University of Wisconsin - Madison 0 There Are Many Kinds Of Proteins The word protein comes

### Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

### SEQUENCE alignment is an underlying application in the

194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific

### Local Alignment Statistics

Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

### Bioinformatics and BLAST

Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

### Bioinformatics for Biologists

Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

### Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence Alignments

doi:10.1016/j.jmb.2004.05.045 J. Mol. Biol. (2004) 341, 617 631 Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence Alignments

### ... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

1 2 ... and searches for related sequences probably make up the vast bulk of bioinformatics activities. 3 The terms homology and similarity are often confused and used incorrectly. Homology is a quality.

### Bioinformatics Exercises

Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

### Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a

### Similarity searching summary (2)

Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

### Viewing and Analyzing Proteins, Ligands and their Complexes 2

2 Viewing and Analyzing Proteins, Ligands and their Complexes 2 Overview Viewing the accessible surface Analyzing the properties of proteins containing thousands of atoms is best accomplished by representing

### 114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome