BINF 730. DNA Sequence Alignment Why?

Similar documents
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Pairwise & Multiple sequence alignments

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Algorithms in Bioinformatics

Sequence analysis and comparison

Quantifying sequence similarity

Pairwise sequence alignment

Computational Biology

Advanced topics in bioinformatics

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Practical considerations of working with sequencing data

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

In-Depth Assessment of Local Sequence Alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Analysis '17 -- lecture 7

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Single alignment: Substitution Matrix. 16 march 2017

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Sequence Comparison. mouse human

Local Alignment: Smith-Waterman algorithm

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Scoring Matrices. Shifra Ben-Dor Irit Orr

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

BIOINFORMATICS: An Introduction

Week 10: Homology Modelling (II) - HHpred

bioinformatics 1 -- lecture 7

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Practical Bioinformatics

Substitution matrices

Large-Scale Genomic Surveys

Hidden Markov Models

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models

Sequence analysis and Genomics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Graph Alignment and Biological Networks

Bio nformatics. Lecture 3. Saad Mneimneh

CSE 549: Computational Biology. Substitution Matrices

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Local Alignment Statistics

Motivating the need for optimal sequence alignments...

Pairwise sequence alignments

Sequence Alignment (chapter 6)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Evolutionary Analysis of Viral Genomes

EECS730: Introduction to Bioinformatics

Pairwise sequence alignment and pair hidden Markov models

Copyright 2000 N. AYDIN. All rights reserved. 1

Pairwise Sequence Alignment

Ch. 9 Multiple Sequence Alignment (MSA)

Substitution Matrices

Network alignment and querying

Moreover, the circular logic

Tools and Algorithms in Bioinformatics

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Gibbs Sampling Methods for Multiple Sequence Alignment

Introduction to Computation & Pairwise Alignment

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Bioinformatics for Biologists

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Bioinformatics Exercises

Pattern Matching (Exact Matching) Overview

Phylogenetic inference

HMMs and biological sequence analysis

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Today s Lecture: HMMs

PROTEIN SYNTHESIS INTRO

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Lecture Notes: Markov chains

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty.

An Introduction to Sequence Similarity ( Homology ) Searching

Introduction to Bioinformatics

Sequences, Structures, and Gene Regulatory Networks

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Introduction to Bioinformatics Online Course: IBT

Transcription:

BINF 730 Lecture 2 Seuence Alignment DNA Seuence Alignment Why? Recognition sites might be common restriction enzyme start seuence stop seuence other regulatory seuences Homology evolutionary common progenitor 1

Mutations -Deletions -Insertions -Transitional Substitution (purine-purine A-G, pyr-pyr T-C) -Translational Substitution (pur-pyr, pyr-pur) Example Start with ACGTACGT after 9540 generations with the following probabilities: Deletion 0.0001 Insertion 0.001 Transitional substitution 0.00008 Translational substitution 0.00002 2

Example The two new seuences are: - - ACG T -A - - - CG -T - - - - ACACGGTCCTAATAATGGCC --- AC -GTA-C--G-T-- CAG - GAAGATCTTAGTTC Example However if we align the two seuences by superposition - ACAC- GGTCCTAAT--AATGGCC CAG- GAA- G- AT- - CTTAGTTC- - 3

Example or using Gotoh s algorithm with mismatch penalty 3 and gap penalty function g(k) = 2+2k for length k gap ACACG - - GTCCTAATAATGGCC - CAGGAAGATCT - - TAGTT - - C The alignment depends on algorithm used! Protein seuence alignment A. Homologous proteins i. Evolutionary common origin ii. Structural similarity iii. Functional similarity B. Conserved regions i. Functional domains ii. Evolutionary similarity iii. Structural motif 4

Example 3.2 Choosing the best alignment Every alignment has a score Chose alignment with highest score Must choose appropriate scoring function Scoring function based on evolutionary model with insertion deletion and substitutions Use substitution score matrix contains an entry for every amino acid pair 5

Substitution score matrix A D K S s S,A s S,D s S,K R s R,A s R,D s R,K K s K,A s K,D s K,K Scoring Strategies Ad hoc method a biologist can set up a score matrix that gives a good alignment Use physical/chemical properties Statistical approach 6

Statistical approach - Let s and s be two amino acid seuences of length n that we want to compute an alignment score - Assume only substitutions occur (no insertions or deletions) - Works for local alignment - Odds Ratio and Log Odds Ratio Odds Ratio and Log Odds Ratio The score for aligning s and s is based on the comparison of the hypothesis that the two seuences are generated randomly with the hypothesis that they come from a common ancestor. Assume A is the probability of producing amino acid A in model R (based on the relative freuency at which A is found in proteins). The probability for the null hypothesis (that s and s do not stem from a common ancestor) is P(s R) = i s = i s 7

Odds Ratio and Log Odds Ratio The second hypothesis (homologous hypothesis) that s and s arise from a common ancestor seuence r, of length ns based on the evolutionary model (E). The probability that the amino acids A and B are aligned and hence have been derived from an ancestor amino acid C is given by p A,B is given by P(s E) = p i s How this probability is determined will be explained later. Odds Ratio and Log Odds Ratio The odds ratio compares the homologous hypothesis with the null hypothesis P( s E) = P( s R) p i s i s = p i s 1 i n i s To achieve a scoring function that is additive rather that multiplicative, the log odds ratio can be used pab s A, B = log A B 8

Point Accepted Mutation (PAM) and Amino Acid Pair Probabilities We mentioned that we must choose an appropriate evolutionary model E((p AB ) AB ) for the homologous hypothesi ie we have to find p AB for each pair of amino acids A and B. Since we are using a statistical approach, this has to be estimated from data. If we know that two seuences s and s are homologou we could estimate p AB by finding the value of p AB that would maximize P(E((p AB ) AB ) s ) PAM and Amino Acid Pair Probabilities This can be done by using the maximum likelihood approach (section 2.1.6 pp 52-53, Clote and Backofen) Lagrange Multipliers (Section 2.2 Clote and Backofen) Appendix (Chapter 3 Clote and Backofen) 9

PAM and Amino Acid Pair Probabilities We now have which is the relative freuency of a pair (A,B) in the alignment of s and s where n AB (s ) is the number of times the amino acids A and B are aligned in one column in the alignment of s and s and n is the length of s and s. To find a value for n AB, some homologous seuences are needed. To do this Dayhoff and coworkers used local seuence alignment. PAM and Amino Acid Pair Probabilities Problem They used seuence alignment to find a substitution matrix (substitution score matrix) for seuence alignment which comes first, the chicken or the egg? Answer Use only very closely related seuence (seuences differ in at most 15% of the amino acid. Caveat The substitution matrix is only valid for closely related protein seuences 10