Sequence Analysis '17 -- lecture 7

Size: px
Start display at page:

Download "Sequence Analysis '17 -- lecture 7"

Transcription

1 Sequence Analysis '17 -- lecture 7 Significance E-values

2 How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a specific inference. Thanks.

3 Scoring alignments using substitution probabilities For each aligned position (match), we get P(A B), which is the substitution probability. Ignoring all but the first letters, the probability of these two sequences being homologs is: P(s 1 [1] s 2 [1]) substitution of s 2 [1] for s 1 [1] Ignoring all but the first two letters, it is: P(s 1 [1] s 2 [1])xP(s 1 [2] s 2 [2]) position 1 sequence 2 Counting all aligned positions: Π i P(s 1 [i] s 2 [i]) Each position is treated as a different, independent stochastic process.

4 Log space is more convenient log Π i P(s 1 [i] s 2 [i])/p(s 1 [i]) = Σ i S(s 1 [i] s 2 [i]) where S(A B) = 2*log2( P(A B)/P(A) ) = BLOSUM score sequence alignment score = Σ i S(s 1 [i] s 2 [i]) This is the form of the substitution score, Log-likelihood ratios (alias LLRs, logodds, lods). Usually 2 times log 2 of the probability ratio (or half-bits ).

5 Dayhoff's randomization experiment Margaret Dayhoff Aligned scrambled Protein A versus scrambled Protein B 100 times (re-scrambling each time). NOTE: scrambling does not change the AA composition! Results: A Normal Distribution significance of a score is measured as the probability of getting this score in a random alignment

6 Why do we need a model for significance? Because random scores end here. And non-random scores start here. And we care about the difference significance here. So we want to be able to calculate the significance. Easy right? Just fit to a Gaussian (i.e. normal) distribution...?

7 Lipman's randomization experiment Aligned Protein A to 100 natural sequences, not scrambled. Results: A wider normal distribution (Std dev = ~3 times larger) WHY? Because natural sequences are different than random. David Lipman Even unrelated sequences have similar local patterns, and uneven amino acid composition. Was the significance over-estimated using Dayhoff's method? Lippman got a similar result if he randomized the sequences by words instead of letters.

8 Why was Lipman s distribution wider? (part 1) low complexity complexity = sequence heterogeneity A low complexity sequence is homogeneous in its composition. For example: AAAAAAAAHAAAAAAAAKAAAAAEAA...is a "low-complexity" sequence. Align it to another A-rich sequence, you get very high scores. AAFAAAAAAAKGAAAAAAAAAAAANA Align it to a not-a-rich sequence, you get very low scores. GAGGGGPGGGGGGGGGLGGGGGGGIGG

9 Why was Lipman s distribution wider? (part 2) Local patterns (words). The three-letter sequence PGD occurs more often than expected by chance because PGD occurs in beta-turns. The amphipathic helix pattern pnnppnn (p=polar, n=non-polar) occurs more often than expected because it folds into an alpha helix Even sequences that are non-homologous (Lipman s exp) have word matches. But scrambled sequences (Dayhoff s exp) do not have recurrent words.

10 Dayhoff & Lipman saw a normal distribution for random alignment scores. Should DP alignment scores fit a normal distribution? um... I guess we need a theoretical foundation...

11 Can we apply the theoretical foundation of coin tosses to sequence alignments? Alignment scores are like coin tosses. Each position either matches (heads) or mismatches (tails) (...actually each "coin" has 20 sides, but we'll deal with that later...) To establish the theoretical foundation, we first go to the simple case... 11

12 My Erdös number is 4 Paul Erdös >Alan J. Hoffman >Richard M. Karp >Edward C. Thayer >Christopher Bystroff 4 12

13 Theoretical distribution of coin tosses. HTHTHTTTHHHTHTTHHTHHHHHTH Erdos & Renvi equation is inherently local. It models scores for a maximal sub-alignment (local DP), not for a global alignment. Erdos & Renyi equation: E(n) = log 1/p (n) where p is the P(H). Note! Score cannot go below zero. length of longest sequence of H

14 Analogy: Heads = match, tails = mismatch Similarly, we can define an expectation value, E(M), for the longest row of matches in an alignment of length n. E(M) is calculated similar to the heads/tails way, using the Erdos & Renyi equation (p is the odds of a match, 1-p is the odds of a mismatch): E(M) = log 1/p (M) expectation given an alignment of length M But over all possible alignments of two sequences of length n, the number is E(M) = log 1/p (n*n) If the two sequences are length n and length m, it is E(M) = log 1/p (mn) [+ some constant terms that don t depend on m and n]

15 Heads/tails = match/mismatch, cont'd Theoretically derived equation for the expectation value for M, the longest block of Matches. E(M) = log 1/p (mn) - log 1/p (1-p) log 1/p (e) - 1/2 Now we define a number K such that log 1/p (K) = constant terms. E(M) = log 1/p (Kmn) Now we define λ = log e (1/p), to convert to natural log. E(M) = log e (Kmn)/λ

16 Heads/tails = match/mismatch, cont'd E(M) = log 1/p (mn) - log 1/p (1-p) log 1/p (e) - 1/2 E(M) = log e (Kmn)/λ Solving, using p=0.25, we get K=0.1237, λ= log e (4) = 1.386, m=n=100 E(M) = log e (Kmn)/λ = would be the most likely number of matches in an alignment of two random DNA sequences of length 100 and 100. You can try this! See next page.

17 In class exercise google: LALIGN server or Get a piece of DNA sequence from NCBI, or use "Long DNA" from course website. Align two randomly selected 100bp DNA using local DP alignment. Open the LALIGN server. Settings: Local. DNA. Gap opening penalty=0. Extending gap penalty=0. Count the longest string of matches. (:) Write it down. Plot a histogram of scores. What is the expectation value? 17

18 P(S > x) E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question: How good is the score x? (i.e. how significant) So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model. freq score

19 Distribution -- Definitions Mean = average value. Mode = most probable value, expected value. extremes = minimum and maximum values. Standard deviation = works for one type of decay function, the normal distribution. For a variable whose distribution comes from extreme value, such as random sequence alignment scores, the score must be greater than expected from a normal distribution to achieve the same level of significance.

20 A Normal Distribution Usually, we suppose the likelihood of deviating from the mean by x in the positive direction is the same as the likelihood of deviating by x in the negative direction, and the likelihood of devating by x decreases as the power of x. Why? Because multiplying probabilities gives this type of curve. This is called a Normal, or Gaussian distribution. Scores of random alignments have a Normal Distribution.

21 An epiphany: Alignment scores from DP are not random, they are maximal. Get scores from lots of optimal alignments best score (optimal) Tally up the scores, you get... y = exp( x e x ) Alignment scores fit an Extreme value distribution???? but why????

22 Empirical proof that the EVD fits optimal scores. Suppose you had a Gaussian distribution dart-board. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)

23 Empirical proof that the EVD fits optimal scores. What if we throw 10 darts at a time and remove all but the highest-scoring dart. Do that 1000 times. What is the distribution of the scores?

24 24

25 Empirical proof that the EVD fits optimal scores. The EVD gets sharper as we increase the number of darts thrown

26 Empirical proof that the EVD fits optimal scores.

27 Can we simplify the extreme value distribution? The EVD with mode uλ and decay parameter λ: The mode, from the Erdos & Renyi equation: substituting gives: Integrating from x to infinity gives: y = exp( x e λ(x-u) ) u = log e (Kmn)/λ P(x) = exp( x e λ(x-ln(kmn)/ λ) ) P(S x) = 1 - exp(-kmne -λx )

28 voodoo mathematics For values of x greater than 1, we can make this approximation: 1-exp[-e -x ] e -x!

29 Can we simplify the extreme value distribution? The integral of the EVD is applying voodoo math P(S x) = 1 - exp(-kmne -λx ) 1-exp[-Kmne -λx ] Kmne -λx we get P(S x) Kmne -λx Taking the log makes it linear. log(p(s x)) log(kmn) - λx We may now fit the equation to get log(kmn) and λ.

30 Fitting the linearized EVD log(p(s x)) log(kmn) - λx To determine K and λ, we plot log(p(s x)) versus x 1. Generate a large number of known false alignment scores S. (all with the same query lengths m and n). 2. Calculate P(S x) for all x. 3. Plot log(p(s x)) versus x. 4. Fit the data to a line. (Least squares) 5. The slope is λ. 6. Intercept is log(kmn). 7. Solve for K using average sequence lengths m,n. 8. Use K and λ to calculate the p-value = exp( log(kmn) - λx ) logp(s x) x x x x x x x x x x x x x x xx x The slope is λ, (the intercept is log(kmn) x

31 λ is the average probability of a match. More rigorously, λ is the value of x that satisfies: Σp i p j e S ijx = 1 Summed over all amino acid pairs ij. S ij are from substitution matrix. p i and p j are amino acid probabilities. Solving for λ Σp i p j e S ijλ = Σp i p j e S ije λ = e λ Σp i p j e S ij = 1 λ = log(1/σp i p j e S ij) = -log(σp i p j e S ij) OK. We can find λ by fitting. We can calculate λ using the Blosum matrix. Which way is better?

32 e-values in BLAST Every BLAST "hit" has a bit-score, x, derived from the substitution matrix. Parameters for the EVD have been previously calculated for m and n, the lengths of the database and query. Applying the EVD to x we get P(S x), which is our "pvalue" To get the e-value (expected number of times this score will occur over the whole database) we multiply by the size of the database m. e-value(x) = P(S x)*m where x is the alignment score, m is the size of the database, and P is calculated from false alignment scores.

33 Thought experiments -- local DP In Local Alignment we take a MAX over zero (0) and three other scores (diagonal, across, down). Matrix Bias is a single number added to all match scores, so we can control the average match score. What happens if match scores are...? all negative? : all positive? : average positive? : average negative? : Best alignment is always no alignment. Best alignment is gapless, global-local. Best alignment is local (longer). Typical random alignment is local. Best alignment is local (shorter). Typical random alignment is no alignment.

34 Though experiments -- local DP This leads to Altschul's Principle For local DP alignment, the match (substitution) scores should be > zero for a match, and < zero for a mismatch, on average! (Some mismatches may have a > 0 score) Are they? Look at PAM250 and BLOSUM62.

35 Matrix bias * to control alignment length Matrix bias = constant added to each match score. If we add a constant to each value in the substitution matrix, it favors matches over gaps. As we increase matrix bias... Longer alignments are more common in random sets. Longer alignments are less significant. Negative matrix bias No matrix bias Positive matrix bias *aka: bonus

36 A brief history of significance Significance of a score is measured by the probability of getting that score by chance. History of modeling chance in alignments 1970 s Dayhoff: Guassian fit to scrambled alignments 1980 s Lipman: Gaussian fit to false alignments 1990 s Altschul: EVD fit to false alignments

37 How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a specific inference. Thanks.

38 How significant is that? 1. Produce a set of null-model data (such as random or false alignment scores) 2. Define a theoretical function to fit this null-model data. 3. Fit the null-model data. 4. For each experimental data point, integrate the nullmodel data to get the p-value of the experimental data point. 5. If the experimental data point is the result of n "tries", multiply by n to get the e-value.

39 summary of significance The expected value for the maximum length of a match between two sequences, lengths n and m, given the probability of a match p, has a theoretical solution, which is log (1/p) (nm), the Erdos & Lenyi equation. The The score of a DP alignment is the maximum score, which is roughly proportional to the length (for local alignments only!). Therefore, the expected value of alignment scores follows the same theoretical distribution.

40 Review What does the Extreme Value Distribution model? How are the parameters (λ,k) of the EVD determined? Once λ and K have been found, how do we calculate the e- value? What is the meaning of the e-value? For a given score x and a given database, which method gives the highest estimate and which gives the lowest estimate of the significance of x? --Dayhoff s method or scrambled sequences. --Lipman s method of false alignments, using a Normal distribution. --False alignments using the EVD.

41 Exercise 6 -- turn in Sep 28 You did a BLAST search using a sequence that has absolutely no homologs in the database. Absolutely none. The BLAST search gave you false hits with the top e- values ranging from 0.4 to 100. You look at them and you notice a pattern in the e-values. What do you see? How many of your hits have e-value 10.? Write a number on a piece of paper. Explain your reasoning. Sign it. Turn in thursday.

bioinformatics 1 -- lecture 7

bioinformatics 1 -- lecture 7 bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

L3: Blast: Keyword match basics

L3: Blast: Keyword match basics L3: Blast: Keyword match basics Fa05 CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with exactly the same number of hairs! Assignment 1 is online Due 10/6

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Heuristic Alignment and Searching

Heuristic Alignment and Searching 3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

BINF 730. DNA Sequence Alignment Why?

BINF 730. DNA Sequence Alignment Why? BINF 730 Lecture 2 Seuence Alignment DNA Seuence Alignment Why? Recognition sites might be common restriction enzyme start seuence stop seuence other regulatory seuences Homology evolutionary common progenitor

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Steve Smith Tuition: Maths Notes

Steve Smith Tuition: Maths Notes Maths Notes : Discrete Random Variables Version. Steve Smith Tuition: Maths Notes e iπ + = 0 a + b = c z n+ = z n + c V E + F = Discrete Random Variables Contents Intro The Distribution of Probabilities

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

What is a random variable

What is a random variable OKAN UNIVERSITY FACULTY OF ENGINEERING AND ARCHITECTURE MATH 256 Probability and Random Processes 04 Random Variables Fall 20 Yrd. Doç. Dr. Didem Kivanc Tureli didemk@ieee.org didem.kivanc@okan.edu.tr

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Substitution matrices

Substitution matrices Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Lecture 1: Probability Fundamentals

Lecture 1: Probability Fundamentals Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability

More information

Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

More information

Math 180B Problem Set 3

Math 180B Problem Set 3 Math 180B Problem Set 3 Problem 1. (Exercise 3.1.2) Solution. By the definition of conditional probabilities we have Pr{X 2 = 1, X 3 = 1 X 1 = 0} = Pr{X 3 = 1 X 2 = 1, X 1 = 0} Pr{X 2 = 1 X 1 = 0} = P

More information

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell An Essential Guide to the Basic Local Alignment Search Tool Foreword by Stephen Altschul BLAST Ian Korf, Mark Yandell & Joseph Bedell BLAST Other resources from O Reilly oreilly.com oreilly.com is more

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

Scoring Matrices. Shifra Ben Dor Irit Orr

Scoring Matrices. Shifra Ben Dor Irit Orr Scoring Matrices Shifra Ben Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

MAT Mathematics in Today's World

MAT Mathematics in Today's World MAT 1000 Mathematics in Today's World Last Time We discussed the four rules that govern probabilities: 1. Probabilities are numbers between 0 and 1 2. The probability an event does not occur is 1 minus

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

(NRH: Sections 2.6, 2.7, 2.11, 2.12 (at this point in the course the sections will be difficult to follow))

(NRH: Sections 2.6, 2.7, 2.11, 2.12 (at this point in the course the sections will be difficult to follow)) Curriculum, second lecture: Niels Richard Hansen November 23, 2011 NRH: Handout pages 1-13 PD: Pages 55-75 (NRH: Sections 2.6, 2.7, 2.11, 2.12 (at this point in the course the sections will be difficult

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Solving Recurrence Relations 1. Guess and Math Induction Example: Find the solution for a n = 2a n 1 + 1, a 0 = 0 We can try finding each a n : a 0 =

Solving Recurrence Relations 1. Guess and Math Induction Example: Find the solution for a n = 2a n 1 + 1, a 0 = 0 We can try finding each a n : a 0 = Solving Recurrence Relations 1. Guess and Math Induction Example: Find the solution for a n = 2a n 1 + 1, a 0 = 0 We can try finding each a n : a 0 = 0 a 1 = 2 0 + 1 = 1 a 2 = 2 1 + 1 = 3 a 3 = 2 3 + 1

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Theoretical Cryptography, Lecture 10

Theoretical Cryptography, Lecture 10 Theoretical Cryptography, Lecture 0 Instructor: Manuel Blum Scribe: Ryan Williams Feb 20, 2006 Introduction Today we will look at: The String Equality problem, revisited What does a random permutation

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

MATH 118 FINAL EXAM STUDY GUIDE

MATH 118 FINAL EXAM STUDY GUIDE MATH 118 FINAL EXAM STUDY GUIDE Recommendations: 1. Take the Final Practice Exam and take note of questions 2. Use this study guide as you take the tests and cross off what you know well 3. Take the Practice

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

Lecture 6: The Pigeonhole Principle and Probability Spaces

Lecture 6: The Pigeonhole Principle and Probability Spaces Lecture 6: The Pigeonhole Principle and Probability Spaces Anup Rao January 17, 2018 We discuss the pigeonhole principle and probability spaces. Pigeonhole Principle The pigeonhole principle is an extremely

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

SEQUENCE alignment is an underlying application in the

SEQUENCE alignment is an underlying application in the 194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific

More information

Similarity searching summary (2)

Similarity searching summary (2) Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics Probability Rules MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2018 Introduction Probability is a measure of the likelihood of the occurrence of a certain behavior

More information

besides your solutions of these problems. 1 1 We note, however, that there will be many factors in the admission decision

besides your solutions of these problems. 1 1 We note, however, that there will be many factors in the admission decision The PRIMES 2015 Math Problem Set Dear PRIMES applicant! This is the PRIMES 2015 Math Problem Set. Please send us your solutions as part of your PRIMES application by December 1, 2015. For complete rules,

More information

Sequence-specific sequence comparison using pairwise statistical significance

Sequence-specific sequence comparison using pairwise statistical significance Graduate Theses and Dissertations Graduate College 2009 Sequence-specific sequence comparison using pairwise statistical significance Ankit Agrawal Iowa State University Follow this and additional works

More information

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Introduction to Bioinformatics Algorithms Homework 3 Solution

Introduction to Bioinformatics Algorithms Homework 3 Solution Introduction to Bioinformatics Algorithms Homework 3 Solution Saad Mneimneh Computer Science Hunter College of CUNY Problem 1: Concave penalty function We have seen in class the following recurrence for

More information

Introduction to Computation & Pairwise Alignment

Introduction to Computation & Pairwise Alignment Introduction to Computation & Pairwise Alignment Eunok Paek eunokpaek@hanyang.ac.kr Algorithm what you already know about programming Pan-Fried Fish with Spicy Dipping Sauce This spicy fish dish is quick

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

CS 124 Math Review Section January 29, 2018

CS 124 Math Review Section January 29, 2018 CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to

More information

The Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic

The Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic Theory and Applications of Complex Networs 1 Theory and Applications of Complex Networs 2 Theory and Applications of Complex Networs Class Three The Beginning of Graph Theory Leonhard Euler wonders, can

More information

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM? Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Last few slides from last time

Last few slides from last time Last few slides from last time Example 3: What is the probability that p will fall in a certain range, given p? Flip a coin 50 times. If the coin is fair (p=0.5), what is the probability of getting an

More information