Sequence Analysis '17 -- lecture 7
|
|
- Stephany Watson
- 5 years ago
- Views:
Transcription
1 Sequence Analysis '17 -- lecture 7 Significance E-values
2 How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a specific inference. Thanks.
3 Scoring alignments using substitution probabilities For each aligned position (match), we get P(A B), which is the substitution probability. Ignoring all but the first letters, the probability of these two sequences being homologs is: P(s 1 [1] s 2 [1]) substitution of s 2 [1] for s 1 [1] Ignoring all but the first two letters, it is: P(s 1 [1] s 2 [1])xP(s 1 [2] s 2 [2]) position 1 sequence 2 Counting all aligned positions: Π i P(s 1 [i] s 2 [i]) Each position is treated as a different, independent stochastic process.
4 Log space is more convenient log Π i P(s 1 [i] s 2 [i])/p(s 1 [i]) = Σ i S(s 1 [i] s 2 [i]) where S(A B) = 2*log2( P(A B)/P(A) ) = BLOSUM score sequence alignment score = Σ i S(s 1 [i] s 2 [i]) This is the form of the substitution score, Log-likelihood ratios (alias LLRs, logodds, lods). Usually 2 times log 2 of the probability ratio (or half-bits ).
5 Dayhoff's randomization experiment Margaret Dayhoff Aligned scrambled Protein A versus scrambled Protein B 100 times (re-scrambling each time). NOTE: scrambling does not change the AA composition! Results: A Normal Distribution significance of a score is measured as the probability of getting this score in a random alignment
6 Why do we need a model for significance? Because random scores end here. And non-random scores start here. And we care about the difference significance here. So we want to be able to calculate the significance. Easy right? Just fit to a Gaussian (i.e. normal) distribution...?
7 Lipman's randomization experiment Aligned Protein A to 100 natural sequences, not scrambled. Results: A wider normal distribution (Std dev = ~3 times larger) WHY? Because natural sequences are different than random. David Lipman Even unrelated sequences have similar local patterns, and uneven amino acid composition. Was the significance over-estimated using Dayhoff's method? Lippman got a similar result if he randomized the sequences by words instead of letters.
8 Why was Lipman s distribution wider? (part 1) low complexity complexity = sequence heterogeneity A low complexity sequence is homogeneous in its composition. For example: AAAAAAAAHAAAAAAAAKAAAAAEAA...is a "low-complexity" sequence. Align it to another A-rich sequence, you get very high scores. AAFAAAAAAAKGAAAAAAAAAAAANA Align it to a not-a-rich sequence, you get very low scores. GAGGGGPGGGGGGGGGLGGGGGGGIGG
9 Why was Lipman s distribution wider? (part 2) Local patterns (words). The three-letter sequence PGD occurs more often than expected by chance because PGD occurs in beta-turns. The amphipathic helix pattern pnnppnn (p=polar, n=non-polar) occurs more often than expected because it folds into an alpha helix Even sequences that are non-homologous (Lipman s exp) have word matches. But scrambled sequences (Dayhoff s exp) do not have recurrent words.
10 Dayhoff & Lipman saw a normal distribution for random alignment scores. Should DP alignment scores fit a normal distribution? um... I guess we need a theoretical foundation...
11 Can we apply the theoretical foundation of coin tosses to sequence alignments? Alignment scores are like coin tosses. Each position either matches (heads) or mismatches (tails) (...actually each "coin" has 20 sides, but we'll deal with that later...) To establish the theoretical foundation, we first go to the simple case... 11
12 My Erdös number is 4 Paul Erdös >Alan J. Hoffman >Richard M. Karp >Edward C. Thayer >Christopher Bystroff 4 12
13 Theoretical distribution of coin tosses. HTHTHTTTHHHTHTTHHTHHHHHTH Erdos & Renvi equation is inherently local. It models scores for a maximal sub-alignment (local DP), not for a global alignment. Erdos & Renyi equation: E(n) = log 1/p (n) where p is the P(H). Note! Score cannot go below zero. length of longest sequence of H
14 Analogy: Heads = match, tails = mismatch Similarly, we can define an expectation value, E(M), for the longest row of matches in an alignment of length n. E(M) is calculated similar to the heads/tails way, using the Erdos & Renyi equation (p is the odds of a match, 1-p is the odds of a mismatch): E(M) = log 1/p (M) expectation given an alignment of length M But over all possible alignments of two sequences of length n, the number is E(M) = log 1/p (n*n) If the two sequences are length n and length m, it is E(M) = log 1/p (mn) [+ some constant terms that don t depend on m and n]
15 Heads/tails = match/mismatch, cont'd Theoretically derived equation for the expectation value for M, the longest block of Matches. E(M) = log 1/p (mn) - log 1/p (1-p) log 1/p (e) - 1/2 Now we define a number K such that log 1/p (K) = constant terms. E(M) = log 1/p (Kmn) Now we define λ = log e (1/p), to convert to natural log. E(M) = log e (Kmn)/λ
16 Heads/tails = match/mismatch, cont'd E(M) = log 1/p (mn) - log 1/p (1-p) log 1/p (e) - 1/2 E(M) = log e (Kmn)/λ Solving, using p=0.25, we get K=0.1237, λ= log e (4) = 1.386, m=n=100 E(M) = log e (Kmn)/λ = would be the most likely number of matches in an alignment of two random DNA sequences of length 100 and 100. You can try this! See next page.
17 In class exercise google: LALIGN server or Get a piece of DNA sequence from NCBI, or use "Long DNA" from course website. Align two randomly selected 100bp DNA using local DP alignment. Open the LALIGN server. Settings: Local. DNA. Gap opening penalty=0. Extending gap penalty=0. Count the longest string of matches. (:) Write it down. Plot a histogram of scores. What is the expectation value? 17
18 P(S > x) E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question: How good is the score x? (i.e. how significant) So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model. freq score
19 Distribution -- Definitions Mean = average value. Mode = most probable value, expected value. extremes = minimum and maximum values. Standard deviation = works for one type of decay function, the normal distribution. For a variable whose distribution comes from extreme value, such as random sequence alignment scores, the score must be greater than expected from a normal distribution to achieve the same level of significance.
20 A Normal Distribution Usually, we suppose the likelihood of deviating from the mean by x in the positive direction is the same as the likelihood of deviating by x in the negative direction, and the likelihood of devating by x decreases as the power of x. Why? Because multiplying probabilities gives this type of curve. This is called a Normal, or Gaussian distribution. Scores of random alignments have a Normal Distribution.
21 An epiphany: Alignment scores from DP are not random, they are maximal. Get scores from lots of optimal alignments best score (optimal) Tally up the scores, you get... y = exp( x e x ) Alignment scores fit an Extreme value distribution???? but why????
22 Empirical proof that the EVD fits optimal scores. Suppose you had a Gaussian distribution dart-board. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)
23 Empirical proof that the EVD fits optimal scores. What if we throw 10 darts at a time and remove all but the highest-scoring dart. Do that 1000 times. What is the distribution of the scores?
24 24
25 Empirical proof that the EVD fits optimal scores. The EVD gets sharper as we increase the number of darts thrown
26 Empirical proof that the EVD fits optimal scores.
27 Can we simplify the extreme value distribution? The EVD with mode uλ and decay parameter λ: The mode, from the Erdos & Renyi equation: substituting gives: Integrating from x to infinity gives: y = exp( x e λ(x-u) ) u = log e (Kmn)/λ P(x) = exp( x e λ(x-ln(kmn)/ λ) ) P(S x) = 1 - exp(-kmne -λx )
28 voodoo mathematics For values of x greater than 1, we can make this approximation: 1-exp[-e -x ] e -x!
29 Can we simplify the extreme value distribution? The integral of the EVD is applying voodoo math P(S x) = 1 - exp(-kmne -λx ) 1-exp[-Kmne -λx ] Kmne -λx we get P(S x) Kmne -λx Taking the log makes it linear. log(p(s x)) log(kmn) - λx We may now fit the equation to get log(kmn) and λ.
30 Fitting the linearized EVD log(p(s x)) log(kmn) - λx To determine K and λ, we plot log(p(s x)) versus x 1. Generate a large number of known false alignment scores S. (all with the same query lengths m and n). 2. Calculate P(S x) for all x. 3. Plot log(p(s x)) versus x. 4. Fit the data to a line. (Least squares) 5. The slope is λ. 6. Intercept is log(kmn). 7. Solve for K using average sequence lengths m,n. 8. Use K and λ to calculate the p-value = exp( log(kmn) - λx ) logp(s x) x x x x x x x x x x x x x x xx x The slope is λ, (the intercept is log(kmn) x
31 λ is the average probability of a match. More rigorously, λ is the value of x that satisfies: Σp i p j e S ijx = 1 Summed over all amino acid pairs ij. S ij are from substitution matrix. p i and p j are amino acid probabilities. Solving for λ Σp i p j e S ijλ = Σp i p j e S ije λ = e λ Σp i p j e S ij = 1 λ = log(1/σp i p j e S ij) = -log(σp i p j e S ij) OK. We can find λ by fitting. We can calculate λ using the Blosum matrix. Which way is better?
32 e-values in BLAST Every BLAST "hit" has a bit-score, x, derived from the substitution matrix. Parameters for the EVD have been previously calculated for m and n, the lengths of the database and query. Applying the EVD to x we get P(S x), which is our "pvalue" To get the e-value (expected number of times this score will occur over the whole database) we multiply by the size of the database m. e-value(x) = P(S x)*m where x is the alignment score, m is the size of the database, and P is calculated from false alignment scores.
33 Thought experiments -- local DP In Local Alignment we take a MAX over zero (0) and three other scores (diagonal, across, down). Matrix Bias is a single number added to all match scores, so we can control the average match score. What happens if match scores are...? all negative? : all positive? : average positive? : average negative? : Best alignment is always no alignment. Best alignment is gapless, global-local. Best alignment is local (longer). Typical random alignment is local. Best alignment is local (shorter). Typical random alignment is no alignment.
34 Though experiments -- local DP This leads to Altschul's Principle For local DP alignment, the match (substitution) scores should be > zero for a match, and < zero for a mismatch, on average! (Some mismatches may have a > 0 score) Are they? Look at PAM250 and BLOSUM62.
35 Matrix bias * to control alignment length Matrix bias = constant added to each match score. If we add a constant to each value in the substitution matrix, it favors matches over gaps. As we increase matrix bias... Longer alignments are more common in random sets. Longer alignments are less significant. Negative matrix bias No matrix bias Positive matrix bias *aka: bonus
36 A brief history of significance Significance of a score is measured by the probability of getting that score by chance. History of modeling chance in alignments 1970 s Dayhoff: Guassian fit to scrambled alignments 1980 s Lipman: Gaussian fit to false alignments 1990 s Altschul: EVD fit to false alignments
37 How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a specific inference. Thanks.
38 How significant is that? 1. Produce a set of null-model data (such as random or false alignment scores) 2. Define a theoretical function to fit this null-model data. 3. Fit the null-model data. 4. For each experimental data point, integrate the nullmodel data to get the p-value of the experimental data point. 5. If the experimental data point is the result of n "tries", multiply by n to get the e-value.
39 summary of significance The expected value for the maximum length of a match between two sequences, lengths n and m, given the probability of a match p, has a theoretical solution, which is log (1/p) (nm), the Erdos & Lenyi equation. The The score of a DP alignment is the maximum score, which is roughly proportional to the length (for local alignments only!). Therefore, the expected value of alignment scores follows the same theoretical distribution.
40 Review What does the Extreme Value Distribution model? How are the parameters (λ,k) of the EVD determined? Once λ and K have been found, how do we calculate the e- value? What is the meaning of the e-value? For a given score x and a given database, which method gives the highest estimate and which gives the lowest estimate of the significance of x? --Dayhoff s method or scrambled sequences. --Lipman s method of false alignments, using a Normal distribution. --False alignments using the EVD.
41 Exercise 6 -- turn in Sep 28 You did a BLAST search using a sequence that has absolutely no homologs in the database. Absolutely none. The BLAST search gave you false hits with the top e- values ranging from 0.4 to 100. You look at them and you notice a pattern in the e-values. What do you see? How many of your hits have e-value 10.? Write a number on a piece of paper. Explain your reasoning. Sign it. Turn in thursday.
bioinformatics 1 -- lecture 7
bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment
More informationLocal Alignment Statistics
Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationL3: Blast: Keyword match basics
L3: Blast: Keyword match basics Fa05 CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with exactly the same number of hairs! Assignment 1 is online Due 10/6
More informationSequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationBLAST: Basic Local Alignment Search Tool
.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].
More informationExercise 5. Sequence Profiles & BLAST
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationSequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.
Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationLearning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling
Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence
More informationLocal Alignment: Smith-Waterman algorithm
Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.
More informationPairwise sequence alignment
Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
More informationBiochemistry 324 Bioinformatics. Pairwise sequence alignment
Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene
More informationHeuristic Alignment and Searching
3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two
More informationScoring Matrices. Shifra Ben-Dor Irit Orr
Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More informationSequence Comparison. mouse human
Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity
More information1.5 Sequence alignment
1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)
More informationBINF 730. DNA Sequence Alignment Why?
BINF 730 Lecture 2 Seuence Alignment DNA Seuence Alignment Why? Recognition sites might be common restriction enzyme start seuence stop seuence other regulatory seuences Homology evolutionary common progenitor
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationGrundlagen der Bioinformatik, SS 08, D. Huson, May 2,
Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7
More informationBioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre
Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationSteve Smith Tuition: Maths Notes
Maths Notes : Discrete Random Variables Version. Steve Smith Tuition: Maths Notes e iπ + = 0 a + b = c z n+ = z n + c V E + F = Discrete Random Variables Contents Intro The Distribution of Probabilities
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationMoreover, the circular logic
Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT
More informationWays to make neural networks generalize better
Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationSequence Analysis '17- lecture 8. Multiple sequence alignment
Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationWhat is a random variable
OKAN UNIVERSITY FACULTY OF ENGINEERING AND ARCHITECTURE MATH 256 Probability and Random Processes 04 Random Variables Fall 20 Yrd. Doç. Dr. Didem Kivanc Tureli didemk@ieee.org didem.kivanc@okan.edu.tr
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationSubstitution matrices
Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationLecture 1: Probability Fundamentals
Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability
More informationChapter 7: Rapid alignment methods: FASTA and BLAST
Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul
More informationMath 180B Problem Set 3
Math 180B Problem Set 3 Problem 1. (Exercise 3.1.2) Solution. By the definition of conditional probabilities we have Pr{X 2 = 1, X 3 = 1 X 1 = 0} = Pr{X 3 = 1 X 2 = 1, X 1 = 0} Pr{X 2 = 1 X 1 = 0} = P
More informationForeword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell
An Essential Guide to the Basic Local Alignment Search Tool Foreword by Stephen Altschul BLAST Ian Korf, Mark Yandell & Joseph Bedell BLAST Other resources from O Reilly oreilly.com oreilly.com is more
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search
More informationScoring Matrices. Shifra Ben Dor Irit Orr
Scoring Matrices Shifra Ben Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More informationPair Hidden Markov Models
Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]
More informationMAT Mathematics in Today's World
MAT 1000 Mathematics in Today's World Last Time We discussed the four rules that govern probabilities: 1. Probabilities are numbers between 0 and 1 2. The probability an event does not occur is 1 minus
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationCS 361: Probability & Statistics
October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationMACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be
More informationHMMs and biological sequence analysis
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
More information(NRH: Sections 2.6, 2.7, 2.11, 2.12 (at this point in the course the sections will be difficult to follow))
Curriculum, second lecture: Niels Richard Hansen November 23, 2011 NRH: Handout pages 1-13 PD: Pages 55-75 (NRH: Sections 2.6, 2.7, 2.11, 2.12 (at this point in the course the sections will be difficult
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationPractical considerations of working with sequencing data
Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!
More informationHomology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
More informationLecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models
Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary
More informationSolving Recurrence Relations 1. Guess and Math Induction Example: Find the solution for a n = 2a n 1 + 1, a 0 = 0 We can try finding each a n : a 0 =
Solving Recurrence Relations 1. Guess and Math Induction Example: Find the solution for a n = 2a n 1 + 1, a 0 = 0 We can try finding each a n : a 0 = 0 a 1 = 2 0 + 1 = 1 a 2 = 2 1 + 1 = 3 a 3 = 2 3 + 1
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationBLAST. Varieties of BLAST
BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database
More informationAlignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
More informationTheoretical Cryptography, Lecture 10
Theoretical Cryptography, Lecture 0 Instructor: Manuel Blum Scribe: Ryan Williams Feb 20, 2006 Introduction Today we will look at: The String Equality problem, revisited What does a random permutation
More informationLecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008
Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically
More informationStephen Scott.
1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for
More informationMachine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation
Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform
More informationSequence Database Search Techniques I: Blast and PatternHunter tools
Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered
More informationMATH 118 FINAL EXAM STUDY GUIDE
MATH 118 FINAL EXAM STUDY GUIDE Recommendations: 1. Take the Final Practice Exam and take note of questions 2. Use this study guide as you take the tests and cross off what you know well 3. Take the Practice
More informationDesign and Implementation of Speech Recognition Systems
Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from
More informationLecture 6: The Pigeonhole Principle and Probability Spaces
Lecture 6: The Pigeonhole Principle and Probability Spaces Anup Rao January 17, 2018 We discuss the pigeonhole principle and probability spaces. Pigeonhole Principle The pigeonhole principle is an extremely
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationAlignment & BLAST. By: Hadi Mozafari KUMS
Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More information11.3 Decoding Algorithm
11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence
More informationSEQUENCE alignment is an underlying application in the
194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific
More informationSimilarity searching summary (2)
Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity
More informationPairwise & Multiple sequence alignments
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
More informationSequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University
Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)
More informationPairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )
Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance
More informationProbability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics
Probability Rules MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2018 Introduction Probability is a measure of the likelihood of the occurrence of a certain behavior
More informationbesides your solutions of these problems. 1 1 We note, however, that there will be many factors in the admission decision
The PRIMES 2015 Math Problem Set Dear PRIMES applicant! This is the PRIMES 2015 Math Problem Set. Please send us your solutions as part of your PRIMES application by December 1, 2015. For complete rules,
More informationSequence-specific sequence comparison using pairwise statistical significance
Graduate Theses and Dissertations Graduate College 2009 Sequence-specific sequence comparison using pairwise statistical significance Ankit Agrawal Iowa State University Follow this and additional works
More informationBIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University
BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationMarkov Chains and Hidden Markov Models. = stochastic, generative models
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
More informationIntroduction to Bioinformatics Algorithms Homework 3 Solution
Introduction to Bioinformatics Algorithms Homework 3 Solution Saad Mneimneh Computer Science Hunter College of CUNY Problem 1: Concave penalty function We have seen in class the following recurrence for
More informationIntroduction to Computation & Pairwise Alignment
Introduction to Computation & Pairwise Alignment Eunok Paek eunokpaek@hanyang.ac.kr Algorithm what you already know about programming Pan-Fried Fish with Spicy Dipping Sauce This spicy fish dish is quick
More informationBLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010
BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for
More informationCS 124 Math Review Section January 29, 2018
CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to
More informationThe Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic
Theory and Applications of Complex Networs 1 Theory and Applications of Complex Networs 2 Theory and Applications of Complex Networs Class Three The Beginning of Graph Theory Leonhard Euler wonders, can
More informationNeyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?
Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationBioinformatics for Biologists
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational
More informationLEARNING WITH BAYESIAN NETWORKS
LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016
More informationPairwise sequence alignments
Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October
More informationLast few slides from last time
Last few slides from last time Example 3: What is the probability that p will fall in a certain range, given p? Flip a coin 50 times. If the coin is fair (p=0.5), what is the probability of getting an
More information