Search sequence databases 3 10/25/2016

Similar documents
COHORT MBA. Exponential function. MATH review (part2) by Lucian Mitroiu. The LOG and EXP functions. Properties: e e. lim.

10. The Discrete-Time Fourier Transform (DTFT)

Background: We have discussed the PIB, HO, and the energy of the RR model. In this chapter, the H-atom, and atomic orbitals.

NEW APPLICATIONS OF THE ABEL-LIOUVILLE FORMULA

Higher order derivatives

Answer Homework 5 PHA5127 Fall 1999 Jeff Stark

EXST Regression Techniques Page 1

Chemical Physics II. More Stat. Thermo Kinetics Protein Folding...

Homework #3. 1 x. dx. It therefore follows that a sum of the

1 Minimum Cut Problem

22/ Breakdown of the Born-Oppenheimer approximation. Selection rules for rotational-vibrational transitions. P, R branches.

u 3 = u 3 (x 1, x 2, x 3 )

Basic Polyhedral theory

Estimation of apparent fraction defective: A mathematical approach

Week 3: Connected Subgraphs

Introduction to Arithmetic Geometry Fall 2013 Lecture #20 11/14/2013

Section 11.6: Directional Derivatives and the Gradient Vector

Derangements and Applications

Continuous probability distributions

Derivation of Electron-Electron Interaction Terms in the Multi-Electron Hamiltonian

Solution: APPM 1360 Final (150 pts) Spring (60 pts total) The following parts are not related, justify your answers:

Text: WMM, Chapter 5. Sections , ,

Einstein Equations for Tetrad Fields

Davisson Germer experiment

Differential Equations

Abstract Interpretation: concrete and abstract semantics

Title: Vibrational structure of electronic transition

Unit 6: Solving Exponential Equations and More

Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers Roy D. Yates and David J.

Cramér-Rao Inequality: Let f(x; θ) be a probability density function with continuous parameter

10. Limits involving infinity

Observer Bias and Reliability By Xunchi Pu

The van der Waals interaction 1 D. E. Soper 2 University of Oregon 20 April 2012

15. Stress-Strain behavior of soils

Problem Statement. Definitions, Equations and Helpful Hints BEAUTIFUL HOMEWORK 6 ENGR 323 PROBLEM 3-79 WOOLSEY

Math 34A. Final Review

ECE602 Exam 1 April 5, You must show ALL of your work for full credit.

First derivative analysis

On spanning trees and cycles of multicolored point sets with few intersections

Calculus concepts derivatives

2008 AP Calculus BC Multiple Choice Exam

Quasi-Classical States of the Simple Harmonic Oscillator

4.2 Design of Sections for Flexure

y = 2xe x + x 2 e x at (0, 3). solution: Since y is implicitly related to x we have to use implicit differentiation: 3 6y = 0 y = 1 2 x ln(b) ln(b)

Introduction to Condensed Matter Physics

4 x 4, and. where x is Town Square

Roadmap. XML Indexing. DataGuide example. DataGuides. Strong DataGuides. Multiple DataGuides for same data. CPS Topics in Database Systems

Collisions between electrons and ions

EEO 401 Digital Signal Processing Prof. Mark Fowler

Determination of Vibrational and Electronic Parameters From an Electronic Spectrum of I 2 and a Birge-Sponer Plot

MA 262, Spring 2018, Final exam Version 01 (Green)

CS 361 Meeting 12 10/3/18

COMPUTER GENERATED HOLOGRAMS Optical Sciences 627 W.J. Dallas (Monday, April 04, 2005, 8:35 AM) PART I: CHAPTER TWO COMB MATH.

1973 AP Calculus AB: Section I

Deepak Rajput

The Matrix Exponential

1 Isoparametric Concept

Coupled Pendulums. Two normal modes.

Fourier Transforms and the Wave Equation. Key Mathematics: More Fourier transform theory, especially as applied to solving the wave equation.

Chapter 8: Electron Configurations and Periodicity

General Notes About 2007 AP Physics Scoring Guidelines

A Propagating Wave Packet Group Velocity Dispersion

The Matrix Exponential

That is, we start with a general matrix: And end with a simpler matrix:

4. Money cannot be neutral in the short-run the neutrality of money is exclusively a medium run phenomenon.

[ ] 1+ lim G( s) 1+ s + s G s s G s Kacc SYSTEM PERFORMANCE. Since. Lecture 10: Steady-state Errors. Steady-state Errors. Then

Supplementary Materials

Solution of Assignment #2

Ch. 24 Molecular Reaction Dynamics 1. Collision Theory

Aim To manage files and directories using Linux commands. 1. file Examines the type of the given file or directory

Section 6.1. Question: 2. Let H be a subgroup of a group G. Then H operates on G by left multiplication. Describe the orbits for this operation.

surface of a dielectric-metal interface. It is commonly used today for discovering the ways in

Square of Hamilton cycle in a random graph

DIFFERENTIAL EQUATION

3 Effective population size

1.2 Faraday s law A changing magnetic field induces an electric field. Their relation is given by:

BINOMIAL COEFFICIENTS INVOLVING INFINITE POWERS OF PRIMES. 1. Statement of results

Brief Notes on the Fermi-Dirac and Bose-Einstein Distributions, Bose-Einstein Condensates and Degenerate Fermi Gases Last Update: 28 th December 2008

Engineering 323 Beautiful HW #13 Page 1 of 6 Brown Problem 5-12

5.80 Small-Molecule Spectroscopy and Dynamics

Procdings of IC-IDC0 ( and (, ( ( and (, and (f ( and (, rspctivly. If two input signals ar compltly qual, phas spctra of two signals ar qual. That is

Hydrogen Atom and One Electron Ions

Contemporary, atomic, nuclear, and particle physics

Searching Linked Lists. Perfect Skip List. Building a Skip List. Skip List Analysis (1) Assume the list is sorted, but is stored in a linked list.

Function Spaces. a x 3. (Letting x = 1 =)) a(0) + b + c (1) = 0. Row reducing the matrix. b 1. e 4 3. e 9. >: (x = 1 =)) a(0) + b + c (1) = 0

Principles of Humidity Dalton s law

Construction of asymmetric orthogonal arrays of strength three via a replacement method

SCHUR S THEOREM REU SUMMER 2005

Combinatorial Networks Week 1, March 11-12

Condensed. Mathematics. General Certificate of Education Advanced Level Examination January Unit Pure Core 3. Time allowed * 1 hour 30 minutes

Introduction to the Fourier transform. Computer Vision & Digital Image Processing. The Fourier transform (continued) The Fourier transform (continued)

The Equitable Dominating Graph

Propositional Logic. Combinatorial Problem Solving (CPS) Albert Oliveras Enric Rodríguez-Carbonell. May 17, 2018

Gradebook & Midterm & Office Hours

PHA 5127 Answers Homework 2 Fall 2001

Why is a E&M nature of light not sufficient to explain experiments?

VII. Quantum Entanglement

MCB137: Physical Biology of the Cell Spring 2017 Homework 6: Ligand binding and the MWC model of allostery (Due 3/23/17)

6.1 Integration by Parts and Present Value. Copyright Cengage Learning. All rights reserved.

Chapter 13 GMM for Linear Factor Models in Discount Factor form. GMM on the pricing errors gives a crosssectional

Transcription:

Sarch squnc databass 3 10/25/2016

Etrm valu distribution Ø Suppos X is a random variabl with probability dnsity function p(, w sampl a larg numbr S of indpndnt valus of X from this distribution for an infinit numbr s 1 of tims. Ø For ach sampl of siz S, w rcord th largst valu, ma, so w hav a nw random variabl taking ths valus. Lt s dnot it by X ma. Ø Th probability that a valu of X is smallr than a givn valu, is givn by th cumulativ probability function, G ( = P( X < = p( d. Ø Lt F( ma = P(X ma = ma, i.., th probability that th maimum valu of th S valus is qual to ma. Thn w hav, F( ma S-1 = Sp(.. ma ma G( ma S 1. X.. Infinity numbr of S valus 1ma 2ma.. ima.. ma X ma

Etrm valu distribution Ø To driv th plicit form F( ma in th gnral cas is difficult, but if w assum X follows an ponntial distribution, thn it is rathr asy. p( =, G( = P(X < = p(d = d = d( F( ma = Sp( ma G( ma S 1 = $ % = S ma (1 ma S 1 S ma (S 1 ma sinc (1-a n na and S >>1. & ' =1. Thrfor, S ma S ma, Lt u = ln S, thn, S = u, thrfor, F( ma = S ma S ma = ( ma u ( ma u. = u ma u ma

Etrm valu distribution Ø A distribution with a dnsity function F( is calld an trm valu distribution (EVD or a Gumbl distribution. Ø It ariss whn w considr th maimum valus for many indpndnt sampls of th sam siz takn from any distribution. Ø Although w driv this formula basd on ponntial distribution, it is a good approimation for many othr distributions of random variabls X. ma = ( ma u ( ma u, Ø Th distribution has two paramtrs u and, and its dnsity has a pak at X ma = u. Ø Th width is controlld by, th smallr th valu, th narrowr th pak. F( ma F( ma ma

Etrm valu distribution Ø Whn th sampl siz S changs, u will chang, ln S u =. A chang in u shifts th distribution curv horizontally without changing th shap of th distribution. Ø If w chang th sampl siz from S 1 to S 2, th pak of th distribution will mov from ln S1 to ln S2 u 1 = u 2 = Ø Th distanc of moving is givn by, ln S2 ln S1 ln( S2 / S1 u 2 u1 = =. Ø Th probability that X ma taks a valu gratr than or qual to an obsrvd valu obs can b computd by, P( X ma obs = 1 = obs ( obs u F(. ma d ma = obs ( ma u ( ma u d ma

An idalistic databas sarch scnario Ø Lt s considr a databas sarch algorithm that rturns a squnc in th databas with th highst numbr of matchs to th qury squnc, i, w us th numbr of matchs m btwn th two squncs to scor th alignmnt. Lt it b th random variabl M. Ø In ordr to know th significant of a rturnd squnc with a scor m ma, which is also a random variabl, dnotd as M ma, w nd to know th distribution of M ma, dnotd by F (m ma. Ø Lt s first look at a computr simulation rsult: sarch a databas of 2,000 random squncs of lngth 200 bs by anothr diffrnt 2,000 random squncs of th sam lngth. s 1 s 2 s i s 2000 q 1 m 1, 1 m 1, 2 m 1, j m 1, 2000 m 1ma q 2 m 2, 1 m 2, 2 m 2, j m 2, 2000 q i m i, 1 m i, 2 m i, 2000 q 2000 m 2000, 1 m 2000, 2 m 2000, j m 2000, 2000 Random variabl M m 2ma m ima m 2000ma Random variabl M ma

An idalistic databas sarch scnario Ø Suppos that th squncs ar only mad of Cs and Gs with th sam frquncy, i.., C=G=50%. Ø Clarly, th distribution of th scor m i, j M, i.., th scor that a qury squnc q i aligns with a squnc s j in th databas, follows a binomial distribution, with N=200 and a = 0.5; Ø Howvr, th distribution of th scor m i ma M ma, i.., th bst scor rturnd whn th databas is qurid by squnc s i, follows an EVD. M ma M

An idalistic databas sarch scnario Ø Spcifically, w sampl 2,000 M valus for 2,000 tims, and for ach sampl of 2000 m valus, w obtain an m ma. Ø Us th formula of EVD that w drivd abov, w hav, F( m ma = ( m ma u ( mma u. Ø Fitting th simulation data to this formula, w hav =0.497, u=123.2. Ø Givn a qury squnc, if th rturnd bst hit from a databas has a scor of match, m obs, th statistical significanc of this hit can b valuatd by th following probability valu, which is th p-valu basd on th null hypothsis that th qury squnc has no rlationship with th squncs in th databas: p valu( m obs = P( M ma m obs = 1 ( m obs u. Ø Th smallr th p valu, th mor significant th hit.

An idalistic databas sarch scnario Ø Suppos w hav a qury squnc of 200 bass, and w us it to sarch against a databas of 2,000 squncs, th rturnd bst hit has 130 matchs to th qury squnc, thn th p-valu is, p valu( m = 1 obs 0.497(130 123.2 = P( M ma = 0.033. m obs = 1 ( m obs u M ma M p valu

Distribution of th lngth of matching k-mrs in two squncs Ø To dvlop a statistical mthod usd in k-mr basd databas sarch algorithms such as FASTA and BLAST, w nd to considr th distribution th scors of k-mr matchs, rathr than th numbr of matchs. Ø Lt s considr a vry simpl pairwis local alignmnt algorithm that finds th longst actly matching k-mr in two squncs of lngth N and M. Ø For simplicity, th scor of th alignmnt is th lngth of th matching k-mr, k. Squnc 1, N=19 GGATATCCAGCGCTCCTCT Squnc 2, M=14 ATCCGATATCTTGG Ø Suppos that w align a lot of two unrlatd squncs, thn th longst lngth of match btwn two squncs, L is a random variabl. Ø Clarly, L should follow an EVD: F(l = P(L=l ~ EVD.

Distribution of th lngth of matching k-mrs in two squncs Ø Howvr, th lngth of actly matching k-mrs btwn two random squncs follows an ponntial distribution, which can b drivd as follows. Ø As discussd arlir, givn two unrlatd squncs, th probability that th two squncs hav a match at a position is, 2 2 a = π A + πc + πg + πt. Ø Lt K b th random variabl of th lngth of k-mrs found in two random squncs. Th probability that two random squncs hav at last k conscutiv matchs is, P(K k = P((match OR mismatch AND k matchs AND (match OR mismatch = P(match OR mismatch P (k matchs P (match OR mismatch = a k. Lt = -ln a, thn, a = -, P(K k = k. Ø Th probability that th two squncs hav lss than k conscutiv matchs is, G( k = P( K 2 < k = 1 P( K 2 k = 1 k.

Distribution of th lngth of matching k-mrs in two squncs Ø If w trat th K as a continuous variabl, thn th probability dnsity k of K is, p( k = Thrfor, K follows an ponntial distribution. Ø Th longst lngth of k-mr matchs btwn two squncs, L is an EVD, ( l u ( l u F( l = P( L = l =. All k-mr matchs dg( k dk = d(1 Find th longst matchs in ach alignmnt... dk = k. All pairwis alignmnts in random squnc spac Lngth spac of matching k-mrs p( k k = Lngth spac of th longst matching k-mrs F( l = ( l u ( l u

Distribution of th lngth of matching k-mrs in two squncs Ø Hr, u is rlatd to th numbr of k-mr alignmnts that can b gnratd btwn two squncs, i.., th siz of sampling, S, sinc w dfin u = (ln S /,. assuming that S is a constant numbr. Squnc 1, N=19 Squnc 2, M=14 GGATATCCAGCGCTCCTCT ATCCGATATCTTGG Ø In rality, S is clarly not a constant, but it clos to a constant valu. Ø Thr ar NM ways w can initiat a match btwn two squncs, but th actually numbr of k-mr alignmnts S is much lss than NM, lt it b S=βMN, thn ln( βmn u =. Ø Th pctd numbr of matchs btwn two squncs with lngth of at last k is (latr on w will dfin this as th E valu, E( k = βmnp( K k = βmn k.

Distribution of th lngth of matching k-mrs in databas sarch Ø So far, w only considr th longst k-mr match btwn two squncs. During th databas sarch w rturn th longst k-mr match btwn th qury squnc and all possibl squncs in th databass. Ø W can tnd our analysis of k-mr match btwn two squncs to th databas sarch by prtnding that w concatnat all th squncs in th databas to form a vry long squnc. Ø Lt th longst k-mr lngth for a databas rsarch is L ma, thn it should follows an EVD. n squncs Qury squnc Databas squnc k-mr lngth spac p( k k = F Longst k-mr lngth spac ( ( lma u ( lma u l = ma u = ln( βnmn.

Distribution of th lngth of matching k-mrs in databas sarch Ø Lt s look at th rsult of a computr simulation using 2,000 GC (G=C=50% random squncs of lngth 200 bass. q 1 q 2 q 3 q 2000 s 1 s 2 s 3 s 2000 Longst k-mrs btwn a pair of squncs, thir lngth is L. P( L = l = F( l = ( l u1 ln( βmn u 1 =. ( l u1 P( L Longst k-mrs btwn a squnc and any squnc in th databas, thir lngth is L ma. ma = lma = F( lma = ( l ln( βnmn u 2 =. ma u2 ( lma u 2

Distribution of th lngth of matching k-mrs in databas sarch Ø For both th distributions of L, and L ma, w hav, = -lna = -ln0.5 = ln2 Ø Howvr, computing u in ithr distribution is difficult, bcaus w do not know th valu of β, u = 1 ln( βmn for L, and ln( βnmn u2 = for L Ø W can find th valus of u 1 and u 2 by fitting th data to an EVD, which yilds, u 1 =13.6 and u 2 = 24.5. Ø Th diffrnc btwn u 1 and u 2 is, u 2 - u 1 = 24.5-13.6 = 10.9 ma.

Distribution of th lngth of matching k-mrs in databas sarch Ø Both th distribution of L, th longst lngth of k-mr match btwn two squncs, and L ma, th longst lngth of k-mr match btwn a qury squnc and any squnc in th databas, can b fittd to a EVD vry nicly. Ø Th diffrnc btwn u 2 and u 1 also mts our pctation: u 2 u 1 ln( S2 / S1 = ln( βnmn / βmn = lnn ln2000 = = ln2 = 10.97. F( l = K ( l u1 L ( l u1 F ( ( ( ma ma 2 2 l u l u l = L ma ma

Statistics in th BLAST algorithm Ø BLAST finds th highst HSP btwn a qury squnc and any squnc in th databas. Ø If w trat an HSP as a spcial k-mr match btwn two squncs, thn th lngth of HSP should follow an EVD. Ø Sinc th scor of a HSP is calculatd basd on th BLUSOM or PAM substitution matrics, it is mor informativ for th quality of alignmnt than th lngth of a HSP, so th lngth of an HSP is not usd for scoring in BLAST. Ø If a gap is not allowd, th scor of a HSP is rlatd to its lngth. Ø Kalin and Altschul (1990 showd that th scors of HSPs follow an EVD. Ø It has bn shown by computr simulations that th scors of gappd local alignmnts btwn random squncs gnratd by algorithms such as Smith-Watrman, FASTA, and BLAST all follow an EVD.

Statistics in th BLAST algorithm Ø BLAST usd a computr simulation to dtrmin th two paramtrs in th EVD formula, and u basd on a larg numbr of random squncs. Ø In particular, BLAST outputs th scor of th HSP btwn a qury squnc and th bst hit in th databas, as wll as th E valu of th scor. Ø Th E valu is dfind as th numbr of pctd HSPs that hav a bttr scor than that of th rturnd HSP, obtaind by sarching a random squnc databas of th sam siz. Ø W hav dvlopd th formula of E valu arlir, E( S = βmn S whr N is th lngth of qury squnc, M th total lngth of squncs in th databas, and S th scor of th HSP. Ø Thrfor, E valu dpnds on th siz of th databas bing sarchd. Givn a qury squnc, th largr th databas, th highr th E valu..

Statistics in th BLAST algorithm Ø Blow ar ampls of th databas sarch rsults by BLASTP using yast PTP1 as th qury. Sarch against th ntir Swiss-Prot databas =PTP1 =PTP2 =PTP3 Sarch against th nr databas, which is largr than Swiss-Prot Ø Th sam or narly sam scor for th hits PTP2 (85 vs 84 and PTP3 (49 vs 49 in both sarchs, but vry diffrnt E valus (2-17 vs 2-15 and (6-7 vs 9-5 du to diffrnt sizs of databass.

Rmarks for using BLAST Ø Th E valu is dpndnt on th sarch spac MN, thrfor, whn sarching against a small databas (M is small, th rsulting HSP may b significant, but it may not whn sarching against a larg databas; Ø A HSP may b significant for a small protin (N is small, but it may not b significant for a larg protin (N is larg; Ø With th ponntial incras of th siz of databass, any HSP bcoms lss and lss significant, so w nd nw mthods to b dvlopd in th futur.