Sarch squnc databass 3 10/25/2016
Etrm valu distribution Ø Suppos X is a random variabl with probability dnsity function p(, w sampl a larg numbr S of indpndnt valus of X from this distribution for an infinit numbr s 1 of tims. Ø For ach sampl of siz S, w rcord th largst valu, ma, so w hav a nw random variabl taking ths valus. Lt s dnot it by X ma. Ø Th probability that a valu of X is smallr than a givn valu, is givn by th cumulativ probability function, G ( = P( X < = p( d. Ø Lt F( ma = P(X ma = ma, i.., th probability that th maimum valu of th S valus is qual to ma. Thn w hav, F( ma S-1 = Sp(.. ma ma G( ma S 1. X.. Infinity numbr of S valus 1ma 2ma.. ima.. ma X ma
Etrm valu distribution Ø To driv th plicit form F( ma in th gnral cas is difficult, but if w assum X follows an ponntial distribution, thn it is rathr asy. p( =, G( = P(X < = p(d = d = d( F( ma = Sp( ma G( ma S 1 = $ % = S ma (1 ma S 1 S ma (S 1 ma sinc (1-a n na and S >>1. & ' =1. Thrfor, S ma S ma, Lt u = ln S, thn, S = u, thrfor, F( ma = S ma S ma = ( ma u ( ma u. = u ma u ma
Etrm valu distribution Ø A distribution with a dnsity function F( is calld an trm valu distribution (EVD or a Gumbl distribution. Ø It ariss whn w considr th maimum valus for many indpndnt sampls of th sam siz takn from any distribution. Ø Although w driv this formula basd on ponntial distribution, it is a good approimation for many othr distributions of random variabls X. ma = ( ma u ( ma u, Ø Th distribution has two paramtrs u and, and its dnsity has a pak at X ma = u. Ø Th width is controlld by, th smallr th valu, th narrowr th pak. F( ma F( ma ma
Etrm valu distribution Ø Whn th sampl siz S changs, u will chang, ln S u =. A chang in u shifts th distribution curv horizontally without changing th shap of th distribution. Ø If w chang th sampl siz from S 1 to S 2, th pak of th distribution will mov from ln S1 to ln S2 u 1 = u 2 = Ø Th distanc of moving is givn by, ln S2 ln S1 ln( S2 / S1 u 2 u1 = =. Ø Th probability that X ma taks a valu gratr than or qual to an obsrvd valu obs can b computd by, P( X ma obs = 1 = obs ( obs u F(. ma d ma = obs ( ma u ( ma u d ma
An idalistic databas sarch scnario Ø Lt s considr a databas sarch algorithm that rturns a squnc in th databas with th highst numbr of matchs to th qury squnc, i, w us th numbr of matchs m btwn th two squncs to scor th alignmnt. Lt it b th random variabl M. Ø In ordr to know th significant of a rturnd squnc with a scor m ma, which is also a random variabl, dnotd as M ma, w nd to know th distribution of M ma, dnotd by F (m ma. Ø Lt s first look at a computr simulation rsult: sarch a databas of 2,000 random squncs of lngth 200 bs by anothr diffrnt 2,000 random squncs of th sam lngth. s 1 s 2 s i s 2000 q 1 m 1, 1 m 1, 2 m 1, j m 1, 2000 m 1ma q 2 m 2, 1 m 2, 2 m 2, j m 2, 2000 q i m i, 1 m i, 2 m i, 2000 q 2000 m 2000, 1 m 2000, 2 m 2000, j m 2000, 2000 Random variabl M m 2ma m ima m 2000ma Random variabl M ma
An idalistic databas sarch scnario Ø Suppos that th squncs ar only mad of Cs and Gs with th sam frquncy, i.., C=G=50%. Ø Clarly, th distribution of th scor m i, j M, i.., th scor that a qury squnc q i aligns with a squnc s j in th databas, follows a binomial distribution, with N=200 and a = 0.5; Ø Howvr, th distribution of th scor m i ma M ma, i.., th bst scor rturnd whn th databas is qurid by squnc s i, follows an EVD. M ma M
An idalistic databas sarch scnario Ø Spcifically, w sampl 2,000 M valus for 2,000 tims, and for ach sampl of 2000 m valus, w obtain an m ma. Ø Us th formula of EVD that w drivd abov, w hav, F( m ma = ( m ma u ( mma u. Ø Fitting th simulation data to this formula, w hav =0.497, u=123.2. Ø Givn a qury squnc, if th rturnd bst hit from a databas has a scor of match, m obs, th statistical significanc of this hit can b valuatd by th following probability valu, which is th p-valu basd on th null hypothsis that th qury squnc has no rlationship with th squncs in th databas: p valu( m obs = P( M ma m obs = 1 ( m obs u. Ø Th smallr th p valu, th mor significant th hit.
An idalistic databas sarch scnario Ø Suppos w hav a qury squnc of 200 bass, and w us it to sarch against a databas of 2,000 squncs, th rturnd bst hit has 130 matchs to th qury squnc, thn th p-valu is, p valu( m = 1 obs 0.497(130 123.2 = P( M ma = 0.033. m obs = 1 ( m obs u M ma M p valu
Distribution of th lngth of matching k-mrs in two squncs Ø To dvlop a statistical mthod usd in k-mr basd databas sarch algorithms such as FASTA and BLAST, w nd to considr th distribution th scors of k-mr matchs, rathr than th numbr of matchs. Ø Lt s considr a vry simpl pairwis local alignmnt algorithm that finds th longst actly matching k-mr in two squncs of lngth N and M. Ø For simplicity, th scor of th alignmnt is th lngth of th matching k-mr, k. Squnc 1, N=19 GGATATCCAGCGCTCCTCT Squnc 2, M=14 ATCCGATATCTTGG Ø Suppos that w align a lot of two unrlatd squncs, thn th longst lngth of match btwn two squncs, L is a random variabl. Ø Clarly, L should follow an EVD: F(l = P(L=l ~ EVD.
Distribution of th lngth of matching k-mrs in two squncs Ø Howvr, th lngth of actly matching k-mrs btwn two random squncs follows an ponntial distribution, which can b drivd as follows. Ø As discussd arlir, givn two unrlatd squncs, th probability that th two squncs hav a match at a position is, 2 2 a = π A + πc + πg + πt. Ø Lt K b th random variabl of th lngth of k-mrs found in two random squncs. Th probability that two random squncs hav at last k conscutiv matchs is, P(K k = P((match OR mismatch AND k matchs AND (match OR mismatch = P(match OR mismatch P (k matchs P (match OR mismatch = a k. Lt = -ln a, thn, a = -, P(K k = k. Ø Th probability that th two squncs hav lss than k conscutiv matchs is, G( k = P( K 2 < k = 1 P( K 2 k = 1 k.
Distribution of th lngth of matching k-mrs in two squncs Ø If w trat th K as a continuous variabl, thn th probability dnsity k of K is, p( k = Thrfor, K follows an ponntial distribution. Ø Th longst lngth of k-mr matchs btwn two squncs, L is an EVD, ( l u ( l u F( l = P( L = l =. All k-mr matchs dg( k dk = d(1 Find th longst matchs in ach alignmnt... dk = k. All pairwis alignmnts in random squnc spac Lngth spac of matching k-mrs p( k k = Lngth spac of th longst matching k-mrs F( l = ( l u ( l u
Distribution of th lngth of matching k-mrs in two squncs Ø Hr, u is rlatd to th numbr of k-mr alignmnts that can b gnratd btwn two squncs, i.., th siz of sampling, S, sinc w dfin u = (ln S /,. assuming that S is a constant numbr. Squnc 1, N=19 Squnc 2, M=14 GGATATCCAGCGCTCCTCT ATCCGATATCTTGG Ø In rality, S is clarly not a constant, but it clos to a constant valu. Ø Thr ar NM ways w can initiat a match btwn two squncs, but th actually numbr of k-mr alignmnts S is much lss than NM, lt it b S=βMN, thn ln( βmn u =. Ø Th pctd numbr of matchs btwn two squncs with lngth of at last k is (latr on w will dfin this as th E valu, E( k = βmnp( K k = βmn k.
Distribution of th lngth of matching k-mrs in databas sarch Ø So far, w only considr th longst k-mr match btwn two squncs. During th databas sarch w rturn th longst k-mr match btwn th qury squnc and all possibl squncs in th databass. Ø W can tnd our analysis of k-mr match btwn two squncs to th databas sarch by prtnding that w concatnat all th squncs in th databas to form a vry long squnc. Ø Lt th longst k-mr lngth for a databas rsarch is L ma, thn it should follows an EVD. n squncs Qury squnc Databas squnc k-mr lngth spac p( k k = F Longst k-mr lngth spac ( ( lma u ( lma u l = ma u = ln( βnmn.
Distribution of th lngth of matching k-mrs in databas sarch Ø Lt s look at th rsult of a computr simulation using 2,000 GC (G=C=50% random squncs of lngth 200 bass. q 1 q 2 q 3 q 2000 s 1 s 2 s 3 s 2000 Longst k-mrs btwn a pair of squncs, thir lngth is L. P( L = l = F( l = ( l u1 ln( βmn u 1 =. ( l u1 P( L Longst k-mrs btwn a squnc and any squnc in th databas, thir lngth is L ma. ma = lma = F( lma = ( l ln( βnmn u 2 =. ma u2 ( lma u 2
Distribution of th lngth of matching k-mrs in databas sarch Ø For both th distributions of L, and L ma, w hav, = -lna = -ln0.5 = ln2 Ø Howvr, computing u in ithr distribution is difficult, bcaus w do not know th valu of β, u = 1 ln( βmn for L, and ln( βnmn u2 = for L Ø W can find th valus of u 1 and u 2 by fitting th data to an EVD, which yilds, u 1 =13.6 and u 2 = 24.5. Ø Th diffrnc btwn u 1 and u 2 is, u 2 - u 1 = 24.5-13.6 = 10.9 ma.
Distribution of th lngth of matching k-mrs in databas sarch Ø Both th distribution of L, th longst lngth of k-mr match btwn two squncs, and L ma, th longst lngth of k-mr match btwn a qury squnc and any squnc in th databas, can b fittd to a EVD vry nicly. Ø Th diffrnc btwn u 2 and u 1 also mts our pctation: u 2 u 1 ln( S2 / S1 = ln( βnmn / βmn = lnn ln2000 = = ln2 = 10.97. F( l = K ( l u1 L ( l u1 F ( ( ( ma ma 2 2 l u l u l = L ma ma
Statistics in th BLAST algorithm Ø BLAST finds th highst HSP btwn a qury squnc and any squnc in th databas. Ø If w trat an HSP as a spcial k-mr match btwn two squncs, thn th lngth of HSP should follow an EVD. Ø Sinc th scor of a HSP is calculatd basd on th BLUSOM or PAM substitution matrics, it is mor informativ for th quality of alignmnt than th lngth of a HSP, so th lngth of an HSP is not usd for scoring in BLAST. Ø If a gap is not allowd, th scor of a HSP is rlatd to its lngth. Ø Kalin and Altschul (1990 showd that th scors of HSPs follow an EVD. Ø It has bn shown by computr simulations that th scors of gappd local alignmnts btwn random squncs gnratd by algorithms such as Smith-Watrman, FASTA, and BLAST all follow an EVD.
Statistics in th BLAST algorithm Ø BLAST usd a computr simulation to dtrmin th two paramtrs in th EVD formula, and u basd on a larg numbr of random squncs. Ø In particular, BLAST outputs th scor of th HSP btwn a qury squnc and th bst hit in th databas, as wll as th E valu of th scor. Ø Th E valu is dfind as th numbr of pctd HSPs that hav a bttr scor than that of th rturnd HSP, obtaind by sarching a random squnc databas of th sam siz. Ø W hav dvlopd th formula of E valu arlir, E( S = βmn S whr N is th lngth of qury squnc, M th total lngth of squncs in th databas, and S th scor of th HSP. Ø Thrfor, E valu dpnds on th siz of th databas bing sarchd. Givn a qury squnc, th largr th databas, th highr th E valu..
Statistics in th BLAST algorithm Ø Blow ar ampls of th databas sarch rsults by BLASTP using yast PTP1 as th qury. Sarch against th ntir Swiss-Prot databas =PTP1 =PTP2 =PTP3 Sarch against th nr databas, which is largr than Swiss-Prot Ø Th sam or narly sam scor for th hits PTP2 (85 vs 84 and PTP3 (49 vs 49 in both sarchs, but vry diffrnt E valus (2-17 vs 2-15 and (6-7 vs 9-5 du to diffrnt sizs of databass.
Rmarks for using BLAST Ø Th E valu is dpndnt on th sarch spac MN, thrfor, whn sarching against a small databas (M is small, th rsulting HSP may b significant, but it may not whn sarching against a larg databas; Ø A HSP may b significant for a small protin (N is small, but it may not b significant for a larg protin (N is larg; Ø With th ponntial incras of th siz of databass, any HSP bcoms lss and lss significant, so w nd nw mthods to b dvlopd in th futur.