Algorithms in Computtionl Biology More on BWT
tody Plese Lst clss! don't forget to submit And by next (vi emil, repo ) implementtion week or shre prgectfltw get Not I would like reding overview! Discuss design designs, how tested Cr show you me some tests ), how to compile fuse nd, ny comprisons lessons lerned or All Hw is grded! Don Come 't forget get week HWS next instructor evlutions #
tody First More on BWT, recp Koi Compressible Reversible Useful A fst) for serching
' ' bb $ sort But ll 7$ prs Sort$ b b b b 9 99 # s 4% sort $9 gin t b b b b $ $ b b b b sort H4hp# beb sort Btt b b sort But Reversing $ 9
sorts sorted But Stupk b b $ 9 sort sorted Bit b b Gtnpks $ 9 sort sorted BAI b b 7tnpks $ 9 originl row ending in $
Code esy C if slow ) * Runtime? 0Cn4ogn ) C Importnt tkewy I line of Code is not 04 ) time! ) *
Lst time! Connect on to suffix trees t suffix rrys <p $ = = 1 EX fbwtes/elp1ep
How to reverse more efficiently? LF todympping Give ech chrcter trnk = # of times chrcter previously ppered Ex oboe A in string zbz/swhf? Look bck t BWT Keyfct $ obo, zbi reltive order IS b Sme in bo Ft L ( of $ Frnks) b, bo
This is true for ny vlue I V Clled LF mpping The ith occurrence of chrcter c in L nd chrcter c in R correspond lwys to sve occur nd in originl string
Why?? Becuse we're doing lexicogrphicl Ge lphbeticl ) sorted order! m m All the 's hve sme order Ties sorted Sometimes broken by string sme it 's Suffix of one & of other! prefix clled " First Lst properly "
Nowy How cn we use BWT ' to look fr ll one string? repets of Let 's look t " biologicl dt set GATGCGAGAG String Tr AT 6$ Compute ll cyclic permuttions ( do or su x lst time ) ) v s rry from [ Suffix rry,, # wt Let 's look for ll " GAGA " in text
Counting All t bckwrd serch end Ech of these with " A " ' ' A 's is 1st letter of some Su x However, only suffices preceded options by G cn be BWT stores this! AIGA T I 40 These must be stored next to ech other in suffix 67 rry C since ll strt the sme ) Q Where is the 1st G in the string? ( Remember descending order ) Since 1st G in 6, these re 710
So we Look! GA " 2 re preceded In 710, only by n A " These re the fist two A 's in BWT " 1st two A 's in suffx order sorted continue usffrcofneaefseforoett Em * p; Both to 2 re, use sorted order position To 8 mtch <
Implementtion Need first Lst row Sorted Plug the index of Occurrences TBWT Counting # 0 Spce For OCC one # bets now per lphbet Gor14 = chrcter & one column per input = chrcter N string Ech entry stores humn this ws Totl O@ N log N ) fined ) For 47 68 genome 613
serching ) Tock ) For k query of size k steps, 2 ech memory with ccessed Not of size of the text!! independent Ock ) hire
Spce improvements Store 0/1 count Gusted of bits ) O lg_n Keep I column per then count just tble binry Now ON bits 32, using C plus IGN for every 32K ( For humn now entry) down to genome 298, GB ' not 4768 GB )
Also How compress the suffx rry keep t vlue out of to 32 every compute vlues? missing { Cool trick! $ isstored t Ot contins vlue 13 letter G Where is 12? y 1 CIG ] to cec, o ) 1 Generlly 12 = 6+11 positron of! if stored t m, y BWTEMTEX, is t ccxx3tocccxms I
If we do this Just iterte this compute E Poston of su x until t multiple look up previous you rech of 32 those vlues ( 2 ccess itertion memory per rt most, 31 itertions to rech http 6 of 32 )? & Sves nother fctor For of 32 down humn to genome, now ~ 300 MB or so ( Even more tricks dvnced dt using structures bit beyond our scope )
Most fmous ppliction of DNA Seeding step lignment BWA uses exct tricks we just looked t Prticulrly good' ", since biology so smll in lphbet "