The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein!

The Double Helix SE 417: lgorithms and omputational omplexity! Winter 29! W. L. Ruzzo! Dynamic Programming, II" RN Folding! http://www.rcsb.org/pdb/explore.do?structureid=1t! Los lamos Science The entral Dogma of Molecular Biology! DN! RN! Protein! Fig. 2. The arrows show the situation as it seemed in 1958. Solid arrows represent probable transfers, dotted arrows possible transfers. The absent arrows (compare Fig. 1) represent the impossible transfers postulated by the central dogma. They are the three possible arrows starting from protein.! Protein! gene! DN " (chromosome)! cell! RN! (messenger)!

Non-coding RN! DN structure: dull! Messenger RN - codes for proteins! Non-coding RN - all the rest! 5 TT 3! Before, say, mid 199 s, 1-2 dozen known (critically important, but narrow roles)! 3 TTT 5! Since mid 9 s dramatic discoveries! Regulation, transport, stability/degradation! E.g. microrn : 1s in humans => 5% of genes! E.g. riboswitches : 1s in bacteria! RN " Structure:" Rich! RN Secondary Structure: RN makes helices too Base pairs!!!!!!!!!!!!!!!!!!!!!!!! 5!!!!!!! sually single stranded " RN s fold, " and function! " Nature uses" what works! 3! 7

http://www.rcsb.org/pdb/explore.do?structureid=1evv! RN " Secondary " Structure:" Not everything," but important," easier than 3d" Why is structure important?! " For protein-coding, similarity in sequence is a powerful tool for finding related sequences! " e.g. hemoglobin is easily recognized in all vertebrates! " For non-coding RN, many different sequences have the same structure, and structure is most important for function.! " So, using structure plus sequence, can find related sequences at much greater evolutionary distances! 6S mimics an " open promoter! E.coli! Barrick et al. RN 25! Trotochaud et al. NSMB 25! Willkomm et al. NR 25!

hloroflexus aurantiacus hloroflexi eobacter metallireducens eobacter sulphurreducens " -Proteobacteria In Bacteria: typical biosynthetic cycle (E. coli) around a critical metabolite ( SM ) ene Regulation: The MET Repressor lberts, et al, 3e. The protein way Riboswitch alternative SM SM rundy & Henkin, Mol. Microbiol 1998 Epshtein, et al., PNS 23 Winkler et al., Nat. Struct. Biol. 23 Protein lberts, et al, 3e. DN 15 16

lberts, et al, 3e. The protein way Riboswitch alternatives lberts, et al, 3e. The protein way Riboswitch alternatives SM-II SM-III SM-I SM-I SM-II rundy, Epshtein, Winkler et al., 1998, 23 orbino et al., enome Biol. 25 17 rundy, Epshtein, Winkler et al., 1998, 23 orbino et al., enome Biol. 25 Fuchs et al., NSMB 26 18 lberts, et al, 3e. The protein way Riboswitch alternatives SM-I SM-II SM-III SM-IV boxed = confirmed riboswitch nd many other examples. Widespread, deeply conserved, structurally sophisticated, functionally diverse, biologically important uses for ncrn throughout prokaryotic world. (+2 more) rundy, Epshtein, Winkler et al., 1998, 23 orbino et al., enome Biol. 25 Fuchs et al., NSMB 26 19 Weinberg et al., RN 28 Weinberg, et al. Nucl. cids Res., July 27 35: 489-4819.! 2

Vertebrates " Bigger, more complex genomes " <2% coding " But >5% conserved in sequence? " nd 5-9% transcribed? " nd structural conservation, if any, invisible (without proper alignments, etc.) " What s going on? Fastest Human ene? 21 Q: What s so hard?!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! : Structure often more important than sequence! Origin of Life? Life needs information carrier: DN molecular machines, like enzymes: Protein making proteins needs DN + RN + proteins making (duplicating) DN needs proteins Horrible circularities! How could it have arisen in an abiotic environment?

Origin of Life? 6.5 RN Secondary Structure RN can carry information, too RN double helix; RN-directed RN polymerase RN can form complex structures RN enzymes exist (ribozymes) RN can control, do logic (riboswitches) Nussinov s lgorithm core technology for RN structure prediction! The RN world hypothesis: 1st life was RN-based RN Secondary Structure RN Secondary Structure (somewhat oversimplified) RN. String B = b 1 b 2 b n over alphabet {,,, }. Secondary structure. RN is usually single-stranded, and tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule. Ex: Secondary structure. set of pairs S = { (b i, b j ) } that satisfy:!" [Watson-rick.] " S is a matching, i.e. each base pairs with at most one other, and " each pair in S is a Watson-rick pair: -, -, -, or -.!" [No sharp turns.] The ends of each pair are separated by at least 4 intervening bases. If (b i, b j ) # S, then i < j - 4.!" [Non-crossing.] If (b i, b j ) and (b k, b l ) are two pairs in S, then we cannot have i < k < j < l. (Violation of this is called a pseudoknot.) Free energy. sual hypothesis is that an RN molecule will form the secondary structure with the optimum total free energy. approximate by number of base pairs complementary base pairs: -, - oal. iven an RN molecule B = b 1 b 2 b n, find a secondary structure S that maximizes the number of base pairs.

RN Secondary Structure: Examples RN Secondary Structure: Subproblems Examples. First attempt. OPT[j] = maximum number of base pairs in a secondary structure of the substring b 1 b 2 b j. match b t and b j base pair 1 t j ok $4 sharp turn crossing Difficulty. Results in two sub-problems.!" Finding secondary structure in: b 1 b 2 b t-1.!" Finding secondary structure in: b t+1 b t+2 b j-1. OPT(t-1) not OPT of anything; need more sub-problems Dynamic Programming Over Intervals: (R. Nussinov s algorithm) Bottom p Dynamic Programming Over Intervals Notation. OPT[i, j] = maximum number of base pairs in a secondary structure of the substring b i b i+1 b j. Q. What order to solve the sub-problems?. Do shortest intervals first.!" ase 1. If i % j - 4. " OPT[i, j] = by no-sharp turns condition.!" ase 2. Base b j is not involved in a pair. " OPT[i, j] = OPT[i, j-1]!" ase 3. Base b j pairs with b t for some i $ t < j - 4. " non-crossing constraint decouples resulting sub-problems " OPT[i, j] = 1 + max t { OPT[i, t-1] + OPT[t+1, j-1] } Key point: Either last base is unpaired (case 1,2) or paired (case 3)! RN(b 1,,b n ) { for k = 5, 6,, n-1 for i = 1, 2,, n-k j = i + k ompute OPT[i, j] } return OPT[1, n] using recurrence i 4 3 2 1 6 7 8 9 j j k take max over t such that i $ t < j-4 and b t and b j are Watson-rick complements Running time. O(n 3 ). i 1 2 5 4 1 1 i 2 6 7 8 9 Remark. Same core idea in KY algorithm to parse context-free grammars. 3 4 3 4 k

omputing one cell: OPT[2,18] =?!!n = 16! ( (. (.... ). ).. )..! 1 1 1 1 1 2 2 2 3 3 3! 1 1 2 2 2 2 2 2! 1 1 1 1 1 2 2 2! 1 1 1 1 1 2 2 2! 1 1 1 1 1 1 2! 1 1 1 1 2! 1 1 1 1 1! 1! 1! E.g.: OPT[1,6] = 1:! (...) E.g.: OPT[6,16] = 2:! ((...)...) n= 2! $ OPT[i, j -1]! ' max %! ( ase 1:! 2 % 18-4? no.! ase 2:! B 18 unpaired?! lways a possibility;" then OPT[2,18] % 3"! ((...))(...)... omputing one cell: OPT[2,18] =? n= 2! $ OPT[i, j -1]! ' max %! ( ase 3, 2 $ t <18-4:! t = 2: no pair! omputing one cell: OPT[2,18] =? n= 2! $ OPT[i, j -1]! ' max %! ( ase 3, 2 $ t <18-4:! t = 3: no pair!

omputing one cell: OPT[2,18] =? n= 2! $ OPT[i, j -1]! ' max %! ( ase 3, 2 $ t <18-4:! t = 4: yes pair" OPT[2,18]%1++3!!..(...(((...))))! omputing one cell: OPT[2,18] =? n= 2! OPT(i, j) = % $ OPT(i, OPT[i, j -1) -1]! ' max% ( & 1+ max t (OPT(i,t (OPT[i,t #1) #1] + OPT[t OPT(t +1, j #1] #1)) ase 3, 2 $ t <18-4:! t = 5: yes pair" OPT[2,18]%1++3!!...(..(((...))))! omputing one cell: OPT[2,18] =? n= 2! OPT(i, j) = % $ OPT(i, OPT[i, j -1) -1]! ' max% ( & 1+ max t (OPT(i,t (OPT[i,t #1) #1] + OPT[t OPT(t +1, j #1] #1)) ase 3, 2 $ t <18-4:! t = 6: yes pair" OPT[2,18]%1++3!!...(.(((...))))! omputing one cell: OPT[2,18] =? n= 2! OPT(i, j) = % $ OPT(i, OPT[i, j -1) -1]! ' max% ( & 1+ max t (OPT(i,t (OPT[i,t #1) #1] + OPT[t OPT(t +1, j #1] #1)) ase 3, 2 $ t <18-4:! t = 7: yes pair" OPT[2,18]%1++3!!...((((...))))!

omputing one cell: OPT[2,18] =? n= 2! $ OPT[i, j -1]! ' max %! ( ase 3, 2 $ t <18-4:! t = 8: no pair! omputing one cell: OPT[2,18] =? n= 2! $ OPT[i, j -1]! ' max %! ( ase 3, 2 $ t <18-4:! t = 11: yes pair" OPT[2,18]%1+2+!! ((...))(...)! (not shown:! t=9,1, 12,13)! n= 2! omputing one cell: OPT[2,18] = 4 1 2 2 2 2 2 2 3 3 3 4 5 6! $ 1 1 1! if i " j # 4 $ OPT[i, j -1]! ' max %! ( Overall, Max = 4! several ways, e.g.:!!..(...(((...))))! tree shows trace back:! square = case 3! octagon = case 1" nother Trace Back Example!n = 16! ( (. (.... ). ).. )..! 1 1 1 1 1 2 2 2 3 3 3! 1 1 2 2 2 2 2 2! 1 1 1 1 1 2 2 2! 1 1 1 1 1 2 2 2! 1 1 1 1 1 1 2! 1 1 1 1 2! 1 1 1 1 1! 1! 1!! $! if i " j # 4 * $ OPT[i, j -1] ' max% ( E.g.: OPT[1,16] = 3:! ((.(...).)..)..