RNA Search and! Motif Discovery"

Size: px

Start display at page:

Download "RNA Search and! Motif Discovery""

Lizbeth Pope
5 years ago
Views:

1 RN Search and! Motif Discovery" Lecture 9! SEP 590! utumn 2008" Outline" Whirlwind tour of ncrn search & discovery" RN motif description (ovariance Model Review)" lgorithms for searching" Rigorous & heuristic filtering" Motif discovery" pplications" 2 RN Motif Models" Motif Description" ovariance Models (Eddy & Durbin 1994)" aka profile stochastic context-free grammars" aka hidden Markov models on steroids" Model position-specific nucleotide preferences and base-pair preferences" Pro: accurate" on: model building hard, search sloooow" 7 8

2 Example: searching for trns Profile Hmm Structure" 13 Mj: "Match states (20 emission probabilities)" Ij: "Insert states (Background emission probabilities)" Dj: "Delete states (silent - no emission)" 17 M Structure" M Viterbi lignment" : Sequence + structure" B: the M guide tree " : probabilities of letters/ pairs & of indels" Think of each branch being an HMM emitting both sides of a helix (but 3 side emitted in reverse order)" 18 x i x ij = i th letter of input = substring i,..., j of input T yz = P(transition y " z) y E xi,x j S ij y = P(emission of x i, x j from state y) = max # log P(x ij gen'd starting in state y via path #) 20

mrn leader S ij y = max " log P(x ij generated starting in state y via path ") % z y max z [S i+1, j#1 + logt yz + log E xi,x j ] match pair ' z y max z [S i+1, j + logt yz + log E xi ] match/insert

3 mrn leader S ij y = max " log P(x ij generated starting in state y via path ") % z y max z [S i+1, j#1 + logt yz + log E xi,x j ] match pair ' z y max z [S i+1, j + logt yz + log E xi ] match/insert left ' S y z ij = & max z [S i, j#1 + logt yz + log E y x j ] match/insert right ' z ' max z [S i, j + logt yz ] delete y max i<k$ j [S left y i,k + S right (' k +1, j ] bifurcation mrn leader switch? Time O(qn 3 ), q states, seq len n Mutual Information" f M ij = " f xi,xj xi,xj log xi,xj 2 ; 0 # M ij # 2 f xi f xj Max when no seq conservation but perfect pairing" MI = expected score gain from using a pair state" Finding optimal MI, (i.e. opt pairing of cols) is hard(?)" Finding optimal MI without pseudoknots can be done by dynamic programming" 24 26

Pseudoknots disallowed allowed # n % & " max j M $ i=1 i, j (/2 ' 29 31 Rfam an RN family DB!

, NR 03, 05" Biggest scientific computing user in Europe - 1000 cpu cluster for a month per release" Rapidly

0, 3/05: 503 families, >300k instances" 33 Input (hand-curated):" MS seed alignment " SS_cons" Score Thresh T"

4 Pseudoknots disallowed allowed # n % & " max j M $ i=1 i, j (/2 ' Rfam an RN family DB! riffiths-jones, et al., NR 03, 05" Biggest scientific computing user in Europe cpu cluster for a month per release" Rapidly growing:" Rel 1.0, 1/03: 25 families, 55k instances" Rel 7.0, 3/05: 503 families, >300k instances" 33 Input (hand-curated):" MS seed alignment " SS_cons" Score Thresh T" Window Len W" Output:" M" scan results & full alignment " Rfam" IRE (partial seed alignment):! Hom.sap. Hom.sap.. Hom.sap.. Hom.sap.... Hom.sap. Hom.sap.... Hom.sap... Hom.sap.... Hom.sap.... av.por. Mus.mus.... Mus.mus. Mus.mus. Rat.nor.... Rat.nor. SS_cons <<<<<...<<<<<...>>>>>.>>>>> 34

5 Faster Search" Faster enome nnotation! of Non-coding RNs! Without Loss of ccuracy" Zasha Weinberg " & W.L. Ruzzo" Recomb 04, ISMB 04, Bioinfo 06" RaveNn: enome Scale! RN Search" Typically 100x speedup over raw M, w/ no loss in accuracy: " " "drop structure from M to create a (faster) HMM" "use that to pre-filter sequence; " "discard parts where, provably, M score < threshold;" "actually run M on the rest (the promising parts)" "assignment of HMM transition/emission scores is key " ""(large convex optimization problem)" Weinberg & Ruzzo, Bioinformatics, 2004, junk M s are good, but slow! Rfam Reality EMBL BLST M hits 1 month, 1000 computers Our Work EMBL Ravenna M hits ~2 months, 1000 computers junk Rfam oal EMBL M hits 10 years, computers

6 M M to HMM" HMM M Key Issue: 25 scores " 10" HMM P L R Need: log Viterbi scores M! HMM" 25 emisions per state 5 emissions per state, 2x states Viterbi/Forward Scoring" Path! defines transitions/emissions" Score(!) = product of probabilities on!" NB: ok if probs aren t, e.g.!"1! (e.g. in M, emissions are odds ratios vs! 0th-order background)" For any nucleotide sequence x:" Viterbi-score(x) = max{ score(!)! emits x}" Forward-score(x) =!{ score(!)! emits x}" 45 M Key Issue: 25 scores " 10" P Need: log Viterbi scores M! HMM" P! L + R P! L + R P! L + R P! L + R P! L + R L P! L + R P! L + R P! L + R P! L + R P! L + R R HMM 46 NB: HMM not a prob. model

7 Rigorous Filtering" ny scores satisfying the linear inequalities give rigorous filtering! Proof:! M Viterbi path score!! corresponding HMM path score!! Viterbi HMM path score! (even if it does not correspond to any M path)" P! L + R P! L + R P! L + R P! L + R P! L + R 47 Some scores filter better" P = 1! L + R " P = 4! L + R " " " " " " " ssuming # 25% " Option 1: " " " "Opt 1:" L = R = R = 2 " " L + (R + R )/2 = 4 " Option 2: " " " "Opt 2:" L = 0, R = 1, R = 4 " L + (R + R )/2 = 2.5" 48 Optimizing filtering" For any nucleotide sequence x:" Viterbi-score(x) = max{ score(!)! emits x }" Forward-score(x) =!{ score(!)! emits x }" Expected Forward Score" E(L i, R i ) =! all sequences x Forward-score(x)*Pr(x)" NB: E is a function of L i, R i only" nder 0th-order background model Optimization:! Minimize E(L i, R i ) subject to score Lin.Ineq.s" This is heuristic ( forward$ % Viterbi$ % filter$ )" But still rigorous because subject to score Lin.Ineq.s " 49 alculating E(L i, R i )" E(L i, R i ) =! x Forward-score(x)*Pr(x)" Forward-like: for every state, calculate expected score for all paths ending there; easily calculated from expected scores of predecessors & transition/emission probabilities/scores" 50

Minimizing E(L i, R i )" onvex Optimization" alculate E(L i, R i ) symbolically, in terms of emission scores, so we can do partial derivatives for numerical convex optimization algorithm" Forward:

8 Minimizing E(L i, R i )" onvex Optimization" alculate E(L i, R i ) symbolically, in terms of emission scores, so we can do partial derivatives for numerical convex optimization algorithm" Forward: Viterbi: onvex:! local max = global max;" simple hill climbing works" Nonconvex:! can be many local maxima, << global max;! hill-climbing fails" "E(L 1, L 2,...) "L i Estimated Filtering Efficiency! (139 Rfam 4.0 families)" Filtering fraction" # families (compact)" # families (expanded)" < 10-4 " 105" 110" " 8" 17" " 11" 3" " 2" 2" " 6" 4" " 7" 3" ~100x speedup 53 Name" Results: New ncrn s?" # found" BLST! + M" # found rigorous filter + M" # new" Pyrococcus snorn" 57" 180" 123" Iron response element" 201" 322" 121" Histone 3 element" 1004" 1106" 102" Purine riboswitch" 69" 123" 54" Retron msr" 11" 59" 48" Hammerhead I" 167" 193" 26" Hammerhead III" 251" 264" 13" 4 snrn" 283" 290" 7" S-box" 128" 131" 3" 6 snrn" 1462" 1464" 2" 5 snrn" 199" 200" 1" 7 snrn" 312" 313" 1" 54

9 RN Motif Discovery" Motif Discovery" Typical problem: given a ~10-20 unaligned sequences of ~1kb, most of which contain instances of one RN motif of, say, 150bp -- find it." Example: 5 TRs of orthologous glycine cleavage genes from &-proteobacteria" Searching for noncoding RNs" M s are great, but where do they come from?" n approach: comparative genomics" Search for motifs with common secondary structure in a set of functionally related sequences." hallenges" Three related tasks" Locate the motif regions." lign the motif instances." Predict the consensus secondary structure." Motif search space is huge!" Motif location space, alignment space, structure space." 62 mfinder-- ovariance! Model Based RN Motif! Finding lgorithm Bioinformatics, 2006, 22(4): Zizhen Yao" Zasha Weinberg" Walter L. Ruzzo" niversity of Washington, Seattle" 63

10 pproaches" Obvious pproach I: lign First,! Predict from Multiple Sequence lignment" lign sequences, then look for common structure" Predict structures, then try to align them" Do both together" " " " " " " ompensatory mutations reveal structure, (core of comparative sequence analysis ) but usual alignment algorithms penalize them (twice) Pfold (KH03) Test Set D Trusted alignment Pitfall for sequence-alignmentfirst approach" Structural conservation " Sequence conservation" lignment without structure information is unreliable" LSTLW alignment of SEIS elements with flanking regions lustalw lignment Evolutionary Distance Knudsen & Hein, Pfold: RN secondary structure prediction using stochastic context-free grammars, Nucleic cids Research, 2003, v 31, same-colored boxes should be aligned 67

11 pproaches" lign sequences, then look for common structure" Predict structures, then try to align them" single-seq struct prediction only ~ 60% accurate; exacerbated by flanking seq; no biologicallyvalidated model for structural alignment" Do both together" Sankoff good but slow" Heuristic" 68 Our pproach: Mfinder" Simultaneous alignment, folding and Mbased motif description using an EM-style learning procedure" Yao, Weinberg & Ruzzo, Bioinformatics, Design oals" Find RN motifs in unaligned sequences" Seq conservation exploited, but not required" Robust to inclusion of unrelated sequences" Robust to inclusion of flanking sequence" Reasonably fast and scalable" Produce a probabilistic model of the motif that can be directly used for homolog search" lignment " M " lignment" Similar to HMM, but slower" Builds on Eddy & Durbin, 94" But new way to infer which columns to pair, via a principled combination of mutual information and predicted folding energy" nd, it s local, not global, alignment (harder)" 70 71

12 Folding predictions Mfinder Outline" M step Heuristics andidate alignment M E step Realign Search Initial lignment Heuristics" fold sequences separately" candidates: regions with low folding energy" compare candidates via tree edit algorithm" find best central candidates & align to them" BLST anchors" EM M-step uses M.I. + folding energy for structure prediction Structure Inference" Part of M-step is to pick a structure that maximizes data likelihood" We combine:" mutual information" position-specific priors for paired/unpaired! (based on single sequence thermodynamic folding predictions)" intuition: for similar seqs, little MI; fall back on singlesequence folding predictions" data-dependent, so not strictly Bayesian " Mfinder ccuracy! (on Rfam families with flanking sequence)" /W /W 74 78

13 Searching for noncoding RNs" pplication I! omputational Pipeline for High Throughput Discovery of cis-regulatory Noncoding RN in Prokaryotes. Yao, Barrick, Weinberg, Neph, Breaker, Tompa and Ruzzo. PLoS omputational Biology. 3(7): e126, July 6, M s are great, but where do they come from?" n approach: comparative genomics" Search for motifs with common secondary structure in a set of functionally related sequences." hallenges" Three related tasks" Locate the motif regions." lign the motif instances." Predict the consensus secondary structure." Motif search space is huge!" Motif location space, alignment space, structure space." 81 Predicting New cis-regulatory RN Elements" oal: " iven unaligned TRs of coexpressed or orthologous genes, find common structural motifs" Difficulties: " Low sequence similarity: alignment difficult" Varying flanking sequence " Motif missing from some input genes" Right Data: Why/How" We can recognize, say, 5-10 good examples amidst 20 extraneous ones (but not 5 in 200 or 2000) of length 1k or 10k (but not 100k)" Regulators often near regulatees (protein coding genes), which are usually recognizable cross-species" So, find similar genes ( homologs ), look at adjacent DN " (Not strategy used in vertebrates x larger genomes)" 82 83

14 et bacterial genomes" pproach" For each gene, get close orthologs (DD)" Find most promising genes, based on conserved sequence motifs (Footprinter)" From those, find structural motifs (Mfinder)" enome-wide search for more instances "(Ravenna)" Expert analyses (Breaker Lab, Yale)" 84 pipeline for RN motif genome scans" DD Ortholgous genes pstream sequences Top datasets Footprinter Rank datasets Mfinder Motifs Homologs Search enome database Yao, Barrick, Weinberg, Neph, Breaker, Tompa and Ruzzo. omputational Pipeline for High Throughput Discovery of cis-regulatory Noncoding RN in 85 Prokaryotes. PLoS omputational Biology. 3(7): e126, July 6, 2007.! enome Scale Search: Why" enome Scale Search" Many riboswitches, e.g., are present in ~5 copies per genome" In most close relatives " More examples give better model, hence even more examples, fewer errors " More examples give more clues to function - critical for wet lab verification" But inclusion of non-examples can degrade motif " 86 Mfinder is directly usable for/with search" Folding predictions Smart heuristics andidate alignment Realign M Search 88

15 Results" Have analyzed most sequenced bacteria (~2005) " bacillus/clostridia" gamma proteobacteria" cyanobacteria" actinobacteria" firmicutes" Identify DD group members 2946 DD groups Retrieve upstream sequences Footprinter ranking Mfinder motifs Motif postprocessing 1740 motifs RaveNn Mfinder refinement < 10 P days < 10 P days 1 ~ 2 P months 10 P months < 1 P month Motif postprocessing 1466 motifs Rank Score # DD Rfam RV MF FP RV MF ID ene Description IlvB Thiamine pyrophosphate-requiring enzymes RF00230 T-box O3859 Predicted membrane protein RF00059 THI MetH Methionine synthase I specific DN methylase RF00162 S_box O0116 Predicted N6-adenine-specific DN methylase RF00011 RNaseP_bact_b DHBP 3,4-dihydroxy-2-butanone 4-phosphate synthase RF00050 RFN ua MP synthase RF00167 Purine cvp lycine cleavage system protein P RF00504 lycine DF149 ncharacterised BR, YbaB family O0718 RF00169 SRP_bact bib obalamin biosynthesis protein obd/bib RF00174 obalamin Lys Diaminopimelate decarboxylase RF00168 Lysine Ter Membrane protein Ter RF00080 yybp-ykoy MgtE Mg/o/Ni transporter MgtE RF00380 ykok lms lucosamine 6-phosphate synthetase RF00234 glms OpuBB B-type proline/glycine betaine transport systems RF00005 trn EmrE Membrane transporters of cations and cationic drug RF00442 ykk-yxkd O0398 ncharacterized conserved protein RF00023 tmrn Table 1: Motifs that correspond to Rfam families. Rank : the three columns show ranks for refined motif clusters after genome scans ( RV ), Mfinder motifs before genome scans ( MF ), and FootPrinter results ( FP ). We used the same ranking scheme for RV and MF. Score 91 Rfam Membership Overlap Structure # Sn Sp nt Sn Sp bp Sn Sp RF00174 obalamin RF00504 lycine RF00234 glms RF00168 Lysine RF00167 Purine RF00050 RFN RF00011 RNaseP_bact_b RF00162 S_box RF00169 SRP_bact RF00230 T-box RF00059 THI RF00442 ykk-yxkd RF00380 ykok RF00080 yybp-ykoy mean median Table 2: Motif prediction accuracy vs prokaryotic subset of Rfam full alignments. Membership : the number of sequences in the overlap between our predictions and Rfam s ( # ), the sensitivity ( Sn ) and specificity ( Sp ) of our membership predictions. Overlap : avg length of overlap between our predictions and Rfam s ( nt ), the fractional lengths of the overlapped region in Rfam s predictions ( Sn ) and in ours ( Sp ). Structure : avg number of correctly predicted canonical base pairs (in overlapped regions) and the sensivity ( Sn ) and specificity ( Sp ) of our predictions. 1 fter another iteration of RaveNn scan and refinement, the membership sensitivities of lycine and obalamin increased to 76% and 98% respectively, while the specificity of 92 lycine remained the same, and specificity of obalamin dropped to 84%.

Rank # DD ene: Description nnotation 6 69 28178 DHOase IIa: Dihydroorotase PyrR attenuator [22] 15 33 10097 RplL: Ribosomal protein L7/L1 L10 r-protein leader; see Supp 19 36 10234 RpsF: Ribosomal

domain 31 31 10164 Inf: Translation initiation factor 3 IF-3 r-protein leader; see Supp 41 26 10393 RpsD: Ribosomal protein S4 and related proteins S4 r-protein leader; see Supp [30] 44 30 10332 rol:

16 Rank # DD ene: Description nnotation DHOase IIa: Dihydroorotase PyrR attenuator [22] RplL: Ribosomal protein L7/L1 L10 r-protein leader; see Supp RpsF: Ribosomal protein S6 S6 r-protein leader O1179: Dinucleotide-utilizing enzymes 6S RN [25] RpsJ: Ribosomal protein S10 S10 r-protein leader; see Supp Resolvase: N terminal domain Inf: Translation initiation factor 3 IF-3 r-protein leader; see Supp RpsD: Ribosomal protein S4 and related proteins S4 r-protein leader; see Supp [30] rol: haperonin roel Hrc DN binding site [46] Ribosomal L21p: Ribosomal prokaryotic L21 protein L21 r-protein leader; see Supp ad: admium resistance transporter [47] RplB: Ribosomal protein L2 S10 r-protein leader RN pol Rpb2 1: RN polymerase beta subunit O3830: T domain-containing protein Ribosomal S2: Ribosomal protein S2 S2 r-protein leader Rps: Ribosomal protein S7 S12 r-protein leader O2984: B-type uncharacterized transport system tsr: Firmicutes transcriptional repressor of class III tsr DN binding site [48] Formyl trans N: Formyl transferase PurE: Phosphoribosylcarboxyaminoimidazole O4129: Predicted membrane protein RplO: Ribosomal protein L15 L15 r-protein leader RpmJ: Ribosomal protein L36 IF-1 r-protein leader na B: na protein B-type domain Ribosomal S12: Ribosomal protein S12 S12 r-protein leader Ribosomal L4: Ribosomal protein L4/L1 family L3 r-protein leader O0742: N6-adenine-specific methylase ylbh putative RN motif [4] Pencillinase R: Penicillinase repressor BlaI, MecI DN binding site [49] Ribosomal S9: Ribosomal protein S9/S16 L13 r-protein leader; Fig Ribosomal L19: Ribosomal protein L19 L19 r-protein leader; Fig ap: lyceraldehyde-3-phosphate dehydrogenase/erythrose O4708: Predicted membrane protein O0325: Predicted enzyme with a TIM-barrel fold RpmF: Ribosomal protein L32 L32 r-protein leader LDH: L-lactate dehydrogenases spr: Predicted rrn methylase Fus: Translation elongation factors EF- r-protein leader mrn leader mrn leader switch? 94 pplication II" boxed = confirmed riboswitch (+2 more) Identification of 22 candidate structured RNs in bacteria using the Mfinder comparative genomics pipeline. " Weinberg, Barrick, Yao, Roth, Kim, ore, Wang, Lee, Block, Sudarsan, Neph, Tompa, Ruzzo and Breaker. " Nucl. cids Res., July : " Weinberg, et al. Nucl. cids Res., July : !

17 New Riboswitches" EMM regulated genes" SM IV SH MOO PreQ1 II EMM "(S-adenosyl methionine)" "(S-adenosyl homocystein)" "(Molybdenum ofactor)" "(queuosine precursor)" "(cyclic di-mp)" Pili and flagella! Secretion " hemotaxis " Signal transduction " hitin" Membrane Peptide " Other - tfox, cytochrome c" EMM sense a metabolite (cyclic di-mp) produced for signal transduction or for cell-cell communication tility?" ncrn discovery in Vertebrates" nknown" BT" omparative genomics beyond sequence based alignments: RN structures in the ENODE regions E.g., there are no known human riboswitches, so potentially fewer side effects from drugs that might target them" Some such drugs (w/ previously unknown targets) have been known for decades!" E. Torarinsson, Z. Yao, E. D. Wiklund, J. B. Bramsen,. Hansen, J. Kjems, N. Tommerup, W. L. Ruzzo and J. orodkin enome Research, Jan

18 ncrn discovery in Vertebrates" Previous studies focus on highly conserved regions (Washietl, Pedersen et al. 2007)" Evofold (Pedersen et al. 2006)" RNz (Washietl et al. 2005)" We explore regions with weak sequence conservation " pproach" Extract ENODE Multiz alignments " Remove exons, most conserved elements. " blocks, 8.7M bps." pply Mfinder to both strands." 10,106 predictions, 6,587 clusters. " False positive rate: 50% based on a heuristic ranking function. " Search in Vertebrates" Extract ENODE Multiz alignments " Remove exons, most conserved elements. " blocks, 8.7M bps." pply Mfinder to both strands." 10,106 predictions, 6,587 clusters. " High false positive rate, but still suggests 1000 s of RNs. " (We ve applied Mfinder to whole human genome:! O(1000) P years. nalysis in progress.)" Trust 17-way alignment for orthology, not for detailed alignment ssoc w/ coding genes" Many known human ncrns lie in introns" Several of our candidates do, too, including some of the tested ones" #6: SYN3 (Synapsin 3)" #10: TIMP3, antisense within SYN3 intron" #9: RM8 (glutamate receptor metabotropic 8)"

19 Overlap with known transcripts" Input regions include only one known ncrn hasmir-483, and we found it." 40% intergenetic, 60% overlap with protein coding gene" 114 Overlap w/ Indel Purified Segments" IPS presumed to signal purifying selection" Majority (64%) of candidates have >45% +" Strong P-value for their overlap w/ IPS " + data P N Expected Observed P-value % 0-35 igs % igs % igs % igs E % igs E % all igs E % 115 omparison with Evofold, RNz" lignment Matters" Mfinder Evofold RNz 1781 Small overlap (w/ highly significant p-values) emphasizes complementarity

" Neither RNz nor Evofold are robust on poorly conserved and gappy regions. (Of course, they weren t designed to be.

20 Realignment" New scoring scheme " % realigned oal: improve false discovery rate for top ranking motifs " urrent methods can not improve beyond 50% FDR by using higher score threshold. " Neither RNz nor Evofold are robust on poorly conserved and gappy regions. (Of course, they weren t designed to be.)" verage pairwise sequence similarity Test on Mfinder motifs in ENODE regions" 10 of 11 top (differentially) expressed" False Discovery Rate FDR vs score ranks in the original alignments

21 Summary" ncrn - apparently widespread, much interest" ovariance Models - powerful but expensive tool for ncrn motif representation, search, discovery" Rigorous/Heuristic filtering - typically 100x speedup in search with no/little loss in accuracy" Mfinder - good M-based motif discovery in unaligned sequences " Pipeline integrating comp and bio for ribowitch discovery" Potentially many ncrns with weak sequence conservation in vertebrates. " Summary" Lots of structurally conserved ncrn" Functional significance often unclear" But high rate of confirmed tissue-specific expression in (small) set of top candidates in humans" BI P demands " Still need for further methods development & application" Thanks!" Discovering ncrns in prokaryotes through genome-wide clustering Elizabeth Tseng W SE 135

22 Our work! oal! lustering for homologous ncrn prediction! Our pproach! luster genomic sequences by homology! Incorporate secondary structure information! hallenges! Input: large search space! Homology inference: what tools to use?! How to evaluate? Overview! Motivation! pproach!lustering based on homology!incorporating secondary structure information! Evaluation! onclusion Overview of approach full genomic sequences full genomic sequence for species X 5 3 enbank annotated DS / trn / rrn / repeat regions for X strand 5 - strand 3 5 Intergenic Region (IR) extraction remove annotated regions < 15bps discard IRs < 15 bps discard IRs adjacent to rrn extracted IR for species X

23 Overview of approach Homology search programs full genomic sequences Query Subject Database Hits Intergenic Region (IR) extraction pool of IRs TTTTTTT TTTTTTTTTT T.. DB of sequences Query segment : x1-x2 Subject segment: y1-y2 Matching score: S Homology search Homology search programs! Popular homology search programs:! NBI-BLST! W-BLST! FST! SSERH Overview of approach full genomic sequences Intergenic Region (IR) extraction pool of IRs ses dynamic programming to find matching regions between two sequences SSERH 10 times slower than the rest homology hits Homology search Hierarchical clustering

24 Hierarchical clustering What if segments overlap?! Homology program produces a list of hits between IR segments! IR segments! nodes! hit between two segments! connecting edge! Similarity score! edge weight Query segment : x1-x2 Subject segment: y1-y2 Matching score: S1 Query segment : x3-x4 Subject segment: z1-z2 Matching score: S2 What if (x1,x2) and (x3,x4) overlap by a significant portion? Query segment : x1-x2 Subject segment: y1-y2 Matching score: S x1-x2 S y1-y2 y1 y2 z1 z2 x1 x3 x2 x4 highly conserved in sequence What if segments overlap? WPM (Weighted Pair roup Method using arithmetic veraging) y1 y2 x1 x3 x2 x4 highly conserved in sequence z1 z2! While exists a connecting edge! Select the edge with highest weight! Replace the connected two nodes with an new internal node! pdate edge associations hit y1-y2 F 6 x1-x4 hit z1-z2 infer 4 B 2 3 E 10 D 5

25 4 B 2 10 F E D 6 4 E B (,D) (F,) 4 B E F 6 B 1 4 E 0.3 (F,) (,D) ((,D),) se size threshold to cut down tree size Overview (((,D),), (B,E)) (F,) luster too big? D B E F! Motivation! pproach!lustering based on homology!incorporating secondary structure information! Evaluation! onclusion

26 Secondary structure info Secondary structure info!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! <<<<<<... >>>>>> <<<<.... >>>>! More conserved in structure than sequence! an we include secondary structure when searching for homologs?!!! convert 4-alphabet (,,,) to 12-alphabet {<, >, _} x {,,,} allow for mismatches between alphabets that are from different nucleotides, but the same structure DIY scoring matrix (<, <)! great (<, <)! good (<, _)! bad Predicting structures on IR iven an IR: 1.! Break into overlapping pieces (prev. slide) 2.! Feed each piece to RNfold! obtain structure 3.! onvert pieces from 4-alphabet to 12-alphabet 4.! se homology program with DIY scoring matrix 5.! Same clustering process INPT: enomic sequences IR extraction Incorporate structure info. Homology search program Hierarchical clustering Tree cutting OTPT: lusters of IR segments

27 Example of a good scan ncrn motif IR segment Example of a bad scan M scan recovered 95% of glms with NO false hits! M scan returned ~5000 false hits with 6 ylbh positive hits

Searching for Noncoding RNA

Searching for Noncoding RNA Searching for Noncoding RN Larry Ruzzo omputer Science & Engineering enome Sciences niversity of Washington http://www.cs.washington.edu/homes/ruzzo Bio 2006, Seattle, 8/4/2006 1 Outline Noncoding RN Why