ROC TPR FPR

Size: px

Start display at page:

Download "ROC TPR FPR"

Kory Hicks
5 years ago
Views:

1 HW5

2 2

3 3

4 TPR RO TPR = True Positive Rate = Sensitivity = Recall = TP/(TP+FN) FPR = False Positive Rate = 1 Specificity = FP/(FP+TN) Precision, aka PPV = TP/(TP+FP) FPR 4

5 TPR FPR 5

6 * scores * orfs$len 6

7 scores log2(orfs$len) 7

8 FPR FPR scores scores TPR TPR orfs$len log2(orfs$len)

9 RNA Search and Motif Discovery SEP 590 A omputational Biology

10 15

11 Approaches to Structure Prediction Maximum Pairing + works on single sequences + simple - too inaccurate Minimum Energy + works on single sequences - ignores pseudoknots - only finds optimal fold Partition Function + finds all folds - ignores pseudoknots

12 Optimal pairing of r i... r j Two possibilities j Unpaired: Find best pairing of r i... r j-1 j! i! j-1! j Paired (with some k): Find best r i... r k-1 + best r k+1... r j-1 plus 1 i! k-1! k! Why is it slow? Why do pseudoknots matter? j! j-1! k+1!

13 Nussinov: A omputation Order B(i,j) = # pairs in optimal pairing of r i... r j B(i,j) = 0 for all i, j with i j-4; otherwise B(i,j) = max of: B(i,j-1) Or energy max { B(i,k-1)+1+B(k+1,j-1) i k < j-4 and r k -r j may pair} Time: O(n 3 ) Loop-based energy version is better; recurrences similar, slightly messier K=

14 Today Structure prediction via comparative analysis ovariance Models (Ms) represent RNA sequence/structure motifs Fast M search Motif Discovery Applications in prokaryotes & vertebrates 19

15 Approaches, II omparative sequence analysis + handles all pairings (potentially incl. pseudoknots) - requires several (many?) aligned, appropriately diverged sequences Stochastic ontext-free rammars Roughly combines min energy & comparative, but no pseudoknots Physical experiments (x-ray crystalography, NMR) Today

16 ovariation is strong evidence for base pairing 21

17 Example: Ribosomal Autoregulation: Excess L19 represses L19 (RF00556; similar) A B P2 L19 (rpls) mrna leader Y U A U A R R Y U R R U 5' Y 3' P1 N N N TSS nucleotide identity nucleotide present 97% 97% 90% 90% 75% 75% 50% A stem loop always present compensatory mutations compatible mutations Watson-rick base pair other base interaction 5' P1 P2 RBS Start A A U U U U U A A U A U U 3' U U U A A A U A A U A U A A U U U U B. subtilis L19 mrna leader U U U U U L19 U U U A A A A U U U U U A U A 3' U A A U U U A 5' A A A A? 22

18 Mutual Information M ij = f f xi,xj xi,xj log xi,xj 2 f xi f xj ; 0 M ij 2 Max when no seq conservation but perfect pairing MI = expected score gain from using a pair state (below) Finding optimal MI, (i.e. opt pairing of cols) is hard(?) Finding optimal MI without pseudoknots can be done by dynamic programming 23

19 27

20 omputational Problems How to predict secondary structure How to model an RNA motif (I.e., sequence/structure pattern) iven a motif, how to search for instances iven (unaligned) sequences, find motifs How to score discovered motifs How to leverage prior knowledge 29

21 Motif Description

22 RNA Motif Models ovariance Models (Eddy & Durbin 1994) aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences Pro: accurate on: model building hard, search slow 32

23 Eddy & Durbin 1994: What A probabilistic model for RNA families The ovariance Model A Stochastic ontext-free rammar A generalization of a profile HMM Algorithms for Training From aligned or unaligned sequences Automates comparative analysis omplements Nusinov/Zucker RNA folding Algorithms for searching 33

24 Main Results Very accurate search for trna (Precursor to trnascanse - current favorite) iven sufficient data, model construction comparable to, but not quite as good as, human experts Some quantitative info on importance of pseudoknots and other tertiary features 34

25 Probabilistic Model Search As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model You set a score threshold Anything above threshold a hit Scoring: Forward / Inside algorithm - sum over all paths Viterbi approximation - find single best path (Bonus: alignment & structure prediction) 35

26 Example: searching for trnas 36

27 Profile Hmm Structure Mj: Match states (20 emission probabilities) Ij: Insert states (Background emission probabilities) Dj: Delete states (silent - no emission) 38

28 How to model an RNA Motif? onceptually, start with a profile HMM: from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position given a new seq, estimate likelihood that it could be generated by the model, & align it to the model mostly all del ins 39

29 How to model an RNA Motif? Add column pairs and pair emission probabilities for base-paired regions <<<<<<< paired columns >>>>>>> 40

30 Profile Hmm Structure Mj: Match states (20 emission probabilities) Ij: Insert states (Background emission probabilities) Dj: Delete states (silent - no emission) 41

31 M Structure A: Sequence + structure B: the M guide tree : probabilities of letters/ pairs & of indels Think of each branch being an HMM emitting both sides of a helix (but 3 side emitted in reverse order) 47

32 Overall M Architecture One box ( node ) per node of guide tree BE/MATL/INS/DEL just like an HMM MATP & BIF are the key additions: MATP emits pairs of symbols, modeling basepairs; BIF allows multiple helices 48

33 M Viterbi Alignment (the inside algorithm) x i x ij = i th letter of input = substring i,..., j of input T yz = P(transition y z) y E xi,x j = P(emission of x i, x j from state y) S ij y = max π log P(x ij gen'd starting in state y via path π) 49

34 M Viterbi Alignment (the inside algorithm) S ij y = max π log P(x ij generated starting in state y via path π) S ij y = % z max z [S i+1, j 1 ' z max z [S i+1, j ' z & max z [S i, j 1 ' z ' max z [S i, j (' max i<k j [S i,k y left y + logt yz + log E xi,x j y + logt yz + log E xi ] match pair ] match/insert left + logt yz + log E x j ] match/insert right + logt yz ] delete y + S right k +1, j ] bifurcation Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 50

35 An Important Application: Rfam

36 Rfam an RNA family DB riffiths-jones, et al., NAR 03, 05, 08, 11, 12 Was biggest scientific comp user in Europe cpu cluster for a month per release Rapidly growing: Rel 1.0, 1/03: 25 families, 55k instances Rel 7.0, 3/05: 503 families, 363k instances Rel 9.0, 7/08: 603 families, 636k instances Rel 9.1, 1/09: 1372 families, 1148k instances Rel 10.0, 1/10: 1446 families, 3193k instances Rel 11.0, 8/12: 2208 families, 6125k instances DB size: ~8B ~160B ~320B 58

RF00037: Example Rfam Family Input (hand-curated): MSA seed alignment SS_cons Score Thresh T Window Len W Output: M scan results & full alignment phylogeny, etc. IRE (partial seed alignment): Hom.sap.

37 RF00037: Example Rfam Family Input (hand-curated): MSA seed alignment SS_cons Score Thresh T Window Len W Output: M scan results & full alignment phylogeny, etc. IRE (partial seed alignment): Hom.sap. UUUUUAAAUUUUAUAA Hom.sap. UUUUU.UUAAAUUUUAUAA Hom.sap. UUUUUUUAAAUUUA.AA Hom.sap. UUUAU..AUAAAUUAU.AUAAA Hom.sap. UUUUUUAAAUUUUAUAA Hom.sap. AUUAU..AAAUUUU.AUAAU Hom.sap. UUU..UUAAAUUUUAAA Hom.sap. UUAU..AAAUAUU.AUAU Hom.sap. AUUAU..AAAUUU.AUAAU av.por. UUUUUAAAUUUAA Mus.mus. UAUAU..AAAUAUU.AUAU Mus.mus. UUUUUUAAAUUUAAAA Mus.mus. UAUUUUAAAUUUUAAAA Rat.nor. UAUAU..AAAUAU.AUAU Rat.nor. UAUUUUUAAAUUUUAAA SS_cons <<<<<...<<<<<...>>>>>.>>>>> 62

38 An Important Need: Faster Search

39 Homology search Homolog similar by descent from common ancestor Sequence-based Smith-Waterman FASTA BLAST For RNA, sharp decline in sensitivity at ~60-70% identity So, use structure, too 73

40 Impact of RNA homology search B. subtilis! glycine riboswitch! (Barrick, et al., 2004)! operon! L. innocua! A. tumefaciens! V. cholera! M. tuberculosis! (and 19 more species)! 74

41 Impact of RNA homology search B. subtilis! L. innocua! glycine riboswitch! (Barrick, et al., 2004)! operon! (Mandal, et al., 2004)! A. tumefaciens! V. cholera! M. tuberculosis! (and 19 more species)! (and 42 more species)! BLAST-based M-based 75

42 Faster enome Annotation of Non-coding RNAs Without Loss of Accuracy Zasha Weinberg & W.L. Ruzzo Recomb 04, ISMB 04, Bioinfo 06

43 RaveNnA: enome Scale RNA Search Typically 100x speedup over raw M, w/ no loss in accuracy: Drop structure from M to create a (faster) HMM Use that to pre-filter sequence; Discard parts where, provably, M score < threshold; Actually run M on the rest (the promising parts) Assignment of HMM transition/emission scores is key (a large convex optimization problem) Weinberg & Ruzzo, Bioinformatics, 2004,

44 M s are good, but slow Rfam Reality EMBL Our Work EMBL Rfam oal EMBL BLAST Ravenna junk M hits 1 month, 1000 computers M hits ~2 months, 1000 computers M junk hits 10 years, 1000 computers 79

45 Oversimplified M (for pedagogical purposes only) A U A U A U A U 81

46 M to HMM A U A U A U A U A U A U A U A U M HMM emisions per state 5 emissions per state, 2x states

47 Key Issue: 25 scores 10 M HMM A U P A U A U L R A U Need: log Viterbi scores M HMM 83

48 Key Issue: 25 scores 10 M A U P A U Need: log Viterbi scores M HMM P AA L A + R A P A L A + R P A L A + R P AU L A + R U P A L A + R A U L P A L + R A P L + R P L + R P U L + R U P L + R R HMM A U NB: HMM not a prob. model 85

49 Rigorous Filtering Any scores satisfying the linear inequalities give rigorous filtering Proof: M Viterbi path score corresponding HMM path score Viterbi HMM path score (even if it does not correspond to any M path) P AA L A + R A P A L A + R P A L A + R P AU L A + R U P A L A + R 86

50 Minimizing E(L i, R i ) (subject to linear constraints) alculate E(L i, R i ) symbolically, in terms of emission scores, so we can do partial derivatives for numerical convex optimization algorithm Forward: Viterbi: E(L 1, L 2,...) L i 90

51 Assignment of scores/ probabilities onvex optimization problem onstraints: enforce rigorous property Objective function: filter as aggressively as possible Problem sizes: variables inequality constraints 91

52 onvex Optimization onvex: local max = global max; simple hill climbing works Nonconvex: can be many local maxima, global max; hill-climbing fails 92

53 Estimated Filtering Efficiency (139 Rfam 4.0 families) Filtering fraction # families (compact) # families (expanded) break even < Averages 283 times faster than M! ~100x speedup 93

54 Motif Discovery

55 RNA Motif Discovery Would be great if: given 100 complete genomes from diverse species, we could automatically find all the RNAs. State of the art: that s hopeless Hope: can we exploit biological knowledge to narrow the search space? 130

56 RNA Motif Discovery More promising problem: given a unaligned sequences of a few kb, most of which contain instances of one RNA motif of bp -- find it. Example: 5 UTRs of orthologous glycine cleavage genes from γ-proteobacteria Example: corresponding introns of orthogolous vertebrate genes Orthologs = counterparts in different species 131

57 Approaches Align-First: Align sequences, then look for common structure Fold-First: Predict structures, then try to align them Joint: Do both together 132

58 Pitfall for sequence-alignment- first approach Structural conservation Sequence conservation Alignment without structure information is unreliable LUSTALW alignment of SEIS elements with flanking regions same-colored boxes should be aligned 134

59 Approaches Align-first: align sequences, then look for common structure Fold-first: Predict structures, then try to align them single-seq struct prediction only ~ 60% accurate; exacerbated by flanking seq; no biologicallyvalidated model for structural alignment Joint: Do both together Sankoff good but slow Heuristic 139

60 Our Approach: Mfinder Simultaneous local alignment, folding and Mbased motif description using an EM-style learning procedure Yao, Weinberg & Ruzzo, Bioinformatics,

61 MFinder Simultaneous alignment, folding & motif description Yao, Weinberg & Ruzzo, Bioinformatics, 2006 Folding predictions ombines folding & mutual information in a principled way. Smart heuristics andidate alignment M Mutual Information Realign EM 146

62 Mfinder Accuracy (on Rfam families with flanking sequence) /W /W 153

63 Discovery in Bacteria 156

64 Right Data: Why/How We can recognize, say, 5-10 good examples amidst 20 extraneous ones (but not 5 in 200 or 2000) of length 1k or 10k (but not 100k) Regulators often near regulatees (protein coding genes), which are usually recognizable cross-species So, look near similar genes ( homologs ) Many riboswitches, e.g., are present in ~5 copies per genome (Not strategy used in vertebrates x larger genomes) 164

65 Processing Times Identify DD group members 2946 DD groups Retrieve upstream sequences < 10 PU days Footprinter ranking < 10 PU days Input from ~70 complete Firmicute genomes available in late 2005-early 2006, totaling ~200 megabases Mfinder motifs Motif postprocessing 1740 motifs RaveNnA Mfinder refinement 1 ~ 2 PU months 10 PU months < 1 PU month Motif postprocessing 1466 motifs 179

66 Table 1: Motifs that correspond to Rfam families Rank Score # DD Rfam RAV MF FP RAV MF ID ene Description IlvB Thiamine pyrophosphate-requiring enzymes RF00230 T-box O3859 Predicted membrane protein RF00059 THI MetH Methionine synthase I specific DNA methylase RF00162 S_box O0116 Predicted N6-adenine-specific DNA methylase RF00011 RNaseP_bact_b DHBP 3,4-dihydroxy-2-butanone 4-phosphate synthase RF00050 RFN uaa MP synthase RF00167 Purine cvp lycine cleavage system protein P RF00504 lycine DUF149 Uncharacterised BR, YbaB family O0718 RF00169 SRP_bact bib obalamin biosynthesis protein obd/bib RF00174 obalamin LysA Diaminopimelate decarboxylase RF00168 Lysine Ter Membrane protein Ter RF00080 yybp-ykoy MgtE Mg/o/Ni transporter MgtE RF00380 ykok lms lucosamine 6-phosphate synthetase RF00234 glms OpuBB AB-type proline/glycine betaine transport systems RF00005 trna EmrE Membrane transporters of cations and cationic drug RF00442 ykk-yxkd O0398 Uncharacterized conserved protein RF00023 tmrna Table 1: Motifs that correspond to Rfam families. Rank : the three columns show ranks for refined motif clusters after genome scans ( RAV ), Mfinder motifs before genome scans ( MF ), and FootPrinter results ( FP ). We used the same ranking scheme for RAV and MF. Score 180

67 Rfam Membership Overlap Structure # Sn Sp nt Sn Sp bp Sn Sp RF00174 obalamin RF00504 lycine RF00234 glms RF00168 Lysine RF00167 Purine RF00050 RFN RF00011 RNaseP_bact_b RF00162 S_box RF00169 SRP_bact RF00230 T-box RF00059 THI RF00442 ykk-yxkd RF00380 ykok RF00080 yybp-ykoy mean median Tbl 2: Prediction accuracy compared to prokaryotic subset of Rfam full alignments. Membership: # of seqs in overlap between our predictions and Rfam s, the sensitivity (Sn) and specificity (Sp) of our membership predictions. Overlap: the avg len of overlap between our predictions and Rfam s (nt), the fractional lengths of the overlapped region in Rfam s predictions (Sn) and in ours (Sp). Structure: the avg # of correctly predicted canonical base pairs (in overlapped regions) in the secondary structure (bp), and sensitivity and specificity of our predictions. 1 After 2nd RaveNnA scan, membership Sn of lycine, obalamin increased to 76% and 98% resp., lycine Sp unchanged, but obalamin Sp dropped to 84%. 184

68 Example: Ribosomal Autoregulation: Excess L19 represses L19 (RF00556; similar) A B P2 L19 (rpls) mrna leader Y U A U A R R Y U R R U 5' Y 3' P1 N N N TSS nucleotide identity nucleotide present 97% 97% 90% 90% 75% 75% 50% A stem loop always present compensatory mutations compatible mutations Watson-rick base pair other base interaction 5' P1 P2 RBS Start A A U U U U U A A U A U U 3' U U U A A A U A A U A U A A U U U U B. subtilis L19 mrna leader U U U U U L19 U U U A A A A U U U U U A U A 3' U A A U U U A 5' A A A A? 186

69 Examples: 6 (of 22) Representative motifs Sudarsan, et al Science, 2008! EMM! Wang, et al Mol ell, 2008! SAH! suca! boxed = confirmed riboswitch SAM-IV! Weinberg et al RNA 08! Moo! Regulski et al Mol Microbiol 08! Legend! O4708! Weinberg, et al. Nucl. Acids Res., July : ! Meyer, et al RNA, 2008! 191

70 Vertebrate ncrnas Some Results

71 Human Predictions Evofold S Pedersen, Bejerano, A Siepel, K Rosenbloom, K Lindblad-Toh, ES Lander, J Kent, W Miller, D Haussler, "Identification and classification of conserved RNA secondary structures in the human genome." PLoS omput. Biol., 2, #4 (2006) e33. 48,479 candidates (~70% FDR?) RNAz S Washietl, IL Hofacker, M Lukasser, A Hutenhofer, PF Stadler, "Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome." Nat. Biotechnol., 23, #11 (2005) ,000 structured RNA elements 1,000 conserved across all vertebrates. ~1/3 in introns of known genes, ~1/6 in UTRs ~1/2 located far from any known gene FOLDALIN E Torarinsson, M Sawera, JH Havgaard, M Fredholm, J orodkin, "Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure." enome Res., 16, #7 (2006) candidates from (of 100,000) pairs Mfinder Torarinsson, Yao, Wiklund, Bramsen, Hansen, Kjems, Tommerup, Ruzzo and orodkin. omparative genomics beyond sequence based alignments: RNA structures in the ENODE regions. enome Research, Feb 2008, 18(2): PMID: candidates in ENODE alone (better FDR, but still high) Some details below

72 Mfinder Search in Vertebrates Extract ENODE * Multiz alignments Remove exons, most conserved elements blocks, 8.7M bps. Apply Mfinder to both strands. 10,106 predictions, 6,587 clusters. High false positive rate, but still suggests 1000 s of RNAs. (We ve applied Mfinder to whole human genome: many 100 s of PU years. Analysis in progress.) Trust 17-way alignment for orthology, not for detailed alignment * ENODE: deeply annotated 1% of human genome 216

DNA Using a Neutral Indel Model erton Lunter, hris P.

73 Log 10 (ount) enomic Intergap Distances (Human-Mouse) Length enome-wide Identification of Human Functional DNA Using a Neutral Indel Model erton Lunter, hris P. Ponting, Jotun Hein, PLoS omput Biol 2006, 2(1): e5. 220

74 Overlap w/ Indel Purified Segments IPS presumed to signal purifying selection Majority (64%) of candidates have >45% + Strong P-value for their overlap w/ IPS + data P N Expected Observed P-value % 0-35 igs % igs % igs % igs E % igs E % all igs E % 221

75 Realignment % realigned Average pairwise sequence similarity 225

76 Alignment Matters The original MULTIZ alignment without flanking regions. RNAz Score: (no RNA) Human TATTAAAATT-TTTAAAAAAT----TTAAATATAAAAAATAATT himp AATTTAATT-ATTTAAAAAT----ATTAAATATAAAATAAATT ow TATTTAAAATT-ATAAA--AAAAT----TTAATTTAAAAATTAATT Dog TATTTAAAATTTTAATA--AAAAAT----TTAATTTAAAATATTAATT Rabbit ATATTTAAAATTT-TTTTAATAAAAT----TTAATTATAAAATTAAATT Rhesus TATTAAAATT-TTTAAAAAATATTTAAATATAAAAAATAATT Str ((((((...(((((((...(((...)))..))))...)))...))))))...( The local Mfinder re-alignment of the MULTIZ block. RNAz Score: (RNA) Human TATTAAAATT-TTTAAA-A-----AATTTAAATATAAAAAATAA himp AATTTAATT-ATTT-AAA-----AATATTAAATATAAAATAAA ow TATTTAAAATT-ATAAA--AAA------ATTTAATTTAAAAATTAA Dog TATTTAAAATTTTAATA--AAA-A-----ATTTAATTTAAAATATTAA Rabbit ATATTTAAAATTT-TTTT-AATA-----AAATTTAATTATAAAATTAAA Rhesus TATTAAAATT-TTTAAA-AAA-TATTTAAATATAAAAAATAA Str ((((((...((((((((..(((...)))...))))))))...))))))

77 10 of 11 top (differentially) expressed 228

78 ncrna Summary ncrna is a hot topic For family homology modeling: Ms Training & search like HMM (but slower) Dramatic acceleration possible Automated model construction possible New computational methods yield new discoveries Many open problems 247

79 ourse Wrap Up

80 What is DNA? RNA? How many Amino Acids are there? Did human beings, as we know them, develop from earlier species of animals? What are stem cells? What did Viterbi invent? What is dynamic programming? What is a likelihood ratio test? What is the EM algorithm? How would you find the maximum of f(x) = ax 3 + bx 2 + cx +d in the interval -10<x<25? 249

81 Sensors High-Throughput DNA sequencing Microarrays/ene expression Mass Spectrometry/Proteomics BioTech Protein/protein & DNA/protein interaction ontrols loning ene knock out/knock in RNAi Floods of data rand hallenge problems 250

82 Exciting Times Lots to do Highly multidisciplinary You ll be hearing a lot more about it I hope I ve given you a taste of it

83 Thanks!

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure