RNA Search and! Motif Discovery"

Size: px
Start display at page:

Download "RNA Search and! Motif Discovery""

Transcription

1 RN Search and! Motif Discovery" Lecture 9! SEP 590! utumn 2008" Outline" Whirlwind tour of ncrn search & discovery" RN motif description (ovariance Model Review)" lgorithms for searching" Rigorous & heuristic filtering" Motif discovery" pplications" 2 RN Motif Models" Motif Description" ovariance Models (Eddy & Durbin 1994)" aka profile stochastic context-free grammars" aka hidden Markov models on steroids" Model position-specific nucleotide preferences and base-pair preferences" Pro: accurate" on: model building hard, search sloooow" 7 8

2 Example: searching for trns Profile Hmm Structure" 13 Mj: "Match states (20 emission probabilities)" Ij: "Insert states (Background emission probabilities)" Dj: "Delete states (silent - no emission)" 17 M Structure" M Viterbi lignment" : Sequence + structure" B: the M guide tree " : probabilities of letters/ pairs & of indels" Think of each branch being an HMM emitting both sides of a helix (but 3 side emitted in reverse order)" 18 x i x ij = i th letter of input = substring i,..., j of input T yz = P(transition y " z) y E xi,x j S ij y = P(emission of x i, x j from state y) = max # log P(x ij gen'd starting in state y via path #) 20

3 mrn leader S ij y = max " log P(x ij generated starting in state y via path ") % z y max z [S i+1, j#1 + logt yz + log E xi,x j ] match pair ' z y max z [S i+1, j + logt yz + log E xi ] match/insert left ' S y z ij = & max z [S i, j#1 + logt yz + log E y x j ] match/insert right ' z ' max z [S i, j + logt yz ] delete y max i<k$ j [S left y i,k + S right (' k +1, j ] bifurcation mrn leader switch? Time O(qn 3 ), q states, seq len n Mutual Information" f M ij = " f xi,xj xi,xj log xi,xj 2 ; 0 # M ij # 2 f xi f xj Max when no seq conservation but perfect pairing" MI = expected score gain from using a pair state" Finding optimal MI, (i.e. opt pairing of cols) is hard(?)" Finding optimal MI without pseudoknots can be done by dynamic programming" 24 26

4 Pseudoknots disallowed allowed # n % & " max j M $ i=1 i, j (/2 ' Rfam an RN family DB! riffiths-jones, et al., NR 03, 05" Biggest scientific computing user in Europe cpu cluster for a month per release" Rapidly growing:" Rel 1.0, 1/03: 25 families, 55k instances" Rel 7.0, 3/05: 503 families, >300k instances" 33 Input (hand-curated):" MS seed alignment " SS_cons" Score Thresh T" Window Len W" Output:" M" scan results & full alignment " Rfam" IRE (partial seed alignment):! Hom.sap. Hom.sap.. Hom.sap.. Hom.sap.... Hom.sap. Hom.sap.... Hom.sap... Hom.sap.... Hom.sap.... av.por. Mus.mus.... Mus.mus. Mus.mus. Rat.nor.... Rat.nor. SS_cons <<<<<...<<<<<...>>>>>.>>>>> 34

5 Faster Search" Faster enome nnotation! of Non-coding RNs! Without Loss of ccuracy" Zasha Weinberg " & W.L. Ruzzo" Recomb 04, ISMB 04, Bioinfo 06" RaveNn: enome Scale! RN Search" Typically 100x speedup over raw M, w/ no loss in accuracy: " " "drop structure from M to create a (faster) HMM" "use that to pre-filter sequence; " "discard parts where, provably, M score < threshold;" "actually run M on the rest (the promising parts)" "assignment of HMM transition/emission scores is key " ""(large convex optimization problem)" Weinberg & Ruzzo, Bioinformatics, 2004, junk M s are good, but slow! Rfam Reality EMBL BLST M hits 1 month, 1000 computers Our Work EMBL Ravenna M hits ~2 months, 1000 computers junk Rfam oal EMBL M hits 10 years, computers

6 M M to HMM" HMM M Key Issue: 25 scores " 10" HMM P L R Need: log Viterbi scores M! HMM" 25 emisions per state 5 emissions per state, 2x states Viterbi/Forward Scoring" Path! defines transitions/emissions" Score(!) = product of probabilities on!" NB: ok if probs aren t, e.g.!"1! (e.g. in M, emissions are odds ratios vs! 0th-order background)" For any nucleotide sequence x:" Viterbi-score(x) = max{ score(!)! emits x}" Forward-score(x) =!{ score(!)! emits x}" 45 M Key Issue: 25 scores " 10" P Need: log Viterbi scores M! HMM" P! L + R P! L + R P! L + R P! L + R P! L + R L P! L + R P! L + R P! L + R P! L + R P! L + R R HMM 46 NB: HMM not a prob. model

7 Rigorous Filtering" ny scores satisfying the linear inequalities give rigorous filtering! Proof:! M Viterbi path score!! corresponding HMM path score!! Viterbi HMM path score! (even if it does not correspond to any M path)" P! L + R P! L + R P! L + R P! L + R P! L + R 47 Some scores filter better" P = 1! L + R " P = 4! L + R " " " " " " " ssuming # 25% " Option 1: " " " "Opt 1:" L = R = R = 2 " " L + (R + R )/2 = 4 " Option 2: " " " "Opt 2:" L = 0, R = 1, R = 4 " L + (R + R )/2 = 2.5" 48 Optimizing filtering" For any nucleotide sequence x:" Viterbi-score(x) = max{ score(!)! emits x }" Forward-score(x) =!{ score(!)! emits x }" Expected Forward Score" E(L i, R i ) =! all sequences x Forward-score(x)*Pr(x)" NB: E is a function of L i, R i only" nder 0th-order background model Optimization:! Minimize E(L i, R i ) subject to score Lin.Ineq.s" This is heuristic ( forward$ % Viterbi$ % filter$ )" But still rigorous because subject to score Lin.Ineq.s " 49 alculating E(L i, R i )" E(L i, R i ) =! x Forward-score(x)*Pr(x)" Forward-like: for every state, calculate expected score for all paths ending there; easily calculated from expected scores of predecessors & transition/emission probabilities/scores" 50

8 Minimizing E(L i, R i )" onvex Optimization" alculate E(L i, R i ) symbolically, in terms of emission scores, so we can do partial derivatives for numerical convex optimization algorithm" Forward: Viterbi: onvex:! local max = global max;" simple hill climbing works" Nonconvex:! can be many local maxima, << global max;! hill-climbing fails" "E(L 1, L 2,...) "L i Estimated Filtering Efficiency! (139 Rfam 4.0 families)" Filtering fraction" # families (compact)" # families (expanded)" < 10-4 " 105" 110" " 8" 17" " 11" 3" " 2" 2" " 6" 4" " 7" 3" ~100x speedup 53 Name" Results: New ncrn s?" # found" BLST! + M" # found rigorous filter + M" # new" Pyrococcus snorn" 57" 180" 123" Iron response element" 201" 322" 121" Histone 3 element" 1004" 1106" 102" Purine riboswitch" 69" 123" 54" Retron msr" 11" 59" 48" Hammerhead I" 167" 193" 26" Hammerhead III" 251" 264" 13" 4 snrn" 283" 290" 7" S-box" 128" 131" 3" 6 snrn" 1462" 1464" 2" 5 snrn" 199" 200" 1" 7 snrn" 312" 313" 1" 54

9 RN Motif Discovery" Motif Discovery" Typical problem: given a ~10-20 unaligned sequences of ~1kb, most of which contain instances of one RN motif of, say, 150bp -- find it." Example: 5 TRs of orthologous glycine cleavage genes from &-proteobacteria" Searching for noncoding RNs" M s are great, but where do they come from?" n approach: comparative genomics" Search for motifs with common secondary structure in a set of functionally related sequences." hallenges" Three related tasks" Locate the motif regions." lign the motif instances." Predict the consensus secondary structure." Motif search space is huge!" Motif location space, alignment space, structure space." 62 mfinder-- ovariance! Model Based RN Motif! Finding lgorithm Bioinformatics, 2006, 22(4): Zizhen Yao" Zasha Weinberg" Walter L. Ruzzo" niversity of Washington, Seattle" 63

10 pproaches" Obvious pproach I: lign First,! Predict from Multiple Sequence lignment" lign sequences, then look for common structure" Predict structures, then try to align them" Do both together" " " " " " " ompensatory mutations reveal structure, (core of comparative sequence analysis ) but usual alignment algorithms penalize them (twice) Pfold (KH03) Test Set D Trusted alignment Pitfall for sequence-alignmentfirst approach" Structural conservation " Sequence conservation" lignment without structure information is unreliable" LSTLW alignment of SEIS elements with flanking regions lustalw lignment Evolutionary Distance Knudsen & Hein, Pfold: RN secondary structure prediction using stochastic context-free grammars, Nucleic cids Research, 2003, v 31, same-colored boxes should be aligned 67

11 pproaches" lign sequences, then look for common structure" Predict structures, then try to align them" single-seq struct prediction only ~ 60% accurate; exacerbated by flanking seq; no biologicallyvalidated model for structural alignment" Do both together" Sankoff good but slow" Heuristic" 68 Our pproach: Mfinder" Simultaneous alignment, folding and Mbased motif description using an EM-style learning procedure" Yao, Weinberg & Ruzzo, Bioinformatics, Design oals" Find RN motifs in unaligned sequences" Seq conservation exploited, but not required" Robust to inclusion of unrelated sequences" Robust to inclusion of flanking sequence" Reasonably fast and scalable" Produce a probabilistic model of the motif that can be directly used for homolog search" lignment " M " lignment" Similar to HMM, but slower" Builds on Eddy & Durbin, 94" But new way to infer which columns to pair, via a principled combination of mutual information and predicted folding energy" nd, it s local, not global, alignment (harder)" 70 71

12 Folding predictions Mfinder Outline" M step Heuristics andidate alignment M E step Realign Search Initial lignment Heuristics" fold sequences separately" candidates: regions with low folding energy" compare candidates via tree edit algorithm" find best central candidates & align to them" BLST anchors" EM M-step uses M.I. + folding energy for structure prediction Structure Inference" Part of M-step is to pick a structure that maximizes data likelihood" We combine:" mutual information" position-specific priors for paired/unpaired! (based on single sequence thermodynamic folding predictions)" intuition: for similar seqs, little MI; fall back on singlesequence folding predictions" data-dependent, so not strictly Bayesian " Mfinder ccuracy! (on Rfam families with flanking sequence)" /W /W 74 78

13 Searching for noncoding RNs" pplication I! omputational Pipeline for High Throughput Discovery of cis-regulatory Noncoding RN in Prokaryotes. Yao, Barrick, Weinberg, Neph, Breaker, Tompa and Ruzzo. PLoS omputational Biology. 3(7): e126, July 6, M s are great, but where do they come from?" n approach: comparative genomics" Search for motifs with common secondary structure in a set of functionally related sequences." hallenges" Three related tasks" Locate the motif regions." lign the motif instances." Predict the consensus secondary structure." Motif search space is huge!" Motif location space, alignment space, structure space." 81 Predicting New cis-regulatory RN Elements" oal: " iven unaligned TRs of coexpressed or orthologous genes, find common structural motifs" Difficulties: " Low sequence similarity: alignment difficult" Varying flanking sequence " Motif missing from some input genes" Right Data: Why/How" We can recognize, say, 5-10 good examples amidst 20 extraneous ones (but not 5 in 200 or 2000) of length 1k or 10k (but not 100k)" Regulators often near regulatees (protein coding genes), which are usually recognizable cross-species" So, find similar genes ( homologs ), look at adjacent DN " (Not strategy used in vertebrates x larger genomes)" 82 83

14 et bacterial genomes" pproach" For each gene, get close orthologs (DD)" Find most promising genes, based on conserved sequence motifs (Footprinter)" From those, find structural motifs (Mfinder)" enome-wide search for more instances "(Ravenna)" Expert analyses (Breaker Lab, Yale)" 84 pipeline for RN motif genome scans" DD Ortholgous genes pstream sequences Top datasets Footprinter Rank datasets Mfinder Motifs Homologs Search enome database Yao, Barrick, Weinberg, Neph, Breaker, Tompa and Ruzzo. omputational Pipeline for High Throughput Discovery of cis-regulatory Noncoding RN in 85 Prokaryotes. PLoS omputational Biology. 3(7): e126, July 6, 2007.! enome Scale Search: Why" enome Scale Search" Many riboswitches, e.g., are present in ~5 copies per genome" In most close relatives " More examples give better model, hence even more examples, fewer errors " More examples give more clues to function - critical for wet lab verification" But inclusion of non-examples can degrade motif " 86 Mfinder is directly usable for/with search" Folding predictions Smart heuristics andidate alignment Realign M Search 88

15 Results" Have analyzed most sequenced bacteria (~2005) " bacillus/clostridia" gamma proteobacteria" cyanobacteria" actinobacteria" firmicutes" Identify DD group members 2946 DD groups Retrieve upstream sequences Footprinter ranking Mfinder motifs Motif postprocessing 1740 motifs RaveNn Mfinder refinement < 10 P days < 10 P days 1 ~ 2 P months 10 P months < 1 P month Motif postprocessing 1466 motifs Rank Score # DD Rfam RV MF FP RV MF ID ene Description IlvB Thiamine pyrophosphate-requiring enzymes RF00230 T-box O3859 Predicted membrane protein RF00059 THI MetH Methionine synthase I specific DN methylase RF00162 S_box O0116 Predicted N6-adenine-specific DN methylase RF00011 RNaseP_bact_b DHBP 3,4-dihydroxy-2-butanone 4-phosphate synthase RF00050 RFN ua MP synthase RF00167 Purine cvp lycine cleavage system protein P RF00504 lycine DF149 ncharacterised BR, YbaB family O0718 RF00169 SRP_bact bib obalamin biosynthesis protein obd/bib RF00174 obalamin Lys Diaminopimelate decarboxylase RF00168 Lysine Ter Membrane protein Ter RF00080 yybp-ykoy MgtE Mg/o/Ni transporter MgtE RF00380 ykok lms lucosamine 6-phosphate synthetase RF00234 glms OpuBB B-type proline/glycine betaine transport systems RF00005 trn EmrE Membrane transporters of cations and cationic drug RF00442 ykk-yxkd O0398 ncharacterized conserved protein RF00023 tmrn Table 1: Motifs that correspond to Rfam families. Rank : the three columns show ranks for refined motif clusters after genome scans ( RV ), Mfinder motifs before genome scans ( MF ), and FootPrinter results ( FP ). We used the same ranking scheme for RV and MF. Score 91 Rfam Membership Overlap Structure # Sn Sp nt Sn Sp bp Sn Sp RF00174 obalamin RF00504 lycine RF00234 glms RF00168 Lysine RF00167 Purine RF00050 RFN RF00011 RNaseP_bact_b RF00162 S_box RF00169 SRP_bact RF00230 T-box RF00059 THI RF00442 ykk-yxkd RF00380 ykok RF00080 yybp-ykoy mean median Table 2: Motif prediction accuracy vs prokaryotic subset of Rfam full alignments. Membership : the number of sequences in the overlap between our predictions and Rfam s ( # ), the sensitivity ( Sn ) and specificity ( Sp ) of our membership predictions. Overlap : avg length of overlap between our predictions and Rfam s ( nt ), the fractional lengths of the overlapped region in Rfam s predictions ( Sn ) and in ours ( Sp ). Structure : avg number of correctly predicted canonical base pairs (in overlapped regions) and the sensivity ( Sn ) and specificity ( Sp ) of our predictions. 1 fter another iteration of RaveNn scan and refinement, the membership sensitivities of lycine and obalamin increased to 76% and 98% respectively, while the specificity of 92 lycine remained the same, and specificity of obalamin dropped to 84%.

16 Rank # DD ene: Description nnotation DHOase IIa: Dihydroorotase PyrR attenuator [22] RplL: Ribosomal protein L7/L1 L10 r-protein leader; see Supp RpsF: Ribosomal protein S6 S6 r-protein leader O1179: Dinucleotide-utilizing enzymes 6S RN [25] RpsJ: Ribosomal protein S10 S10 r-protein leader; see Supp Resolvase: N terminal domain Inf: Translation initiation factor 3 IF-3 r-protein leader; see Supp RpsD: Ribosomal protein S4 and related proteins S4 r-protein leader; see Supp [30] rol: haperonin roel Hrc DN binding site [46] Ribosomal L21p: Ribosomal prokaryotic L21 protein L21 r-protein leader; see Supp ad: admium resistance transporter [47] RplB: Ribosomal protein L2 S10 r-protein leader RN pol Rpb2 1: RN polymerase beta subunit O3830: T domain-containing protein Ribosomal S2: Ribosomal protein S2 S2 r-protein leader Rps: Ribosomal protein S7 S12 r-protein leader O2984: B-type uncharacterized transport system tsr: Firmicutes transcriptional repressor of class III tsr DN binding site [48] Formyl trans N: Formyl transferase PurE: Phosphoribosylcarboxyaminoimidazole O4129: Predicted membrane protein RplO: Ribosomal protein L15 L15 r-protein leader RpmJ: Ribosomal protein L36 IF-1 r-protein leader na B: na protein B-type domain Ribosomal S12: Ribosomal protein S12 S12 r-protein leader Ribosomal L4: Ribosomal protein L4/L1 family L3 r-protein leader O0742: N6-adenine-specific methylase ylbh putative RN motif [4] Pencillinase R: Penicillinase repressor BlaI, MecI DN binding site [49] Ribosomal S9: Ribosomal protein S9/S16 L13 r-protein leader; Fig Ribosomal L19: Ribosomal protein L19 L19 r-protein leader; Fig ap: lyceraldehyde-3-phosphate dehydrogenase/erythrose O4708: Predicted membrane protein O0325: Predicted enzyme with a TIM-barrel fold RpmF: Ribosomal protein L32 L32 r-protein leader LDH: L-lactate dehydrogenases spr: Predicted rrn methylase Fus: Translation elongation factors EF- r-protein leader mrn leader mrn leader switch? 94 pplication II" boxed = confirmed riboswitch (+2 more) Identification of 22 candidate structured RNs in bacteria using the Mfinder comparative genomics pipeline. " Weinberg, Barrick, Yao, Roth, Kim, ore, Wang, Lee, Block, Sudarsan, Neph, Tompa, Ruzzo and Breaker. " Nucl. cids Res., July : " Weinberg, et al. Nucl. cids Res., July : !

17 New Riboswitches" EMM regulated genes" SM IV SH MOO PreQ1 II EMM "(S-adenosyl methionine)" "(S-adenosyl homocystein)" "(Molybdenum ofactor)" "(queuosine precursor)" "(cyclic di-mp)" Pili and flagella! Secretion " hemotaxis " Signal transduction " hitin" Membrane Peptide " Other - tfox, cytochrome c" EMM sense a metabolite (cyclic di-mp) produced for signal transduction or for cell-cell communication tility?" ncrn discovery in Vertebrates" nknown" BT" omparative genomics beyond sequence based alignments: RN structures in the ENODE regions E.g., there are no known human riboswitches, so potentially fewer side effects from drugs that might target them" Some such drugs (w/ previously unknown targets) have been known for decades!" E. Torarinsson, Z. Yao, E. D. Wiklund, J. B. Bramsen,. Hansen, J. Kjems, N. Tommerup, W. L. Ruzzo and J. orodkin enome Research, Jan

18 ncrn discovery in Vertebrates" Previous studies focus on highly conserved regions (Washietl, Pedersen et al. 2007)" Evofold (Pedersen et al. 2006)" RNz (Washietl et al. 2005)" We explore regions with weak sequence conservation " pproach" Extract ENODE Multiz alignments " Remove exons, most conserved elements. " blocks, 8.7M bps." pply Mfinder to both strands." 10,106 predictions, 6,587 clusters. " False positive rate: 50% based on a heuristic ranking function. " Search in Vertebrates" Extract ENODE Multiz alignments " Remove exons, most conserved elements. " blocks, 8.7M bps." pply Mfinder to both strands." 10,106 predictions, 6,587 clusters. " High false positive rate, but still suggests 1000 s of RNs. " (We ve applied Mfinder to whole human genome:! O(1000) P years. nalysis in progress.)" Trust 17-way alignment for orthology, not for detailed alignment ssoc w/ coding genes" Many known human ncrns lie in introns" Several of our candidates do, too, including some of the tested ones" #6: SYN3 (Synapsin 3)" #10: TIMP3, antisense within SYN3 intron" #9: RM8 (glutamate receptor metabotropic 8)"

19 Overlap with known transcripts" Input regions include only one known ncrn hasmir-483, and we found it." 40% intergenetic, 60% overlap with protein coding gene" 114 Overlap w/ Indel Purified Segments" IPS presumed to signal purifying selection" Majority (64%) of candidates have >45% +" Strong P-value for their overlap w/ IPS " + data P N Expected Observed P-value % 0-35 igs % igs % igs % igs E % igs E % all igs E % 115 omparison with Evofold, RNz" lignment Matters" Mfinder Evofold RNz 1781 Small overlap (w/ highly significant p-values) emphasizes complementarity

20 Realignment" New scoring scheme " % realigned oal: improve false discovery rate for top ranking motifs " urrent methods can not improve beyond 50% FDR by using higher score threshold. " Neither RNz nor Evofold are robust on poorly conserved and gappy regions. (Of course, they weren t designed to be.)" verage pairwise sequence similarity Test on Mfinder motifs in ENODE regions" 10 of 11 top (differentially) expressed" False Discovery Rate FDR vs score ranks in the original alignments

21 Summary" ncrn - apparently widespread, much interest" ovariance Models - powerful but expensive tool for ncrn motif representation, search, discovery" Rigorous/Heuristic filtering - typically 100x speedup in search with no/little loss in accuracy" Mfinder - good M-based motif discovery in unaligned sequences " Pipeline integrating comp and bio for ribowitch discovery" Potentially many ncrns with weak sequence conservation in vertebrates. " Summary" Lots of structurally conserved ncrn" Functional significance often unclear" But high rate of confirmed tissue-specific expression in (small) set of top candidates in humans" BI P demands " Still need for further methods development & application" Thanks!" Discovering ncrns in prokaryotes through genome-wide clustering Elizabeth Tseng W SE 135

22 Our work! oal! lustering for homologous ncrn prediction! Our pproach! luster genomic sequences by homology! Incorporate secondary structure information! hallenges! Input: large search space! Homology inference: what tools to use?! How to evaluate? Overview! Motivation! pproach!lustering based on homology!incorporating secondary structure information! Evaluation! onclusion Overview of approach full genomic sequences full genomic sequence for species X 5 3 enbank annotated DS / trn / rrn / repeat regions for X strand 5 - strand 3 5 Intergenic Region (IR) extraction remove annotated regions < 15bps discard IRs < 15 bps discard IRs adjacent to rrn extracted IR for species X

23 Overview of approach Homology search programs full genomic sequences Query Subject Database Hits Intergenic Region (IR) extraction pool of IRs TTTTTTT TTTTTTTTTT T.. DB of sequences Query segment : x1-x2 Subject segment: y1-y2 Matching score: S Homology search Homology search programs! Popular homology search programs:! NBI-BLST! W-BLST! FST! SSERH Overview of approach full genomic sequences Intergenic Region (IR) extraction pool of IRs ses dynamic programming to find matching regions between two sequences SSERH 10 times slower than the rest homology hits Homology search Hierarchical clustering

24 Hierarchical clustering What if segments overlap?! Homology program produces a list of hits between IR segments! IR segments! nodes! hit between two segments! connecting edge! Similarity score! edge weight Query segment : x1-x2 Subject segment: y1-y2 Matching score: S1 Query segment : x3-x4 Subject segment: z1-z2 Matching score: S2 What if (x1,x2) and (x3,x4) overlap by a significant portion? Query segment : x1-x2 Subject segment: y1-y2 Matching score: S x1-x2 S y1-y2 y1 y2 z1 z2 x1 x3 x2 x4 highly conserved in sequence What if segments overlap? WPM (Weighted Pair roup Method using arithmetic veraging) y1 y2 x1 x3 x2 x4 highly conserved in sequence z1 z2! While exists a connecting edge! Select the edge with highest weight! Replace the connected two nodes with an new internal node! pdate edge associations hit y1-y2 F 6 x1-x4 hit z1-z2 infer 4 B 2 3 E 10 D 5

25 4 B 2 10 F E D 6 4 E B (,D) (F,) 4 B E F 6 B 1 4 E 0.3 (F,) (,D) ((,D),) se size threshold to cut down tree size Overview (((,D),), (B,E)) (F,) luster too big? D B E F! Motivation! pproach!lustering based on homology!incorporating secondary structure information! Evaluation! onclusion

26 Secondary structure info Secondary structure info!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! <<<<<<... >>>>>> <<<<.... >>>>! More conserved in structure than sequence! an we include secondary structure when searching for homologs?!!! convert 4-alphabet (,,,) to 12-alphabet {<, >, _} x {,,,} allow for mismatches between alphabets that are from different nucleotides, but the same structure DIY scoring matrix (<, <)! great (<, <)! good (<, _)! bad Predicting structures on IR iven an IR: 1.! Break into overlapping pieces (prev. slide) 2.! Feed each piece to RNfold! obtain structure 3.! onvert pieces from 4-alphabet to 12-alphabet 4.! se homology program with DIY scoring matrix 5.! Same clustering process INPT: enomic sequences IR extraction Incorporate structure info. Homology search program Hierarchical clustering Tree cutting OTPT: lusters of IR segments

27 Example of a good scan ncrn motif IR segment Example of a bad scan M scan recovered 95% of glms with NO false hits! M scan returned ~5000 false hits with 6 ylbh positive hits

Searching for Noncoding RNA

Searching for Noncoding RNA Searching for Noncoding RN Larry Ruzzo omputer Science & Engineering enome Sciences niversity of Washington http://www.cs.washington.edu/homes/ruzzo Bio 2006, Seattle, 8/4/2006 1 Outline Noncoding RN Why

More information

Genome 559 Wi RNA Function, Search, Discovery

Genome 559 Wi RNA Function, Search, Discovery Genome 559 Wi 2009 RN Function, Search, Discovery The Message Cells make lots of RN noncoding RN Functionally important, functionally diverse Structurally complex New tools required alignment, discovery,

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Modeling and Searching for Non-Coding RNA

Modeling and Searching for Non-Coding RNA Modeling and Searching for Non-oding RN W.L. Ruzzo http://www.cs.washington.edu/homes/ruzzo http://www.cs.washington.edu/homes/ruzzo/ courses/gs541/09sp ENOME 541 protein and DN sequence analysis to determine

More information

RNA Search and Motif Discovery. Lectures CSE 527 Autumn 2007

RNA Search and Motif Discovery. Lectures CSE 527 Autumn 2007 RNA Search and Motif Discovery Lectures 18-19 CSE 527 Autumn 2007 The Human Parts List, circa 2001 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca 61 gggcgcagcg gcggccgcag accgagcccc

More information

Modeling and Searching for Non-Coding RNA

Modeling and Searching for Non-Coding RNA Modeling and Searching for Non-oding RN W.L. Ruzzo http://www.cs.washington.edu/homes/ruzzo Outline Why RN? Examples of RN biology omputational hallenges Modeling Search Inference The entral Dogma DN RN

More information

ROC TPR FPR

ROC TPR FPR HW5 2 3 TPR 0.0 0.2 0.4 0.6 0.8 1.0 291 411 537 669 807 957 1098 1269 1506 1971 8703 171 51 RO TPR = True Positive Rate = Sensitivity = Recall = TP/(TP+FN) FPR = False Positive Rate = 1 Specificity = FP/(FP+TN)

More information

Modeling and Searching for Non-Coding RNA

Modeling and Searching for Non-Coding RNA Modeling and Searching for Non-oding RN W.L. Ruzzo http://www.cs.washington.edu/homes/ruzzo The entral Dogma DN RN Protein gene Protein DN (chromosome) cell RN (messenger) lassical RNs mrn trn rrn snrn

More information

A Computational Pipeline for High- Throughput Discovery of cis-regulatory Noncoding RNA in Prokaryotes

A Computational Pipeline for High- Throughput Discovery of cis-regulatory Noncoding RNA in Prokaryotes A Computational Pipeline for High- Throughput Discovery of cis-regulatory Noncoding RNA in Prokaryotes Zizhen Yao 1*, Jeffrey Barrick 2, Zasha Weinberg 3, Shane Neph 1,4, Ronald Breaker 2,3,5, Martin Tompa

More information

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein!

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein! The Double Helix SE 417: lgorithms and omputational omplexity! Winter 29! W. L. Ruzzo! Dynamic Programming, II" RN Folding! http://www.rcsb.org/pdb/explore.do?structureid=1t! Los lamos Science The entral

More information

A Method for Aligning RNA Secondary Structures

A Method for Aligning RNA Secondary Structures Method for ligning RN Secondary Structures Jason T. L. Wang New Jersey Institute of Technology J Liu, JTL Wang, J Hu and B Tian, BM Bioinformatics, 2005 1 Outline Introduction Structural alignment of RN

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

RNA Secondary Structure. CSE 417 W.L. Ruzzo

RNA Secondary Structure. CSE 417 W.L. Ruzzo RN Secondary Structure SE 417 W.L. Ruzzo The Double Helix Los lamos Science The entral Dogma of Molecular Biology DN RN Protein gene Protein DN (chromosome) cell RN (messenger) Non-coding RN Messenger

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna.

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna. onserved RN Structures Ivo L. Hofacker Institut for Theoretical hemistry, University Vienna http://www.tbi.univie.ac.at/~ivo/ Bled, January 2002 Energy Directed Folding Predict structures from sequence

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information graph kernel approach to the identification and characterisation of structured noncoding RNs using multiple sequence alignment information Mariam lshaikh lbert Ludwigs niversity Freiburg, Department of

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

RNA Abstract Shape Analysis

RNA Abstract Shape Analysis ourse: iegerich RN bstract nalysis omplete shape iegerich enter of Biotechnology Bielefeld niversity robert@techfak.ni-bielefeld.de ourse on omputational RN Biology, Tübingen, March 2006 iegerich ourse:

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

RNAscClust clustering RNAs using structure conservation and graph-based motifs

RNAscClust clustering RNAs using structure conservation and graph-based motifs RNsclust clustering RNs using structure conservation and graph-based motifs Milad Miladi 1 lexander Junge 2 miladim@informatikuni-freiburgde ajunge@rthdk 1 Bioinformatics roup, niversity of Freiburg 2

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

CSEP 527 Computational Biology. RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology. RNA: Function, Secondary Structure Prediction, Search, Discovery SEP 527 omputational Biology RN: Function, Secondary Structure Prediction, Search, Discovery The Message ells make lots of RN noncoding RN Functionally important, functionally diverse Structurally complex

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple equence lignment Four ami Khuri Dept of omputer cience an José tate University Multiple equence lignment v Progressive lignment v Guide Tree v lustalw v Toffee v Muscle v MFFT * 20 * 0 * 60 *

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

DUPLICATED RNA GENES IN TELEOST FISH GENOMES

DUPLICATED RNA GENES IN TELEOST FISH GENOMES DPLITED RN ENES IN TELEOST FISH ENOMES Dominic Rose, Julian Jöris, Jörg Hackermüller, Kristin Reiche, Qiang LI, Peter F. Stadler Bioinformatics roup, Department of omputer Science, and Interdisciplinary

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

RNA Protein Interaction

RNA Protein Interaction Introduction RN binding in detail structural analysis Examples RN Protein Interaction xel Wintsche February 16, 2009 xel Wintsche RN Protein Interaction Introduction RN binding in detail structural analysis

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

Towards a Comprehensive Annotation of Structured RNAs in Drosophila

Towards a Comprehensive Annotation of Structured RNAs in Drosophila Towards a Comprehensive Annotation of Structured RNAs in Drosophila Rebecca Kirsch 31st TBI Winterseminar, Bled 20/02/2016 Studying Non-Coding RNAs in Drosophila Why Drosophila? especially for novel molecules

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Alignment Algorithms. Alignment Algorithms

Alignment Algorithms. Alignment Algorithms Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.

More information

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University MB330 - January, 2006 Sequence-comparison methods erard Kleywegt Uppsala University Outline! Why compare sequences?! Dotplots! airwise sequence alignments &! Multiple sequence alignments! rofile methods!

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Phylogeny Tree Algorithms

Phylogeny Tree Algorithms Phylogeny Tree lgorithms Jianlin heng, PhD School of Electrical Engineering and omputer Science University of entral Florida 2006 Free for academic use. opyright @ Jianlin heng & original sources for some

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

BMI/CS 576 Fall 2016 Final Exam

BMI/CS 576 Fall 2016 Final Exam BMI/CS 576 all 2016 inal Exam Prof. Colin Dewey Saturday, December 17th, 2016 10:05am-12:05pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas n Introduction to Bioinformatics lgorithms Multiple lignment Slides revised and adapted to Bioinformática IS 2005 na eresa Freitas n Introduction to Bioinformatics lgorithms Outline Dynamic Programming

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

Eukaryotic vs. Prokaryotic genes

Eukaryotic vs. Prokaryotic genes BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

In Genomes, Two Types of Genes

In Genomes, Two Types of Genes In Genomes, Two Types of Genes Protein-coding: [Start codon] [codon 1] [codon 2] [ ] [Stop codon] + DNA codons translated to amino acids to form a protein Non-coding RNAs (NcRNAs) No consistent patterns

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K HiSeq X & NextSeq Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization:

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

Probabilistic Arithmetic Automata

Probabilistic Arithmetic Automata Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense Overview 1 Probabilistic Arithmetic Automata 2 Application

More information

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): 582 589 July 2008

More information

Genome 373: Hidden Markov Models II. Doug Fowler

Genome 373: Hidden Markov Models II. Doug Fowler Genome 373: Hidden Markov Models II Doug Fowler Review From Hidden Markov Models I What does a Markov model describe? Review From Hidden Markov Models I A T A Markov model describes a random process of

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Multiple Sequence Alignment: HMMs and Other Approaches

Multiple Sequence Alignment: HMMs and Other Approaches Multiple Sequence Alignment: HMMs and Other Approaches Background Readings: Durbin et. al. Section 3.1, Ewens and Grant, Ch4. Wing-Kin Sung, Ch 6 Beerenwinkel N, Siebourg J. Statistics, probability, and

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models A selection of slides taken from the following: Chris Bystroff Protein Folding Initiation Site Motifs Iosif Vaisman Bioinformatics and Gene Discovery Colin Cherry Hidden Markov Models

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

More information

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators CSEP 59A Summer 26 Lecture 4 MLE, EM, RE, Expression FYI, re HW #2: Hemoglobin History 1 Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm

More information

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression CSEP 590A Summer 2006 Lecture 4 MLE, EM, RE, Expression 1 FYI, re HW #2: Hemoglobin History Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

HIDDEN MARKOV MODEL-BASED HOMOLOGY SEARCH AND GENE PREDICTION IN NGS ERA

HIDDEN MARKOV MODEL-BASED HOMOLOGY SEARCH AND GENE PREDICTION IN NGS ERA HIDDEN MRKOV MODEL-BSED HOMOLOY SERCH ND ENE PREDICTION IN NS ER By Prapaporn Techa-angkoon DISSERTTION Submitted to Michigan State University in partial fulfillment of the requirements for the degree

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information