Medical genomics and bioinformatics, 2009 RNA bioinformatics Marcela Davila-Lopez Department of Medical Biochemistry and Cell Biology Institute of Biomedicine
RNA bioinformatics 2 DNA Transcription RNA Alternative splicing 5 cap mrna PolyA tail Translation Mod. / Export ProteinA ProteinB
RNA bioinformatics 3 Overview RNA ncrna Importance disease related Structure type RNA regulatory elements Riboswitches SECIS IRE mirna How to predict ncrna secondary structure Mfold Mutual information How to identify ncrna genes Pattern matching (Patscan) SCFG (CMsearch) Phylogenetic analysis
General concepts RNA bioinformatics 4
RNA bioinformatics 5 Gisela Storz, Shoshy Altuvia and Karen M. Wasserman (2005) Matera, A.G., R.M. Terns, and M.P. Terns, Nat Rev Mol Cell Biol, 2007. mrna codes for proteins Types and Roles of ncrnas A non-coding RNA (ncrna) is any RNA molecule that is not translated into a protein Genomic stability Telomerase RNA processing and modification Spliceosomal snrna U7 snrna RNAse P RNAse MRP Transcription 7SK RNA 6S RNA Translation trna tmrna rrna Protein trafficking SRP RNA
RNA bioinformatics 6 Huttenhofer, A., P. Schattner, and N. Polacek, Trends Genet, 2005 ncrna content Are ncrnas responsible for the complexity in different organisms?
RNA bioinformatics 7 Prasanth, K.V. and D.L. Spector, Genes Dev, 2007. Costa, F.F. Drug Discov Today 2009 Pandey, A.K., P. Agarwal, K. Kaur, and M. Datta. Cell Physiol Biochem 2009 Disease mir MRP RNA Diabetes Cartilage hair-hypoplasia
RNA bioinformatics 8 Thiel, C.T., G. Mortier, I. Kaitila, A. Reis, and A. Rauch. Am J Hum Genet 2007 Disease MRP RNA processing of pre-rrna Cartilage hair-hypoplasia
RNA bioinformatics 9 Protein - Primary sequence ClustalW Sequence similarity biological relation same function
RNA bioinformatics 10 ncrna - Primary sequence No sequence conservation, but structural Covariation: Consistent and compensatory mutations that (often) conserve the structure
RNA bioinformatics 11 http://prion.bchs.uh.edu/bp_type/bp_structure.html A single mutation can radically change the structure Canonical pairs Non-canonical pairs: GU wobble
RNA bioinformatics 12 Secondary structure RNA functionality depends on structure Pseudoknot Loop Stem Multibranched loop Bulge Hairpin External base Internal loop
RNA bioinformatics 13 Tertiary structure RNA tertiary structure comprises interactions of SS: two helices two unpaired regions one unpaired region and a double-stranded helix Prediction of RNA 3D structure is very difficult and RNA bioinformatics is therefore dominated by the prediction and analysis of secondary structure.
RNA bioinformatics 14 Family structure Each family typically adopts a characteristic secondary structure trna P RNA Telomerase RNA
RNA bioinformatics 15 U1 snrna However... Dictyostelium discoideum Candida albicans Trypanosoma brucei MRP RNA
RNA bioinformatics 16 Examples: RNA regulatory elements Riboswitches SECIS IRE mirna
RNA bioinformatics 17 RNA regulatory elements A cis-regulatory element or cis-element is a region of RNA that regulates the expression of genes located on that same strand. Trans-regulatory elements are RNAs that may modify the expression of genes, distant from the gene that was originally transcribed to create them. m7g C D S AAUAA AAAAAAAA 3 5 mirna 5 3
RNA bioinformatics 18 Dominski, Z. and W.F. Marzluff. Gene, 2007 Cis and trans regulatory elements Histones Stem-Loop motif of Histone pre-mrna DNA ZFP-100 CPSF-100 SLBP CPSF-73 Histone pre-mrna Lsm11 F Symplekin Lsm10 E B G D3 U7 snrna
RNA bioinformatics 19 Tucker, B.J. and R.R. Curr Opin Struct Biol, 2005 Riboswitch 2002 Part of an mrna molecule that can directly bind a small target molecule, affecting the gene s activity (Auto-regulation) Typically found in the 5 UTR Biosynthesis, catabolism and transport of various cellular catabolites (aminoacids [K,G], cofactors, nucleotides and metal ions) Most known occur in Bacteria
RNA bioinformatics 20 Serganov A, Patel DJ. Biochim Biophys Acta. 2009 Riboswitch examples Transcription Translation Shine-Dalgarno
RNA bioinformatics 21 Henkin TM. Genes Dev. 2008 Mandal M, et al, Cell. 2003 Riboswitch identification Comparative analysis of upstream regions of several genes: BLAST to find UTRs homologous to all UTRs in Bacillus subtilis (e.g) Inspection for conserved structure RNA-like motifs Experimental confirmation Guanine Riboswitch
RNA bioinformatics 22 Papp, LV, et al. ANTIOXIDANTS & REDOX SIGNALING 2007 Selenoproteins Selenium antioxidant activity chemopreventive, antiinflammatory, and antiviral properties Moderate selenium deficiency has been linked to: increased cancer and infection risk, male infertility, decrease in immune and thyroid function, and several neurologic conditions, including Alzheimer s and Parkinson s disease Not a cofactor incorporated into the polypeptide chain as selenocysteine [SEC] (21st aa) At least 25 selenoproteins Present in all lineages of life (bacteria, archaea and eukarya) Function of most selenoproteins is currently unknown Prevention of some forms of cancer (?) therapeutic targets (?)
RNA bioinformatics 24 Kryukov, G.V., et al., Science, 2003 Overall low sequence similarities SECIS Secondary structures are highly conserved and contain consensus sequences that are indispensable for Sec incorporation Eukaryotic SECIS: non-canonical A-G base pairs K-turn motif
RNA bioinformatics 25
RNA bioinformatics 26 Muckenthaler MU, Galy B, Hentze MW. Annu Rev Nutr. 2008 Piccinelli P, Samuelsson T, RNA, 2007 IRE: Iron responsive element Iron: Essential for oxygen transport, cellular respiration, and DNA synthesis [ ] cellular growth arrest and death anemia, retardation in children [ ] generate hydroxyl or lipid radicals damage lipid membranes, proteins, and nucleic acids. hemochromatosis, liver/heart failure Balance: iron-responsive element/iron regulatory protein regulatory system 26 30 nts (long hairpin) CAGUGN apical loop sequence 5 UTR 3 UTR
RNA bioinformatics 27 Muckenthaler MU, Galy B, Hentze MW. Annu Rev Nutr. 2008 IRE regulation
RNA bioinformatics 28 Gene Identification and SS prediction
RNA bioinformatics 29 Protein vs RNA identification Protein RNA Conserved primary sequence Promoters (Pol II) Sequence-similarity based Not Conserved primary sequence Promoters (Pol II, Pol III) Sequence-similarity based Secondary structure based Comparative genomics
RNA bioinformatics 30 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Nussinov algorithm: Find the structure with the most base pairs (dynamic programming) Drawbacks: Not unique structure Testing all possible structures numerically impossible
RNA bioinformatics 31 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Zuker folding algorithm (1981): The correct structure is the one with the lowest equilibrium free energy (ΔG) which is the sum of individual contributions from loops, base pairs and other secondary structure elements Every system seeks to achieve a minimum of free energy (MFE) However... The structure with the lowest MFE not always is the biological relevant
RNA bioinformatics 32 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Mutual information: quantity that measures the mutual dependence of the two variables (two positions). The unit of measurement is the bit. Covarying positions: consitent and compensatory mutations that conserve the structure
RNA bioinformatics 33 Mutual information - example fx i = fq of one of the 4 bases in column i fx i x j = fq of one of the 16 base-pairs in columns i and j M ij = 2 max value informative = 0 conserved positions not informative 1 2 3 4 G G C C G C C G G A C U G U C A Columns 2-4: GC CG AU UA fg=1/4 fc=1/4 fgc=1/4 fc=1/4 fg=1/4 fcg=1/4 fa=1/4 fu=1/4 fau=1/4 fu=1/4 fa=1/4 fua=1/4 fgc*log2(fgc/fg*fc) 1/4*log2(0.25/(0.25*0.25)) = 0.5 1/4*log2(0.25/(0.25*0.25)) = 0.5 1/4*log2(0.25/(0.25*0.25)) = 0.5 1/4*log2(0.25/(0.25*0.25)) = 0.5 Columns 1-3: GC fg=4/4 fc=4/4 fgc=4/4 4/4*log2(1/(1*1)) = 0 MI = 2 MI = 0
Mutual information excercise RNA bioinformatics 34
RNA bioinformatics 35 Mutual information plot Diagonals of covarying positions correspond to the four stems of the trna. Dashed lines indicate some of the addtional tertiary contacts observed in the yeast trna-phe crytal structure.
RNA bioinformatics 36 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Patscan: is a pattern matcher (deterministic motifs as well as secondary structure constraints) which searches protein or nucleotide sequence archives p1 = 5...7 GGAA ~p1 Drawback: Yes/No answer
RNA bioinformatics 37 PatScan - Example r1={au,ua,gc,cg,gu,ug} 4...4 p2=8...9 r1~p2[1,0,1] GGG [1,0,0] 3...4 [1,0,0] Mismatch Deletion Insertion p1=6...7 ~p1 r1={au,ua,gc,cg,gu,ug} p1=6...7 GGG [1,0,0] p2=8...9 4...4 r1~p2[1,0,1] 3...4
RNA bioinformatics 38 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Regular grammar primary sequence models S at bs T as bt ɛ at aas aabs aabat aabaɛ aaba Model repeat regions (ex. FMR-1 triplet repeat region) S gw1 W1 cw2 W2 gw3 W3 cw4 W4 gw5 W5 gw6 W6 cw7 aw4 cw4 W7 tw8 W8 g gcg cgg ctg gcg cgg agg cgg ctg gag agg ctg gcg agg cgg ctg gcg agg cgg cgg
RNA bioinformatics 39 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Context-free grammar primary sequence models palindromes S asa bsb aa bb S asa aasaa aabsbaa aabaabaa RNA secondary structure CAGGAAACUG GCUGCAAAGC GCUGCAACUG S aw1u cw1g gw1c uw1a W1 aw2u cw2g gw2c uw2a W2 aw3u cw3g gw3c uw3a W3 ggaa gcaa G A G A G.C A.U C.G C A G A U.A C.G G.C C A G A UxC CxU GxG
RNA bioinformatics 40 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Stochastic regular grammar weighted primary sequence models (probabilistic) S rw1 S kw1 S nw1 (0,45) (0,45) (0,10) Hidden markov models A T β ɛ C G
RNA bioinformatics 41 Methods Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis Stochastic context-free grammar Covariance models: probabilistic models that flexibly describe the secondary structure and primary sequences consensus fo an RNA sequence family
RNA bioinformatics 42 Infernal Package Search for additional and family-related sequences in sequence databases
RNA bioinformatics 43 CM example Build a model (automatically) from an existing sequence alignment
CM example RNA bioinformatics 44
Database containing information about ncrna families and other structured RNA elements. RNA bioinformatics 45
RNA bioinformatics 46 Phylogenetic distribution Structural alignments
RNA bioinformatics 47 Methods EVOfold: Nussinov algorithm Mfold (prediction of secondary structure) Analysis of mutual information Pattern matching SCFG (Stochastic context-free grammar models) Phylogenetic analysis - Conserved elements alignment - SCFG Secondary structure -Fold - Phylogenetic evaluation
mirna RNA bioinformatics 48
RNA bioinformatics 49 Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. mirna Target m7g C D S AAUAA AAAAAAAA 3 mirna 5 5 3 SS RNA ~22 nucleotides Inhibit the translation of mrnas to their protein products by biding to specific regions in the 3ʼ UTR Accounts for ~1% of all transcripts in humans and potentially regulate 10%-30% of all genes. Expressed ubiquitously and highly conserved in Metazoans (animal kingdom).
Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. mirna Target m7g C D S AAUAA AAAAAAAA 3 mirna 5 5 3 Biological processes Diseases Apoptosis Cell prolifertion Cell differentiation Development Organism defense against infections Tissue morphogenesis Regulation of metabolism Cancer Viral infections Neurodegenerative disorders Cardiac pathologies Muscle disorders Diabetes RNA bioinformatics 50
RNA bioinformatics 51 Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. He, L. and G.J. Hannon, Nat Rev Genet 2004 mirna Target m7g C D S AAUAA AAAAAAAA 3 mirna 5 5 3 Multiple binding sites lin-4 is partially complementary to 7 sites in the lin-14 3 UTR
RNA bioinformatics 52 Kim VN Nat Rev Mol Cell Biol. 2005 Winter J et al Nat Cell Biol. 2009 mirna genes Exonic mirnas in non-coding transcripts Single Clustered Intronic mirnas in non-coding transcripts Intronic mirnas in protein-coding transcripts
RNA bioinformatics 53 Winter, J., S. Jung, S. Keller, R.I. Gregory, and S. Diederichs. Nat Cell Biol 2009. Paul S. Meltzer, Nature, 2005 Canonical mirna Biogenesis Non-Canonical
RNA bioinformatics 54 Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. mirna structure Hairpin structure mirna Intervening loop High conservation mature mirna Lower conservation loop Human genome ~11 million hairpins mirna*
mirna computational identification RNA bioinformatics 55 Homology search based BLAST miraling, ProMir, microharvester Gene finding Identification of conserved genomic regions Folding of the identified regions (Mfold, RNAfold) Evalutation of hairpins mirseeker, mirscan Neighbour stem loop (~42% of human mirna genes are clustered together) Check surroundings of a known mirna for candidate secondary structures Comparative genomics BLAST intergenic sequences of two genomes against each other Filter based on rules inferred based on known mirnas mirfinder Intragenomic matching (A functional mirna should have at least a target) mirnas show perfect complementarity to their targets (?) It simultaneously predicts mirnas and their targets mimatcher
RNA bioinformatics 56 Ruby JG. et al. Genome Res., 2007 Experimental approach: Purify small RNAs (15-35 nt) Deep sequencing of the RNA library. Map sequence traces to the genome. mirna experimental validation through sequencing
RNA bioinformatics 57 Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. mirna Target prediction Target m7g C D S AAUAA AAAAAAAA 3 mirna 5 5 3 Predicting mirna targets in plants is easier, due to the perfect complementarity to the mirnas In animals, perfect complementarity is not common mirna seed complementarity (6 to 9 nt) High false positives rate Common approach Experimental evidences Validated mirna/target pairs Tarbase, mirecords Computational methods: Base-pairing rules and binding sites sequence features Conservation Thermodynamics
RNA bioinformatics 58 Bartel, D.P. 2009. Cell 2009. Base-pairing rules 5 dominant sites 6-9 nt, starting usually at P2 P1 is typically unpaired or starts with U Often flanked by A Usually no G:U wobbles (vs regulation) Canonical sites lsy-6/cog-1 3 UTR 3 compensatory sites Atypical sites May compensate for insufficient base pairing in the seed
RNA bioinformatics 59 Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. More methods... Conservation Search for conserved seeds in the UTRs across different species Thermodynamics Evaluation of ΔG of predicted duplexes usually < -20 Kcal/mol Discard F(+) but favorable interactions not always correspond to actual duplex Structural accesiblity The targe site on the mrna not involved in any intramolecular bp Any existing secondary structure must be first removed
RNA bioinformatics 60 Bartel, D.P. 2009. Cell 2009 mirna
RNA bioinformatics 61 Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. mirna gene expression in cancer
RNA bioinformatics 62 Lu, J., et al., Nature, 2005 mirna in Cancer
RNA bioinformatics 63 Carlo Croce 2009 mirnas as tumor suppresors MiR-29b inhibits Leukemic growth in vivo. (A) Diagram illustrating the experimental design of the mice xenograft experiment. A K562 cells injected SC Tumor size mir-29b or scrambled oligos injection (5 µg) Stop 0 3 7 10 14 Days (B) Graphic representing the tumor volume determinations at the indicated days during the experiment for the three groups; mock (n= 6), scrambled (n=12) and synthetic mir- 29b (n=12). B D Tumor Volume (mm 3 ) 1800 1600 1400 1200 1000 800 600 400 200 0 Mock Scrambled mir-29b * P<0.003 Days 0 +3 +7 +10 +14 * * C Tumor Weight (grams) 1.2 1 0.8 0.6 0.4 0.2 0 scrambled P<0.001 mir-29b (C) Tumor weight averages between scrambled and synthetic mir-29b treated mice groups at the end of the experiment (Day +14). P-values were obtained using t- test. Bars represent ±S.D. (D) Photographs of two mice injected with mir-29b (left flank) or scrambled (right flank). Scrambled mir-29b
RNA bioinformatics 64 mir DBs Published mirnas Prediction of mirnas targets mirna-disease relationships reported in the literature. Experimentally suported targets