Genome 559 Wi RNA Function, Search, Discovery

Similar documents
Searching for Noncoding RNA

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein!

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and Motif Discovery. Lectures CSE 527 Autumn 2007

Modeling and Searching for Non-Coding RNA

RNA Search and! Motif Discovery"

Modeling and Searching for Non-Coding RNA

CSEP 527 Computational Biology. RNA: Function, Secondary Structure Prediction, Search, Discovery

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Modeling and Searching for Non-Coding RNA

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Measuring TF-DNA interactions

RNA Secondary Structure. CSE 417 W.L. Ruzzo

Introduction to Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Markov Models & DNA Sequence Evolution

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna.

RNA Protein Interaction

Towards a Comprehensive Annotation of Structured RNAs in Drosophila

SUPPLEMENTARY INFORMATION

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Predicting RNA Secondary Structure

ROC TPR FPR

Detecting non-coding RNA in Genomic Sequences

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Matrix-based pattern discovery algorithms

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information

Model Accuracy Measures

SnoPatrol: How many snorna genes are there? Supplementary

Algorithms in Bioinformatics

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

In Genomes, Two Types of Genes

Sensing Metabolic Signals with Nascent RNA Transcripts: The T Box and S Box Riboswitches as Paradigms

A Method for Aligning RNA Secondary Structures

Expanded View Figures

Filter Methods. Part I : Basic Principles and Methods

Comparative Genomics II

A Computational Pipeline for High- Throughput Discovery of cis-regulatory Noncoding RNA in Prokaryotes

Discovering MultipleLevels of Regulatory Networks

BLAST. Varieties of BLAST

Genomics and bioinformatics summary. Finding genes -- computer searches

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

MSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

RNALogo: a new approach to display structural RNA alignment

Mariam Alshaikh. April 15, 2015

Unravelling the biochemical reaction kinetics from time-series data

DNA/RNA Structure Prediction

Hidden Markov Models in Bioinformatics 14.11

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BIOINFORMATICS LAB AP BIOLOGY

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

Computational Biology and Chemistry

Computational methods for predicting protein-protein interactions

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

A Structure-Based Flexible Search Method for Motifs in RNA

A DNA Sequence 2017/12/6 1

Position-specific scoring matrices (PSSM)

Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

Mitochondrial Genome Annotation

Bioinformatics. Dept. of Computational Biology & Bioinformatics

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Figure S1: Similar to Fig. 2D in paper, but using Euclidean distance instead of Spearman distance.

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

De novo prediction of structural noncoding RNAs

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

Comparing transcription factor regulatory networks of human cell types. The Protein Network Workshop June 8 12, 2015

Theoretical distribution of PSSM scores

STRUCTURAL BIOINFORMATICS I. Fall 2015

SUPPLEMENTARY INFORMATION

UNIVERSITY OF CALIFORNIA SANTA BARBARA DEPARTMENT OF CHEMICAL ENGINEERING. CHE 154: Engineering Approaches to Systems Biology Spring Quarter 2004

Learning in Bayesian Networks

The Riboswitch is functionally separated into the ligand binding APTAMER and the decision-making EXPRESSION PLATFORM

Quantifying sequence similarity

Computational Cell Biology Lecture 4

The Phylo- HMM approach to problems in comparative genomics, with examples.

Searching genomes for non-coding RNA using FastR

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

EECS730: Introduction to Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

Predicting Protein Functions and Domain Interactions from Protein Interactions

VIEW. Flipping Off the Riboswitch: RNA Structures That Control Gene Expression

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Alignment. Peak Detection

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

DNA Binding Proteins CSE 527 Autumn 2007

Hidden Markov Models

Lecture 3: Markov chains.

The wonderful world of RNA informatics

Computational Genomics and Molecular Biology, Fall

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Stochastic processes and Markov chains (part II)

Genome Annotation Project Presentation

HIDDEN MARKOV MODEL-BASED HOMOLOGY SEARCH AND GENE PREDICTION IN NGS ERA

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Transcription:

Genome 559 Wi 2009 RN Function, Search, Discovery

The Message Cells make lots of RN noncoding RN Functionally important, functionally diverse Structurally complex New tools required alignment, discovery, search, scoring, etc. 2

RN Secondary Structure: RN makes helices too Base pairs U C G U G 5 UC C G C G G C U C G G U C C U 3 Usually single stranded 3

Fig. 2. The arrows show the situation as it seemed in 1958. Solid arrows represent probable transfers, dotted arrows possible transfers. The absent arrows (compare Fig. 1) represent the impossible transfers postulated by the central dogma. They are the three possible arrows starting from protein.

Proteins catalyze & regulate biochemistry The Met Repressor SM 5

lberts, et al, 3e. The protein way Riboswitch alternative SM Grundy & Henkin, Mol. Microbiol 1998 Epshtein, et al., PNS 2003 Winkler et al., Nat. Struct. Biol. 2003 6

lberts, et al, 3e. The protein way Riboswitch alternatives SM-II SM-I Grundy, Epshtein, Winkler et al., 1998, 2003 Corbino et al., Genome Biol. 2005 7

lberts, et al, 3e. The protein way Riboswitch alternatives SM-III SM-I SM-II Fuchs et al., NSMB 2006 Grundy, Epshtein, Winkler et al., 1998, 2003 Corbino et al., Genome Biol. 2005 8

lberts, et al, 3e. The protein way Riboswitch alternatives SM-III SM-I SM-II SM-IV Grundy, Epshtein, Winkler et al., 1998, 2003 Corbino et al., Genome Biol. 2005 Fuchs et al., NSMB 2006 Weinberg et al., RN 2008 9

boxed = confirmed riboswitch (+2 more) Widespread, deeply conserved, structurally sophisticated, functionally diverse, biologically important uses for ncrn throughout prokaryotic world. Weinberg, et al. Nucl. cids Res., July 2007 35: 4809-4819. 10

Why is RN hard to deal with? Similar structures, but only 29% seq id G CG U C U G U CU C U G C G U G C C G GCGG G U G G G G G C CU G G GC C C G C GG G G G G C C C U C G U C U C G C G G C G C C U U G G G C G G C U U U CU U G U UUC U CUG C C C G G C G G G G G C U G C CG G U UG U U U C CG U U G U GU G CG 11 : Structure often more important than sequence

Motif Description & Inference 12

RN Motif Models Covariance Models (Eddy & Durbin 1994) aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences Pro: accurate Con: model building hard, search sloooow 13

14

mrn leader mrn leader switch? 15

Mutual Information M ij = f f xi,xj xi,xj log xi,xj 2 f xi f xj ; 0 M ij 2 Max when no seq conservation but perfect pairing; Expected score gain from modeling i & j as paired. Given columns, finding optimal pairing without pseudoknots can be done by dynamic programming 16

17

Fast Motif Search Faster Genome nnotation of Non-coding RNs Without Loss of ccuracy Weinberg & Ruzzo Recomb 04, ISMB 04, Bioinformatics 06 18

CM s are good, but slow Rfam Reality EMBL Our Work EMBL Rfam Goal EMBL BLST Ravenna junk CM hits 1 month, 1000 computers CM hits ~2 months, 1000 computers CM junk hits 10 years, 19 1000 computers

Name Results: New ncrn s? # found BLST + CM # found rigorous filter + CM # new Pyrococcus snorn 57 180 123 Iron response element 201 322 121 Histone 3 element 1004 1106 102 Purine riboswitch 69 123 54 Retron msr 11 59 48 Hammerhead I 167 193 26 Hammerhead III 251 264 13 U4 snrn 283 290 7 S-box 128 131 3 U6 snrn 1462 1464 2 U5 snrn 199 200 1 U7 snrn 312 313 1 20

Motif Discovery In Prokaryotes (Vertebrates too, but no time today see, e.g., Torarinsson, et al. Genome Research, Jan 2008) 21

pipeline for RN motif genome scans CDD CMfinder Ortholgous genes Top datasets Motifs Search Genome database Upstream sequences Footprinter Rank datasets Homologs Yao, Barrick, Weinberg, Neph, Breaker, Tompa and Ruzzo. Computational Pipeline for High Throughput Discovery of cis-regulatory Noncoding RN in 22 Prokaryotes. PLoS Computational Biology. 3(7): e126, July 6, 2007.

Identify CDD group members 2946 CDD groups < 10 CPU days Retrieve upstream sequences Footprinter ranking < 10 CPU days CMfinder 35975 motifs 1 ~ 2 CPU months Motif postprocessing 1740 motifs RaveNn 10 CPU months CMfinder refinement < 1 CPU month Motif postprocessing 1466 motifs 23

boxed = confirmed riboswitch (+2 more) Weinberg, et al. Nucl. cids Res., July 2007 35: 4809-4819. 24

Summary ncrn - apparently widespread, much interest Covariance Models - powerful but expensive RaveNn filtering - search ~100x faster with no/little loss CMfinder - CM-based motif discovery in unaligned sequences Pipelines integrating comp and bio for ncrn discovery Many vertebrate ncrns? structural, not seq conservation; functional significance unclear BIG CPU demands Still need for further methods development & application

Course Wrap Up Modern biology is suddenly very data-rich Mathematical & computational tools needed We showed: sequence modeling, alignment & search, phylogeny, linkage/association mapping, some data bases Python is a good tool for doing much of this There s lots more! Check out, e.g., GENOME 540/1, CSE 527 26

We hope you enjoyed it. Thanks! 27