A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information

Similar documents
RNAscClust clustering RNAs using structure conservation and graph-based motifs

Mariam Alshaikh. April 15, 2015

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Searching for Noncoding RNA

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna.

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes

SUPPLEMENTARY INFORMATION

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Genome 559 Wi RNA Function, Search, Discovery

RNA Abstract Shape Analysis

CFG PSA Algorithm. Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Machine Learning: Exercise Sheet 2

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Bayesian Networks. Motivation

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Declarative Merging of and Reasoning about Decision Diagrams

Decision trees COMS 4771

Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein!

De novo assembly and genotyping of variants using colored de Bruijn graphs

K-means-based Feature Learning for Protein Sequence Classification

Today s Lecture: HMMs

QB LECTURE #4: Motif Finding

A Method for Aligning RNA Secondary Structures

proteins are the basic building blocks and active players in the cell, and

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

NP-Complete Problems. Complexity Class P. .. Cal Poly CSC 349: Design and Analyis of Algorithms Alexander Dekhtyar..

Introduction to Machine Learning CMU-10701

Quantitative Bioinformatics

Structure Local Multiple Alignment of RNA

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Junction-Explorer Help File

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Computational Design of New and Recombinant Selenoproteins

Entropy Measures for System Identification and Analysis Joseph J. Simpson, Mary J. Simpson System Concepts, LLC

Algorithms in Bioinformatics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Random Field Models for Applications in Computer Vision

Graph Alignment and Biological Networks

DECISION TREE BASED QUALITATIVE ANALYSIS OF OPERATING REGIMES IN INDUSTRIAL PRODUCTION PROCESSES *

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Week 10: Homology Modelling (II) - HHpred

Multiple Sequence Alignment

Nonlinear Dimensionality Reduction. Jose A. Costa

A Model-Theoretic Coreference Scoring Scheme

RNA Secondary Structure Prediction

Lecture 4. Spatial Statistics

Xia Ning,*, Huzefa Rangwala, and George Karypis

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

COMS 4771 Introduction to Machine Learning. Nakul Verma

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Algorithms Exam TIN093 /DIT602

Prediction of Conserved and Consensus RNA Structures

Haploid & diploid recombination and their evolutionary impact

Overview Multiple Sequence Alignment

Discriminative Direction for Kernel Classifiers

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Probabilistic Machine Learning. Industrial AI Lab.

Learning Methods for Linear Detectors

Conditional Graphical Models

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Learning SVM Classifiers with Indefinite Kernels

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Induction of Decision Trees

Causal Discovery Methods Using Causal Probabilistic Networks

Predicting Protein Functions and Domain Interactions from Protein Interactions

Structure Learning in Sequential Data

Comparative Network Analysis

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Ability to Count Messages Is Worth Θ( ) Rounds in Distributed Computing

Gaussian Process Based Image Segmentation and Object Detection in Pathology Slides

CSEP 527 Computational Biology. RNA: Function, Secondary Structure Prediction, Search, Discovery

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Research Article A Topological Description of Hubs in Amino Acid Interaction Networks

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Linear Classifiers (Kernels)

Notes on Machine Learning for and

Computational Complexity and Genetic Algorithms

Final Examination CS 540-2: Introduction to Artificial Intelligence

Tufts COMP 135: Introduction to Machine Learning

ECE521 Lecture7. Logistic Regression

BINF6201/8201. Molecular phylogenetic methods

Machine Learning, Fall 2009: Midterm

CS612 - Algorithms in Bioinformatics

Feature Engineering, Model Evaluations

RNA Protein Interaction

Machine Learning Concepts in Chemoinformatics

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

STRUCTURAL BIOINFORMATICS I. Fall 2015

STA 414/2104: Machine Learning

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

LOCAL SEQUENCE-STRUCTURE MOTIFS IN RNA

Learning Decision Trees

Alignment. Peak Detection

Transcription:

graph kernel approach to the identification and characterisation of structured noncoding RNs using multiple sequence alignment information Mariam lshaikh lbert Ludwigs niversity Freiburg, Department of omputer Science Feb 18th, 016 Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 1 / 1

Motivation Explosion in the discovery of nonidentified ncrns efficient automated approaches. Lack of automated classification tools done manually. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 / 1

pproach onservation important for unpaired region. ovariation important for base pairs. In this work onsider both conservation and covariation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 3 / 1

Multiple lignment raph enerator (M) What is M M is a graph encoder tool. M can encode the evolutionary conservation of sequences and structures. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 4 / 1

Multiple lignment raph enerator (M) What is M M is a graph encoder tool. M can encode the evolutionary conservation of sequences and structures. Why M raph formalism flexible encoding. raphs powerful machine learning techniques (graph kernels). Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 4 / 1

Multiple lignment raph enerator (M) What is M M is a graph encoder tool. M can encode the evolutionary conservation of sequences and structures. Why M raph formalism flexible encoding. raphs powerful machine learning techniques (graph kernels). M aim Simulate experts on identifying interesting alignments for further investigation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 4 / 1

Nerest Neighbourhood subgraph pairwise Distance kernal EDeN EDeN graph kernel tool. EDeN Extend the notion of kmears from string to graphs. It counts the fraction of identical pairs of neighborhood subgraphs. r=1, d=5 r=, d=5 r=3, d=5 Figure : Pairs of neighbourhood graphs for radius=1,,3 and distance=5 Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 5 / 1

Information in the input alignment files 1 The alignments are generated using Mfinder. Mfinder is an alignment tool, produces sequences that have consensus structure. Every alignment contains information about: Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 6 / 1

Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 7 / 1

Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 8 / 1

Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. 3 Strength of conservation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 9 / 1

Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. 3 Strength of conservation. 4 Entropy of the nucleotides. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 10 / 1

Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. 3 Strength of conservation. 4 Entropy of the nucleotides. 5 ovariation of the secondary structure. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 11 / 1

raphs created by M M produces two different graph representations. 1 Node based graphs N. Summary based graphs S. Each representation can encode the information in: 1 One node:. List of nodes: L. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 1 / 1

Node based graphs 1 N encodes the information in one node. N L encodes the information in set of nodes forming a list. 1 1 Figure : N : onservation information encoded in single nodes. Figure : N L : onservation and covariation information in multiple nodes forming a list. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 13 / 1

Summary based graphs 1 S same as N but summary information about the structure is encoded. S L same as N L but more summary information about the structure is encoded. NS : NS : S :3 1 1 S :3 1.8 Figure : S : onservation Figure : S L : onservation and covariation This extra information can be the vg, Max, Min, of occurrence of a specific nucleotide or the conservation of the alignment information. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 14 / 1

Data description The motif sequences are from bacteria, archaea [Weinberg 010]. Z.Weinberg has manually annotated the alignments in functional and nonfunctional. They are binary classified. Data Num. files Num. classes vg seqs num. vg. seq. length Positive 308 classes 70 seqs 150 nucleotides Negative 160 10 classes 70 seqs 130 nucleotides Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 15 / 1

Evaluation The experiment data sets were balanced. Same number of files in pos and neg. Testing each pos class against the 10 neg classes. In total we have 0 experiments. The Receiver Operator haracteristic RO is the performance measurement. RO computes the true positive rate against the false positive rate. The final RO score is averaged over the different experiments. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 16 / 1

Results (RO) M S N S L N L N 1 = cons 0.86 N 1 = cons 0.85 N 1 = cov N = cons 0.69 N 1 = cons N = sscons 0.68 NS : NS :,,,,,, < >, < > S :3 1 1 S :3 1.8 < < > > < > < > Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 17 / 1

onclusion M can identify interesting ncrns up to RO 86%. Take home message The best graph representation is 1 Summary based S. Labelled with the conservation information. The tool can be used as: powerful prefiltering method for large amounts of alignments. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 18 / 1

urrent work Integrating M into ipython environment. Integrate automated alignment of input sequences into M. Encoding finer structural information as hairpins, bulges, and loops to improve the classification. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 19 / 1

cknowkegment Prof.Dr. Rolf Backofen Dr. Fabrizio osta Dr.Zasha Weinberg Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 0 / 1

Questions Thank you for your attention Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 1 / 1