RNAscClust clustering RNAs using structure conservation and graph-based motifs

Size: px
Start display at page:

Download "RNAscClust clustering RNAs using structure conservation and graph-based motifs"

Transcription

1 RNsclust clustering RNs using structure conservation and graph-based motifs Milad Miladi 1 lexander Junge 2 miladim@informatikuni-freiburgde ajunge@rthdk 1 Bioinformatics roup, niversity of Freiburg 2 enter for non-coding RN in Technology and Health (RTH), niversity of openhagen 31 st TBI Winterseminar Bled, Slovenia February / 15

2 lustering ncrn sequences using structure conservation human raphlust [Heyne et al, Bioinformatics, 2012]: lusters ncrn sequences an find paralogs belonging to same ncrn class Features based on local sequence and structure 2 / 15

3 lustering ncrn sequences using structure conservation human human chimp pig mouse - raphlust [Heyne et al, Bioinformatics, 2012]: lusters ncrn sequences an find paralogs belonging to same ncrn class Features based on local sequence and structure RNsclust: lusters paralogous RN sequences structurally aligned to their orthologs Extends raphlust approach: 2 / 15

4 lustering ncrn sequences using structure conservation covariation human human chimp pig mouse - consensus structure ( ( ) ) raphlust [Heyne et al, Bioinformatics, 2012]: lusters ncrn sequences an find paralogs belonging to same ncrn class Features based on local sequence and structure RNsclust: lusters paralogous RN sequences structurally aligned to their orthologs Extends raphlust approach: Derives evolutionary conserved sequence and secondary structure 2 / 15

5 Single sequence vs alignment clustering covariation human structure prediction from single sequence human chimp pig mouse - consensus structure ( ( ) ) structure prediction based on multiple alignment consensus structure 5' 3' wrong structure 5' 5' 5' 3' 5' 3' 3' 3' set of correct structures 3 / 15

6 Single sequence vs alignment clustering covariation human structure prediction from single sequence human chimp pig mouse - consensus structure ( ( ) ) structure prediction based on multiple alignment consensus structure 5' 3' wrong structure single sequence clustering 5' 5' 5' 3' 5' 3' 3' 3' multiple sequence alignment clustering set of correct structures cluster 1 cluster 2 cluster 1 cluster 2 cluster 3 wrong clustering correct clustering 3 / 15

7 Identifying similarities of secondary structures Neighborhood Subgraph Pairwise Distance (NSPD) raph Kernel [osta and De rave, Proceedings IML 10, 2010] Extract substructures: intuitively: structure k-mers NSPD Kernel ncrns highly similar if many shared substructures 4 / 15

8 Measuring structure similarity of multiple alignments human chimp pig mouse threshold t - - ( ( ( ) ) ) base pair reliabilities constraints PETfold [Seemann et al, NR, 2008] 5 / 15

9 Measuring structure similarity of multiple alignments human chimp pig mouse threshold t - - ( ( ( ) ) ) base pair reliabilities constraints PETfold [Seemann et al, NR, 2008] constrained folding human chimp pig mouse RNfold for each sequence [Lorenz et al, lg for Mol Biol, 2011] use NSPD raph Kernel to compare alignments 5 / 15

10 RNsclust pipeline: From input alignments to clustering Set of multiple alignments Reliable basepairs onstrained folding NSPD raph Kernel Pairwise similarities raphlust postprocessing steps lustering of multiple alignments 6 / 15

11 onstructing a benchmark data set Split Rfam 12 family seed alignments into subalignments: Human Rfam family seed alignment Family subalignments himp Mouse split Pig 7 / 15

12 onstructing a benchmark data set Split Rfam 12 family seed alignments into subalignments: Human Rfam family seed alignment Family subalignments himp Mouse split Pig 1 Each subalignment contains one human sequence 2 Similar sequences from different species form a subalignment subalignments have max sequence identity 7 / 15

13 onstructing a benchmark data set Split Rfam 12 family seed alignments into subalignments: Human Rfam family seed alignment Family subalignments himp Mouse split Pig 1 Each subalignment contains one human sequence 2 Similar sequences from different species form a subalignment subalignments have max sequence identity Ideal clustering groups only subalignments from same Rfam family 7 / 15

14 omparing sequence to alignment clustering Subalignments in benchmark set Human Extract human sequences Human sequences in benchmark set luster with raphlust luster with RNsclust ompare clustering performance 8 / 15

15 Low covariation in the benchmark data set Benchmark set has high verage Pairwise Sequence Identity (PSI) in alignments High PSI ount mean PSI 48 families 234 alignments PSI per alignment (based on Rfam alignments) 9 / 15

16 Low covariation in the benchmark data set Benchmark set has high verage Pairwise Sequence Identity (PSI) in alignments High PSI ount mean PSI 48 families 234 alignments PSI per alignment (based on Rfam alignments) limit PSI to study effect of covariation on clustering performance 9 / 15

17 Benchmark sets with different degrees of covariation reate 2 additional benchmark data set with controlled PSI in alignments Medium PSI Low PSI ount 30 ount mean PSI 26 families 166 alignments PSI per alignment (based on Rfam alignments) mean PSI 10 families 92 alignments PSI per alignment (based on Rfam alignments) 10 / 15

18 lustering performance metrics - V-measure Homogeneity H: each cluster contains only members of a single family ompleteness : all members of a given family are in same cluster V-measure [Rosenberg and Hirschberg, 2007] is harmonic mean of H and : V = 2 H H + 11 / 15

19 lustering performance metrics - djusted Rand Index a = #object pairs from same family assigned to same cluster b = #object pairs from different families assigned to different clusters n = number of alignments Rand Index = a ( + b n 2) djusted Rand Index [Hubert and rabie, 1985] is Rand Index [Rand, 1971] adjusted for chance 12 / 15

20 More covariation improves RNsclust performance 10 V-measure 10 RNsclust raphlust djusted Rand Index V-measure djusted Rand Index Low PSI (049) Medium PSI (064) High PSI (081) Low PSI (049) Medium PSI (064) High PSI (081) 13 / 15

21 RNsclust onclusion Leverage conserved sec structure derived from multiple alignments NSPD raph Kernel as similarity measure Improved clustering compared to raphlust 14 / 15

22 RNsclust onclusion Leverage conserved sec structure derived from multiple alignments NSPD raph Kernel as similarity measure Improved clustering compared to raphlust Next step: enome-scale clustering of potential ncrns 14 / 15

23 cknowledgements Bioinformatics roup, niversity of Freiburg: Milad Miladi Fabrizio osta Rolf Backofen RTH, niversity of openhagen: Stefan Seemann Jakob Hull Havgaard Jan orodkin Funding: DF, Danish enter for Scientific omputing, Innovation Fund Denmark, Danish ancer Society 15 / 15

24 cknowledgements Bioinformatics roup, niversity of Freiburg: Milad Miladi Fabrizio osta Rolf Backofen RTH, niversity of openhagen: Stefan Seemann Jakob Hull Havgaard Jan orodkin Funding: DF, Danish enter for Scientific omputing, Innovation Fund Denmark, Danish ancer Society Thank you for your attention! 15 / 15

25 16 / 15

26 Identifying similarities of secondary structures Neighborhood Subgraph Pairwise Distance (NSPD) Kernel used in raphlust [Heyne et al, Bioinformatics, 2012] Extract substructures: ' 6 3' structure k-mers graph kernel H H H H H sparse feature vector ncrns highly similar if many shared substructures / 15

27 RNsclust full pipeline RNsclust specific input and methodology Input multiple alignments Predicting sets of sec structures Sparse feature vectorization Local sensitivity hashing andidate clusters of multiple alignments raphlust methodology andidate clusters of representatives luster refinement and extension Report clusters Iterate until no candiate left Steps executed in parallel are shown as stacks 18 / 15

28 V-measure lusters K = {K 1,, K m }; true classes = { 1,, n } h is defined as: Homogeneity h = { 1 if H( K) = 0 1 H( K) H() H( K) is the conditional entropy of the classes given the clustering and H() is the entropy of the classes, ie, else K H( K) = H() = c=1 k=1 c=1 a ck N log K k=1 a ck log n a ck c=1 a ck K k=1 a ck n 19 / 15

29 V-measure lusters K = {K 1,, K m }; true classes = { 1,, n } On the other hand, completeness c is defined as: c = { 1 if H(K ) = 0 1 H(K ) H(K) else where H(K ) is the conditional entropy of the clustering given the classes and H(K) is the entropy of the clustering, ie, K H(K ) = K H(K) = k=1 c=1 k=1 a ck N log c=1 a ck log n a ck K k=1 a ck c=1 a ck n V-measure is harmonic mean of homogeneity and completeness and is not normalized wrt random labeling 00 is as bad as it can be, 10 is perfect 19 / 15

30 onstructing a benchmark data set Split each Rfam 12 family seed alignment into subalignments Similar sequences from different species form a subalignment Human Rfam family seed alignment Family subalignments himp Mouse split Pig 20 / 15

31 onstructing a benchmark data set 1) Each sequence in the alignment is represented as a node in a graph Human himp Mouse Pig 20 / 15

32 onstructing a benchmark data set 2) Remove sequences with pairwise sequence identify (PSI) > 095 Human himp Mouse Pig 20 / 15

33 onstructing a benchmark data set 3) dd edge between sequences from diff species with PSI (09, 095] Human himp Mouse Pig 20 / 15

34 onstructing a benchmark data set 4) Search for cliques in graph Human himp Mouse Pig 20 / 15

35 onstructing a benchmark data set 5) dd clique with max PSI as subalignment to benchmark data set Human himp Mouse Pig Family subalignments (liques) 1 20 / 15

36 onstructing a benchmark data set 6) dd edge between sequences from diff species with PSI (08, 09] Human himp Mouse Pig Family subalignments (liques) 1 20 / 15

37 onstructing a benchmark data set 7) dd clique as subalignment to benchmark data set Human himp Mouse Pig Family subalignments (liques) / 15

38 onstructing a benchmark data set 8) dd edge between sequences from diff species with PSI (07, 08] Human himp Mouse Pig Family subalignments (liques) / 15

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information graph kernel approach to the identification and characterisation of structured noncoding RNs using multiple sequence alignment information Mariam lshaikh lbert Ludwigs niversity Freiburg, Department of

More information

Structure Local Multiple Alignment of RNA

Structure Local Multiple Alignment of RNA Structure Local Multiple lignment of RN Wolfgang Otto 1 and Sebastian Will 2 and Rolf ackofen 2 1 ioinformatics, niversity Leipzig, D-04107 Leipzig wolfgang@bioinf.uni-leipzig.de 2 ioinformatics, lbert-ludwigs-niversity

More information

A phylogenetic view on RNA structure evolution

A phylogenetic view on RNA structure evolution 3 2 9 4 7 3 24 23 22 8 phylogenetic view on RN structure evolution 9 26 6 52 7 5 6 37 57 45 5 84 63 86 77 65 3 74 7 79 8 33 9 97 96 89 47 87 62 32 34 42 73 43 44 4 76 58 75 78 93 39 54 82 99 28 95 52 46

More information

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna.

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna. onserved RN Structures Ivo L. Hofacker Institut for Theoretical hemistry, University Vienna http://www.tbi.univie.ac.at/~ivo/ Bled, January 2002 Energy Directed Folding Predict structures from sequence

More information

Chapter DM:II (continued)

Chapter DM:II (continued) Chapter DM:II (continued) II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis

More information

RNA Abstract Shape Analysis

RNA Abstract Shape Analysis ourse: iegerich RN bstract nalysis omplete shape iegerich enter of Biotechnology Bielefeld niversity robert@techfak.ni-bielefeld.de ourse on omputational RN Biology, Tübingen, March 2006 iegerich ourse:

More information

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 Dr. Stefan Simm, 01.11.2016 simm@bio.uni-frankfurt.de RNA secondary structures a. hairpin loop b. stem c. bulge loop d. interior loop e. multi

More information

Mariam Alshaikh. April 15, 2015

Mariam Alshaikh. April 15, 2015 Master Thesis graph kernel approach to the identification and characterization of structured non-coding RNs using multiple sequence alignment information Mariam lshaikh pril 15, 2015 Faculty of Engineering

More information

Flexible RNA design under structure and sequence constraints using a language theoretic approach

Flexible RNA design under structure and sequence constraints using a language theoretic approach Flexible RN design under structure and sequence constraints using a language theoretic approach Yu Zhou,, Yann Ponty Stéphane Vialette Jérôme Waldispühl Yi Zhang lain Denise LRI, niv. Paris-Sud, Orsay

More information

DUPLICATED RNA GENES IN TELEOST FISH GENOMES

DUPLICATED RNA GENES IN TELEOST FISH GENOMES DPLITED RN ENES IN TELEOST FISH ENOMES Dominic Rose, Julian Jöris, Jörg Hackermüller, Kristin Reiche, Qiang LI, Peter F. Stadler Bioinformatics roup, Department of omputer Science, and Interdisciplinary

More information

Genome 559 Wi RNA Function, Search, Discovery

Genome 559 Wi RNA Function, Search, Discovery Genome 559 Wi 2009 RN Function, Search, Discovery The Message Cells make lots of RN noncoding RN Functionally important, functionally diverse Structurally complex New tools required alignment, discovery,

More information

Searching for Noncoding RNA

Searching for Noncoding RNA Searching for Noncoding RN Larry Ruzzo omputer Science & Engineering enome Sciences niversity of Washington http://www.cs.washington.edu/homes/ruzzo Bio 2006, Seattle, 8/4/2006 1 Outline Noncoding RN Why

More information

Protein Folding by Robotics

Protein Folding by Robotics Protein Folding by Robotics 1 TBI Winterseminar 2006 February 21, 2006 Protein Folding by Robotics 1 TBI Winterseminar 2006 February 21, 2006 Protein Folding by Robotics Probabilistic Roadmap Planning

More information

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes Local Alignment of RNA Sequences with Arbitrary Scoring Schemes Rolf Backofen 1, Danny Hermelin 2, ad M. Landau 2,3, and Oren Weimann 4 1 Institute of omputer Science, Albert-Ludwigs niversität Freiburg,

More information

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out

More information

12/31/2010. Overview. 05-Boolean Algebra Part 3 Text: Unit 3, 7. DeMorgan s Law. Example. Example. DeMorgan s Law

12/31/2010. Overview. 05-Boolean Algebra Part 3 Text: Unit 3, 7. DeMorgan s Law. Example. Example. DeMorgan s Law Overview 05-oolean lgebra Part 3 Text: Unit 3, 7 EEGR/ISS 201 Digital Operations and omputations Winter 2011 DeMorgan s Laws lgebraic Simplifications Exclusive-OR and Equivalence Functionally omplete NND-NOR

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Comparative Network Analysis COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Biomolecular Network Components 2 Accumulation of Network Components

More information

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University MB330 - January, 2006 Sequence-comparison methods erard Kleywegt Uppsala University Outline! Why compare sequences?! Dotplots! airwise sequence alignments &! Multiple sequence alignments! rofile methods!

More information

Phylogeny Tree Algorithms

Phylogeny Tree Algorithms Phylogeny Tree lgorithms Jianlin heng, PhD School of Electrical Engineering and omputer Science University of entral Florida 2006 Free for academic use. opyright @ Jianlin heng & original sources for some

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots Impact Of The Energy Model On The omplexity Of RN Folding With Pseudoknots Saad Sheikh, Rolf Backofen Yann Ponty, niversity of Florida, ainesville, S lbert Ludwigs niversity, Freiburg, ermany LIX, NRS/Ecole

More information

A Method for Aligning RNA Secondary Structures

A Method for Aligning RNA Secondary Structures Method for ligning RN Secondary Structures Jason T. L. Wang New Jersey Institute of Technology J Liu, JTL Wang, J Hu and B Tian, BM Bioinformatics, 2005 1 Outline Introduction Structural alignment of RN

More information

A New Space for Comparing Graphs

A New Space for Comparing Graphs A New Space for Comparing Graphs Anshumali Shrivastava and Ping Li Cornell University and Rutgers University August 18th 2014 Anshumali Shrivastava and Ping Li ASONAM 2014 August 18th 2014 1 / 38 Main

More information

Value-Ordering and Discrepancies. Ciaran McCreesh and Patrick Prosser

Value-Ordering and Discrepancies. Ciaran McCreesh and Patrick Prosser Value-Ordering and Discrepancies Maintaining Arc Consistency (MAC) Achieve (generalised) arc consistency (AC3, etc). If we have a domain wipeout, backtrack. If all domains have one value, we re done. Pick

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) Mass-spectrometry data analysis in KNIME Julianus Pfeuffer,

More information

Comparative Genomics: Sequence, Structure, and Networks. Bonnie Berger MIT

Comparative Genomics: Sequence, Structure, and Networks. Bonnie Berger MIT Comparative Genomics: Sequence, Structure, and Networks Bonnie Berger MIT Comparative Genomics Look at the same kind of data across species with the hope that areas of high correlation correspond to functional

More information

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Abstract: We used an algorithm known as statistical coupling analysis (SCA) 1 to create a set of features for building

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple equence lignment Four ami Khuri Dept of omputer cience an José tate University Multiple equence lignment v Progressive lignment v Guide Tree v lustalw v Toffee v Muscle v MFFT * 20 * 0 * 60 *

More information

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS 249-2 2017 Winter Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Xiaojin

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction Jakob V. Hansen Department of Computer Science, University of Aarhus Ny Munkegade, Bldg. 540, DK-8000 Aarhus C,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Towards a Comprehensive Annotation of Structured RNAs in Drosophila

Towards a Comprehensive Annotation of Structured RNAs in Drosophila Towards a Comprehensive Annotation of Structured RNAs in Drosophila Rebecca Kirsch 31st TBI Winterseminar, Bled 20/02/2016 Studying Non-Coding RNAs in Drosophila Why Drosophila? especially for novel molecules

More information

Scale & Affine Invariant Interest Point Detectors

Scale & Affine Invariant Interest Point Detectors Scale & Affine Invariant Interest Point Detectors KRYSTIAN MIKOLAJCZYK AND CORDELIA SCHMID [2004] Shreyas Saxena Gurkirit Singh 23/11/2012 Introduction We are interested in finding interest points. What

More information

LOCAL SEQUENCE-STRUCTURE MOTIFS IN RNA

LOCAL SEQUENCE-STRUCTURE MOTIFS IN RNA Journal of Bioinformatics and omputational Biology c Imperial ollege Press LOL SEQENE-STRTRE MOTIFS IN RN ROLF BKOFEN and SEBSTIN WILL hair for Bioinformatics at the Institute of omputer Science, Friedrich-Schiller-niversitaet

More information

An Introduction to Exponential-Family Random Graph Models

An Introduction to Exponential-Family Random Graph Models An Introduction to Exponential-Family Random Graph Models Luo Lu Feb.8, 2011 1 / 11 Types of complications in social network Single relationship data A single relationship observed on a set of nodes at

More information

Interaction Network Analysis

Interaction Network Analysis CSI/BIF 5330 Interaction etwork Analsis Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Biological etworks Definition Maps of biochemical reactions, interactions, regulations

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

I. Pairwise Display and the PairViz R package

I. Pairwise Display and the PairViz R package raph theoretic methods for ata Visualization. I. Pairwise isplay and the PairViz R package Wayne Oldford based on joint work with atherine Hurley Tutorial 1 The problem an we automatically, yet meaningfully,

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Outline Sequence-comparison methods. Buzzzzzzzz. MB330 - The class of 2008

Outline Sequence-comparison methods. Buzzzzzzzz. MB330 - The class of 2008 Outline Sequence-comparison methods erard Kleywegt Uppsala University Why compare sequences otplots airwise sequence alignments Multiple sequence alignments rofile methods Buzzzzzzzz Why compare sequences

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

Gray-box Inference for Structured Gaussian Process Models: Supplementary Material

Gray-box Inference for Structured Gaussian Process Models: Supplementary Material Gray-box Inference for Structured Gaussian Process Models: Supplementary Material Pietro Galliani, Amir Dezfouli, Edwin V. Bonilla and Novi Quadrianto February 8, 017 1 Proof of Theorem Here we proof the

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Towards Detecting Protein Complexes from Protein Interaction Data

Towards Detecting Protein Complexes from Protein Interaction Data Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures, Divide and Conquer Albert-Ludwigs-Universität Freiburg Prof. Dr. Rolf Backofen Bioinformatics Group / Department of Computer Science Algorithms and Data Structures, December

More information

Introduction to protein alignments

Introduction to protein alignments Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

NcRNA Homology Search Using Hamming Distance Seeds

NcRNA Homology Search Using Hamming Distance Seeds NcRN Homology Search Using Hamming Distance Seeds Osama ljawad Dept. of omputer Science and Engineering Michigan State University East Lansing, MI 48824 aljawado@msu.edu Yanni Sun Dept. of omputer Science

More information

clustq: Efficient Protein Decoy Clustering Using Superposition-free Weighted Internal Distance Comparisons

clustq: Efficient Protein Decoy Clustering Using Superposition-free Weighted Internal Distance Comparisons clustq: Efficient Protein Decoy Clustering Using Superposition-free Weighted Internal Distance Comparisons Debswapna Auburn University ACM-BCB August 31, 2018 What is protein decoy clustering? Clustering

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

The Lovász Local Lemma : A constructive proof

The Lovász Local Lemma : A constructive proof The Lovász Local Lemma : A constructive proof Andrew Li 19 May 2016 Abstract The Lovász Local Lemma is a tool used to non-constructively prove existence of combinatorial objects meeting a certain conditions.

More information

Lecture 24: Randomized Algorithms

Lecture 24: Randomized Algorithms Lecture 24: Randomized Algorithms Chapter 12 11/26/2013 Comp 465 Fall 2013 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly used in situations

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

Solving the MWT. Recall the ILP for the MWT. We can obtain a solution to the MWT problem by solving the following ILP:

Solving the MWT. Recall the ILP for the MWT. We can obtain a solution to the MWT problem by solving the following ILP: Solving the MWT Recall the ILP for the MWT. We can obtain a solution to the MWT problem by solving the following ILP: max subject to e i E ω i x i e i C E x i {0, 1} x i C E 1 for all critical mixed cycles

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

A New Distributed Algorithm for Side-Chain Positioning in the Process of Protein Docking

A New Distributed Algorithm for Side-Chain Positioning in the Process of Protein Docking 5nd IEEE Conference on Decision and Control December 10-13, 013. Florence, Italy A New Distributed Algorithm for Side-Chain Positioning in the Process of Protein Docking Mohammad Moghadasi, Dima Kozakov,

More information

Network Alignment 858L

Network Alignment 858L Network Alignment 858L Terms & Questions A homologous h Interolog = B h Species 1 Species 2 Are there conserved pathways? What is the minimum set of pathways required for life? Can we compare networks

More information

Better restore the recto side of a document with an estimation of the verso side: Markov model and inference with graph cuts

Better restore the recto side of a document with an estimation of the verso side: Markov model and inference with graph cuts June 23 rd 2008 Better restore the recto side of a document with an estimation of the verso side: Markov model and inference with graph cuts Christian Wolf Laboratoire d InfoRmatique en Image et Systèmes

More information

Lecture 23: Randomized Algorithms

Lecture 23: Randomized Algorithms Lecture 23: Randomized Algorithms Chapter 12 11/20/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

Phylogenetic trees 07/10/13

Phylogenetic trees 07/10/13 Phylogenetic trees 07/10/13 A tree is the only figure to occur in On the Origin of Species by Charles Darwin. It is a graphical representation of the evolutionary relationships among entities that share

More information

Towards firm foundations for the rational design of RNA molecules

Towards firm foundations for the rational design of RNA molecules 39 28 4 8 282 28 98 32 3 36 68 3 82 : (405)-43-> 43 42 4 40 8 25 37 46 26 36 24 27 E23_9 34 22 3 33 32 E23_2 2 48 o 9 Towards firm foundations for the rational design of RN molecules 47 2 E23_6 E23_ 45

More information

Markov Random Fields

Markov Random Fields Markov Random Fields Umamahesh Srinivas ipal Group Meeting February 25, 2011 Outline 1 Basic graph-theoretic concepts 2 Markov chain 3 Markov random field (MRF) 4 Gauss-Markov random field (GMRF), and

More information

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction Journal of Bioinformatics and Computational Biology c Imperial College Press Profile-based string kernels for remote homology detection and motif extraction Rui Kuang 1, Eugene Ie 1,3, Ke Wang 1, Kai Wang

More information

Protein Complex Identification by Supervised Graph Clustering

Protein Complex Identification by Supervised Graph Clustering Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda Balem 2, Christos Faloutsos 1, Judith Klein- Seetharaman 1,2, Ziv Bar-Joseph 1 1 School of Computer Science, Carnegie

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

Computational Design of New and Recombinant Selenoproteins

Computational Design of New and Recombinant Selenoproteins Computational Design of ew and Recombinant Selenoproteins Rolf Backofen and Friedrich-Schiller-University Jena Institute of Computer Science Chair for Bioinformatics 1 Computational Design of ew and Recombinant

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Multiple Whole Genome Alignment

Multiple Whole Genome Alignment Multiple Whole Genome Alignment BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 206 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

EVALUATION OF RNA SECONDARY STRUCTURE MOTIFS USING REGRESSION ANALYSIS

EVALUATION OF RNA SECONDARY STRUCTURE MOTIFS USING REGRESSION ANALYSIS EVLTION OF RN SEONDRY STRTRE MOTIFS SIN RERESSION NLYSIS Mohammad nwar School of Information Technology and Engineering, niversity of Ottawa e-mail: manwar@site.uottawa.ca bstract Recent experimental evidences

More information

Example: multivariate Gaussian Distribution

Example: multivariate Gaussian Distribution School of omputer Science Probabilistic Graphical Models Representation of undirected GM (continued) Eric Xing Lecture 3, September 16, 2009 Reading: KF-chap4 Eric Xing @ MU, 2005-2009 1 Example: multivariate

More information

Comparative Network Analysis

Comparative Network Analysis Comparative Network Analysis BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

RNA Protein Interaction

RNA Protein Interaction Introduction RN binding in detail structural analysis Examples RN Protein Interaction xel Wintsche February 16, 2009 xel Wintsche RN Protein Interaction Introduction RN binding in detail structural analysis

More information

GEOMETRY OF PROTEIN STRUCTURES. I. WHY HYPERBOLIC SURFACES ARE A GOOD APPROXIMATION FOR BETA-SHEETS

GEOMETRY OF PROTEIN STRUCTURES. I. WHY HYPERBOLIC SURFACES ARE A GOOD APPROXIMATION FOR BETA-SHEETS GEOMETRY OF PROTEIN STRUCTURES. I. WHY HYPERBOLIC SURFACES ARE A GOOD APPROXIMATION FOR BETA-SHEETS Boguslaw Stec 1;2 and Vladik Kreinovich 1;3 1 Bioinformatics Program 2 Chemistry Department 3 Department

More information

Bearing fault diagnosis based on EMD-KPCA and ELM

Bearing fault diagnosis based on EMD-KPCA and ELM Bearing fault diagnosis based on EMD-KPCA and ELM Zihan Chen, Hang Yuan 2 School of Reliability and Systems Engineering, Beihang University, Beijing 9, China Science and Technology on Reliability & Environmental

More information

Conditional Graphical Models

Conditional Graphical Models PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing

More information

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas n Introduction to Bioinformatics lgorithms Multiple lignment Slides revised and adapted to Bioinformática IS 2005 na eresa Freitas n Introduction to Bioinformatics lgorithms Outline Dynamic Programming

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

BarMap & SundialsWrapper

BarMap & SundialsWrapper BarMap & SundialsWrapper advanced RN folding kinetics Stefan Badelt Institute for Theoretical hemistry Theoretical Biochemistry roup stef@tbi.univie.ac.at February 17, 215 1 Outline BarMap.pm & SundialsWrapper.pm

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 1: Introduction to Graphical Models Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ epartment of ngineering University of ambridge, UK

More information

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity CS322 Project Writeup Semih Salihoglu Stanford University 353 Serra Street Stanford, CA semih@stanford.edu

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets

4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets 4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets William H. Press The University of Texas at Austin Lecture 6 IMPRS Summer School 2009, Prof. William H. Press 1 Mixture

More information