Università della Calabria

Similar documents
Comparative Network Analysis

Computational methods for predicting protein-protein interactions

Network alignment and querying

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Pattern Matching (Exact Matching) Overview

Network Alignment 858L

Algorithms for Molecular Biology

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Quantifying sequence similarity

Computational Systems Biology

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Graph Alignment and Biological Networks

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

Discovering Binding Motif Pairs from Interacting Protein Groups

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Comparative Genomics: Sequence, Structure, and Networks. Bonnie Berger MIT

Bioinformatics: Network Analysis

On the Monotonicity of the String Correction Factor for Words with Mismatches

Module 9: Tries and String Matching

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Phylogenetic Analysis of Molecular Interaction Networks 1

Foreword. Grammatical inference. Examples of sequences. Sources. Example of problems expressed by sequences Switching the light

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

Linear Classifiers (Kernels)

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Tandem Mass Spectrometry: Generating function, alignment and assembly

Protein Complex Identification by Supervised Graph Clustering

Introduction to Bioinformatics

Hidden Markov Models

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Bioinformatics Chapter 1. Introduction

PROTEINS form the basic building blocks of all living

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Motif Extraction from Weighted Sequences

Lecture 10: May 19, High-Throughput technologies for measuring proteinprotein

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages

Introduction to Bioinformatics Online Course: IBT

Computational Structural Bioinformatics

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Computational Molecular Biology (

E D I C T The internal extent formula for compacted tries

CSCE555 Bioinformatics. Protein Function Annotation

Genomics and bioinformatics summary. Finding genes -- computer searches

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Predicting Protein Functions and Domain Interactions from Protein Interactions

Combinatorial Optimization

BLAST. Varieties of BLAST

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

CS375 Midterm Exam Solution Set (Fall 2017)

Protein Structure Prediction Using Neural Networks

METABOLIC PATHWAY PREDICTION/ALIGNMENT

Gene Ontology and overrepresentation analysis

Unsupervised Learning in Spectral Genome Analysis

Towards Detecting Protein Complexes from Protein Interaction Data

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Introduction to Bioinformatics

Motivating the need for optimal sequence alignments...

Clustering and Network

Self Similar (Scale Free, Power Law) Networks (I)

Model Accuracy Measures

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Learning in Bayesian Networks

Procedure to Create NCBI KOGS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Exhaustive search. CS 466 Saurabh Sinha

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Types of biological networks. I. Intra-cellurar networks

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Automata-based Verification - III

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Implicit and Explicit Representation of Approximated Motifs

Pathway Association Analysis Trey Ideker UCSD

Inferring Protein-Signaling Networks

The State Explosion Problem

Optimal spaced seeds for faster approximate string matching

BIOINFORMATICS LAB AP BIOLOGY

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl

Lecture: Computational Systems Biology Universität des Saarlandes, SS Introduction. Dr. Jürgen Pahle

Optimal spaced seeds for faster approximate string matching

Improved network-based identification of protein orthologs

Hidden Markov Models

networks in molecular biology Wolfgang Huber

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Detecting unfolded regions in protein sequences. Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Transcription:

Università della Calabria Facoltà di Ingegneria BIOINFORMATICS TECHNIQUES AND METHODOLOGIES Research group coordinated by Prof. Luigi Palopoli Lecturer: Simona Rombo

OUTLINE 1. Introduction to Bioinformatics 2. Pattern discovery Strings Images 3. Biological Networks Analysis Network alignment Network clustering 2

Introduction to Bioinformatics Donald Knuth, 1993: It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at people fingertips, that it won t be pretty much working on refinement of wellexplored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can t predict an unending growth. I can t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on 3

Introduction to Bioinformatics There are several facts about biology that are important to keep in mind: In biology there are no rules without exceptions In reasoning with biological structures, looking for generalizations maybe often misleading It is often impossible to look at a biological phenomenon in isolation, for it may take place just as long as other related phenomena take place as well, which need to be taken care of too To reason with incomplete information is quite the rule rather than the exception In reasoning about biological structures and functions it is important to bear in mind the pervasive role of evolution 4

Introduction to Bioinformatics A definition: Bioinformatics is the combination of biology and Information technology. It is the branch of science that deals with computer-based analysis of large biological data sets. Bioinformatics incorporates the development of databases to store and search data, and statistical tools and algorithms to analyze and determine relationships between biological data sets, such as macromolecular sequences, structures, expression profiles and biochemical pathways. (R.M. Twyman) In most cases, computer based tools developed in bioinformatics require expert human intervention for the addressed problems to get solved 5

Introduction to Bioinformatics Generally speaking, the aim of bioinformatics is to help biologists in gathering and processing biological data and to aid in studying protein structures and interactions in order to allow optimal drug design. 6

Introduction to Bioinformatics Here is a summary of CS methods and techniques relevant to bioinformatics: String algorithms, grammars and automata Indexing methods and query optimization Integration techniques Optimization techniques Dynamic programming and heuristics Data mining and machine learning techniques Probability and statistic-based methods Computational geometry methods Text mining 7

Introduction to Bioinformatics Two main points of view: 1. Cellular components (e.g., DNA, RNA, proteins) 2. Interaction of cellular components (e.g., metabolic pathways, protein-protein interactions) 8

Introduction to Bioinformatics Cellular Components 9

Introduction to Bioinformatics Cellular Components DNA 10

Introduction to Bioinformatics Cellular Components AMINO ACIDS Proteins are the core structures determining cell lifecycle; they are made up of elementary units called amino acids (few exceptions exist) or residues; There are 20 amino acids in nature 11

Introduction to Bioinformatics Interactions of components Another perspective is the analysis of protein mutual interactions Proteins are involved in complexes performing specific biological functions Saccaromyces Cerevisiae 12

Pattern Discovery 13

Pattern discovery Efficient data structures Trie A tree data structure used to store strings Each edge has a label representing a symbol Two edges out of the same node have distinct labels Each node, except the root, is associated with a string Concatenating all the symbols in the path from the root to a node n, the string corresponding to n is obtained All the descendance of the same node n are associated with strings having a common prefix, i.e., the string corresponding to n 14

Pattern discovery Example A trie storing the words {to, te, tea, ten, hi, he, her}: t o e to a tea h e i te n hi ten he r her 15

Pattern discovery Efficient data structures Suffix Tree Given a string s of n caracters on the alphabet Σ, a suffix tree T associated to s can be defined as a trie containing all the n suffixes of s. For each leaf of T, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix si of s For any pairs of suffixes in s, the path associated with their longer prefix is the same in T (Example on the string abbababbab) 16

Pattern Discovery 17

Pattern Discovery 18

Pattern Discovery 19

Pattern Discovery 20

Pattern Discovery 21

Pattern Discovery 22

Pattern Discovery 23

Pattern Discovery Problem: often the size of the output is exponential in the input size 24

Pattern Discovery 25

Pattern Discovery 2D Array 26

Pattern Discovery 2D Array 27

Definition of maximal motif MAXIMAL not in composition not in length 28

29

BASIS A basis of an image I is a set of irredundant motifs able to generate all the other motifs of I It is possible to prove that each image has ONLY ONE basis the basis is unique The size of the basis is linear in the size of the image - If I has size N, the number of motifs in the basis is O(N) In general, the number of motifs with don t care in I is exponential in N An important problem is the extraction of the basis from I 30

A key concept: autocorrelation Autocorrelations: the meet between I and all its bites P ababbbbaba ba b ba b baba bababababa bababababa bbb b baba b b bababa bab b b ba b b b baba bbb ba b ba ba bbbbbaabab A bbbbaba babbaba bababab abbbbab ababbbb Q bababa bababa bbbbab ababbb meet between P and Q: b b bab b b ab bb b b bab b b ab bb 31

Consensus, Meet, Autocorrelation Projection at (i1, j1) and (i2, j2) 32

Basic Approach Theorem: the basis is a subset of the set of autocorrelations Three steps: 1. Generate all the autocorrelations of the inpute image I 2. Compute the lists of occurrences of the autocorrelations 3. Discard irredundant motifs 1. O(N2) 2.? 3. O(N2) 33

Second step 1) Fisher & Paterson O(N2lognloglogn) 2) Incremental building of the setb of irredundant motifs O(N3) j ababbbbaba bababababa bababababa i bbbbbbbaba bababababa ij bbbbbaabab R Bij Bij+1 3) Exploit some properties about don t cares O(N2), but only for binary alphabets 34

Optimal Approach Exploit some properties holding for Σ =2 (e.g., Σ ={a,b}) 35

Optimal Approach - Example d1=2 Is (2, 2) an occurrence of A34? d2=0 d3=2 Is (2, 4) an occurrence of A34? d2=1 d3=1 36

Optimal Approach Three steps: 1. Generate all the autocorrelations of the inpute image I 2. Compute the lists of occurrences of the autocorrelations 3. Discard irredundant motifs 1. O(N2) 2. O(N2) Only black-and-white Images 3. O(N2) Overall Cost: O(N2) 37

Image Compression Main Idea: Exploit motif basis as 2D patches 38

Image Compression 39

Image Compression 40

Pattern discovery References: A. Amelio, A. Apostolico and S. E. Rombo. Image Compression by 2D Motif Basis. In Proceedings of IEEE Data Compression Conference (DCC 2011), IEEE CS Press, Snowbird, UT, USA, 2011 (Forthcoming). A. Apostolico, L. Parida and S. E. Rombo, Motif Patterns in 2D. Theoretical Computer Science. 2008. S. E. Rombo: Optimal extraction of motif patterns in 2D. Inf. Process. Lett. 109(17): 1015-1020 (2009). A. Apostolico and L. Parida, Incremental Paradigms of Motif Discovery, J. of Comp. Biol. 11:1 (2004) 15-25. A. Amir and M. Farach, Two-dimensional dictionary matching, Inf. Process. Lett. 44:5 (1992) 233-239. M.J. Fisher and M.S. Paterson, String Matching and Other Products, in: R.M. Karp (Ed.), Complexity of Computation (SIAM-AMS Proceedings, v.7), 1974, pp. 113-125. 41

Pattern discovery Approfondimenti (dal 2009 in poi): Compressione di immagini Analisi di immagini biologiche Pattern discovery/matching su immagini con rotazioni, scaling e altre varianti Tecniche applicate alla ricerca di similarità tra immagini Pattern discovery (motif extraction) su stringhe biologiche 42

Biological Networks Analysis PPI networks similarity search Evolution influence protein-protein interactions Proteins cannot be analyzed independently Both high-throughput and computational methods contribute to discover and predict protein-protein interactions 43

Biological Networks Analysis The Interaction Network of an organism: nodes= proteins edges= interactions 44

Biological Networks Analysis Why searching for similarity between proteins belonging to different PPI networks? To individuate functional conservations across species 45

Biological Networks Analysis Our basic idea Two proteins p1 and p2 in two different PPI networks may be considered similar if: p1 and p2 have similar sequences proteins p1 and p2 are connected with, i.e., their neighborhoods, have similar sequences 46

Biological Networks Analysis Refining protein similarities S=sequence similarity 47

Biological Networks Analysis Refining protein similarities S =refined similarity 48

Biological Networks Analysis The Graph Network P = a set of nodes labeled by proteins id I = a set of indirect labeled edges <w,c> w,c [0,1] w = weakness c = confidence Graph Network: GN = <P,I> 49

Biological Networks Analysis Interaction Pathi (I-Pathi) A path such that: F(i-1) Σu wu F(i), i 1, F(0) = 0 Example: p1 <0.8,0.4> p2 <0.2,0.7> p6 <0.1,0.6> p4 <0.3,0.4> p5 <0.6,0.2> <0.9,0.4> p8 p9 p3 <0.7,0.1> p7 F(x)=x2 i=1 <p2, p1, p4> satisfied <p3, p4, p5, p6 > satisfied <p4, p5, p9 > not satisfied <0.5,0.3> 50

Biological Networks Analysis Cumulative Confidence Given an I-Pathi: C=Πucu Example: p1 <0.8,0.4> p2 <0.2,0.7> p6 <0.1,0.6> p4 <0.3,0.4> p5 <0.6,0.2> <0.9,0.4> p8 p9 p3 <0.7,0.1> p7 F(x)=x2 i=1 For the path <p2, p1, p4>: C = 0.4 * 0.7 = 0.28 <0.5,0.3> 51

Biological Networks Analysis i-th Neighborhood Given a node p in GN = <P,I>: N(p,i)={q q P, q p, <p,q> is an I-Pathi in GN with minimum Σuwu} Example: p1 p2 <0.3,0.4> p3 <0.6,0.2> <0.9,0.4> p5 p6 <0.7,0.1> p4 F(x)=x2 i=1 N(p,i)={p, p, p, p } 3 1 2 4 6 <0.5,0.3> 52

Biological Networks Analysis The Bi-GRAPPIN Algorithm Let GN 1 and GN 2 be graph networks of two different organisms, with n1 and n2 nodes, resp. Align each pair of proteins (p,p ) p GN 1 and p GN 2 (e.g., by the BLAST 2 seq. algorithm) 53

Biological Networks Analysis The Bi-GRAPPIN Algorithm INPUT: a sequence similarity dictionary SSD storing all the triplets: <p, p, f0> p GN 1, p GN 2, f0 [0,1] f0: obtained by sequence alignment parameters OUTPUT: a dictionary FSD storing: <p, p, fp> p GN 1, p GN 2, fp [0,1] fp: functional similarity 54

Biological Networks Analysis The Bi-GRAPPIN Algorithm FSD = SSD for each <p,p, f0> SSD if (f0 > fcut-off ) set i=1 while i<imax a fixed treshold value corr. to the maximum network percentage to be analized generate N(p,i) and N(p,i) compute a bipartite graph maximum weight matching between N(p,i) and N(p,i) refine f0 obtaining a new value fp, according to the objective function of the max. weight matching i=i+1 return FSD 55

Biological Networks Analysis Example (1/3) yeast Target N(, 1) P P fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 56

Biological Networks Analysis Example (2/3) Bipartite graph maximum weight matching between N(p,1) and N(p,1) 0,75 0,22 yeast 0,83 0,34 0,89 0,85 0,73 fly 0,82 0,33 0,65 57

Biological Networks Analysis Example (2/3) Bipartite graph maximum weight matching between N(p,1) and N(p,1) 0,75 0,22 yeast 0,83 0,34 0,89 085 0,73 fly 0,82 0,33 0,65 fp(1)=δ(1)*µ(n(p,1),n(p,1),fsd,α)+[1 δ(1)]* f0(p,p ) 58

Biological Networks Analysis Example (3/3) yeast Target N(, 1) P P fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 59

Biological Networks Analysis Example (3/3) yeast Target N(, 1) N(, 2) P P fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 60

Biological Networks Analysis Example (3/3) yeast P P Target N(, 1) N(, 2) N(, 3) <p, p, fp(3)> FSD fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 61

Biological Networks Analysis Synthetic data (1/3) Very similar neighborhoods: final fp greater than f0 62

Biological Networks Analysis Synthetic data (2/3) High f0 but very dissimilar neighborhoods: final fp lower than f0 63

Biological Networks Analysis Synthetic data (3/3) High f0, not very similar N(, 1) but very similar N(, 2) : final fp greater than f0 64

Functional Orthologs S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428 435, 2006. R. Singh, J. Xu, and B. Berger. Pairwise global alignment of protein interaction networks by matching neighborhood topology. In RECOMB 2007. LNB, 2007. 65

Biological Networks Analysis Further experiments Query D. Melanogaster PPI network with Abp1, for which no evident homolog has been detected The most similar protein based on the sequence homology: CG10083 (a debrin-like protein) Abp1: an actin binding protein regulating actin nucleation Is it possible to find other proteins involved in actin reorganization, comparing the sub-net composing Abp1 together with its first two neighborhoods against the entire drosophila network? 66

Biological Networks Analysis Further experiments Best match according to our refined similarity: CG10083 (confirm the pairwise sequence similarity) Abp1 and CG10083 are both Actin-binding proteins Other proteins of unknown functions showing low sequence similarity with Abp1, may share similar function CG6873-PA: a cofilin-like protein possibly involved in cytoskeleton shaping SSD: <Abp1, CG6873-PA, 0.287> FSD: <Abp1, CG6873-PA, 0.442 > 67

Biological Networks Analysis Asymmetric Alignment Master Network Guides the alignment process Slave Network It s aligned to the master Some well-characterized organisms: E.g. Saccharomyces Cerevisiae This is not the case for many other organisms Advantage: Results retain the structural characteristic of the master network (so they are sound ) 68

Biological Networks Analysis Asymmetric Alignment Linearization of the slave network: Translation of the network into a sequence of symbols Given a linearization of the slave find the portion of the master that can be associated to it Motivations: Only the slave network is linearized, all the structural information about the master network are kept The approximation allows us to find similar groups of proteins, not just isomorphic structures The resulting algorithm has a polynomial time complexity 69

Biological Networks Analysis Asymmetric Alignment Master network Alignment Model Weighted finite-state automaton States of the model corresponds to proteins (p1, 0), (p2, 1),..., (p3, 0) score 1 (p1, 0), (*, 1),..., (*, 0) score 2 Find the maximum scoring path (among the states of the master) for the linearization of the slave network: Viterbi Algorithm 70

Biological Networks Analysis Asymmetric Alignment Global Alignment of Yeast (Master) and Fly (Slave) 71

Biological Networks Analysis Asymmetric Alignment Yeast (as the master) vs. Fly: 945 protein pairings Fly (as the master) vs. Yeast: 707 protein pairings Possible explanation: Yeast network is better characterized than Fly network with yeast as slave much structural information gets lost There are more regions of the Yeast that have been conserved in the Fly than vice versa, since the Fly is more complex 72

Biological Networks Analysis PPI networks clustering Aim: clustering dense regions of a given PPI network, since it has been observed by biologists that groups of highly interacting proteins could be involved in common biological processes 73

Biological Networks Analysis Search of functional modules in PPI networks The network is modeled by a matrix representing the interactions. The algorithm introduces the concept of quality of a sub-matrix and apply a greedy tecnique to discover compact regions of the network. 74

Biological Networks Analysis 75

Biological Networks Analysis 76

Biological Networks Analysis 77

Biological Networks Analysis 78

Biological Networks Analysis Validation 79

Biological Networks Analysis References 1. N. Ferraro, L. Palopoli, S. Panni and S. E. Rombo. Master-Slave Biological Network Alignment. In Proceedings of 6th International symposium on Bioinformatics Research and Applications (ISBRA 2010), 215 229, Connecticut, USA, 2010. 2. F. Bruno, L. Palopoli and S. E. Rombo. New trends in graph mining: Structural and Node-colored network motifs. International Journal of Knowledge Discovery in Bioinformatics, 1(1), 81 99, 2010. 3. C. Pizzuti and S. E. Rombo. Multi-functional Protein Clustering in PPI Networks. BIRD 2008. 4. V. Fionda, S. Panni, L. Palopoli and S. E. Rombo. Bi-GRAPPIN: Bipartite graph based protein-protein interaction networks similarity search. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07). Silicon Valley, USA, 2007. 5. C. Pizzuti and S. E. Rombo. PINCoC: a Co-Clustering based Method to Analyze Protein-Protein Interaction Networks. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'07). Birmingham, UK, 16th-19th December, 2007. 6. S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428 435, 2006. 80

Biological Networks Analysis Approfondimenti (dal 2009 in poi): Alignment of biological networks Integration and cleaning of biological networks Querying of biological databases/networks Biological networks clustering RNA structure prediction RNA sequence/structure alignment 81