Università della Calabria Facoltà di Ingegneria BIOINFORMATICS TECHNIQUES AND METHODOLOGIES Research group coordinated by Prof. Luigi Palopoli Lecturer: Simona Rombo
OUTLINE 1. Introduction to Bioinformatics 2. Pattern discovery Strings Images 3. Biological Networks Analysis Network alignment Network clustering 2
Introduction to Bioinformatics Donald Knuth, 1993: It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at people fingertips, that it won t be pretty much working on refinement of wellexplored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can t predict an unending growth. I can t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on 3
Introduction to Bioinformatics There are several facts about biology that are important to keep in mind: In biology there are no rules without exceptions In reasoning with biological structures, looking for generalizations maybe often misleading It is often impossible to look at a biological phenomenon in isolation, for it may take place just as long as other related phenomena take place as well, which need to be taken care of too To reason with incomplete information is quite the rule rather than the exception In reasoning about biological structures and functions it is important to bear in mind the pervasive role of evolution 4
Introduction to Bioinformatics A definition: Bioinformatics is the combination of biology and Information technology. It is the branch of science that deals with computer-based analysis of large biological data sets. Bioinformatics incorporates the development of databases to store and search data, and statistical tools and algorithms to analyze and determine relationships between biological data sets, such as macromolecular sequences, structures, expression profiles and biochemical pathways. (R.M. Twyman) In most cases, computer based tools developed in bioinformatics require expert human intervention for the addressed problems to get solved 5
Introduction to Bioinformatics Generally speaking, the aim of bioinformatics is to help biologists in gathering and processing biological data and to aid in studying protein structures and interactions in order to allow optimal drug design. 6
Introduction to Bioinformatics Here is a summary of CS methods and techniques relevant to bioinformatics: String algorithms, grammars and automata Indexing methods and query optimization Integration techniques Optimization techniques Dynamic programming and heuristics Data mining and machine learning techniques Probability and statistic-based methods Computational geometry methods Text mining 7
Introduction to Bioinformatics Two main points of view: 1. Cellular components (e.g., DNA, RNA, proteins) 2. Interaction of cellular components (e.g., metabolic pathways, protein-protein interactions) 8
Introduction to Bioinformatics Cellular Components 9
Introduction to Bioinformatics Cellular Components DNA 10
Introduction to Bioinformatics Cellular Components AMINO ACIDS Proteins are the core structures determining cell lifecycle; they are made up of elementary units called amino acids (few exceptions exist) or residues; There are 20 amino acids in nature 11
Introduction to Bioinformatics Interactions of components Another perspective is the analysis of protein mutual interactions Proteins are involved in complexes performing specific biological functions Saccaromyces Cerevisiae 12
Pattern Discovery 13
Pattern discovery Efficient data structures Trie A tree data structure used to store strings Each edge has a label representing a symbol Two edges out of the same node have distinct labels Each node, except the root, is associated with a string Concatenating all the symbols in the path from the root to a node n, the string corresponding to n is obtained All the descendance of the same node n are associated with strings having a common prefix, i.e., the string corresponding to n 14
Pattern discovery Example A trie storing the words {to, te, tea, ten, hi, he, her}: t o e to a tea h e i te n hi ten he r her 15
Pattern discovery Efficient data structures Suffix Tree Given a string s of n caracters on the alphabet Σ, a suffix tree T associated to s can be defined as a trie containing all the n suffixes of s. For each leaf of T, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix si of s For any pairs of suffixes in s, the path associated with their longer prefix is the same in T (Example on the string abbababbab) 16
Pattern Discovery 17
Pattern Discovery 18
Pattern Discovery 19
Pattern Discovery 20
Pattern Discovery 21
Pattern Discovery 22
Pattern Discovery 23
Pattern Discovery Problem: often the size of the output is exponential in the input size 24
Pattern Discovery 25
Pattern Discovery 2D Array 26
Pattern Discovery 2D Array 27
Definition of maximal motif MAXIMAL not in composition not in length 28
29
BASIS A basis of an image I is a set of irredundant motifs able to generate all the other motifs of I It is possible to prove that each image has ONLY ONE basis the basis is unique The size of the basis is linear in the size of the image - If I has size N, the number of motifs in the basis is O(N) In general, the number of motifs with don t care in I is exponential in N An important problem is the extraction of the basis from I 30
A key concept: autocorrelation Autocorrelations: the meet between I and all its bites P ababbbbaba ba b ba b baba bababababa bababababa bbb b baba b b bababa bab b b ba b b b baba bbb ba b ba ba bbbbbaabab A bbbbaba babbaba bababab abbbbab ababbbb Q bababa bababa bbbbab ababbb meet between P and Q: b b bab b b ab bb b b bab b b ab bb 31
Consensus, Meet, Autocorrelation Projection at (i1, j1) and (i2, j2) 32
Basic Approach Theorem: the basis is a subset of the set of autocorrelations Three steps: 1. Generate all the autocorrelations of the inpute image I 2. Compute the lists of occurrences of the autocorrelations 3. Discard irredundant motifs 1. O(N2) 2.? 3. O(N2) 33
Second step 1) Fisher & Paterson O(N2lognloglogn) 2) Incremental building of the setb of irredundant motifs O(N3) j ababbbbaba bababababa bababababa i bbbbbbbaba bababababa ij bbbbbaabab R Bij Bij+1 3) Exploit some properties about don t cares O(N2), but only for binary alphabets 34
Optimal Approach Exploit some properties holding for Σ =2 (e.g., Σ ={a,b}) 35
Optimal Approach - Example d1=2 Is (2, 2) an occurrence of A34? d2=0 d3=2 Is (2, 4) an occurrence of A34? d2=1 d3=1 36
Optimal Approach Three steps: 1. Generate all the autocorrelations of the inpute image I 2. Compute the lists of occurrences of the autocorrelations 3. Discard irredundant motifs 1. O(N2) 2. O(N2) Only black-and-white Images 3. O(N2) Overall Cost: O(N2) 37
Image Compression Main Idea: Exploit motif basis as 2D patches 38
Image Compression 39
Image Compression 40
Pattern discovery References: A. Amelio, A. Apostolico and S. E. Rombo. Image Compression by 2D Motif Basis. In Proceedings of IEEE Data Compression Conference (DCC 2011), IEEE CS Press, Snowbird, UT, USA, 2011 (Forthcoming). A. Apostolico, L. Parida and S. E. Rombo, Motif Patterns in 2D. Theoretical Computer Science. 2008. S. E. Rombo: Optimal extraction of motif patterns in 2D. Inf. Process. Lett. 109(17): 1015-1020 (2009). A. Apostolico and L. Parida, Incremental Paradigms of Motif Discovery, J. of Comp. Biol. 11:1 (2004) 15-25. A. Amir and M. Farach, Two-dimensional dictionary matching, Inf. Process. Lett. 44:5 (1992) 233-239. M.J. Fisher and M.S. Paterson, String Matching and Other Products, in: R.M. Karp (Ed.), Complexity of Computation (SIAM-AMS Proceedings, v.7), 1974, pp. 113-125. 41
Pattern discovery Approfondimenti (dal 2009 in poi): Compressione di immagini Analisi di immagini biologiche Pattern discovery/matching su immagini con rotazioni, scaling e altre varianti Tecniche applicate alla ricerca di similarità tra immagini Pattern discovery (motif extraction) su stringhe biologiche 42
Biological Networks Analysis PPI networks similarity search Evolution influence protein-protein interactions Proteins cannot be analyzed independently Both high-throughput and computational methods contribute to discover and predict protein-protein interactions 43
Biological Networks Analysis The Interaction Network of an organism: nodes= proteins edges= interactions 44
Biological Networks Analysis Why searching for similarity between proteins belonging to different PPI networks? To individuate functional conservations across species 45
Biological Networks Analysis Our basic idea Two proteins p1 and p2 in two different PPI networks may be considered similar if: p1 and p2 have similar sequences proteins p1 and p2 are connected with, i.e., their neighborhoods, have similar sequences 46
Biological Networks Analysis Refining protein similarities S=sequence similarity 47
Biological Networks Analysis Refining protein similarities S =refined similarity 48
Biological Networks Analysis The Graph Network P = a set of nodes labeled by proteins id I = a set of indirect labeled edges <w,c> w,c [0,1] w = weakness c = confidence Graph Network: GN = <P,I> 49
Biological Networks Analysis Interaction Pathi (I-Pathi) A path such that: F(i-1) Σu wu F(i), i 1, F(0) = 0 Example: p1 <0.8,0.4> p2 <0.2,0.7> p6 <0.1,0.6> p4 <0.3,0.4> p5 <0.6,0.2> <0.9,0.4> p8 p9 p3 <0.7,0.1> p7 F(x)=x2 i=1 <p2, p1, p4> satisfied <p3, p4, p5, p6 > satisfied <p4, p5, p9 > not satisfied <0.5,0.3> 50
Biological Networks Analysis Cumulative Confidence Given an I-Pathi: C=Πucu Example: p1 <0.8,0.4> p2 <0.2,0.7> p6 <0.1,0.6> p4 <0.3,0.4> p5 <0.6,0.2> <0.9,0.4> p8 p9 p3 <0.7,0.1> p7 F(x)=x2 i=1 For the path <p2, p1, p4>: C = 0.4 * 0.7 = 0.28 <0.5,0.3> 51
Biological Networks Analysis i-th Neighborhood Given a node p in GN = <P,I>: N(p,i)={q q P, q p, <p,q> is an I-Pathi in GN with minimum Σuwu} Example: p1 p2 <0.3,0.4> p3 <0.6,0.2> <0.9,0.4> p5 p6 <0.7,0.1> p4 F(x)=x2 i=1 N(p,i)={p, p, p, p } 3 1 2 4 6 <0.5,0.3> 52
Biological Networks Analysis The Bi-GRAPPIN Algorithm Let GN 1 and GN 2 be graph networks of two different organisms, with n1 and n2 nodes, resp. Align each pair of proteins (p,p ) p GN 1 and p GN 2 (e.g., by the BLAST 2 seq. algorithm) 53
Biological Networks Analysis The Bi-GRAPPIN Algorithm INPUT: a sequence similarity dictionary SSD storing all the triplets: <p, p, f0> p GN 1, p GN 2, f0 [0,1] f0: obtained by sequence alignment parameters OUTPUT: a dictionary FSD storing: <p, p, fp> p GN 1, p GN 2, fp [0,1] fp: functional similarity 54
Biological Networks Analysis The Bi-GRAPPIN Algorithm FSD = SSD for each <p,p, f0> SSD if (f0 > fcut-off ) set i=1 while i<imax a fixed treshold value corr. to the maximum network percentage to be analized generate N(p,i) and N(p,i) compute a bipartite graph maximum weight matching between N(p,i) and N(p,i) refine f0 obtaining a new value fp, according to the objective function of the max. weight matching i=i+1 return FSD 55
Biological Networks Analysis Example (1/3) yeast Target N(, 1) P P fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 56
Biological Networks Analysis Example (2/3) Bipartite graph maximum weight matching between N(p,1) and N(p,1) 0,75 0,22 yeast 0,83 0,34 0,89 0,85 0,73 fly 0,82 0,33 0,65 57
Biological Networks Analysis Example (2/3) Bipartite graph maximum weight matching between N(p,1) and N(p,1) 0,75 0,22 yeast 0,83 0,34 0,89 085 0,73 fly 0,82 0,33 0,65 fp(1)=δ(1)*µ(n(p,1),n(p,1),fsd,α)+[1 δ(1)]* f0(p,p ) 58
Biological Networks Analysis Example (3/3) yeast Target N(, 1) P P fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 59
Biological Networks Analysis Example (3/3) yeast Target N(, 1) N(, 2) P P fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 60
Biological Networks Analysis Example (3/3) yeast P P Target N(, 1) N(, 2) N(, 3) <p, p, fp(3)> FSD fly imax =4 f0(p,p )>fcut OFF F(x)=Identity <w,c> = <1,1> 61
Biological Networks Analysis Synthetic data (1/3) Very similar neighborhoods: final fp greater than f0 62
Biological Networks Analysis Synthetic data (2/3) High f0 but very dissimilar neighborhoods: final fp lower than f0 63
Biological Networks Analysis Synthetic data (3/3) High f0, not very similar N(, 1) but very similar N(, 2) : final fp greater than f0 64
Functional Orthologs S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428 435, 2006. R. Singh, J. Xu, and B. Berger. Pairwise global alignment of protein interaction networks by matching neighborhood topology. In RECOMB 2007. LNB, 2007. 65
Biological Networks Analysis Further experiments Query D. Melanogaster PPI network with Abp1, for which no evident homolog has been detected The most similar protein based on the sequence homology: CG10083 (a debrin-like protein) Abp1: an actin binding protein regulating actin nucleation Is it possible to find other proteins involved in actin reorganization, comparing the sub-net composing Abp1 together with its first two neighborhoods against the entire drosophila network? 66
Biological Networks Analysis Further experiments Best match according to our refined similarity: CG10083 (confirm the pairwise sequence similarity) Abp1 and CG10083 are both Actin-binding proteins Other proteins of unknown functions showing low sequence similarity with Abp1, may share similar function CG6873-PA: a cofilin-like protein possibly involved in cytoskeleton shaping SSD: <Abp1, CG6873-PA, 0.287> FSD: <Abp1, CG6873-PA, 0.442 > 67
Biological Networks Analysis Asymmetric Alignment Master Network Guides the alignment process Slave Network It s aligned to the master Some well-characterized organisms: E.g. Saccharomyces Cerevisiae This is not the case for many other organisms Advantage: Results retain the structural characteristic of the master network (so they are sound ) 68
Biological Networks Analysis Asymmetric Alignment Linearization of the slave network: Translation of the network into a sequence of symbols Given a linearization of the slave find the portion of the master that can be associated to it Motivations: Only the slave network is linearized, all the structural information about the master network are kept The approximation allows us to find similar groups of proteins, not just isomorphic structures The resulting algorithm has a polynomial time complexity 69
Biological Networks Analysis Asymmetric Alignment Master network Alignment Model Weighted finite-state automaton States of the model corresponds to proteins (p1, 0), (p2, 1),..., (p3, 0) score 1 (p1, 0), (*, 1),..., (*, 0) score 2 Find the maximum scoring path (among the states of the master) for the linearization of the slave network: Viterbi Algorithm 70
Biological Networks Analysis Asymmetric Alignment Global Alignment of Yeast (Master) and Fly (Slave) 71
Biological Networks Analysis Asymmetric Alignment Yeast (as the master) vs. Fly: 945 protein pairings Fly (as the master) vs. Yeast: 707 protein pairings Possible explanation: Yeast network is better characterized than Fly network with yeast as slave much structural information gets lost There are more regions of the Yeast that have been conserved in the Fly than vice versa, since the Fly is more complex 72
Biological Networks Analysis PPI networks clustering Aim: clustering dense regions of a given PPI network, since it has been observed by biologists that groups of highly interacting proteins could be involved in common biological processes 73
Biological Networks Analysis Search of functional modules in PPI networks The network is modeled by a matrix representing the interactions. The algorithm introduces the concept of quality of a sub-matrix and apply a greedy tecnique to discover compact regions of the network. 74
Biological Networks Analysis 75
Biological Networks Analysis 76
Biological Networks Analysis 77
Biological Networks Analysis 78
Biological Networks Analysis Validation 79
Biological Networks Analysis References 1. N. Ferraro, L. Palopoli, S. Panni and S. E. Rombo. Master-Slave Biological Network Alignment. In Proceedings of 6th International symposium on Bioinformatics Research and Applications (ISBRA 2010), 215 229, Connecticut, USA, 2010. 2. F. Bruno, L. Palopoli and S. E. Rombo. New trends in graph mining: Structural and Node-colored network motifs. International Journal of Knowledge Discovery in Bioinformatics, 1(1), 81 99, 2010. 3. C. Pizzuti and S. E. Rombo. Multi-functional Protein Clustering in PPI Networks. BIRD 2008. 4. V. Fionda, S. Panni, L. Palopoli and S. E. Rombo. Bi-GRAPPIN: Bipartite graph based protein-protein interaction networks similarity search. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07). Silicon Valley, USA, 2007. 5. C. Pizzuti and S. E. Rombo. PINCoC: a Co-Clustering based Method to Analyze Protein-Protein Interaction Networks. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'07). Birmingham, UK, 16th-19th December, 2007. 6. S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428 435, 2006. 80
Biological Networks Analysis Approfondimenti (dal 2009 in poi): Alignment of biological networks Integration and cleaning of biological networks Querying of biological databases/networks Biological networks clustering RNA structure prediction RNA sequence/structure alignment 81