The vast amount of sequence data collected over the past two

Size: px
Start display at page:

Download "The vast amount of sequence data collected over the past two"

Transcription

1 Local grah alignment and motif search in biological networks Johannes Berg and Michael Lässig Institut für Theoretische Physik, Universität zu Köln, Zülicherstrasse 77, Cologne, Germany Edited by Richard M. Kar, International Comuter Science Institute, Berkeley, CA, and aroved July 8, 2004 (received for review August 13, 2003) Interaction networks are of central imortance in ostgenomic molecular biology, with increasing amounts of data becoming available by high-throughut methods. Examles are gene regulatory networks or rotein interaction mas. The main challenge in the analysis of these data is to read off biological functions from the toology of the network. Toological motifs, i.e., atterns occurring reeatedly at different ositions in the network, have recently been identified as basic modules of molecular information rocessing. In this article, we discuss motifs derived from families of mutually similar but not necessarily identical atterns. We establish a statistical model for the occurrence of such motifs, from which we derive a scoring function for their statistical significance. Based on this scoring function, we develo a search algorithm for toological motifs called grah alignment, a rocedure with some analogies to sequence alignment. The algorithm is alied to the gene regulation network of Escherichia coli. The vast amount of sequence data collected over the ast two decades is at the heart of quantitative molecular biology. Biological information is extracted from these data mainly by analyzing similarities between sequences. This aroach is based on efficient sequence alignment algorithms and a statistical theory to assess the significance of the results (see ref. 1). Its ultimate goal is to infer functional relationshis from correlations between sequences. Over the last few years, however, it has become clear that functions in many cases cannot be identified at the level of single genes. A given function may require the cooerative action of several genes, and conversely, a given gene may lay a role in quite different functional contexts. The genome is thus a highly interactive system and the exression of a gene deends on the activity of other genes. The athways of these interactions are encoded in so-called regulatory networks. Similarly comlex networks govern signal transduction, that is, the influence of external signals on gene exression, or rotein interactions, that is, the ability of two or more roteins to enter a bound state in a living cell. A few exemlary cases of gene networks have been studied in much detail, such as the regulation of early develoment in the sea urchin Strongylocentrotus ururatus (2) or in Drosohila (3). In some aroximation, these structures can be understood as logical networks: the exression level of a gene is reduced to a binary variable (on or off) and is secified in terms of binary inut data, i.e., the exression levels of its ustream genes. On the other hand, a large amount of data on molecular interaction networks is now obtained by high-throughut exeriments, for examle rotein interaction mas in yeast (4) or gene exression arrays (5). In these arrays, one robes the activity of an entire genome, rather than of just a few genes. However, the detailed logical connection of interaction athways is tyically lost. The information is reduced to a toological network, that is, a set of nodes (reresenting, e.g., genes or roteins) and links reresenting their airwise interactions. These links can be directed as in the case of regulatory interactions or undirected as for rotein rotein binding. The amount of toological data on molecular networks is exected to increase raidly in the next few years, aralleling the earlier exlosion of sequence data. What can be learned from these data? Using the network toology alone, can we distinguish atterns of biological function from random background? The urose of this article is to develo a bioinformatics aroach to the search for local modules in networks. We discuss a heuristic search algorithm and its statistical grounding in a stochastic model of network evolution. This aroach is designed to comlement exeriments in secific organisms by large-scale database searches. Two seminal studies (6, 7) recently have shown that toological networks indeed contain statistically significant atterns indicative of biological functions. These motifs are atterns that occur more frequently in the observed network than exected in a suitable null ensemble. The motifs found so far have been identified because they occur identically at different ositions in a network. If network evolution is a stochastic rocess, however, functionally related motifs do not need to be toologically identical. Hence, the notion of a motif has to be generalized to a stochastic one as well. Variations arise because of uncertainties in the network data, or, more imortantly, because some of the interactions can change without affecting the functionality of the motif. This noise is an imortant characteristic of biological systems, familiar from sequence analysis, where one searches for local sequence similarities blurred by mutations and insertionsdeletions, rather than for identical subsequences. It leads us to the notion of a robabilistic motif in which each link occurs with a certain likelihood. Probabilistic motifs arise as consensus from finding a family of sufficiently similar subgrahs in a network. The search for mutually similar subgrahs and their robabilistic motifs is the central issue of this article. The motifs of interest here are nonrandom in two ways: they have an enhanced number of internal links, associated, e.g., with feedback, and they aear in a significant number of subgrahs. Identifying these local deviations from randomness in networks requires a statistical theory of local grah structure, which we establish in this article. This is a comlementary aroach to the global statistics measured by the connectivity distribution (8) or connectivity correlations (9, 10) of a network. Our aroach leads to an algorithmic rocedure termed local grah alignment, which is concetually similar to sequence alignment. It is based on a scoring function measuring the statistical significance for families of mutually similar subgrahs. This scoring involves quantifying the significance of the individual subgrahs as well as their mutual similarity, and is thus considerably more comlicated than for families of identical motifs. Our scoring function is derived from a stochastic model for network evolution. There is indeed evidence that network evolution can be described as a stochastic rocess. For examle, the comarison of the regulatory networks for early develoment in several Drosohila secies has revealed the continuous buildu and loss of gene interactions following an aroximate molecular clock (11). Yet little is known about the secific athways of network evolution. Our scoring function is comatible with divergent evolution of subgrahs but also with convergent evolution toward a common functional motif. These ath- This aer was submitted directly (Track II) to the PNAS office. To whom corresondence should be addressed. berg@th.uni-koeln.de by The National Academy of Sciences of the USA PHYSICS GENETICS PNAS October 12, 2004 vol. 101 no

2 ways can be illustrated by a comarison with sequence evolution. An examle of convergent evolution is the formation of sequence motifs serving as binding sites of secific enzymes (12, 13). An examle of divergent evolution is a set of sequences stemming from a common ancestor undergoing mutations indeendently. The robabilistic grounding of grah alignment allows us to infer otimal scoring arameters by a maximumlikelihood rocedure (14). As a comutational roblem, grah alignment is more challenging than sequence alignment. Sequences can be aligned in olynomial time by using dynamic rogramming algorithms. For grah alignment, a olynomial-time algorithm robably does not exist. Already simler grah matching roblems such as the subgrah isomorhism roblem (deciding whether a grah contains a given subgrah) (15, 16) or finding the largest common subgrah of two grahs (17) are NP-comlete and NP-hard, resectively. Thus, an imortant issue for grah alignment is the construction of efficient heuristic search algorithms. Here we solve this roblem by maing grah alignment onto a sin model familiar in statistical hysics, which can be treated by simulated annealing. This article is structured as follows. In the first art, we discuss the statistics of local subgrahs based on a robabilistic model. This is done in three stes: (i) an individual subgrah with an enhanced number of internal links, (ii) a subgrah in the resence of a temlate motif secifying the functional imortance of each link, and (iii) correlated subgrahs, whose common attern is to be inferred from the data instead of being given as a temlate. We then construct a scoring function designed to distinguish sets of statistically significant network motifs with an enhanced number of links from a background of other atterns. High-scoring motifs are found by an alignment algorithm, details of which are described in Suorting Text, which is ublished as suorting information on the PNAS web site. In the second art of the aer, we aly this method to the regulatory network of Escherichia coli and discuss the robabilistic motifs found. The statistics of these motifs is used to test the assumtions of our robabilistic model. Grahs and Patterns A toological network or grah is a set of nodes and links. Labeling the nodes by an index r 1,...,N, the network is described by the adjacency matrix C, which has entries C rr 1 if there is a directed link from node r to node r and C rr 0 otherwise. Grahs with a generic adjacency matrix are called directed. The secial case of a symmetric adjacency matrix can be used to describe undirected grahs. The in and out connectivities of a node, k r r C rr and k r r C rr, are defined as the number of in- and outgoing links, resectively. The total number of links is denoted by K r,r C rr. The networks considered here are sarse, i.e., their average connectivity KN is of order 1. A subgrah G is given by a subset of n vertices {r 1,...,r n } and the resulting restriction of the adjacency matrix. More recisely, we define the matrix c(g, A) with the entries c ij C ri r j (i, j 1,...,n) secifying the internal links of the subgrah for a given order A of the nodes. This matrix c is called a attern, which is contained in the subgrah. The definition of a attern used here imlies that two atterns are counted as searate if the matrices c and c are different. This assumes that nodes are distinguishable by their biochemical identity and their functional role even if they are at symmetric ositions, i.e., if c and c differ only by the labeling of the nodes. An alternative definition would count two matrices c and c related by a relabeling as defining an identical attern. Which definition is more aroriate deends on the articular biological alication. The most imortant characteristic of atterns for what follows is their number of internal links, Lc c ij. [1] i, j Fig. 1. Motifs and alignment in toological networks. (a) A randomly chosen connected subgrah is likely to be a tree; i.e., it has the number of internal links equal to its number of nodes minus 1. (b) Putatively functional subgrahs are distinguished by internal loos, i.e., by a higher number of internal links. (c) An alignment of three subgrahs with four nodes each. Each node carries an index 1, 2, 3 labeling its subgrah and an index i 1, 2, 3, 4 given by the order of nodes within the subgrah. Nodes with the same index i are joined by dashed lines, defining a one-to-one maing between any two subgrahs. Network links are shown as solid lines (with their arrows suressed for clarity). (d) The consensus attern of this alignment. Each link occurs with a likelihood c ij indicated by the gray scale. Fig. 1 shows two subgrahs that differ in the values of L. Grah Alignments and Motifs A grah alignment is defined by a set of several subgrahs G ( 1,..., ) and a secific order of the nodes {r 1,..., r n } in each subgrah; this joint order is again denoted by A. For simlicity, we assume here that the subgrahs are of the same size n, but it is not difficult to generalize our aroach to include subgrahs of different size. For a given set of mutually disjoint subgrahs, there are (n!) different alignments. An alignment associates each node in a subgrah with exactly one node in each of the other subgrahs. This association can be visualized by n strings, each connecting the nodes with the same index i as shown in Fig. 1c. A given alignment A secifies a attern in each subgrah; we write c c(g, A). The consensus attern of this alignment is given by the matrix c 1 c. [2] Thisisarobabilistic attern, the entry c ij denoting the likelihood that a given link is resent in the aligned subgrahs. For any two aligned subgrahs G and G, we can define the airwise mismatch n Mc, c c ij 1 c ij 1 c ij c ij. [3] i, j1 The mismatch is 0 if and only if the matrices c and c are equal, and is ositive otherwise. It can be considered as a Hamming distance for aligned subgrahs. The average mismatch over all airs of aligned subgrahs, M M(c, c), is termed the fuzziness of the consensus attern c. Analogously, the average number of internal links is denoted by L L(c). We now define network motifs as statistically significant consensus atterns of grah alignments, which are distinguished by a high number of internal links and low fuzziness. Clearly, this definition Berg and Lässig

3 is mathematically loose before we quantify the statistical significance. This quantification will be done in the next three sections. Guided by the results of refs. 6 and 7, we take an enhanced number of internal links as a toological indicator of ossible functional modules in networks. The additional links beyond a treelike toology can be associated with feedback or feed-forward loos in transcrition networks, or clusters in rotein interaction networks. For examle, the triangle shown in Fig. 1b can be interreted (6) as a low-frequency bandass filter: the central node is activated if both to nodes are active. However, the right-hand node is activated by that on the left with a small delay, so the central node is activated rovided the left node is active for a time longer than this delay. The nontreelike nature of this motif is crucial for its function. On the other hand, most randomly chosen connected subgrahs would be treelike. Clearly, an enhanced number of internal links is but the simlest toological indicator of utative functionality, and more detailed ways of identifying network motifs are likely to emerge in the future. Statistics of Individual Subgrahs To quantify the statistical significance of a given number of internal links, we first comute the relevant robability distribution in a suitable random grah ensemble, which is generated by an unbiased sum over all grahs with the same number of nodes and the same connectivities k r, k r (r 1,...,N) as in the data set but randomly chosen links (10, 18). This null ensemble is aroriate for biological networks, whose connectivity distribution generally differs markedly from that of a random grah with uniformly distributed links. In the null ensemble, the robability of finding a directed link from node r i to node r j is in good aroximation given by w ij k ri k rj K (19). Hence, a given subset of nodes {r 1,...,r n } forms a subgrah G with robability n P 0 G 1 w ij 1cij c w ij ij. [4] i, j1 This exression neglects double links, which can be included as in ref. 19. The robability P 0 (G) deends on the attern c(g) ofthe subgrah, as well as on its environment given by the connectivities k i, k i (i 1,...,n). In this ensemble, the exected number of internal links er node is small, L 0 n nn, where we denote the average over a given ensemble by. Hence, most random subgrahs in a large and sarse grah are disconnected. Within the subset of connected subgrahs, most are treelike. (Later we will be interested in the subset of nontreelike subgrahs, and this will require a modification of the null ensemble.) We now assume that subgrahs containing network motifs are generated by a different ensemble P (G). The robability that a given air of nodes carries a link is enhanced by a factor e relative to the null ensemble 4, leading to P G/P 0 G Z 1 exlc. [5] Again the robability P (G) that a given subset of nodes {r 1,..., r n } forms a subgrah G deends on the matrix c(g). We have introduced the normalization factor Z ij cij 0,1 ex[l(c)]p 0 (G), which ensures that P (G) summed over all matrices c gives unity. The quantity, called the link reward, is multilied by the total number L of internal links given by Eq. 1. The ensemble 5 is a statistically unbiased way to describe that functional motifs are distinguished by a large number of internal links. (Technically, it is the ensemble of maximal information entroy with a given average link number L, which is determined by the value of.) This ensemble may be thought of as resulting from an evolutionary rocess favoring the formation of links due to selection ressure; such a rocess has recently been studied for regulatory networks (20). Here we focus on the detection of evolved motifs rather than on the reconstruction of evolutionary histories. Hence, we do not need to make assumtions on dynamical details of motif formation but only on its outcome, which is described by the ensemble P (G). We have tested the form of this ensemble for the regulatory network of E. coli as discussed in Results and Discussion below. Moreover, the value of the link reward can be inferred from the data. One finds e Nn, which results in a finite exected number Ln of internal links er node within a motif. Statistics in the Presence of a Temlate The distribution 5 describes an ensemble with an enhanced number of links, which is aroriate for scoring individual subgrahs in the absence of further knowledge. Consider now an evolutionary rocess directed toward a given network motif reresented by a temlate adjacency matrix t. An alignment A between the motif t and the subgrah G is secified by a given ordering of the nodes {r 1,..., r n }ing. The outcome of this evolutionary rocess can be modeled by an ensemble Q t (G, A) with a bias against links that do not occur in the temlate, Q t G,A/P 0 G Z t 1 ex Lc 2 Mc, t. [6] This exression denotes the robability that a given subset of nodes {r 1,...,r n } forms an aligned subgrah (G, A), with the definition 3 of the airwise mismatch of the subgrah G and the temlate t in a given alignment A. Again, Z t is given by normalization. This is a hidden Markov model: the outcome of the stochastic rocess is an aligned subgrah (G, A), whereas only G is observed. The likelihood of observing G is then a sum over all alignments, Q t G Q t G,A. [7] A This ensemble has two free arameters, the link reward and the mismatch enalty (with a factor 12 introduced for later convenience). It is concetually similar to hidden Markov models for the alignment of sequences with gas. Statistics of Correlated Subgrahs Now we turn to the case where a network motif is not given as a temlate but has to be inferred from a family of suitably aligned subgrahs. The underlying evolutionary rocess can be regarded as a biased link formation as in the revious section, with the consensus attern c as temlate. Assuming the link formation is indeendent for each subgrah, we obtain an ensemble given by Q, G 1,...,G, A P 0 G Z 1, ex Lc Mc 2,c Z 1, ex Lc Mc,c 2,, [8] where A secifies an alignment of all subgrahs and we have used the definition 2 of the consensus attern. The normalization is given by Z, A c 1,...,c ex Lc Mc,c 2, P 0 G. [9] PHYSICS GENETICS Berg and Lässig PNAS October 12, 2004 vol. 101 no

4 The Scoring Function We now construct a scoring function designed to select a set of (utatively) functional subgrahs, characterized by a consensus motif with a high number of internal links and low fuzziness, from the background of random subgrahs in a large network. Based on the receding discussion, we assume that the statistics of functional motifs is described by an ensemble Q(G 1,..., G, A) Q, (G 1,...,G, A), where the scoring arameters and remain to be determined from the data. For the biological alications described above, where internal links are associated with feedback loos, it is clearly useful to restrict the motif search to the set of all connected subgrahs that contain internal loos, i.e., that are nontreelike. For connected subgrahs of size n, this set is given by the constraint L n on the internal link number. A large random grah tyically contains a number of order one of such subgrahs, and these define the relevant null ensemble for motif search. We model these subgrahs by using the ensemble P 0 with an enhanced number of links defined in Eq. 5. The arameter 0 will be adjusted such that the average number of internal links in the null ensemble equals that found in the nontreelike subgrahs of a suitable randomized grah. Comaring with the ensemble P 0 of random subgrahs introduced earlier, it is clear that the constraint L n corresonds to a link reward 0 0. Given these two ensembles, we define the log-likelihood score SG 1,...,G, A log Q,G 1,...,G, A P 0 G 1,...,G, A 0 Lc Mc,c 2, logz, /Z 0, [10] which is ositive if a set of subgrahs G 1,...,G and an alignment A between them is more likely to occur in the ensemble Q, than in the null ensemble P 0. The term log(z, Z 0 ) acts as a threshold assigning a negative score to alignments with too large fuzziness or a too small number of internal links. As is clear from the form of the scoring function, grah alignment is a nontrivial otimization roblem, the statistical weight of each subgrah G deending on the scoring arameters as well as on the other subgrahs included in the alignment. We address this roblem in two stes. First we find the maximum-score alignment(s) for given score arameters, which is essentially an algorithmic search roblem. Then we discuss the arameter deendence of highscoring alignments and obtain the otimal values of and for a given data set from a maximum-likelihood rocedure. Maximum Score Alignments and Parametric Otimization Finding the maximum score alignments involves a huge search sace of ossible alignments. The number of alignments is of order (n) N for given and the comutational exense grows further when the otimization over is erformed. Here we use a heuristic algorithm, which can be described by a maing to a discrete sin model. First we enumerate all nontreelike subgrahs of n nodes, which is feasible for modest values of n, and label them by the index 1,..., max. Next we evaluate the internal link numbers L L(c ) and the airwise mismatches M, defined as the minimum of M(c, c ) over all airwise alignments of the subgrahs G and G. High-scoring multile alignments are then found by a simulated annealing algorithm in the sace (s 1,...,s max ), where each sin s takes the value 1 if G is included in the alignment and 0 otherwise. The resulting Hamiltonian H is H 0 max L s max M s s logz 2,/Z, 0, [11] where s. The couling between s and s is given by M, which is equal to the airwise mismatch M if subgrahs and do not overla, and a large ositive constant if they do. (Two subgrahs overla if they have more than one node in common. According to this definition, links in nonoverlaing subgrahs form indeendently as assumed in Eq. 8.) The threshold term log(z, Z 0 ) is evaluated by saddle-oint integration; details are given in Suorting Text. Simulated annealing using the Hamiltonian 11 will then yield high-scoring alignments of nonoverlaing subgrahs (22). For fixed values of the scoring arameters, the algorithm is exected to roduce well defined maximum-score alignments. This can be understood as follows. For a (hyothetical) alignment of subgrahs with equal number of internal links and equal airwise mismatches, the score 10 scales linearly with, the number of aligned subgrahs. This behavior is consistent with the interretation of Eq. 10 as a log-likelihood score, because the aligned subgrahs occur indeendently. A high-scoring alignment in a realistic network may consist of a limited number of identical or very similar motifs. As we extend this alignment to include more subgrahs, subgrahs with increasing mutual mismatches are included. Hence, we exect the total mismatch to increase faster than linearly with, leading to a maximum S*(, ) of the total score at some intermediate value of *(, ). The roerties of the maximum-score alignments deend strongly on the arameters and. With increasing, the number of internal links L*(, ) er subgrah is exected to increase. With increasing, both the number of grahs *(, ) and the fuzziness M *(, ) decrease. In this way, the maximum-score alignment varies between a set of indeendent subgrahs for 0 and a set of identical subgrahs with identical motifs for 3. A maximum-likelihood aroach can be used to infer the otimal scoring arameters *, * for a given data set, which we obtain as the oint of the global score maximum S* max, S*(, ). Results and Discussion In this section, we discuss the alication of local grah alignment to motif search in the gene regulatory network of E. coli, taken from ColiNet-1.1, containing 424 nodes and 577 directed links. Each labeled node in this network reresents a gene. A directed link between two nodes signifies that the roduct of the gene reresented by the first node acts as a transcrition factor on the gene reresented by the second. Throughout we consider motifs with a fixed number of nodes n 5. First we show that our algorithm indeed roduces well defined alignments of maximal score (i.e., of maximal relative likelihood). For fixed arameters 3.8 and 4.0, this is illustrated by Fig. 2a, which shows the score S and the fuzziness M for the highestscoring alignment with a rescribed number of subgrahs, lotted against. As exected, the fuzziness increases with increasing, and the total score reaches its global maximum S*(, ) at an intermediate value *(, ). It is lower for *(, ) because the alignment contains fewer subgrahs and for *(, ) because the subgrahs have higher mutual mismatches. Fig. 2b shows the score S*(, ) as a function of and. This function has a unique global maximum S*, which defines the maximum-likelihood oint (* 3.8, * 2.25, * 24). The scoring arameter 0 of the null ensemble P 0 is determined as follows: the data set is randomized by generating a network with the same connectivities as in the data set but randomly chosen links (10, 18). Again, the nontreelike subgrahs are extracted and their Berg and Lässig

5 Fig. 2. Maximum score alignment and arametric otimization. (a) Score otimization at fixed scoring arameters 3.8 and 4.0. The total score S (thick line) and the fuzziness M (thin line) are shown for the highest-scoring alignment of subgrahs, lotted as a function of. (b) The score S*(, ) lotted against the arameters and. The unique maximum S*defines the maximum-likelihood arameters * 3.8 and * average number of internal links is determined. The value of 0 is uniquely determined by the condition that the exected number of links in the null ensemble 5 equals the average number of internal links found in the nontreelike subgrahs. One obtains As exected, 0 *, which shows that the data set has an enhanced number of internal links relative to the randomized network. At the maximum-likelihood scoring arameters, we can, moreover, verify the functional form of the ensembles used to construct the score function 10. To test the model 5 for individual subgrahs, we enumerate all subgrahs with n 5 that have nontreelike atterns (i.e., a link number L 5). All ordered airs of nodes i, j are then binned according to the robability w ij of a directed link existing between them in the ensemble P 0 (G). In Fig. 3a the fraction of these airs i, j carrying a link is lotted against w (). The exectation value of this fraction is given by Eq. 5 as e w(1 w e w), shown as a solid line with a fit value * 3.8. Our model (Eq. 8) for generic alignments can be tested in a similar way. From this ensemble, the marginal robability that a given ordered air of nodes secified by, i, j is linked can be comuted. We grou all such airs with the same exectation value c ij *,* according to Eq. 8 to build a histogram. For each grou, the average of c ij over node airs in the actual maximum-likelihood alignment is comuted and lotted against the model rediction (see Fig. 3b). The same rocedure is reeated for the two-oint correlations c ij c ij *,* between associated nodes in different subgrahs and as also shown in Fig. 3b. In both histograms, the data oints cluster well around the straight line equating exectation values in the model 8 and averages in the actual alignment. The fluctuations seen reflect the limited size of the data set and the small number of fitting arameters in the model. For such data, more detailed models can hardly be tested because they would lead to overfitting. We now turn to the robabilistic consensus motifs found in the data for different number of nodes n 4 and n 5. Fig. 4a shows the n 4 consensus motif c ij at consecutive values of * 3.6, 8, and 15. The gray scale encodes the average number of Fig. 3. Statistics of motif ensembles. (a) Testing the statistical model for single subgrahs (Eq. 5). Nontreelike subgrahs are enumerated and node airs i, j are binned according to w ij. The fraction of such airs carrying a link is shown against w ij. The solid line results from fitting the model with enhanced number of links (Eq. 5) to these data, giving 3.8. (b) Testing the statistical model for alignments (Eq. 8). (Uer) The average value of c ij over all, i, j with a given exectation value of c ij according to Eq. 8 at * 3.8 and * 2.25 against the corresonding exectation value (). For a erfect fit between model and data a straight line is exected (shown solid). (Lower) The same rocedure is used averaging the two-oint function c ij c ij over all,, i, j with a given exectation value c ij c ij. links c ij between a given air of nodes. As exected, the fuzziness decreases with increasing values of the mismatch enalty and c ij tends either to zero (no link resent) or one (link resent with certainty) as 3. The consensus motif is a layered structure, in this case with two inut and two outut nodes. A similar motif is found for n 5. Fig. 4b shows the n 5 consensus motif at consecutive values of 2.25, 5, and 12. As in the case of n 4, a layered structure is clearly discernible: the motif consistsof2 3 nodes forming an inut and an outut layer, with links largely going from the inut to the outut layer. The left node of the inut layer has an average number of about 30 outgoing links. These connectivities are excetional because the average outconnectivity of the network is Comaring the alignments of subgrahs of n 4 nodes with those of n 5 nodes in Fig. 4, one finds that many of the subgrahs found in the n 4 alignments also are a art of the subgrahs found in the n 5 alignments. This finding immediately leads to the question of how to identify larger atterns in the network from which the subgrahs at a given value of n are taken. Obviously any scoring scheme oerating at a fixed number of nodes n will be blind to the combinatorial ossibilities of selecting subgrahs from a larger attern. The henomenon is exemlified in Fig. 4c. From the 3-by-4 attern two nonoverlaing layered subgrahs with n 4 and n 5 can be generated (nonoverlaing subgrahs have at most one node in common, see above). Larger atterns generate corresondingly more nonoverlaing subgrahs. In the suorting information, we discuss a simle scheme that allows one to identify larger atterns as in Fig. 4c from smaller subgrahs. The attern of Fig. 4c is found twice in the data, contributing in total four nonoverlaing subgrahs to the alignments with n 4 and n 5. The statistics of these atterns at the level of identical atterns has PHYSICS GENETICS Berg and Lässig PNAS October 12, 2004 vol. 101 no

6 Fig. 4. Probabilistic motifs in the E. coli transcrition network. (a) Consensus motifs with n 4 nodes at different values of. From left to right, * 3.6, 8, and 15. The gray scale of the links indicates the likelihood that a given link is resent in the aligned subgrahs; the five gray values corresond to c in the ranges , , , , and , and links with c 0.9 are shown black. The link reward is ket fixed at * 3.6 and 0 takes on the value (b) Consensus motifs with n 5 for different * 2.25, 5, and 12 (left to right) at * 3.8. (c) This attern with n 7 is found twice in the data set. From each such subgrah two nonoverlaing layered subgrahs with n 4 and n 5 can be generated. (d) The number *(*, ) of subgrahs in the maximum score alignment (thick line) and the fuzziness M *(*, ) (thin line) as a function of for n 5. recently been analyzed in ref. 23, their treatment using robabilistic atterns remains a future develoment. Fig. 4d shows details of the alignments roducing the consensus motifs at n 5, namely, the number of subgrahs *( *, ) and the fuzziness M *( *, ) lotted as a function of. For 12 the fuzziness reaches zero and the alignment contains 10 identical nonoverlaing motifs. This layered attern has been found by the aroach of ref. 6, which is based on counting identical motifs. However, the maximum-likelihood alignment occurs at * 2.25 and contains a much larger number of * 24 nonoverlaing subgrahs, leading to the robabilistic consensus motif shown on the left in Fig. 4a. The same effect is found in the consensus motif of size n 4. Furthermore, at arbitrary nonzero fuzziness the robability that a given air of subgrahs have identical motifs decreases with subgrah size. As a result, counting identical motifs, rather than following a robabilistic aroach as the one resented here, will miss a fraction of relevant subgrahs resent in the data that increases with the size of the subgrah. The robabilistic grounding of motif search is also indisensable for estimating the quantitative significance of the results obtained. Here we comare the maximum-likelihood alignment in the E. coli data set with suitable random grah ensembles. We do this in two stes, to disentangle the significance of the number of internal links and of the mutual similarity of atterns found in the data. (i) To assess the significance of the number of internal links, we consider the ensemble of grahs with the same in- and outconnectivities as the data set but randomly chosen neighbors (18, 10) and comute the distribution of the score with scoring arameters *, 0. The null distribution of scores from the randomized grah has the average and standard deviation given by S* The score S* 73.1 found from the data is thus significantly higher, indicating an enhanced link number with resect to the random grah ensemble. (ii) The significance of the mutual similarity of the aligned atterns is assessed by comaring the data to mutually indeendent random subgrahs with the same average density of links. (This null ensemble is generated by randomizing the internal links of each subgrah indeendently.) We then comute the score with arameters 0 * and * (thereby focusing only on the fuzziness of the data relative to that found in the ensemble of uncorrelated subgrahs). This null distribution of scores has average and standard deviation given by S* ; the corresonding score S* 50.1 found from the data is thus significantly higher. We note that the assessment of subgrah similarity is quite subtle. Subgrahs taken from a large but finite random grah may show a surious mutual similarity with resect to indeendent random subgrahs because of a revalence of internal loos. The statistical significance of the results can be formulated more recisely by using so-called values, which involve the tail of the score distribution in the random grah ensemble. Fast and reliable -value estimates are crucial for searching large databases, as is well known for sequence alignment (21). This aroach can be carried over to the grah alignments discussed here. The statistical framework resented is very flexible. For examle, as large-scale data on the logic of gene regulation become available, the definition of the airwise mismatch, Eq. 3, can be extended to reward aligning sets of nodes erforming the same logical function. In this way, features of motifs going beyond their toology can be exlored. Similarly, simle modifications of the mismatch score allow the analysis of undirected networks, networks whose links have a secific function (reressive or enhancing) or whose interaction strength is quantified by a real number. The rosect of a sizable amount of new data on biological networks becoming available over the next few years through high-throughut methods oens exciting oortunities to identify the building blocks of molecular information rocessing in a wide range of organisms, and even build hylogenetic histories of regulation from transcrition network data. We thank T. Hwa for fruitful discussions and U. Alon and P. Arndt for comments on the manuscrit. We thank the Kavli Institute for Theoretical Physics at Santa Barbara, the Abdus Salam International Centre for Theoretical Physics, Trieste, the Centro di Ricerca Matematica Ennio De Giorgi, Pisa, and the Asen Center for Physics for hositality during various stages of this work. This work has been suorted through Deutsche Forschungsgemeinschaft Grant LA Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. (1998) Biological Sequence Analysis (Cambridge Univ. Press, Cambridge, U.K.). 2. Davidson, E. H. (2001) Genomic Regulatory Systems: Develoment and Evolution (Academic, San Diego). 3. Tautz, D. (2000) Curr. Oin. Genet. Dev. 1, Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. (2000) Nature 403, Lockhart, D. J. & Winzeler, E. A. (2000) Nature 405, Shen-Orr, S., Milo, R., Mangan, S. & Alon, U. (2002) Nat. Genet. 31, Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. & Alon, U. (2002) Science 298, Newman, M. E. J., Strogatz, S. H. & Watts, D. J. (2001) Phys. Rev. E 64, Newman, M. (2002) Phys. Rev. Lett. 89, Berg, J. & Lässig, M. (2002) Phys. Rev. Lett. 89, Costas, J., Casares, F. & Viera, J. (2003) Gene 310, Stormo, G. D. & Hartzell, G. W. (1989) Proc. Natl. Acad. Sci. USA 86, Bussemaker, H. J., Li, H. & Siggia, E. D. (2000) Proc. Natl. Acad. Sci. USA 97, Felsenstein, J. (1981) J. Mol. Evol. 17, Ullmann, J. R. (1976) J. Assoc. Comut. Mach. 23, Messmer, B. T. & Bunke, H. (1998) IEEE Trans. Pattern Anal. Mach. Intell. 20, Garey, M. R. & Johnson, D. S. (1979) Comuters and Intractability: A Guide to the Theory of N-Comleteness (Freeman, NY). 18. Maslov, S. & Sneen, K. (2002) Science 296, Itzkovitz, S., Milo, R., Kashtan, N., Ziv, G. & Alon, U. (2003) Phys. Rev. E 68, Berg, J., Lässig, M. & Radic, S. (2003) htt:arxiv.orgabscond-mat Karlin, S. & Altschul, S. F. (1990) Proc. Natl. Acad. Sci. USA 87, Blat, M., Wiseman, S. & Domany, E. (1996) Phys. Rev. Lett. 76, Kashtan, N., Itzkovitz, S., Milo, R. & Alon, U. (2003) htt:arxiv.orgabsq-bio Berg and Lässig

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v) Multile Sequence Alignment Given: Set of sequences Score matrix Ga enalties Find: Alignment of sequences such that otimal score is achieved. Motivation Aligning rotein families Establish evolutionary relationshis

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

Distributed Rule-Based Inference in the Presence of Redundant Information

Distributed Rule-Based Inference in the Presence of Redundant Information istribution Statement : roved for ublic release; distribution is unlimited. istributed Rule-ased Inference in the Presence of Redundant Information June 8, 004 William J. Farrell III Lockheed Martin dvanced

More information

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points. Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the

More information

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK Towards understanding the Lorenz curve using the Uniform distribution Chris J. Stehens Newcastle City Council, Newcastle uon Tyne, UK (For the Gini-Lorenz Conference, University of Siena, Italy, May 2005)

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

arxiv:cond-mat/ v2 25 Sep 2002

arxiv:cond-mat/ v2 25 Sep 2002 Energy fluctuations at the multicritical oint in two-dimensional sin glasses arxiv:cond-mat/0207694 v2 25 Se 2002 1. Introduction Hidetoshi Nishimori, Cyril Falvo and Yukiyasu Ozeki Deartment of Physics,

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

Diverse Routing in Networks with Probabilistic Failures

Diverse Routing in Networks with Probabilistic Failures Diverse Routing in Networks with Probabilistic Failures Hyang-Won Lee, Member, IEEE, Eytan Modiano, Senior Member, IEEE, Kayi Lee, Member, IEEE Abstract We develo diverse routing schemes for dealing with

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu and Markov Models Huizhen Yu janey.yu@cs.helsinki.fi Det. Comuter Science, Univ. of Helsinki Some Proerties of Probabilistic Models, Sring, 200 Huizhen Yu (U.H.) and Markov Models Jan. 2 / 32 Huizhen Yu

More information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK Comuter Modelling and ew Technologies, 5, Vol.9, o., 3-39 Transort and Telecommunication Institute, Lomonosov, LV-9, Riga, Latvia MATHEMATICAL MODELLIG OF THE WIRELESS COMMUICATIO ETWORK M. KOPEETSK Deartment

More information

Hotelling s Two- Sample T 2

Hotelling s Two- Sample T 2 Chater 600 Hotelling s Two- Samle T Introduction This module calculates ower for the Hotelling s two-grou, T-squared (T) test statistic. Hotelling s T is an extension of the univariate two-samle t-test

More information

Period-two cycles in a feedforward layered neural network model with symmetric sequence processing

Period-two cycles in a feedforward layered neural network model with symmetric sequence processing PHYSICAL REVIEW E 75, 4197 27 Period-two cycles in a feedforward layered neural network model with symmetric sequence rocessing F. L. Metz and W. K. Theumann Instituto de Física, Universidade Federal do

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i Comuting with Haar Functions Sami Khuri Deartment of Mathematics and Comuter Science San Jose State University One Washington Square San Jose, CA 9519-0103, USA khuri@juiter.sjsu.edu Fax: (40)94-500 Keywords:

More information

General Linear Model Introduction, Classes of Linear models and Estimation

General Linear Model Introduction, Classes of Linear models and Estimation Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm Gabriel Noriega, José Restreo, Víctor Guzmán, Maribel Giménez and José Aller Universidad Simón Bolívar Valle de Sartenejas,

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

8 STOCHASTIC PROCESSES

8 STOCHASTIC PROCESSES 8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular

More information

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest.

More information

An Improved Calibration Method for a Chopped Pyrgeometer

An Improved Calibration Method for a Chopped Pyrgeometer 96 JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY VOLUME 17 An Imroved Calibration Method for a Choed Pyrgeometer FRIEDRICH FERGG OtoLab, Ingenieurbüro, Munich, Germany PETER WENDLING Deutsches Forschungszentrum

More information

CMSC 425: Lecture 4 Geometry and Geometric Programming

CMSC 425: Lecture 4 Geometry and Geometric Programming CMSC 425: Lecture 4 Geometry and Geometric Programming Geometry for Game Programming and Grahics: For the next few lectures, we will discuss some of the basic elements of geometry. There are many areas

More information

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have

More information

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V. Deriving ndicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V. Deutsch Centre for Comutational Geostatistics Deartment of Civil &

More information

The Noise Power Ratio - Theory and ADC Testing

The Noise Power Ratio - Theory and ADC Testing The Noise Power Ratio - Theory and ADC Testing FH Irons, KJ Riley, and DM Hummels Abstract This aer develos theory behind the noise ower ratio (NPR) testing of ADCs. A mid-riser formulation is used for

More information

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014 Morten Frydenberg Section for Biostatistics Version :Friday, 05 Setember 204 All models are aroximations! The best model does not exist! Comlicated models needs a lot of data. lower your ambitions or get

More information

ECE 534 Information Theory - Midterm 2

ECE 534 Information Theory - Midterm 2 ECE 534 Information Theory - Midterm Nov.4, 009. 3:30-4:45 in LH03. You will be given the full class time: 75 minutes. Use it wisely! Many of the roblems have short answers; try to find shortcuts. You

More information

Chapter 1: PROBABILITY BASICS

Chapter 1: PROBABILITY BASICS Charles Boncelet, obability, Statistics, and Random Signals," Oxford University ess, 0. ISBN: 978-0-9-0005-0 Chater : PROBABILITY BASICS Sections. What Is obability?. Exeriments, Outcomes, and Events.

More information

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Improved Capacity Bounds for the Binary Energy Harvesting Channel Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,

More information

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential Chem 467 Sulement to Lectures 33 Phase Equilibrium Chemical Potential Revisited We introduced the chemical otential as the conjugate variable to amount. Briefly reviewing, the total Gibbs energy of a system

More information

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding Outline EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Error detection using arity Hamming code for error detection/correction Linear Feedback Shift

More information

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

A Model for Randomly Correlated Deposition

A Model for Randomly Correlated Deposition A Model for Randomly Correlated Deosition B. Karadjov and A. Proykova Faculty of Physics, University of Sofia, 5 J. Bourchier Blvd. Sofia-116, Bulgaria ana@hys.uni-sofia.bg Abstract: A simle, discrete,

More information

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

Research of power plant parameter based on the Principal Component Analysis method

Research of power plant parameter based on the Principal Component Analysis method Research of ower lant arameter based on the Princial Comonent Analysis method Yang Yang *a, Di Zhang b a b School of Engineering, Bohai University, Liaoning Jinzhou, 3; Liaoning Datang international Jinzhou

More information

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model Shadow Comuting: An Energy-Aware Fault Tolerant Comuting Model Bryan Mills, Taieb Znati, Rami Melhem Deartment of Comuter Science University of Pittsburgh (bmills, znati, melhem)@cs.itt.edu Index Terms

More information

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers

More information

Developing A Deterioration Probabilistic Model for Rail Wear

Developing A Deterioration Probabilistic Model for Rail Wear International Journal of Traffic and Transortation Engineering 2012, 1(2): 13-18 DOI: 10.5923/j.ijtte.20120102.02 Develoing A Deterioration Probabilistic Model for Rail Wear Jabbar-Ali Zakeri *, Shahrbanoo

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming Maximum Entroy and the Stress Distribution in Soft Disk Packings Above Jamming Yegang Wu and S. Teitel Deartment of Physics and Astronomy, University of ochester, ochester, New York 467, USA (Dated: August

More information

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 5 Jul 1998

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 5 Jul 1998 arxiv:cond-mat/98773v1 [cond-mat.stat-mech] 5 Jul 1998 Floy modes and the free energy: Rigidity and connectivity ercolation on Bethe Lattices P.M. Duxbury, D.J. Jacobs, M.F. Thore Deartment of Physics

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling Scaling Multile Point Statistics or Non-Stationary Geostatistical Modeling Julián M. Ortiz, Steven Lyster and Clayton V. Deutsch Centre or Comutational Geostatistics Deartment o Civil & Environmental Engineering

More information

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu

More information

SYMPLECTIC STRUCTURES: AT THE INTERFACE OF ANALYSIS, GEOMETRY, AND TOPOLOGY

SYMPLECTIC STRUCTURES: AT THE INTERFACE OF ANALYSIS, GEOMETRY, AND TOPOLOGY SYMPLECTIC STRUCTURES: AT THE INTERFACE OF ANALYSIS, GEOMETRY, AND TOPOLOGY FEDERICA PASQUOTTO 1. Descrition of the roosed research 1.1. Introduction. Symlectic structures made their first aearance in

More information

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process P. Mantalos a1, K. Mattheou b, A. Karagrigoriou b a.deartment of Statistics University of Lund

More information

MODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS

MODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS MODEL-BASED MULIPLE FAUL DEECION AND ISOLAION FOR NONLINEAR SYSEMS Ivan Castillo, and homas F. Edgar he University of exas at Austin Austin, X 78712 David Hill Chemstations Houston, X 77009 Abstract A

More information

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Journal of Modern Alied Statistical Methods Volume Issue Article 7 --03 A Comarison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Ghadban Khalaf King Khalid University, Saudi

More information

Metrics Performance Evaluation: Application to Face Recognition

Metrics Performance Evaluation: Application to Face Recognition Metrics Performance Evaluation: Alication to Face Recognition Naser Zaeri, Abeer AlSadeq, and Abdallah Cherri Electrical Engineering Det., Kuwait University, P.O. Box 5969, Safat 6, Kuwait {zaery, abeer,

More information

Churilova Maria Saint-Petersburg State Polytechnical University Department of Applied Mathematics

Churilova Maria Saint-Petersburg State Polytechnical University Department of Applied Mathematics Churilova Maria Saint-Petersburg State Polytechnical University Deartment of Alied Mathematics Technology of EHIS (staming) alied to roduction of automotive arts The roblem described in this reort originated

More information

q-ary Symmetric Channel for Large q

q-ary Symmetric Channel for Large q List-Message Passing Achieves Caacity on the q-ary Symmetric Channel for Large q Fan Zhang and Henry D Pfister Deartment of Electrical and Comuter Engineering, Texas A&M University {fanzhang,hfister}@tamuedu

More information

Named Entity Recognition using Maximum Entropy Model SEEM5680

Named Entity Recognition using Maximum Entropy Model SEEM5680 Named Entity Recognition using Maximum Entroy Model SEEM5680 Named Entity Recognition System Named Entity Recognition (NER): Identifying certain hrases/word sequences in a free text. Generally it involves

More information

Chapter 1 Fundamentals

Chapter 1 Fundamentals Chater Fundamentals. Overview of Thermodynamics Industrial Revolution brought in large scale automation of many tedious tasks which were earlier being erformed through manual or animal labour. Inventors

More information

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle] Chater 5 Model checking, verification of CTL One must verify or exel... doubts, and convert them into the certainty of YES or NO. [Thomas Carlyle] 5. The verification setting Page 66 We introduce linear

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

A Closed-Form Solution to the Minimum V 2

A Closed-Form Solution to the Minimum V 2 Celestial Mechanics and Dynamical Astronomy manuscrit No. (will be inserted by the editor) Martín Avendaño Daniele Mortari A Closed-Form Solution to the Minimum V tot Lambert s Problem Received: Month

More information

Ordered and disordered dynamics in random networks

Ordered and disordered dynamics in random networks EUROPHYSICS LETTERS 5 March 998 Eurohys. Lett., 4 (6),. 599-604 (998) Ordered and disordered dynamics in random networks L. Glass ( )andc. Hill 2,3 Deartment of Physiology, McGill University 3655 Drummond

More information

Keywords: pile, liquefaction, lateral spreading, analysis ABSTRACT

Keywords: pile, liquefaction, lateral spreading, analysis ABSTRACT Key arameters in seudo-static analysis of iles in liquefying sand Misko Cubrinovski Deartment of Civil Engineering, University of Canterbury, Christchurch 814, New Zealand Keywords: ile, liquefaction,

More information

Finding Shortest Hamiltonian Path is in P. Abstract

Finding Shortest Hamiltonian Path is in P. Abstract Finding Shortest Hamiltonian Path is in P Dhananay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune, India bstract The roblem of finding shortest Hamiltonian ath in a eighted comlete grah belongs

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

PHYS 301 HOMEWORK #9-- SOLUTIONS

PHYS 301 HOMEWORK #9-- SOLUTIONS PHYS 0 HOMEWORK #9-- SOLUTIONS. We are asked to use Dirichlet' s theorem to determine the value of f (x) as defined below at x = 0, ± /, ± f(x) = 0, - < x

More information

GENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION E. G. MANSOORI, M. J. ZOLGHADRI, S. D. KATEBI, H. MOHABATKAR, R. BOOSTANI AND M. H.

GENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION E. G. MANSOORI, M. J. ZOLGHADRI, S. D. KATEBI, H. MOHABATKAR, R. BOOSTANI AND M. H. Iranian Journal of Fuzzy Systems Vol. 5, No. 2, (2008). 21-33 GENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION E. G. MANSOORI, M. J. ZOLGHADRI, S. D. KATEBI, H. MOHABATKAR, R. BOOSTANI AND M. H. SADREDDINI

More information

SIMULATED ANNEALING AND JOINT MANUFACTURING BATCH-SIZING. Ruhul SARKER. Xin YAO

SIMULATED ANNEALING AND JOINT MANUFACTURING BATCH-SIZING. Ruhul SARKER. Xin YAO Yugoslav Journal of Oerations Research 13 (003), Number, 45-59 SIMULATED ANNEALING AND JOINT MANUFACTURING BATCH-SIZING Ruhul SARKER School of Comuter Science, The University of New South Wales, ADFA,

More information

Network Configuration Control Via Connectivity Graph Processes

Network Configuration Control Via Connectivity Graph Processes Network Configuration Control Via Connectivity Grah Processes Abubakr Muhammad Deartment of Electrical and Systems Engineering University of Pennsylvania Philadelhia, PA 90 abubakr@seas.uenn.edu Magnus

More information

c Copyright by Helen J. Elwood December, 2011

c Copyright by Helen J. Elwood December, 2011 c Coyright by Helen J. Elwood December, 2011 CONSTRUCTING COMPLEX EQUIANGULAR PARSEVAL FRAMES A Dissertation Presented to the Faculty of the Deartment of Mathematics University of Houston In Partial Fulfillment

More information

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R.

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R. 1 Corresondence Between Fractal-Wavelet Transforms and Iterated Function Systems With Grey Level Mas F. Mendivil and E.R. Vrscay Deartment of Alied Mathematics Faculty of Mathematics University of Waterloo

More information

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules. Introduction: The is widely used in industry to monitor the number of fraction nonconforming units. A nonconforming unit is

More information

Bayesian Networks Practice

Bayesian Networks Practice Bayesian Networks Practice Part 2 2016-03-17 Byoung-Hee Kim, Seong-Ho Son Biointelligence Lab, CSE, Seoul National University Agenda Probabilistic Inference in Bayesian networks Probability basics D-searation

More information

Robust hamiltonicity of random directed graphs

Robust hamiltonicity of random directed graphs Robust hamiltonicity of random directed grahs Asaf Ferber Rajko Nenadov Andreas Noever asaf.ferber@inf.ethz.ch rnenadov@inf.ethz.ch anoever@inf.ethz.ch Ueli Peter ueter@inf.ethz.ch Nemanja Škorić nskoric@inf.ethz.ch

More information

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test) Chater 225 Tests for Two Proortions in a Stratified Design (Cochran/Mantel-Haenszel Test) Introduction In a stratified design, the subects are selected from two or more strata which are formed from imortant

More information

One-way ANOVA Inference for one-way ANOVA

One-way ANOVA Inference for one-way ANOVA One-way ANOVA Inference for one-way ANOVA IPS Chater 12.1 2009 W.H. Freeman and Comany Objectives (IPS Chater 12.1) Inference for one-way ANOVA Comaring means The two-samle t statistic An overview of ANOVA

More information

Casimir Force Between the Two Moving Conductive Plates.

Casimir Force Between the Two Moving Conductive Plates. Casimir Force Between the Two Moving Conductive Plates. Jaroslav Hynecek 1 Isetex, Inc., 95 Pama Drive, Allen, TX 751 ABSTRACT This article resents the derivation of the Casimir force for the two moving

More information

Lower bound solutions for bearing capacity of jointed rock

Lower bound solutions for bearing capacity of jointed rock Comuters and Geotechnics 31 (2004) 23 36 www.elsevier.com/locate/comgeo Lower bound solutions for bearing caacity of jointed rock D.J. Sutcliffe a, H.S. Yu b, *, S.W. Sloan c a Deartment of Civil, Surveying

More information

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation Paer C Exact Volume Balance Versus Exact Mass Balance in Comositional Reservoir Simulation Submitted to Comutational Geosciences, December 2005. Exact Volume Balance Versus Exact Mass Balance in Comositional

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

Characteristics of Beam-Based Flexure Modules

Characteristics of Beam-Based Flexure Modules Shorya Awtar e-mail: shorya@mit.edu Alexander H. Slocum e-mail: slocum@mit.edu Precision Engineering Research Grou, Massachusetts Institute of Technology, Cambridge, MA 039 Edi Sevincer Omega Advanced

More information

Uniformly best wavenumber approximations by spatial central difference operators: An initial investigation

Uniformly best wavenumber approximations by spatial central difference operators: An initial investigation Uniformly best wavenumber aroximations by satial central difference oerators: An initial investigation Vitor Linders and Jan Nordström Abstract A characterisation theorem for best uniform wavenumber aroximations

More information

EXACTLY PERIODIC SUBSPACE DECOMPOSITION BASED APPROACH FOR IDENTIFYING TANDEM REPEATS IN DNA SEQUENCES

EXACTLY PERIODIC SUBSPACE DECOMPOSITION BASED APPROACH FOR IDENTIFYING TANDEM REPEATS IN DNA SEQUENCES EXACTLY ERIODIC SUBSACE DECOMOSITION BASED AROACH FOR IDENTIFYING TANDEM REEATS IN DNA SEUENCES Ravi Guta, Divya Sarthi, Ankush Mittal, and Kuldi Singh Deartment of Electronics & Comuter Engineering, Indian

More information

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation 6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca

More information

An Ant Colony Optimization Approach to the Probabilistic Traveling Salesman Problem

An Ant Colony Optimization Approach to the Probabilistic Traveling Salesman Problem An Ant Colony Otimization Aroach to the Probabilistic Traveling Salesman Problem Leonora Bianchi 1, Luca Maria Gambardella 1, and Marco Dorigo 2 1 IDSIA, Strada Cantonale Galleria 2, CH-6928 Manno, Switzerland

More information

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators S. K. Mallik, Student Member, IEEE, S. Chakrabarti, Senior Member, IEEE, S. N. Singh, Senior Member, IEEE Deartment of Electrical

More information

Research of PMU Optimal Placement in Power Systems

Research of PMU Optimal Placement in Power Systems Proceedings of the 5th WSEAS/IASME Int. Conf. on SYSTEMS THEORY and SCIENTIFIC COMPUTATION, Malta, Setember 15-17, 2005 (38-43) Research of PMU Otimal Placement in Power Systems TIAN-TIAN CAI, QIAN AI

More information

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21 Trimester 1 Pretest (Otional) Use as an additional acing tool to guide instruction. August 21 Beyond the Basic Facts In Trimester 1, Grade 8 focus on multilication. Daily Unit 1: Rational vs. Irrational

More information

Estimation of Separable Representations in Psychophysical Experiments

Estimation of Separable Representations in Psychophysical Experiments Estimation of Searable Reresentations in Psychohysical Exeriments Michele Bernasconi (mbernasconi@eco.uninsubria.it) Christine Choirat (cchoirat@eco.uninsubria.it) Raffaello Seri (rseri@eco.uninsubria.it)

More information