arxiv: v1 [cs.db] 30 Jun 2012

Size: px

Start display at page:

Download "arxiv: v1 [cs.db] 30 Jun 2012"

David Carpenter
6 years ago
Views:

1 Mining Statistially Signifiant Substrings using the Chi-Square Statisti Mayank Sahan Det. of Comuter Siene and Engineering Indian Institute of Tehnology, Kanur INDIA Arnab Bhattaharya arxiv:7.0144v1 [s.db] 30 Jun 01 ABSTRACT The roblem of identifiation of statistially signifiant atterns in a sequene of data has been alied to many domains suh as intrusion detetion systems, finanial models, web-lik reords, automated monitoring systems, omutational biology, rytology, and text analysis. An observed attern of events is deemed to be statistially signifiant if it is unlikely to have ourred due to randomness or hane alone. We use the hi-square statisti as a quantitative measure of statistial signifiane. Given a string of haraters generated from a memoryless Bernoulli model, the roblem is to identify the substring for whih the emirial distribution of single letters deviates the most from the distribution exeted from the generative Bernoulli model. This deviation is atured using the hi-square measure. The most signifiant substring (MSS) of a string is thus defined as the substring having the highest hi-square value. Till date, to the best of our knowledge, there does not exist any algorithm to find the MSS in better than O(n ) time, where n denotes the length of the string. In this aer, we roose an algorithm to find the most signifiant substring, whose running time is O(n 3/ ) with high robability. We also study some variants of this roblem suh as finding the to-t set, finding all substrings having hi-square greater than a fixed threshold and finding the MSS among substrings greater than a given length. We exerimentally demonstrate the asymtoti behavior of the MSS on varying the string size and alhabet size. We also desribe some aliations of our algorithm on rytology and real world data from finane and sorts. Finally, we omare our tehnique with the existing heuristis for finding the MSS. 1. MOTIVATION Statistial signifiane is used to asertain whether the outome of a given exeriment an be asribed to some extraneous fators or is solely due to hane. Given a string omosed of haraters from an alhabet Σ {a 1, a,..., a k } of onstant size k, the null hyothesis assumes that the letters of the string are generated from a memoryless Bernoulli model. Eah letter of the string is drawn randomly and indeendently from a fixed multinomial robability Permission to make digital or hard oies of all or art of this work for ersonal or lassroom use is granted without fee rovided that oies are not made or distributed for rofit or ommerial advantage and that oies bear this notie and the full itation on the first age. To oy otherwise, to reublish, to ost on servers or to redistribute to lists, requires rior seifi ermission and/or a fee. Artiles from this volume were invited to resent their results at The 38th International Conferene on Very Large Data Bases, August 7th - 31st 01, Istanbul, Turkey. Proeedings of the VLDB Endowment, Vol. 5, No. Coyright 01 VLDB Endowment /1/06... $.00. distribution P { 1,,..., k } where i denotes the robability of ourrene of harater a i in the alhabet ( P i 1). The objetive is to find the onneted subregion of the string (i.e., a substring) for whih the emirial distribution of single letters deviates the most from the distribution given by the Bernoulli model. Detetion of statistially relevant atterns in a sequene of events has drawn signifiant interest in the omuter siene ommunity and has been diversely alied in many fields inluding moleular biology, rytology, teleommuniations, intrusion detetion, automated monitoring, text mining, and finanial modeling. The aliations in omutational biology inlude assessing the over reresentation of exetional atterns [7] and studying the mutation harateristis in the rotein sequene of an organism by identifying the sudden hanges in their mutation rates [18]. Different studies suggest deteting intrusions in various information systems by searhing for hidden atterns that are unlikely to our [6, 7]. In teleommuniation, it has been alied to detet eriods of heavy traffi [13]. It has also been used in analyzing finanial time series to reveal hidden temoral atterns that are harateristi and reditive of time series events [] and to redit stok ries [17]. Quantifying a substring as statistially signifiant deends on the statistial model used to alulate the deviation of the emirial distribution of single letters from its exeted nature. The exat formulation of statistial signifiane deends on the metri used; -value and z-sore [3, 5] reresent the two most ommonly used ones (some of the other ones are reviewed in [, 4]). Researh indiates that in most ratial ases, -value rovides more reise and aurate results as omared to z-sore [7]. The -value is defined as the robability of obtaining a test statisti at least as extreme as the one that was atually observed assuming the null hyothesis to be true. For examle, in an exeriment to determine whether a oin is fair, suose it turns u head on 19 out of 0 tosses. Assuming the null hyothesis, i.e., the oin is fair, to be true, the -value is equal to the robability of observing 19 or more heads in 0 flis of a fair oin: 1 -value P r(19h) + P r(0h) ` ` % Traditionally, the deision to rejet or fail to rejet the null hyothesis is based on a re-defined signifiane level α. If the -value is low, the result is less likely assuming the null hyothesis to be true. Consequently, the observation is statistially more signifiant. 1 This definition of -value is art of a one-sided test; however, we an also alulate the robability of getting at least 19 heads or at least 19 tails whih is art of a two-sided test. The -value is just double in this ase due to symmetry. 5

2 In a memoryless Bernoulli multinomial model, the robability of observing a onfiguration β 0, given by a ount vetor C {Y 1, Y,..., Y k } with P k Yi l (where l is the length of the substring) denoting the set of observed frequenies of eah harater in the alhabet, is defined as P r(c β 0) l! ky Y i i Y i! The -value for this model then is X -value P r(β) () β more extreme than β 0 However, omuting the -value exatly requires analyzing all ossible outomes of the exeriment whih are otentially exonential in number, thereby rendering the omutation imratial. Moreover, it has been shown that for large samles, asymtoti aroximations are aurate enough and easier to alulate [4]. The two broadly used aroximations are the likelihood ratio statisti and the Pearson s hi-square statisti [4]. In ase of likelihood ratio test, an alternative hyothesis is set u under whih eah i is relaed by its maximum likelihood estimate π i x i/n with the exat robability of a onfiguration under null hyothesis defined similarly as in the revious ase. The natural logarithm of the ratio between these two robabilities multilied by is then the statisti for the likelihood ratio test: «πi ln(lr) x i ln (3) Alternatively, the Pearson s hi-square statisti, denoted by X, measures the deviation of observed frequeny distribution from the theoretial distribution [5]: X (O i E i) E i i (1) (Y i l i) (4) l i where O i and E i are theoretial and observed frequenies of the haraters in the substring. Sine eah letter of the substring is drawn from a fixed robability distribution, the exeted frequeny E i of a harater in the substring is obtained by multilying the length of the substring l with the robability of ourrene of that harater. Hene, the exeted frequeny vetor is given by E lp, where P { 1,,..., k }. The hi-square (X ) definition in (4) an be further simlified as: X (Y i l i) l i Y i l i l " Y i l i Y i l and Y i + l # i 1 Note that the hi-square value for a substring deends only on the ount of the haraters in it, and not on the order in whih they aear. It an be seen in the oin toss examle that all the outomes that are less likely to our have higher X values than the observed outome. For multinomial models, under the null hyothesis, both X statisti and ln(lr) statisti onverge to the χ distribution with k 1 degrees of freedom [1, 4]. Hene, the -value of the outome an then be omuted using the umulative distribution funtion (df) F (x) of the χ (k 1) distribution. If z 0 is the X value of the observed outome, then its -value is 1 F (z 0). Moreover, it has also been shown that the X statisti onverges to the χ distribution from below as oosed to the ln(lr) i (5) statisti whih onverges from above [1, 4]. Thus, the hi-square statisti diminishes the robability of tye-i errors (false ositives). Considering these signifiant advantages, we adot the Pearson s X statisti as the estimate to quantify the statistial signifiane in our study. In this aer, we fous on the roblem where only ortions of the string instead of the whole string may deviate from the exeted behavior. As disussed in the exerimental setion, this roblem is artiularly useful in the analysis of temoral strings where an external event ourring in the middle of a string may be ausing the artiular substring to deviate signifiantly from the exeted behavior by inflating or deflating the robabilities of ourrene of some haraters in the alhabet. Our work fouses on the roblem of identifiation of suh statistially signifiant substrings in large strings. Before venturing forward, we formally define the different roblem statements handled in this aer for a string S of length n. PROBLEM 1 (MOST SIGNIFICANT SUBSTRING). Find the most signifiant substring (MSS) of S, whih is the substring having the highest hi-square value (X ) among all ossible substrings. PROBLEM (TOP-T SUBSTRINGS). Find the to-t set T of t substrings suh that T t and for any two arbitrary substrings S 1 T and S T, XS 1 XS. PROBLEM 3 (SIGNIFICANCE GREATER THAN THRESHOLD). Find all substrings having hi-square value (X ) greater than a given threshold α 0. PROBLEM 4 (MSS GREATER THAN GIVEN LENGTH). Find the substring having the highest hi-square value (X ) among all substrings of length greater than γ 0. The rest of the aer is organized as follows. Setion rovides an overview of the related work. Setion 3 formulates some imortant definitions and observations used by our algorithm. Setion 4 desribes the algorithm for finding the MSS of a string. Setion 5 resents the analysis of the algorithm. Setion 6 extends the MSS finding algorithm to the more general roblems. Setion 7 shows the exerimental analysis and some aliations of the algorithm on real datasets. Finally, Setion 8 disusses ossible future work.. RELATED WORK The roblem of identifying frequent and statistially relevant subsequenes (not neessarily ontiguous) in a sequene has been an ative area of researh over the ast deade [19]. The roblem of finding statistially signifiant subsequenes within a window of size w has also been addressed [3, 15]. Sine the number of subsequenes grows exonentially with w, the task of omuting subsequenes within a large window is ratially infeasible. We address a different version of the roblem where the window size an be arbitrarily large but statistially signifiant atterns are onstrained to be ontiguous, thus forming substrings of the given string. The roblem has many relevant aliations in laes where the extraneous fator that triggers suh unexeted atterns our ontinuously over an arbitrarily large eriod in the ourse of a sequene, as in the ase of temoral strings. As the ossible number of substrings redues to O(n ), the roblem of omuting statistially signifiant atterns beomes muh more salable. However, it is still omutationally intensive for large data. The trivial algorithm roeeds by heking all O(n ) ossible substrings. Some imrovements suh as bloking tehnique and hea strategy were roosed, but they showed no asymtoti imrovement in the time omlexity []. Two algorithms, namely, 53

3 ARLM and AGMM, were roosed whih use loal maxima to find the MSS [9]. It was laimed (only through a onjeture and not a roof) that ARLM would find the MSS. However, the time omlexity is still O(n ) with only onstant time imrovements. AGMM was a O(n) time heuristi that found a substring whose X value was roughly lose to the X value of MSS, but no theoretial guarantees were rovided on the bound of the aroximation ratio. The omarative analysis of our algorithms with them is shown in detail in Setion 7. To the best of our knowledge, no algorithm exists till date that exatly finds the MSS or solves the other variants of the roblem in better than O(n ) time. It may seem that a fast algorithm an be obtained using the suffix tree [14]. However, the roblem at hand is different. To omute the X value of any substring we need not traverse the whole substring; rather, we just need the number of ourrenes of eah harater in that substring. This an be easily omuted in O(1) time by maintaining k ount arrays, one for eah harater of the alhabet, where i th element of the array stores the number of ourrenes of the harater till i th osition in the string. Eah array an be reroessed in O(n) time. Furthermore, due to omlex non-linear nature of the X funtion we assume that no obvious roerties of the suffix trees or its invariants an be utilized. The trivial algorithm heks for all ossible substrings that have O(n) starting ositions and for eah starting osition have O(n) ending ositions, thus requiring O(n ) time. Our algorithm also onsiders all the O(n) starting ositions, but for a artiular starting osition, it does not hek all ossible ending ositions. Rather, it skis ending ositions that annot generate andidates for the MSS or the to-t set. We show that for a artiular starting osition, we hek only O( n) different ending ositions, thereby sanning a total of only O(n 3/ ) substrings. We formally show that the running time of our algorithm is O(n 3/ ). We also extend the algorithm for finding the to-t substrings and other variants, all of whih, again, run in O(n 3/ ) time. 3. DEFINITIONS AND OBSERVATIONS In the rest of the aer, any string S over a multinomial alhabet Σ {a 1, a,..., a k } and drawn from a fixed robability distribution P { 1,,..., k } is hrased as S over (Σ, P ). For a given string S of length n, S[i] (1 i n) denotes the i th letter of the string S and S[i... j] denotes the substring of S from index i to index j, both inluded. So, the omlete string S an also be denoted by S[1... n]. DEFINITION 1 (CHAIN COVER). For any string S of length l, a string λ(s, a i, l 1) of length l + l 1 is said to be the hain over of S over l 1 symbols of harater a i if S is the refix of λ(s, a i, l 1) and the last l 1 ositions of λ(s, a i, l 1) are ouied by the harater a i. Alternatively, λ(s, a i, l 1) is of the form S followed by l 1 ourrenes of harater a i. For examle, if S dbb then λ(s, d, 3) dbbddd, and if S baad then λ(s, a, ) baadaa. We first rove that for any string S of length l, X value of any string S of length less than or equal to l + l 1 and having S as its refix is uer bounded by the X value of a hain over of S over l 1 symbols of some harater a i Σ. A suffix tree is a data struture that an be built in θ(n) time. The ower of suffix trees lies in quikly finding a artiular substring of the string. It rovides a fast imlementation of many imortant string oerations. LEMMA 1. Let S be any given string of length l over (Σ, P ) with ount vetor denoted by {Y 1, Y,..., Y k } where eah Y i 0 and P k Yi l. Let S be any string whih has S as its refix and is of length l + l 1. Then there exists some harater a j Σ suh that X value of S is uer bounded by the X value of the over string λ(s, a j, l 1). The harater a j is suh that it has the maximum value of Y j +l 1 among all j {1,,..., k}. PROOF. Let the X values of strings S, S and λ(s, a j, l 1) be denoted by XS, X S and X λ resetively. We need to rove that X S X λ. By definition, the ount vetor of λ(s, a j, l 1) is {Y 1, Y,..., Y j+ l 1,... Y k }. Further, let Y i denote the frequeny of harater a i in S that are not resent in S (i.e., frequeny of a i in the l 1 length suffix of S ). So, the ount vetor of S is {Y 1 +Y 1, Y +Y..., Y k + Y k} where eah Y m 0 and P k Y m l 1. From the definition of X statisti given in (5), we have X S X λ,m j X S Y m l l (6) Ym (Yj + l1) + (l + l 1) (l + l 1) (l + l 1) Y m (l + l 1) + Yjl1 + l 1 (l + l 1) (l + l 1) (7) (Y m + Y m) (l + l 1) (l + l 1) Y m (l + l 1) + Y my m + Y m (l + l 1) (l + l 1) (8) The harater a j is hosen suh that it maximizes the quantity Y j +l 1 over all ossible alhabets. So for any other harater a m where m {1,,..., k} we have Y m + Y m Ym + l1 Multilying (9) by Y m and summing it over m we get Y my m + Y m From (7), (8) and () we have X S Yj + l1 (9) Y m Y j + l 1 Yjl1 + l 1 () Y m (l + l 1) + Yjl1 + l 1 (l + l 1) (l + l 1) X λ. The next lemma states that the X value of a string an always be inreased by adding a artiular harater to it. LEMMA. Let S be any given string of length l over (Σ, P ) with ount vetor denoted by {Y 1, Y,..., Y k } where eah Y i 0 and P k Yi l. There always exists some harater aj suh that by aending it to S, the X value of resultant string S beomes greater than that of S. The harater a j is suh that it has the maximum value of Y j among all j {1,,..., k}. 54

4 PROOF. Let the X values of strings S and S be denoted by X S and X S resetively. We need to rove that X S < X S. The string S is the resultant string obtained by aending alhabet a j to the string S, so the ount vetor of S is {Y 1,..., Y j + 1,..., Y k }. From (5), we have X S X S From (11) and (1) we have Y m l l (11) Y m (l + 1) + Yj + 1 (l + 1) (l + 1) (1) X S X S Ym + Yj + 1 (l + 1) (l + 1) (l + 1) " 1 (Y j + 1)l Ym l(l + 1) l(l + 1) # Y m l + l (13) The harater a j is hosen suh that it maximizes Y j over all j. So we have Y m Yj m {1,,..., k} (14) Multilying (14) by Y m and summing it over m we get Y m Yj Y m lyj (15) Putting (15) into (13) we get X S 1 h i X S (Y j + 1)l l(l + 1) ly j l(l + 1) 1 h i l(y j l) + l(1 ) (16) l(l + 1) Again, from (14) we have: Y m Y j Y m Y j l Y j (17) Putting (17) into (16) and using < 1, we get X S X S > 0. In the next result, we show that the X value of any string S having S as its refix is uer bounded by X value of the hain over of S. THEOREM 1. Let S be any given string of length l over (Σ, P ) with ount vetor denoted by {Y 1, Y,..., Y k } where eah Y i 0 and P k Yi l. Further, let S be any string whih has S as its refix and is of length less than or equal to l + l 1. Then there exists some harater a j Σ suh that X value of S is less than λ(s, a j, l 1). The harater a j is suh that it has the maximum value of Y j +l 1 among all j {1,... k}. PROOF. The roof follows diretly from the results stated in Lemma 1 and Lemma. From Lemma, we an say that there always exists a harater suh that aending it inreases the X value of S. Hene, we kee aending the string S with suh Algorithm 1 Algorithm for finding the most signifiant substring (MSS) 1: Xmax 0 : for i n to 1 do 3: for l 0 to n i do 4: l i + l 5: Xl X value of S[i... l ] 6: if Xl > Xmax then 7: Xmax Xl 8: end if 9: t m s.t. m {1,,..., k}, Ym+x is maximum : a 1 t 11: b Y t l t txmax 1: (Xl Xmax)l t 13: x b+ b 4a a 14: Inrement l by x 15: end for 16: i i 1 17: end for 18: return Xmax haraters till its length beomes l + l 1. We all the resultant string S. Clearly, S has S as its refix and is of length l + l 1 and X value of S is less than or equal to X value of S. The harater a j is suh that maximizes Y j +l 1 over all j {1,,..., k}; so using Lemma 1, we an say that the X value of S is less than the X value of λ(s, a j, l 1). This further imlies that X value of S is less than or equal to the X value of λ(s, a j, l 1). We next formally desribe our algorithm for finding the most signifiant substring (MSS). 4. THE MSS ALGORITHM The algorithm looks for the ossible andidates of MSS in an ordered fashion. The seudoode is shown in Algorithm 1. The loo in line iterates over the start ositions of the substrings while the loo in line 3 iterates over all the ossible lengths of the substrings from a artiular start osition. We kee trak of the maximum X value of any substring omuted by our algorithm by storing it in a variable X max. For a given substring S[i... l ], we alulate its X value, whih is stored in X l (line 5). If X l turns out to be greater than X max then X max is udated aordingly (line 7). The harater a t is hosen suh that it maximizes the value of Y j +x over all j (line 9 of the seudoode). This roerty is neessary for the aliation of the result stated in Theorem 1. Denoting the X value of a hain over of S[i... l ] over x symbols of harater a t by Xλ, the result stated in Theorem 1 states that the X value of any substring of the form S[i... (l + m)] for m {0, 1,..., x} is uer bounded by Xλ. We hoose x suh that it is maximized within the onstraint that Xλ is guaranteed to be less than or equal to Xmax. Then, under the given onstraint, we an ski heking all substrings of the form S[i... (l +m)] for m {0, 1,..., x} as their X values are not greater than Xmax. So, we diretly inrement l by x (line 14). Next, we find out what the ideal hoie of x is. We denote the ount vetor of substring S[i... l ] of length l by {Y 1, Y,..., Y k }. The ount vetor of over hain is given by {Y 1, Y..., Y t + x,..., Y k } where Y t denotes the frequeny of 55

5 harater a in the algorithm. By definition of X from (5), and X λ X l l(x l + l) (l + x) (Y m) l l (18) (Y m) xyt + x + (l + x) (l + x) (l + x) t + xyt + x (l + x) t (l + x) (19) We want to maximize x with the onstraint that Xλ Xmax. From (19) we have, l(x l + l) (l + x) + xyt + x (l + x) t (l + x) X max (0) On multilying (0) by (l + x) t and rearranging, the onstraint simlifies to (1 t)x + (Y t l t tx max)x + (X l X max)l t 0 (1) Eq. (1) is a quadrati equation in x with a 1 t > 0, b Y t l t tx max and (X l X max)l t 0 (X l X max). We need to maximize x with the onstraint that ax + bx + 0. Thus, we hoose x as the ositive root of the quadrati equation: x b + b 4a a () Sine a > 0 and 0 we have x 0. Further, sine x has to be an integer we hoose x as the greatest integer greater than or equal to the above value (line 13 of the algorithm). 5. ANALYSIS OF THE MSS ALGORITHM We first show that the running time of the algorithm on an inut string generated from a memoryless Bernoulli model is O(kn 3/ ) with high robability where n and k denote the string and alhabet size resetively. For a string not generated from the null model, we will argue that the time taken by our algorithm on that string is less than the time taken by our algorithm on an equivalent string of the same size generated from the null model. Hene, the time omlexity of our algorithm for any inut string is O(kn 3/ ) with high robability. Let S be any string drawn from a memoryless Bernoulli model. Let T ij denote the random variable that takes value 1 if a i ours at osition S[j] and 0 otherwise. Eah harater of the string S is indeendently drawn from a fixed robability distribution P, so the robability that T ij 1 is i. The frequeny of harater a i in the string S denoted by the random variable Y i is the sum of n Bernoulli random variables T ij where j ranges from 1 to n. Sine Y i is the sum of n i.i.d. (indeendent and identially distributed) Bernoulli random variables, eah having a suess robability i, Y i follows a binomial distribution with arameters n and i. Y i T ij Bernoulli( i) nx T ij Y i Binomial(n, i) (3) j1 We state the following two standard results from the domain of robability distributions. THEOREM. For large values of n, the Binomial(n,) distribution onverges to Normal(µ,σ ) distribution with the same mean and variane, i.e., µ n and σ n(1 ). PROOF. The roof uses the result of Central Limit Theorem. Please refer to [4] for the detailed roof. 3 It has been shown in [1] that for both n and n greater than a onstant 4, the binomial distribution an be aroximated by the normal distribution. Sine all the robabilities i in our setting are fixed, we an always find a onstant (say ) suh that for all n greater than, every X i N(n i, n i(1 i)) distribution. We use the following result to obtain the distribution of the X statisti of any substring from a string generated using the null model. THEOREM 3. Let the random variable Y i, i {1,... k} follows N(n i, n i(1 i)) distribution with P k i 1 and the additional onstraint that P k Yi n. The random variable X (Y i n i) (4) n i then follows the hi-square distribution with (k 1) degrees of freedom, denoted by χ (k 1). PROOF. It has to be noted that all Y i s in the theorem are not indeendent but have an added onstraint that P k Yi n. This is reisely the reason why the degrees of freedom of hi-square distribution is k 1 instead of k. A well known result is that the sum of squares of n indeendent standard normal random variables follows a χ (k) distribution. The roof (whih is slightly omliated) follows diretly from this well known result. Please refer to [0] for the detailed roof. We will next rove that with high robability, the X value of the MSS of S generated using the null model is greater than ln n. However, before that, we rove another useful result using elementary robability theory. LEMMA 3. Let Z max denote the maximum of m i.i.d. random variables following χ (k) distribution. Then with robability at least 1 O(1/m ), for suffiiently large m and for any onstant > 0, ln m Z max. PROOF. We first show this for k. Let f(x) and F (x) denote the df and df of χ () distribution: We have f(x; ) 1 e x/ F (x; ) 1 e x/ (5) Z max max{z 1, Z,..., Z m} i, Z i χ (k) (6) For any onstant > 0 we have: P r{z max > ln m} P r{ i, s.t. Z i > ln m} 1 P r{ i, Z i ln m} 1 (P r{z i ln m}) m 1 (1 e 1 ln m ) m 1 (1 1 m ) m 1 e m/ 1 O(1/m ) (7) In the above roof we only utilized the asymtoti behavior of df and df of the χ (k) distribution. Sine for any general k, 3 In the above aroximation, we an think of the binomial distribution as the disrete version of the normal distribution having the same mean and variane. So we do not need to aount for the aroximation error using the Berry-Esseen theorem [8]. 4 In general, the value of this onstant is taken as 5 [1]. 56

6 the asymtoti behavior of df and df of χ (k) distribution has the same dominating term e x/, the above result is valid for any given k. 5 LEMMA 4. In the MSS algorithm, at any iteration in the loo over i, X max > ln n with robability at least 1 O(1/n ) where n n i. PROOF. We an verify from the seudo ode (Algorithm 1) that before we begin the loo in line for i i 0, we have heked all the substrings that are otential andidates for MSS of S starting at i > i 0. So, at this instane, the variable X max stores the maximum X value of any substring of the string S[(i 0 +1)... n]. In other words, the variable X max would store the maximum of n C O(n ) (where n n i 0) random variables eah following the same χ (k 1) distribution. However, sine these O(n ) substrings are not mutually indeendent, the result of Lemma 3 annot be diretly alied in this ase. However, we an still say that a subset of at least O(n ) substrings are indeendent, with eah substring following a χ (k 1) distribution. One way of onstruting a mutually indeendent subset of size O(n ) is by hoosing n / substrings eah of length suh that they do not share any harater among them, i.e., the i th substring in this set is S[(i )... (i 1))] where is a onstant suh that the binomial distribution an be aroximated by the normal distribution for all strings of length greater than or equal to. Sine all haraters of the string S are drawn indeendently from a fixed robability distribution, all the substrings in the subset are mutually indeendent, and sine length of all these substrings are greater than, X statistis of these substrings follow the χ (k 1) distribution. Consequently, the value of X max in our algorithm is greater than the max of at least O(n ) χ (k 1) i.i.d. random variables. Putting the value of m n / in the result of Lemma 3, we an rove the above result. LEMMA 5. On an inut string generated from the null model, with high robability (> 1 ǫ for any onstant ǫ > 0) the number of substrings skied (denoted by x) in any iteration of the loo on l in the MSS algorithm is ω( l) for suffiiently large values of l. Hene, ǫ an be set so lose to 0 that with robability ratially equal to 1, the number of substrings skied x in any iteration is at least ω( l). PROOF. As stated in (), the number of substrings skied in any iteration of the loo on l is x b + b 4a (8) a We will rove that in the string generated from the null model, with high robability b 1 lt ln l and 1 lt ln l. These bounds hel us in guaranteeing that x ω( l) with high robability. In order to rove the bounds on b and, we first rove that the following onditions hold with high robability. (i) From the result stated in Lemma 4, for any onstant ǫ 1 > 0, we have with robability at least 1 O(1/n ) > 1 ǫ 1 that X max > ln n where n n i. In the algorithm, l in the loo iterates from 0 to n i, so we have l n. Hene, X max > ln l with robability at least 1 ǫ 1. (ii) Suose Y t denotes the frequeny of alhabet a t in the string S[i... l ] of length l. As denoted in (3) it is the sum of l 5 The term of x k/ 1 e x/ ourring in df of a general k is asymtotially less than e x/+ǫ and greater than e x/ ǫ for any ǫ > 0, whih is indeendent of k. indeendent Bernoulli random variables T ij eah with exetation t; so, E[Y t] l t. Also, we have P r(t ij [0, 1]) 1. Now, using the Hoeffding s inequality [16], we get P r{y t E[Y t] < t} 1 e t l P l (b i a i ) (9) Substituting E[Y t] l t, t 1 4 lt ln l, a i 0 and b i 1, we have for any onstant ǫ > 0 P r{y t l t < 1 4 lt ln l} 1 e l t ln l 16l 1 l t 8 1 ǫ (30) (iii) As stated in Theorem 3, the X value of substring S[i... l ] of length l denoted by Xl follows the χ distribution. Further, using the definition of df of χ distribution denoted by F x, we have for any onstant ǫ 3 > 0 P r{x l < ln l ln l } Fx( ) 1 ln l e 4 1 ǫ 3 (31) We hoose onstants ǫ 1, ǫ and ǫ 3 small enough suh that for any onstant ǫ > 0, 1 ǫ 1 ǫ ǫ 3 > 1 ǫ. Thus ombining the above three onditions, the following results hold with robability 1 ǫ: b (Y t l t) tx max (Y t l t) 1 lt ln l (3) l t(xl Xmax) l 1 t( ln l ln l) 1 lt ln l (33) a 1 t 1 (34) We use the fat that if any ositive x satisfies the equation a x + b x + 0 then it also satisfies the equation ax + bx + 0 if a a, b b and. So substituting uer bounds of a, b and in (8) and maximizing x in (8) we have with robability 1 ǫ x 1 ( r 1 4 lt ln l + lt ln l 1 lt ln l) 1 ( r 9 4 lt ln l 1 lt ln l) 1 lt ln l Ω( l ln l) ω( l) (35) Further, in Algorithm 1, exet line 9, all the stes inside the loo over l in line 3 an be erformed in onstant time. However, if we an determine the frequenies of all of the haraters in the substring S[i... l ] in O(1) time, then we an find the harater a t (line 9) in O(k) time. For this urose, we maintain one ount array for eah harater a t, t 1,..., k, where the i th element of the ount array stores the number of ourrene of a t u to the i th osition in the string. Eah ount array an be reroessed in O(n) time. Consequently, eah iteration of the loo over l in line 3 takes O(k) time. Further the loo over i in line iterates n times. Now, we only need to omute the number of iterations of the loo over l for whih we use the next lemma. LEMMA 6. The exeted number of iterations of the loo on l (in line 3 of the MSS algorithm) for eah value of i is O( n). 57

7 Algorithm Algorithm for finding the to-t substrings 1: T Min Hea on t elements all initialized to 0 : for i n to 1 do 3: for l 0 to n i do 4: l i + l 5: Xmax t Find Min(T) 6: Xl X value of S[i... l ] 7: if Xl > Xmax t then 8: Extrat Min(T ) 9: Insert Xl in T : end if 11: t m s.t. m {1,,..., k}, Ym+x is maximum 1: a 1 t 13: b Y t l t txmax t 14: (Xl Xmax t)l t 15: x b+ b 4a a 16: Inrement l by x 17: end for 18: i i 1 19: end for 0: return T ı Algorithm 3 Algorithm for finding all substrings having X value greater than α 0 1: S α0 φ : for i n to 1 do 3: for l 0 to n i do 4: l i + l 5: Xl X value of S[i... l ] 6: if Xl > α 0 then 7: S α0 S α0 S[i... l ] 8: end if 9: t m s.t. m {1,,..., k}, Ym+x is maximum : a 1 t 11: b Y t l t tα 0 1: (Xl α 0)l t 13: x max j b+ 14: Inrement l by x 15: end for 16: i i 1 17: end for 18: return S α0 b 4a a ı ff, 1 PROOF. Let T (r) be the number of iterations of the loo over l required for l to reah r. We have shown in Lemma 5 that in eah iteration, the number of substrings skied x is ω( l). Thus, l in the next iteration will reah from r to r + ω( r). This gives us the following reursive relation: T (r + r) T (r) + O(1) T (r) + q (36) It an be shown that the solution to the above relation is O( n). Please refer to Lemma 7 in the aendix for detailed roof. Sine eah iteration of the loo over l in line 3 takes O( n) time, the time taken by the algorithm on an inut string generated by the null model is O(kn 3/ ) whih is O(n 3/ ) sine k is taken as a onstant in our roblem setting. Thus, we have shown that the running time of the algorithm on an inut string generated from a memoryless Bernoulli model is O(kn 3/ ) with high robability. 5.1 Nature of the String As it an be verified from the definition, the X value of a substring inreases when the exeted and observed frequenies begin to diverge. Thus, the individual substrings of a string not generated from the null model are exeted to have higher X values whih, in turn, inreases the Xmax. Further, it an be verified from () that the number of substrings skied, x, inreases on inreasing Xmax as we have to maximize x suh that the onstraint Xλ Xmax is satisfied. If Xmax is large, it gives a larger window for Xλ whih allows the hoie of a larger x. Hene, the time taken by our algorithm on an inut string not generated from null model is less than the time taken by our algorithm on an equivalent string of the same size generated from the null model. So, the time omlexity of our algorithm remains O(n 3/ ) and is indeendent of the nature of the inut string. Setion 7.1. gives the details on how our algorithms erform on different tyes of strings. 6. OTHER VARIANTS OF THE PROBLEM 6.1 To-t Substrings The algorithm for finding the to-t statistially signifiant substrings (Algorithm ) is same as the algorithm for finding the MSS exet that Xmax t stores the t th largest X value among all substrings seen till that artiular instant by the algorithm. We maintain a min-hea T of size t for storing the to-t X values seen by the algorithm. The hea T is initially emty and Xmax t always stores the to (minimum) element of the hea. If Xl is omuted to be greater than Xmax, then we extrat the minimum element of T (whih now no more is a art of to-t substrings) and insert the new Xl value into the hea. Now, Xmax t oints to the new minimum of the hea. Finally, at the end of the algorithm we return the hea T whih ontains the to-t X values among all the substrings of string S. The analysis of this algorithm is same as the algorithm for MSS exet that we now need to show that Xmax t is greater than ln n with robability greater than any onstant. This still holds true for any t < ω(n) (lease refer to Lemma 8 in the aendix for detailed roof). Moreover, inside the for loo on l, we now erform insertion and extrat-min oerations on a hea T of size t; so eah iteration of the loo over l now requires O(k + log t) time. Thus, the total time omlexity of the algorithm for finding the to t substrings is O((k + log t)n 3/ ) for t < ω(n). 6. Signifiane Greater Than a Threshold The algorithm for finding all substrings having X value greater than a threshold α 0 (Algorithm 3) is again essentially the same as the MSS algorithm exet that the X max onstantly remains α 0 at every iteration. We maintain S α0 as a set of all substrings having X value greater than α 0. We ski all substrings that annot be a art of S α0, i.e., whose over strings have X value not greater than α 0. Next, we analyze the time omlexity of the algorithm on varying α 0. We again revert to (): j b + b 4a x max a ı ff, 1 (37) where a 1 t > 0, b Y t l t tα 0 and (Xl α 0)l t 0. If α 0 < Xl then in the above equation is ositive. Consequently, as x takes the value 1, the number of iterations of the loo on l is O(n). Hene, the time omlexity of the algorithm is O(kn ). However, the time omlexity dereases sharly on 58

8 4 0 Our Algorithm Trivial Algorithm O(n 1.5 ) Our Algorithm Ln Iter Ln X max Ln n (a) Number of iterations with string length n (k) Ln n Figure : Variation of X max with string length n (k ). Ln Iter k k3 k5 k Iter/X max S 1 :X max S 1 :Iterations(in 4 ) S :X max S :Iterations(in 4 ) Ln n (b) Number of iterations with alhabet size k. Figure 1: Analysis of time omlexity for finding the MSS. inreasing α 0. One α 0 beomes suffiiently greater than X l, the term α 0l t starts redominating b, and x in eah ste is effetively /a whih is O( αl). 6 Hene, the reurrene relation of the number of iterations of the loo on l in this ase is T (l + O( α 0l)) T (l) + 1 (38) It an be again shown with the hel of Lemma 7 in the aendix that the solution to the reursive relation is O( l/α 0). So the total time omlexity of the algorithm is O(kn n/α 0). 6.3 MSS Greater Than a Given Length The algorithm for finding the most signifiant substring among all substrings having length greater than a given length Γ 0 is exatly the same as the MSS algorithm exet that now we ignore any substring whose length is not greater than Γ 0. This means the loo on l starts with Γ 0 instead of 0 and loo on i goes on till n Γ 0 instead of n. The time omlexity of the algorithm dereases not just beause of less number of substrings evaluated in this ase but also beause the ski x in our algorithm is a funtion of l and it inreases with inreasing values of l. Hene, the reursive relation for the loo over l in this ase is the same with only the base ase different: T (Γ 0) 1 instead of T (1) 1. The solution to this reurrene relation is O( n Γ 0). Sine there are n Γ 0 iterations of loo in i, the total time omlexity of the algorithm is O(k(n Γ 0)( n Γ 0)) whih is effetively O(kn 3/ ). 6 In a substring generated from a memoryless Bernoulli distribution, X follows a χ distribution with onstant mean and variane. Hene, it an be shown with high robability that Xl is a small onstant Figure 3: X max and number of iterations for different multinomial strings. S 1 : n 4, k 3, P { 0, 0.5 0, 0.5}; S : n 4, k 5, P { 0, 0.5 0, 0.1, 0., 0.}. 7. EXPERIMENTAL ANALYSES AND AP- PLICATIONS The exerimental results shown in this setion are for C odes run on Maintosh latform on a mahine with.3 GHz Intel dual ore roessor and 4 GB, 1333 MHz RAM. Eah harater of a syntheti string was generated indeendently from the underlying distribution assumed using the standard uniform (0, 1) random number generator in C. 7.1 Syntheti Datasets Time Comlexity of Finding MSS The first exeriment is on the time omlexity of our algorithm for finding the most signifiant substring. Figure 1a deits the omarison of number of iterations required by our algorithm visà-vis the trivial algorithm for inut strings of different lengths (n) generated from the null model for an alhabet of size. The number of iterations of our algorithm when lotted on a logarithmi sale inreases linearly with the logarithm of the string size with a sloe lose to 1.5. Hene, we an laim that the emirial time omlexity of our algorithm for an inut string generated by null model is also O(n 1.5 ). The effet of varying the alhabet size is shown in Figure 1b for different string lengths. It an be observed that, as exeted, varying the alhabet size has no signifiant effet on the number of iterations of the algorithm. Figure shows that the exeted X max inreases linearly with n with sloe whih suorts our laim in Lemma 4 that for suffiiently large n, X max is greater than ln n with high robability. Finally, Figure 3 lots the variation of X max and iterations of the loo over l for different heterogeneous multinomial distributions 59

9 Iterations in Million Iterations in Million Null Geometri Zaian Markov ,000 0,000 50, String Length (a) Varying n (k 5). Null Geometri Zaian Markov Alhabet size (b) Varying k (n 0000). Figure 4: Comarison of time taken by our algorithm on strings not generated by the null model. and different alhabet sizes. It is evident that hange in the robability 0 of ourrene of harater a 0 only hanges the X max but has no signifiant effet on the number of iterations taken by our algorithm. It an be intuitively seen that the hange in 0 is effetively aneled out by the hange in X max, so the number of haraters skied (x in Eq. ()) roughly remains the same Strings Not Generated Using the Null Model We now investigate the results for inut strings not generated from the null model in addition to an equivalent length inut string generated from the null model whih is a memoryless Bernoulli soure where the multinomial robabilities of all the haraters are equal. The different tyes of other strings that we omare are: (a) Geometri string: A string generated from a memoryless multinomial Bernoulli soure but the multinomial robabilities of all the haraters are different. The robability of ourrene of a harater dereases geometrially. Hene, the robability of ourrene of harater a i is roortional to 1/ i. (b) Harmoni string: A string generated from a memoryless multinomial Bernoulli soure but the multinomial robabilities of all the haraters are different. The robability of ourrene of a harater dereases harmonially. Hene, the robability of ourrene of harater a i is roortional to 1/i. () Markov string: A string generated by a Markov roess, i.e., the ourrene of a harater deends on the revious harater. The state transition robability of harater a j following harater a i is roortional to 1/ (i j) mod k. The number of iterations for our algorithm on different values of string length (n) and alhabet size (k) are lotted in Figure 4. It an be verified that in all the ases, the string generated using the null model requires the maximum number of iterations whih Ln Time (in µs) Ln Time (in µs ) MSS To- To-0 To Ln n (a) Number of iterations with string length n. n500 n000 n Ln t (b) Number of iterations with t. Figure 5: Analysis of time omlexity for finding the to-t set. is in aordane with our theoretial laim in Setion 5. The time taken by our algorithm on an inut string not generated from a null model is uer bounded by the time taken on an equivalent size inut string generated from the null model. This verifies that the time omlexity of our algorithm is O(kn 3/ ), indeendent of the tye of the inut string. 7. Other Variants 7..1 To-t Signifiant Substrings The time taken by the algorithm for finding the to-t set on varying string lengths for different values of t is shown in Figure 5a. The linear inrement in logarithmi sale with sloe 1.5 verifies that for any onstant t the time taken by our algorithm to find the to-t set is again O((k + log t)n 1.5 ). The time taken for different t is shown in Figure 5b. The lot shows that till t < ω(n), the running time inreases with sloe 1.5, but one t rosses the limit, the sloe starts inreasing towards. This is agreement with our theoretial analysis in Setion Signifiane Greater Than a Threshold Figure 6 deits the number of iterations taken by the algorithms for finding all substrings greater than a threshold α 0. As disussed in Setion 6., the iterations derease very sharly from O(n ) until α 0 O(X max) after whih it gradually dereases (as a funtion of 1/ α 0) Substrings Greater Than a Given Length The number of iterations taken by the algorithms for finding the MSS among all strings of length greater than Γ 0 is shown in Figure 7. As disussed in Setion 6.3, the number of iterations slowly dereases as Γ 0 tends to n before raidly aroahing 0. 60

10 Ln Iter Trivial Algorithm Our Algorithm α 0 Figure 6: Number of iterations with α 0 (n 5, k ). Ln Iter Trivial Algorithm Our Algorithm Ln Γ 0 Figure 7: Number of iterations with Γ 0 (n 5, k ). 7.3 Comarison with Existing Tehniques Table 1 resents the omarative results of our algorithm with the existing algorithms [13] for two different values of string size (averaged over different runs). As exeted, results indiate that ARLM [13], being O(n ), does not sale well for larger strings, as oosed to our algorithm. AGMM [13], being O(n) time, is very fast and outerforms all the algorithms in terms of time taken. However, being just a heuristi with no theoretial guarantee, it does not always lead to a solution that is lose to the otimal. As an be verified from Table 1, the average X max string found by AGMM is signifiantly lower than the average X max value found by other algorithms. Further, sine there are no guarantees on the lower bound of the X max value found by it relative to the otimal X max value, AGMM an lead to retty bad solutions in some real datasets whih are not as well behaved as the syntheti ones (Setion 7.5). Finally, our algorithm requires only 3 seonds for a string as large as of length whih signifies that for real life senarios, the algorithm is ratial. 7.4 Aliation in Crytology The orrelation between adjaent symbols is of entral imortane in many rytology aliations [1]. The objetive of a random number generator is to draw symbols from the null model. The indeendene of onseutive symbols is an imortant riterion for effiieny of a random number generator [1]. We define orrelation between adjaent symbols in terms of the state transition robability. An ideal random binary string generator should generate the same symbol in next ste with robability exatly 0.5. However, some random number generators whih are ineffiient might be biased towards generating the same symbol again with robability more than 0.5. Table shows the omarison of X max for different lengths n of string and different robabilities of generation of same symbol in the next iteration. Algo String Size Avg Xmax Avg Time Trivial s Our s ARLM s AGMM s Trivial s Our s ARLM s AGMM s Table 1: Comarison with other tehniques for syntheti datasets. Xmax n n n n Table : Variation of X max with n and. It an be verified from the data that the X max is minimum for a string generated with 0.5 and inreases with inreasing. Further, Figure 4 lots the variation of X max of a string generated using the null model with (logarithm of) the string length (ln n). We observe a nie linear onvergene with sloe. This X max value an be used as a benhmark for a string of any length to measure the deviation from the null model. If the observed X max value of a string deviates signifiantly from the benhmark, it means that the string generated is not omletely random but ontains some kind of hidden orrelation among the symbols. One of the major advantages of using the algorithm is in a senario where only a substring of a string might deviate from the random behavior. Our algorithm will be able to ature suh a substring without having to examine all the ossible substrings Real Datasets Analysis of Sorts Data The hi-square statisti an be used to find the best or worst areer athes of sorts teams or rofessionals. Boston Red Sox versus New York Yankees is one of the most famous and fierest rivalries in rofessional sorts [11]. They have ometed against eah other in over two thousand Major League Baseball games over a eriod of 0 years. Yankees have won 113 (54.7%) of those games. However, we would like to analyze the time eriods in whih either of Yankees or Red Sox were artiularly dominant against the other. The dominant eriods should have large win ratio for a team over a suffiiently long streth of games. If we enode the results in the form of a binary string whose letters denote a win or loss for a team, then these suffiiently long eriods will ontain results that signifiantly differ from the exeted or average. Consequently, the X value for the dominant eriods will signifiantly differ from 0. We use the dataset obtained from The to five most signifiant athes found by our algorithm have been summarized in Table 3. The best eriod for Yankees was from mid 190s to early 1930s in whih they won more than 75% of the games. It was learly the era of Yankees dominane in whih they won 6 World Series hamionshis and 39 ennants, omared to only 4 ennants for the Red Sox [11]. Alternatively, the best ath for Red Sox was a two-year eriod around 191 in whih they had lose to 90% winning reord; this is also referred to as the glory eriod in Red-Sox history [11]. 7 Suh substrings will tend to exhibit large X values and, hene, will be atured by our algorithm. 61

11 Start End X val Games Wins Win% % % % % % Table 3: Performane of Yankees against Red-Sox. Algorithm X val Start End Time Trivial s Our s ARLM s AGMM s Table 4: Comarison with other tehniques for sorts data. The omarative results of our algorithm with existing tehniques are summarized in Table 4. As exeted, our algorithm and AGMM finds the otimal solution but our algorithm outerforms the trivial algorithm and is almost as good as ARLM in terms of time (due to relatively small string size). Moreover, though AGMM is faster, it does not find the otimal solution. The best eriod found by AGMM was the seond best (see Table 3) and has a signifiantly lower X value Analysis of Stok Returns Most finanial models are based on the random walk hyothesis whih is onsistent with the effiient-market hyothesis [6]. They assume that the stok market ries evolve aording to a random walk with a onstant drift and, thus, the ries of the stok market annot be redited. 8 We analyze the returns of three generi finanial seurities for whih a long historial data is available. The Dow Jones Industrial Average is one of the oldest stok market index that shows the erformane of 30 large ublily owned omanies in the United States. Similarly, S&P 500 is another large aitalization-weighted index that atures the erformane of 500 large-a ommon stoks atively traded. Finally, the IBM ommon stok is reresentative of one of the oldest and largest ublily owned firms. We run the algorithms on the Dow Jones ries obtained sine the year 198 onwards (0906 days), S&P 500 sine 1950 onwards (15600 days) and IBM sine 196 onwards (1517 days). The daily rie data are obtained from finane.yahoo.om. Given the randomness in the stok ries, we assume that the ries an inrease (or derease) eah day with a fixed robability. The fixed robability is alulated as the ratio of days on whih rie went u (or down) to the total number of trading days. We find the statistially signifiant substrings of the binary string enoded with 1 for the day if the rie of seurity went u and 0 otherwise. These substrings orresond to signifiantly long eriods that ontain a large ratio of days in whih the stok rie hanged. The results are summarized in Table 5. A lot of bad eriods ourred during the Great Deression of 1930s, the reent dot-om bubble burst and mortgage reession eriods of the last deade, whereas a number of good eriods ourred during the eonomi boom of 1950s and 1960s. These observations verify that these statistially signifiant eriods do not our just due to randomness or hane alone, but are onsequenes of external fators as well. The identifiation of suh signifiant atterns an hel in identifying the relevant external fators. Finally, the X values of these substrings an also be used in quantifying the historial risk of the seurities whih is one of the most imortant arameters that investment managers like to ontrol. 8 If the stok ries an be redited then there is an arbitrage in the market whih violates the effiient market hyothesis. Periods Seurity Start End Change Dow Jones % Dow Jones % Good S&P % S&P % IBM % IBM % Dow Jones % Dow Jones % Bad S&P % S&P % IBM % IBM % Table 5: Signifiant eriods for the seurities. Algo Se. X Start End Change Time Trivial Dow % 14.s Our Dow % 0.89s ARLM Dow % 4.15s AGMM Dow % 0.03s Trivial S&P % 9.36s Our S&P % 0.63s ARLM S&P %.87s AGMM S&P % 0.03s Table 6: Comarison with other tehniques for stok returns. The omarative erformane of our algorithm vis-à-vis the other tehniques in finding the eriod with the highest X value is summarized in Table 6. Again, as exeted, our algorithm, trivial algorithm and ARLM find the same eriod for whih the X value is maximized. However, in this ase, the time erformane advantage of our algorithm over ARLM is retty aarent. AGMM, though having the time advantage, does retty badly in terms of identifying the maximum X substring. Eseially for S&P 500, it returns a substring that is not even lose to the to few substrings. 8. CONCLUSIONS AND FUTURE WORK In this aer, we hose to analyze the X statisti in the ontext of a memoryless Bernoulli model. We exerimentally saw that for a string drawn from suh a model, the hi-square value of the most signifiant substring inreases asymtotially as ( ln n) where n is the length of the string. However, the rigorous mathematial roof remains an interesting oen roblem. Suh analysis of asymtoti behavior have signifiant aliations in deiding the onfidene interval with whih the null hyothesis is rejeted. Further, the analysis an be further extended to strings generated from Markov models, the most basi of whih being the ase when there is a orrelation between adjaent haraters. The single dimensional roblem of identifiation of the most signifiant substring an be extended to two-dimensional grid networks as well as general grahs. One otentially interesting aliation is in finanial time series analysis of two seurities that might not be very orrelated in general, but might oint to signifiant orrelations during ertain seifi events suh as reession. Suh orrelations are essential to most risk analysis tehniques. 9. REFERENCES [1] M. Abramowitz and I. Stegun. Handbook of Mathematial Funtions. Wiley, [] S. Agarwal. On finding the most statistially signifiant substring using the hi-square measure. Master s thesis, Indian Institute of Tehnology, Kanur, 009. [3] M. Atallah, R. Gwadera, and W. Szankowski. Detetion of signifiant sets of eisodes in event sequenes. In ICDM, ages 3,

CONSTRUCTION OF MIXED SAMPLING PLAN WITH DOUBLE SAMPLING PLAN AS ATTRIBUTE PLAN INDEXED THROUGH (MAPD, MAAOQ) AND (MAPD, AOQL)

Global J. of Arts & Mgmt., 0: () Researh Paer: Samath kumar et al., 0: P.07- CONSTRUCTION OF MIXED SAMPLING PLAN WITH DOUBLE SAMPLING PLAN AS ATTRIBUTE PLAN INDEXED THROUGH (MAPD, MAAOQ) AND (MAPD, AOQL)