Data Discovery and Anomaly Detection Using Atypicality: Theory

Size: px

Start display at page:

Download "Data Discovery and Anomaly Detection Using Atypicality: Theory"

Muriel Hancock
5 years ago
Views:

1 Data Discovery and Anomay Detection Using Atypicaity: Theory Anders Høst-Madsen, Feow, IEEE, Eyas Sabeti, Member, IEEE, Chad Waton Abstract A centra question in the era of big data is what to do with the enormous amount of information. One possibiity is to characterize it through statistics, e.g., averages, or cassify it using machine earning, in order to understand the genera structure of the overa data. The perspective in this paper is the opposite, namey that most of the vaue in the information in some appications is in the parts that deviate from the average, that are unusua, atypica. We define what we mean by atypica in an axiomatic way as data that can be encoded with fewer bits in itsef rather than using the code for the typica data. We show that this definition has good theoretica properties. We then deveop an impementation based on universa source coding, and appy this to a number of rea word data sets. Index Terms Big Data, atypicaity, minimum description ength, data discovery, anomay. I. INTRODUCTION One characteristic of the information age is the exponentia growth of information, and the ready avaiabiity of this information through networks, incuding the internet Big Data. The question is what to do with this enormous amount of information. One possibiity is to characterize it through statistics think averages. The perspective in this paper is the opposite, namey that most of the vaue in the information is in the parts that deviate from the average, that are unusua, atypica. The rest is just background noise. Take art: the truy vauabe paintings are those that are rare and atypica. The same coud be true for scientific research and entrepreneurship. Take onine coections of photos, such as Fickr.com. Most of the photos are rather pedestrian snapshots and not of interest to a wider audience. The photos that of interest are those that are unique. Fickr has a coection of photos rated for interestingness, and one can notice that those photos are indeed very different from typica photos. They are atypica. The aim of our approach is to extract such rare interesting data out of big data sets. The centra question is what interesting means. A first thought is to focus on the rare part. That is, interesting data is something that A. Høst-Madsen and E. Sabeti are with the Department of Eectrica Engineering, University of Hawaii Manoa, Honouu, HI 968 e-mai: {ahm,sabeti}@hawaii.edu. C. Waton is with the Department of Surgery, University of Hawaii, Honouu, HI, Emai: cwaton@hawaii.edu. This work was supported in part by NSF grants CCF 0783, 07775, and The paper was presented in part at IEEE Information Theory Workshop 03, Sevie.

2 is unikey based on prior knowedge of typica data or exampes of typica data, i.e., training. This is the way an outier is usuay defined. Unikeiness coud be measured in terms of ikeihood, in terms of codeength [], [] caed surprise in [3] or according to some distance measure. This is aso the most common principe in anomay detection [4]. However, perhaps being unikey is not sufficient for something to be interesting. In many cases, outiers are junk that are eiminated not to contaminate the typica data. What makes something interesting is maybe that it has a new unusua structure in itsef that is quite different from the structure of the data we have aready seen. Return to the exampe of paintings: what make masterworks interesting is not just that they are different than other paintings, but that they have some structure that is intriguing. Or take another exampe. Many scientific discoveries, ike the theory of reativity and quantum mechanics, began with experiments that did not fit with prevaiing theories. The experiments were outiers or anomaies. What made them truy interesting was that it was possibe to find a new theory to expain the data, be it reativity or quantum mechanics. This is the principe we pursue: finding data that have better aternative expanations than those that fit the typica data. Something being unikey is not even necessary for the data to be interesting. Suppose the typica data is iid uniform {0, }. Then any sequence of bits are equay ikey. Therefore, a sequence consisting of purey,... is in no way surprising. Yet, it shoud catch our interest. When we ook for new interesting data, a characteristic is that we do not know what we are ooking for. We are ooking for unknown unknowns [5]. Instead of ooking at specific statistics of data, we need to use a universa approach. This is provided by information theory. This idea of finding aternative expanations for data rather than measuring some kind of difference from typica data is what separates our method from usua approaches in outier detection and anomay detection. As far as we can determine from reading hundreds of papers, our approach has not been expored previousy. Obviousy, information theory and coding have been used in anomay detection, data mining, and knowedge discovery before, and we wi discuss how this compares to our approach ater. Our methodoogy aso has connections to tests for randomness, e.g., the run ength test and [6], [7], but our aim is different. A. Appications Atypicaity is reevant in arge number of various appications. We wi ist a few appications here. ECG. For eectrocardiogram ECG recordings there are patterns in heart rate variabiity that are known to indicate possibe heart disease [8], [9], [0], []. With modern technoogy it is possibe for an individua to wear and unobtrusive heart rate monitor 4/7. If atypica patterns occur, it coud be indicative of disease, and the individua or a doctor coud be notified. But perhaps a more important appication is to medica research. One can anayze a arge coection of ECG recordings and ook for individuas with atypica patterns. This can then potentiay be used to deveop new diagnostic toos. Genomics. Another exampe of appication is interpretation of arge coections of genomics data. Given that a mammas have essentiay the same set of genes, there must exist some significant differences that distinguish the obvious distinct attributes between species, as we as more subte differences within a species. Athough the

3 3 genome has been mined by exhaustive studies appying a panopy of approaches, regions once thought to be uninteresting have recenty come under increased study for their potentia roe in defined morphoogica and physioogica differences between individuas []. Appying an atypica evauation too to genomic data from individuas of known pathophysioogica/morphoogica irreguarities may provide vauabe insight to the genetic mechanisms underying the condition. Ocean Monitoring. In passive acoustic monitoring PAM [3] of oceans, one or more hydrophones is towed behind a ship or depoyed in a fixed bottom-mounted or suspended array in order to record vocaizations of marine mammas. One major focus is to detect, and perhaps count, rare or endangered species. It woud be highy interesting to scan the data for any unusua patterns, which can then be further examined by a researcher. Pant Monitoring. In for exampe nucear pants, atypica monitoring data may be indicative of something about to go wrong. Computer Networks. Atypica network traffic coud be indicative of a cyberattack. This is aready being used through anomay detection [4]. However, an abstract atypicaity approach can be used to find more subte attacks the unknown unknowns. Airport Security. Aready software is being used to fag suspicious fyers, ikey based on past attacks. Atypica detection coud be used to find innovative attackers. Stock Market. Atypicaity coud be used to detect insider trading. It coud aso be used by investors to find unusua stocks to invest in, promising outstanding returns or ruin. Astronomy. Atypicaity can be used to scan huge databases for new kinds of cosmoogica phenomena. Credit Card Fraud. Unusua spending patterns coud be indicative of fraud. This is aready used by credit card companies, but obviousy in a simpe, and annoying way, as anyone who s credit card has been bocked on an overseas trip can testify to. Gambing. Casinos are constanty fighting fraudsters. This is a game of cat and mouse. Fraudsters constanty find new ways to trick the casinos one such inventor was Shannon himsef. Therefore, an abstract atypicaity approach may be the best soution to catch new ways of fraud. B. Notation We use x to denote a sequence in genera, and x when we need to make the ength expicit; x i denotes a singe sampe of the sequence. We use capita etters X i to denote random variabes rather than specific outcomes. Finay X denotes a subsequence. A ogarithms are to base uness otherwise indicated. II. ATYPICALITY Our starting point is the in theory of randomness deveoped by Komogorov and Martin-Löf [5], [7], [6]. Komogorov divides infinite sequences into typica and specia. The typica sequences are those that we can ca random, that is, they satisfy a aws of probabiity. They can be characterized through Komogorov compexity. A sequence of bits {x n, n =,..., } is random i.e, iid uniform if the Komogorov compexity of

4 4 the sequence satisfies Kx,..., x n n c for some constant c and for a n[5]. The sequence is incompressibe if Kx,..., x n n n for a n, and a finite sequence is agorithmicay random if Kx,..., x n n n [6]. In terms of coding, an iid random sequence is aso incompressibe, or, put another way, the best coder is the identity function. Let us assume we draw sequences x n from an iid uniform distribution. The optimum coder is the identity function, and the code ength is n. Now suppose that for one of these sequences we can find a universa coder so that the code ength is ess than n; whie not directy equivaent, one coud state this as Kx,..., x n n < n. With an interpretation of Komogorov s terms, this woud not be a typica sequence, but a specia sequence. We wi instead ca such sequences atypica. Considering genera distributions and genera finite aphabets instead of iid uniform distributions, we can state this in the foowing genera principe Definition. A sequence is atypica if it can be described coded with fewer bits in itsef rather than using the optimum code for typica sequences. This definition is centra to our approach to the atypicaity probem. In the definition, the optimum code for typica sequences, is quite specific, foowing the principes in for exampe [6]. We assume prefix free codes. Within that cass the coding coud be done using Huffman codes, Shannon codes, Shannon-Fano-Eias codes, arithmetic coding etc. We care ony about the code ength, and among these the variation in ength is within a few bits, so that the code ength for typica encoding can be quite accuratey cacuated. On the other hand, described coded with fewer bits in itsef is ess precise. In principe one coud use Komogorov compexity, but Komogorov compexity is not cacuabe and it is ony given except for a constant, and comparison with code ength therefore is not an appes-to-appes comparison. Rather, some type of universa source coder shoud be used. This can be given a quite precise meaning in the cass of finite state machine sources, [7] and foowing work, and is strongy reated to minimum description ength MDL [8], [9], [0], [7]. What is essentia is that we adhere to strict decodabiity at the decoder. The decoder ony sees a stream of bits, and from this it shoud be abe to accuratey reconstruct the source sequence. So, for exampe, if a sequence is atypica, there must be a type of header teing the decoder to use a universa decoder rather than the typica decoder. Or, if atypica sequences can be encoded in mutipe ways, the decoder must be informed through the sequence of bits which encoder was used. One coud argue that such things are irreevant for for exampe anomay detection, since we are not actuay encoding sequences. The probem is that if such terms are omitted, it is far too easy to encode a sequence in itsef. This is ike choosing a more compex mode to fit data, without accounting for the mode compexity in itsef, which is exacty what MDL sets out to sove, athough aso in this case actua encoding is not done. We therefore try to account for a factors needed to describe data, and we beieve this is one of the key strengths of the approach. A major difference between atypica data and anomaous data is that atypicaity is an axiomatic property of data, defined by Definition based on Komogorov-Martin-Löf randomness. On the other hand, as far as we know, an

5 5 anomay is not something that can be stricty defined. Usuay, we think of an anomay as something caused by an outside phenomenon: an intruder in a computer network, a heart faiure, a gamber paying tricks. This infuences how we think of performance. If a detector fais to give an indication of an anomay, we have a miss or type II error, but if it gives an indication when such things are not happening we have a fase aarm type I error. Atypicaity, on the other hand, is purey a property of data. Ideay, there are therefore no misses or fase aarms: data is atypica or not. Here is what we mean. If there is an anomay that expresses itsef through the observed data, that must mean that there is some structure in the data, and in theory a source coder woud discover and expoit such structure and reduce code ength. Thus, if the data is not atypica that means there is simpy no way to detect the anomay through the observations again in theory. We therefore cannot reay ca that a miss. On the other hand, suppose that in a casino a gamber has a ong sequence of wins. This coud be due to fraud, but it coud aso be simpy due to randomness. But casino security woud be interested in either case for further scrutiny. Thus, the reason for the atypicaity does not reay matter, the atypicaity itsef matters. Sti, to distinguish the two cases we ca a sequence intrinsicay atypica if it is atypica according to Definition whie being generated from the typica probabiity mode, whie it is extrinsicay atypica if it is in fact generated by any other probabiity aw. Definition has two parts that work in concert, and we can write it simpified as C t x C a x > 0 where C t is the typica codeength and C a the atypica codeength. The typica code ength C t x is simpy an expression of the ikeihood of seeing a particuar sequence. If C t x is arge it means that the given sequence is unikey to happen, and detecting sequences by C t x > τ woud catch many outiers. As an extreme exampe, if a sequence is impossibe according to the typica distribution, C t x =, and it woud aways be caught. But it woud not work universay. If, as we started out with, typica sequences are iid uniform, any sequence is equay ikey and C t x > τ woud not catch any sequences. In this case, if a test sequence has some structure, it is possibe that C a x < C t x, and such sequences woud be caught by atypicaity; thus cacuating C a x is essentia. Cacuating C t x is aso essentia. Suppose that we instead use E[C t ] C a x, where E[C t ] is the code ength used to encode typica sequences on average, essentiay the entropy rate. Again, this wi catch some sequences: if a test sequence has more or ess structure than typica sequences, E[C t ] C a x 0. But again, it wi omit very obvious exampes: if as test sequence we use a typica sequence with 0 and swapped, E[C t ] C a x, whie on the other hand C t x > C a x. And impossibe sequences with C t x = woud not be caught with absoute certainty. Now, to decare something an outier, we have to find a coder with C a x < C t x. It is not sufficient that C t x is arge, i.e., that the sequence is unikey to happen. However, we can aways use the trivia coder that transmits data uncoded. If the sequence is unikey to happen according to the typica distribution, then it is ikey that C t x > ength of x. Thus, it can be seen that the two parts work in concert to catch sequences. Each part might catch some sequences, but to catch a anomaies, both parts have to be used. Another point of view is the foowing. Suppose again the typica mode is binary uniform iid. We ook at a coection of sequences, and now we want to find the most atypica sequences, i.e., the most interesting sequences. Without a specification of what interesting is, it seems reasonabe to choose those sequences that have the most

6 6 structure, and again this can reasonaby be measured by how much the sequence can be compressed. This is what Rissanen [7] cas usefu information, Ux = n C a x. But again, we need to take into account the typica mode if it is not uniform iid. For exampe, if typica sequences have much structure, then sequence with itte structure might be more interesting. We therefore end up with that C t x C a x is a reasonabe measure of how interesting sequences might be. A. Aternative approaches Whie, as argued in the introduction, and outined above, what we are aiming for is not anomay detection in the traditiona sense, there are sti many simiarities. And certainy information theory and universa source coding has been used previousy in anomay detection, e.g., [4], [], [], [3], [4], [5], [6], [7], [8], [9], [30]. The approaches have mosty been heuristic. A more fundamenta and systematic approach is Information Distance defined in [3]. Without being abe to caim that this appies to a of the perhaps hundreds of papers, we think the various approaches can be summarized as using universa source coding as a type of distance measure, whether it satisfies strict mathematica metric properties as in [3] or is more heuristic. On the other hand, our methodoogy in Definition cannot be cassified as a distance measure in a traditiona sense. We are instead trying to find aternative expanations for data. We wi comment on how our approach contrasts with a few other approaches. Whie the simiarity distance deveoped in [3] is not directy appicabe to the probem we consider, we can to some extent adapt it, which is usefu for contrast. The simiarity distance is d = min{ky x, Kx y } max{kx, Ky} Instead of being given the typica distribution, we can imagine that we are given a very ong typica sequence x which is used for training. In that case d = Kx y Kx = Kx, y Ky Kx within a certain approximation. Suppose, as was our starting point above, that the typica distribution is binary iid uniform. If y is aso binary iid uniform, within a constant Kx, y = Kx + Ky, and d =. But if y is drawn from some other distribution, x cannot hep describing x either, and sti d =. That makes sense: two competey random sequences are not simiar, whether they are from the same distribution or not. Thus, simiarity distance cannot be used for anomay detection as we have have defined it: ooking for specia sequences in the words of Komogorov. This is not a probem of the simiarity metric; it does exacty what it is designed for, which is reay deterministic simiarity between sequences, appropriate for cassification. The reason simiarity distance sti gives resuts for anomay detection [3] is actuay that universa source coders approximate Komogorov compexity poory. Heuristic methods using for anomay detection using universa source coding [4], [], [], [3], [4], [5], [6], [7], [8], [9], [30], [], [] are mosty based on comparing code ength. Let Cx be the code ength to encode the sequence x with a universa source coder. Let x be a training string and y a test sequence. We can

7 7 then compare Cx x with Cy y which coud be seen as a measure of entropy rate or compare Cxy with Cx to detect change. The issue with this is that there are many competey dissimiar sources that have the same entropy rate. As an exampe, et the data be binary iid with the origina source having P X = = 3 and the new source P X = = 3. Then the optimum code for the origina source and the optimum code for the new source have the same ength. On the other hand, atypicaity wi immediatey distinguish such sequences. III. BINARY IID CASE In order to carify ideas, at first we consider a very simpe mode. The typica mode is iid binary with P X n = = p. The aternative mode cass aso binary iid but with P X n = = θ, where θ is unknown. We want to decide if a given sequence x is typica or atypica. This can be stated as the hypothesis test probem H 0 :θ = p H :θ p This probem does not have an UMP universa most powerfu test. However, a common approach to soving this type of probem is the GLRT generaized ikeihood ratio test [33]. Let P b = P X n = b ˆP b = Nb x where is the sequence ength and Nb x is the number of x n = b {0, }. The GLRT is b=0 L = og ˆP b Nb x b=0 P bnb x = Nb x og Nb x Nb x og P b b=0 = ˆP b og Nb x b=0 b=0 b=0 = Dˆp p L > t φx = 0 L t ˆP b og P b Where Dˆp p = b=0 ˆP b og ˆP b P b is the reative entropy [6] and t some threshod. Whie the GLRT is a heuristic principe, it satisfies some optimaity properties, and in this case it is equa to the invariant UMP test [34], which can be considered an optimum soution under certain constraints. Thus, it is reasonaby to take this as the optimum soution for this probem, and we do not need to appea to Komogorov or information theory to sove the probem. The compications start if we consider sequences of variabe ength. The test depends on the sequence ength. We need to choose a threshod t as a function of, which wi then resut in a fase aarm probabiity P F A t

8 8 and detection probabiity P D t. There is no obvious argument for how to choose t from a hypothesis testing point of view; we coud choose t independent of, but that is just another arbitrary choice. We wi consider this probem in the context of Definition. In order to do so, we need to mode the probem from a coding point of view. We assume we have an infinite sequence of sequences of variabe ength i, and these need to be encoded. We need to encode each bit, and aso to encode whenever a new sequence starts. For typica encoding of the bits we can use a Shannon code, Huffman code, arithmetic coding etc. The code ength for a sequence of ength is L t = N x og p + N0 x og p = ˆp og p + ˆp og p except for a sma constant factor; here ˆp = ˆP = xi. We aso need to encode where a sequence ends and a new one starts. For simpicity et us for now assume engths are geometricay distributed. We can then mode the probem as one with three source symbos 0, and, with an iid distribution with P, = ɛ, P 0 = p ɛ, P = p ɛ. If we assume ɛ is sma, the expression is sti vaid for the content part, and to each sequence is added a constant og ɛ to encode separators. To decide if a sequence is atypica according to Definition, we can use the universa source coder from [6]: the source encodes first the number of ones k; then it enumerates the sequences with k ones, and transmits the index of the given sequence. For anaysis it is important to have a simpe expression for the code ength. We can therefore use L a = Hˆp + og. This is an approximation which is good for reasonaby arge and it aso reaches the ower bound in [7], [35]. The source-coder aso needs to inform the decoder that the foowing is an atypica sequence so that it knows to use the atypica decoder rather than the typica encoder, and where it ends. For the former we can use a. to indicate the start of an atypica sequence rather than the, for typica sequences. If the probabiity that a sequence is atypica is δ, P. = δɛ and P, = δɛ ɛ. The code ength for a. now is og ɛ og δ. To mark the end of the atypica sequence we coud again insert a. or a,. But the code for either is based on the distribution of engths of typica sequences, which we assume known, whereas we woud have no knowedge of the ength of atypica sequences. Instead it seems more reasonabe to encode the ength of the specific atypica sequence. As argued in [8], [36] this can be done with og + og c, where c is a constant and og = og + og og + og og og + 3 where the sum continues as ong as the argument to the og is positive. To summarize we have L t = ˆp og p + ˆp og og ɛ p L a = Hˆp + og + og + og c og ɛ og δ Hˆp + 3 og og ɛ + τ τ = og δ + og c 4

9 9 The criterion for a sequence to be atypica is L a < L t, which easiy seen to be equivaent to Dˆp p > τ + 3 og If the engths are fixed, this reduces to. But if the engths are variabe, 5 provides a threshod as a function of. The term 3 og ensures that im P F A = 0, which seems reasonabe. If instead Dˆp p > τ 5 is used, it is easy to see that im P F A > 0. Except for this property, the term 3 og might seem arbitrary, e.g., why 3? But it is based on soid theory, and as wi be seen ater it has severa important theoretica properties. We wi examine the criterion 5 in more detai. The inequaity 5 gives two threshods for ˆp, ˆp > p + ˆp < p Where 0 < p < p < p + <. It is impossibe to find expicit expressions for p ±, but it is cear that p ± p as. Therefore, for arge, we can repace Dˆp p with a series expansion. We then end up with the more expicit criterion p ˆp pq n 4 > τ + 3 og ˆp p > τ = pq n 4 τ + 3 og 6 In the foowing we wi use this as it is consideraby simper to anayze. We can aso write this as i= x i p > τ n + 3 n 7 pq Now, if not for the term 3 n, this woud be a centra imit type of statement, and the probabiity that a sequence is cassified as intrinsicay atypica woud be P A Q n τ independent of. Our main interest is exacty the the dependency on, which is given by the foowing Theorem Theorem. Consider an iid {0, }-sequence. Let P A be the probabiity that a sequence of ength is cassified as intrinsicay atypica according to 6. Then P A is bounded by For p = this can be strengthened to τ : im K, τ = P A τ+ K, τ 9 3/ P A τ+ 3/ 0 8

10 0 These bounds are tight in the sense that n P A im 3 n = Proof: The Chernoff bound e.g., [37] states P A = P ˆp p > τ =P X i p + b Where as usua, q = p b = pq n 4 i= { inf e sp sb M X s } s>0 τ + 3 og and M S s is the moment generating function of X i, which for a Bernoui random variabe is Then M X s =pe s + q Minimizing over s gives or P A inf {exp s p + b pe s + q } s>0 p b q qp + b P A q b pq b n q P A n = n qp + b + p b n q b pq b + b + p b n + q b b pq b b 3 q p + b7p 6p 3 + b 6p + 3 6p b q 3 b b3 + O pq = τ n 3 n + O n3/ τ 3/, where we have used x x For p = x n + x x + x3 3 Hoeffding s inequaity [38] gives the bound P A exp b = exp n τ + 3 og for x 0. The equation directy eads to 9. 3 for p = this is tighter than.

11 For the ower bound we use moderate deviations from [39]. Define X i = Xi p pq. We can then rewrite 7 as X i= i > τ n + 3 n We define a = τ n +3 n, which satisfies im a = 0, im a =. Using this as a in [39, Theorem 3.7.] gives im inf = im inf τ n + 3 n n P 3 n n P X i= i > τ n + 3 n X i= i > τ n + 3 n Together with the upper bound, this gives. Figure compares the upper bound with simuations. 0 Upper bound Simuation P A n Fig.. Simuated P A and the Upper bound for τ =, p = 0.3. We can aso bound the miss probabiity for extrinsicay atypica sequences as foows Theorem 3. Suppose that the typica sequence is iid {0, }-sequence with P X n = = p. Let the test sequence by iid with P X n = = p a. The probabiity that the test sequence is missed according to criterion 6 is upper

12 bounded by pqτ n +3 n P M τ qa p 3/ p a q q p a q p+ p p ap p K, τ 4 τ : im K, τ = Proof: We may assume that p a < p. Simiary to the proof of Theorem the Chernoff bound is P M inf {exp s p b p a e s + q a } s>0 Minimizing over s gives or using series expansions. p+b qa qa p b P M q + b p a q + b qa n P M n q + b n + b qa q n + p + b n p n p + n q qa p a qa p n p a qa p b p a q + b q p b pq + O b 3 A. Hypothesis testing interpretation The soution 5 may seem arbitrary, but it has a nice interpretation in terms of hypothesis testing [40]. Return to the soution. That soution gives a test for a given. However, the probem is that it does not reconcie tests for different. One way to sove that issue is to consider a random variabe, i.e., introducing a prior distribution in the Bayesian sense. Let the prior distribution of be P L. The equation now becomes b=0 L = og ˆP b Nb x P L b=0 P bnb x P L 0 = ˆP b og Nb x b=0 ˆP b og P b + og P L og P L 0 b=0 = Dˆp p + og P L og P L 0 The hypothesis test now is Dˆp p > τ + og P L0 og P L 5

13 3 Of course, the probem is that we don t know P. Sti, compare that with 5 without the approximations, Dˆp p > τ + og + c + og To the term c + og corresponds a distribution on the integers, namey Q in [8, 3.6]. Except for the term og, the equations 5 and 6 are identica if we use the prior distribution P L = Q. Rissanen [8] argues that the distribution Q is the most reasonabe distribution on the integers when we have reay no prior knowedge, mainy from a coding point of view. This therefore seems a reasonabe distribution for P. What about the term og? The mode for the non-nu hypothesis has one unknown parameter, p, so that it is more compex than the nu hypothesis. We have to account for this additiona compexity. Our goa is to find an expanation for atypica sequences among a arge cass of expanations, not just the distribution of zeros and ones. If there is no penaty for finding a compex expanation, any data can be expained, and a data wi by atypica. This is Occam s razor [6]. The penaty for one unknown parameter as argued by Rissanen is exacty og. We therefore have the foowing expanation for 5, Fact 4. The criterion 5 can be understood as a hypothesis test with prior distribution Q [8] and penaty og for the unknown parameter. Seen in this ight, Theorem is not surprising. In 5 we have repaced og + og with 3 og, which impicity corresponds to the prior distribution P L 3/, which is exacty the distribution seen in 9. 6 B. Atypica subsequences One probem where we beieve our approach exces is in finding atypica subsequences of ong sequences. The difficuty in find atypica subsequences is that we may have short subsequences that deviate much from the typica mode, and ong subsequences that deviate itte. How do we choose among these? Definition gives a precise answer. For the forma probem statement, consider a sequence {x n, n =,..., } from a finite aphabet A where in this section A = {0, }. The sequence is generated according to a probabiity aw P, which is known. In this sequence is embedded infrequent finite subsequences X i = {x n, n = n i,..., n i + i } from the finite aphabet A, which are generated by an aternative probabiity aw P θ. The probabiity aw P θ is unknown, but it might be known to be from a certain cass of probabiity distributions, for exampe parametrized by the parameter θ. Each subsequence X i may be drawn from a different probabiity aw. The probem we consider is to isoate these subsequences, which we ca atypica subsequences. In this section, as above, we wi assume both P and P θ are binary iid. The soution is very simiar to the one for variabe ength sequences above. The atypica subsequences are encoded with the universa source coder from [6] with a code ength L a = Hˆp + og. The start of the sequence is encoded with an extra symbo. which has a code ength og P. and the ength is encoded in og bits. In concusion we end up with exacty the same criterion as 5, repeated here Dˆp p > τ + 3 og 7

14 4 The ony difference is that τ has a sight different meaning. For the subsequence probem, a centra question is what the probabiity is that a given sampe x n is part of an intrinsicay atypica subsequence. Notice that there are infinitey many subsequences that can contain x n, and each of these have a probabiity of being atypica given by Theorem. We can obtain an upper bound as foows. Let us say that X n has been determined to be part of an atypica sequence X i. It is cear that the sequence X i must aso be atypica according to 7. Therefore, we can upper bound the probabiity P A X n that X n is part of an atypica sequence with the probabiity of the event 7, using the approximate criterion 6, n n < n + : We can rewrite this as n n < n + : n n < n + : n+ i=n n+ i=n n + i=n X i p > τ n + 3 n pq X i p > τ n + 3 n pq X i p > pq n τ + 3 og We coud upper bound this with a union bound using Theorem. However, it is quicky seen that this does not converge. The probem is that the events in the union bound are highy dependent, so we need a sighty more refined approach; this resuts in the foowing Theorem Theorem 5. Consider the case p =. The probabiity P AX n that a given sampe X n is part of an atypica subsequence is upper bounded by for some constants K, K. P A X n K τ + K τ 8 Proof: Without oss of generaity we can assume n = 0. For some 0 > 0 et I 0 be the set of subsequences containing X 0 of ength 0. For i I 0 et i be the ength of the subinterva. From Theorem we know that P A i τ+ 3/ K, τ and therefore i I 0 P A i K τ for some constant K. This argument does not work if we aow arbitrariy ong subsequences, because the sum is divergent. However, we can write P A X 0 i I 0 P A i K τ + P A,0 X 0 where P A,0 X 0 is the probabiity that X 0 is in an atypica subsequence of at east ength 0. The proof wi be to bound P A,0 X 0.

15 5 Define the foowing events An, = An, = { n+ } X i p > pq n τ + 3 og i=n { n+ } X i p < pq n τ + 3 og For p = we can rewrite n+ For ease of notation define i=n X i p > pq n τ + 3 og i=n n + = X i > n τ + 3 og 9 i=n υ = n τ + 3 og Then using the union bound we can write P A X 0 = = n = + P n = n = n = n = P P = n + = n + P = n + An, = n + An, An, A c n, P A c n, A c n, = n + =n n + where we have excuded the ength one sequence consisting of X 0 itsef. Now consider P A c n, = n + Ac n, = P An, = n + Ac n,. We can think of S = n + i=n X i as a simpe random [4], and we wi use this to upper bound the probabiity P An, = n + Ac n,. This probabiity can be interpreted as the probabiity that the random wak passes υ given that it was beow υ at times n < <. But since the random wak can increase by at most one, and since the threshod is increasing with, that means that at time we must have S = υ. Furthermore, it is easy to see that the probabiity is upper bounded by the probabiity that S = υ

16 6 given that the random wak is beow υ at times n < <. Thus P An, A c n, = n + P S = υ S < υ, n < < = P S = υ, S < υ, n < < P S < υ, n < < P S = υ, S < υ, n < < P S < υ, 0 < The denominator can be interpreted as the probabiity that the maximum of the random wak stays beow υ, which by Theorem can be expressed by P D = P S < υ, 0 < = P S υ + P S = υ τ+c 3/ 0 for τ and sufficienty arge, and where c is some constant. Since, as discussed at the start of the proof, we can assume that 0, we can choose 0 arge enough that this is satisfied; furthermore, since P N is increasing in τ, we can choose 0 independent of τ as ong as τ is sufficienty arge. We wi next upper bound the numerator in 0. This is the probabiity that we have a path that has stayed beow υ at steps n < <, but then at step hits υ. We wi count such paths. We divide them into two groups that we count separatey. The first group are a paths that start at zero and hit υ first time after steps. The second group is more easiy described in reverse time. Those are paths that start at υ at step, then stay beow υ unti time ñ < 0, when they hit υ again, and finay hit 0 at time n. According to [4, Section 3.0] we can count a these paths by N = υ n + N 0, υ + t N t, 0N t 0, υ t=υ Where N n a, b are the number of ength n paths between a and b. We need to upper bound the probabiity P S n = k that a path starting a 0 hits k after n steps. We use [4,

17 7 Section 3.0] and [6, 3.] to get P S n = k =N n 0, k n n = n n + k n n + kn knh = π 4 4n πn k nh We can bound the power of the exponent to as foows Thus, where e x = x. n + k nh n =n H + k n n n k n = k n n n+k n n n+k n n n P S n = k πn e n k n We wi use this to bound the probabiity of set of paths in the second term in. We can bound P n, = n + t=υ n + t=υ t N t, 0N t 0, υ t π t e n t N t 0, υ t n 4 + P S π + n 3/ t = υ Here the sum n + t=υ P S t = υ when ooked at in reverse time can be interpreted as the probabiity of a path t=υ starting at υ hits zero before time n +. We can the write this as See [4, Section 3.0] n + t=υ P S t = υ = P M n+ υ P S n+ υ

18 8 We can use the proof of Theorem, specificay 3 to bound this by P M n+ υ exp υ n + Then P n, K exp υ n + + n 3/ n + We wi next bound the probabiity of the paths in the first term in. We have 4 P S = υ π υ e υ n 4 e υ π υ n = π υ τ and Thus P = υ P S = υ n τ + 3 og π 8τ n τ 5/ π υ n + π υ υ 4 π τ 5/ τ n K = n + = n + n = n +. =S n, τ P A c n, = n + P P n, P D P P n, A c n,

19 9 and where K > 0 is some constant. P A X 0 K n = e S n,τ n = = n + + K First we evauate the sum of P. The term υ P n = = n + P n, 3 is decreasing in, so for sufficienty arge, υ. We can evauate the sum separatey for 0 and for >. Convergence depends ony on the atter tai. The threshod is increasing with τ. If for exampe we put = 8τ n, i.e., proportiona to τ, we have υ for τ > 0. Therefore For > we can write Then for n + > = 0 P Kτ τ 6τ n 4 n 4 P = τ 5/ + π π π τ 5/ =k K = n + P k τ nn n n k τ erfc n n where k i > 0 are some constants and where we have used 5/ x 5/ dx = 3k 3/ + k 3 τ τ n 3 4 k n 5/ n xx 5/ dx =k = 9 k 6πerfc [ 3 n k ] + 6 n k k 3/ as it can be verified that a three sums, when 4 is inserted in 3, are convergent, using fxdx. k= fk f +

20 0 We bound the second sum in 3, n = = n + = n = = n + We can ignore the sma constants and write P = = P n, n = = n + n = = n + t 8 n + exp υ π + n 3/ n + 8 exp υ n π + n 3/ n 8 n π + n 3/ τ n 3 n 8 t τ t π t 3/ 3 t ddt 8 t = πt 3/ 8 = π = t 3 dt 8 π 8 = 3 π = τ 8 3 π 3/ τ 3 t 3 td dt 3/ τ 3 t 3 d dt 3/ τ 3 d 3/ τ 3 d 3/ τ 3 d The remaining integra is ceary convergent, and decreasing in τ. Therefore P K τ There are two important impications of Theorem 5. First is that for τ sufficienty arge, P A X n <, and in fact P A X n can be made arbitrariy sma for arge enough τ. This is an important theoretica vaidation of Definition and the resuting criterion 5 and 6. If the theory had resuted in P A X n = then everything woud be atypica, and atypicaity woud be meaningess. That this is not triviay satisfied is shown by Proposition 6 just beow. What that Proposition says is that if in the above equation instead of 3 og we had had og, then everything woud have been atypica. Now, og corresponds to forgetting that the ength of an atypica sequence aso needs to be encoded for the resuting sequence to be decodabe. Thus, it is the strict adherence to decodabiity that has ead to a meaningfu criterion. So, athough decodabiity at first seems unreated to detection, it turns out to be of crucia importance. Simiary, at first the term 3 og may have seen arbitrary. However, this is just within a margin sufficient to ensure that not everything becomes atypica. The second important impication of Theorem 5 is that it vaidates the meaning of τ. The way we introduced τ was as the number of bits needed to encode the fact that an atypica sequence starts, and therefore we shoud put τ = og P atypica sequence starts. Theorem 5 confirms that τ has the desired meaning for purey random sequences. And the reasons is this is not trivia is that τ was chosen from the probabiity of an atypica sequence, whie Theorem 5 gives the probabiity of a sampe being atypica.

21 Proposition 6. Consider the case p =. Suppose instead of 7 we use the criterion i= X i p 4 > τ n + α n 5 with α = 3 giving 7. Then if α, the probabiity that a given sampe X n is part of an atypica subsequence is P A X n =. Proof: We can assume that n = 0. We wi continue with the random wak framework from the proof of Theorem 5. Define the event and Then Ā = A = { 0 i= + { 0 i= + X i > n τ + α og X i < n τ + α og υ = n τ + α og P A X 0 P Ā A =0 Namey, we decare that X 0 is atypica if it is the endpoint of an atypica sequence {x[ ], x[ + ],..., x[0]} for some. Ceary, X 0 coud be the start or midpoint of an atypica sequence, so this a rather oose ower bound. Now we can write Consider the probabiity P P Ā A =0 =0 = P Ā c =0 =0 A c =0 = P Ā c A c Ā c k =0 k=0 k=0 [ = P Ā A Ā c k =0 k=0 A c k Ā A k=0 Āc k k=0 Ac k. The ony way the conditiona event can happen is if k=0 A c k } } ]

22 S = υ and X = or S = υ + and X =. Here we have Here P S = υ = N 0, υ + = + + υ H + υ υ = H υ H +υ + H = H n + υ + +υ + +υ + + υ υ 3 υ + o υ 3 + ɛ υ + υ + + υ3 ɛ = υ n n = υ n + υ3 ɛ υ υ3 n Where the ast inequaity is ony true for sufficienty arge, as for some 0 we have > 0 : ɛ <. Then P S = υ τ α υ 3 = τ α+ υ3 And n P Ā A n = =0 =0 τ α+ υ3 τ α+ υ3 =0

23 3 Here im υ 3 = 0. So, for exampe, for sufficienty arge, υ3. Then n P Ā A = 0 =0 τ α+ This is divergent for α proving that P =0 Ā =0 A =. Theorem 5 states that for α = 3 P AX n < convergence, whie Proposition 6 shows that for α = P A X n = divergence. There is a gap between those vaues of α that is hard to fi in theoreticay. We have therefore tested it out numericay, see Fig.. Of course, testing convergence numericay is not quite we-posed. Sti the figure indicates that the phase transitions between divergence and convergence happens right around α =. =0 Probabiity of atypicaity of same X n for different vaues of τ when ength=0 5 + and n = ength/ at the midde τ = 0.5 τ = τ = Probabiity α Fig.. Transition between divergence and convergence as a function of α C. Recursive coding Instead of using Definition directy, we coud approach the probem as foows. First the sequence is encoded with the typica code. Now, if the distribution of the sequence is in agreement with the typica code, the resuts shoud be a sequence of iid binary bits with P X i = = [6], i.e., a purey random sequence; and this sequence cannot be further encoded. We can now try if we can further encode the sequence with a universa code. If so, we categorize the sequence as atypica. Let be the ength of the sequence after typica coding. In 4 the typica and atypica codeengths are therefore L t = og ɛ L a = Hˆp + 3 og og ɛ + τ 6

24 4 Here ˆp is the estimated p for the encoded sequence. Now = ˆp og p + ˆp og p 7 Hˆp Hˆp 8 og = og + og ˆp og p + ˆp og p 9 og 30 The argument for 8 is as foows without doing detaied cacuations: if we encode a sequence with a wrong code and then ater re-encode with the correct code for the induced statistic, the resut is the same as originay encoding with the correct code. Thus the criterion 4 and 6 are approximatey equivaent. We can state this as foows Proposition 7. Definition can be appied to encoded sequences instead of the origina data. This of course ignores a integer constraints, bock boundaries etc. But the importance of this statement is that it is sometimes easier to operate on partiay encoded sequences simpy because the amount of data has aready been reduced, and the probem has been standardized: as such, we do not need to know the typica codebook or even the mode of typica data since everything under the typica mode has been reduced to a stream of iid binary digits, and atypicaity agorithms can therefore be appied to data streams without knowedge of what is the origina data. It aso means that theoretica resuts such as Theorem 5 where we assume typica data is iid uniform has genera appicabiity. However, first encoding the sequence and then doing atypicaity detection aso has disadvantages in a practica, finite ength setting. Atypica subsequences become embedded in typica sequences in unpredictabe ways. For exampe, it coud be difficut to determine where exacty an atypica sequence starts and ends. Our practica impementation therefore uses Definition directy. IV. GENERAL CASE Return to the probem considered at the start of Section III where we are given a sequence x of fixed ength and we need to determine if it is atypica. In the iid case this is a simpe hypothesis test probem and the soution is given by. In the genera case we woud ike find to aternative expanations from a arge abstract cass of modes. The issue is that it is often possibe to fit an aternative mode very we to the data if we just aow compex enough modes the we known Occam s razor probem [6]. Rissanen s MDL [4], [8], [9] is a soution to this probem. Therefore, in the genera case, even for fixed ength sequences, the probem is not a straightforward hypothesis test probem, and we have to resort to information theory.

25 5 A. Finite State Machines On possibe cass of modes in the genera case is the cass of finite state machines FSM. Rissanen [7] defines the compexity of a sequence x in the cass of FSM sources by Ix = min{ og ˆP x f j + og j + c} 3 where f, f,... is a sequence of state machines, and where we have used ˆP x f j to emphasize that the probabiity is estimated. Rissanen uses Lapace s estimator, but the KT-estimator [43], [6] coud aso be used. Except for integer constraints, this is a vaid descriptive ength, and can therefore be used in Definition. This is a natura extension of the iid case considered in Section III. As opposed to Komogorov compexity, this compexity coud actuay be cacuated, athough with high compexity. Because of the compexity, it is mosty usefu for theoretica considerations, and one resut is the foowing generaization of Theorem Theorem 8. Assume that the typica distribution is iid uniform. If the atypica descriptive ength is given by 3 with a maximum number of states independent of, the probabiity of an intrinsicay atypica sequence P A satisfies n P A im 3 n = 3 Proof: Since we consider a state machines with the number of states up to a certain maximum, this must aso incude the state machine with a singe state. This is equivaent to the iid mode in Section III, and we therefore get the ower bound in 3. The proof wi be to upper bound the probabiity. As in Section III we use og + τ bits to indicate beginning and end of atypica sequences. The probabiity that a sequence x is atypica therefore is P A = P Ix + og + τ > = P f j og ˆP x f j + og j + c + og + τ > f j P og ˆP x f j + og + τ > We wi prove that P og ˆP x f j + og + τ > K j k+/ for constants K j and k the number of states in the state machine, and since the sowest decay dominates, we get the upper bound for 3. For a fixed state machine f the code ength according to [7, 3.6] is Lx f = og n sx + ogn s x + s n 0 s x s where n s x denotes the number of occurrences of state s in x and n 0 s x the number of times the next symbos

26 6 is 0 at this state. Further, from [6, 3.] Lx f n s x n0 s x H n s s x og n sx og 8 n 0 sx n s x n s x n s x + ogn s x + 33 We want to upper bound the probabiity of the event Lx f + og + τ <. We can write ogn s x + ogn sx = og + og ns x + n s x. Let rx = s n s x n0 s x H n s x and et Rx be the remaining sma terms in 33 dependent on x, Rx = og 8 n 0 sx n s x n s s x n s x ns x + + og n s x. Then we have to upper bound notice that s n sx =, P rx Rx τ + k + og The Chernoff bound is or where P rx Rx τ + k + og exp tτ + k + og Mt n P rx Rx τ + k + og tτ + k + og + n Mt Mt = E [ exp trx + Rx ] In order to get a vaid bound, we need to show that Mt < K < independent of for t < n. Now it s easy to see that exp trx K < for a t and. So, we have to show E [ exp trx ] K <

27 7 We have to show that this is true for a state machines in the cass of finite state machines with k states, which can be done by showing max FSM with k states E [ exp trx ] K It turns out it easier to prove this if we expand the cass over which we take the maximum, and ceary expanding the cass does not decrease the maximum. A FSM with k states is a function fx {,..., k} that satisfies that if fx m = f x m = s then fx m b = f x m b for any bit b [7], i.e., if the FSM is in state s after m steps, the next state transition is ony dependent on the next bit, not how it got to state s. We extend the cass by dispensing with this requirement. We can then describe the program we run as foows. Based on x m we choose a state s m {,..., k} without having any knowedge about x m+, except that it is independent and uniformy distributed by the assumption on the typica distribution. We can think of this sighty differenty. The program puts x m+ into bucket s {,..., k} and updates n s m and n i s m, in order to maximize E [ exp trx ]. It does so based on past data x m. Now, as opposed to the state machine setup, the choice of s m in no way restricts the choices of states or buckets s n, n > m. Since the program has no knowedge of x m+ the program cannot optimize s m based on the vaues of x m. Rather, it is sufficient to ook at n s m. It is now easy to see that the worst case is obtained if the bits are distributed eveny in the states. Thus, the worst case of rx is rx = n0 s x H k /k s where the n 0 s x are independent of s. Thus, the probem is reduced to the case of a singe state, which is showing that Here we have [ E exp t H [ E exp t H = k=0 = + ] k K < 34 ] k t n H k k / k= t n H k k / + t n H k H k πk k k= / = + t n H k πk k k= / + t n k n πk k k= where we have used [6, 3.] and. The sum is actuay decreasing as a function of, but this seems hard to prove. Instead we upper bound the sum by

28 8 / k= / t n n k 4k Here we can upper bound πk k k + π / t n n k πk k πk k dk for k. Then t n k n πk k dk / / / = K + K t n n x 4x t n n x 4x for some constants K, K, using Gaussian moments. This proves dx π + dx π 0 - The probabiity of intrinsicay atypica sequence 0 - Probabiity Upper Bound 0-3 P A Length Fig. 3. Probabiity of an intrinsicay atypica sequence. The typica distribution is iid uniform, and for detection of atypica sequences the CTW agorithm has been used Section IV-B. Whie the Theorem is for the typica mode iid uniform, as outined in Section III-C in principe it aso appies to genera sources, since we can first encode and then ook for atypica sequences. The theorem shows that ooking for more compex expanations for data does not essentiay increase the probabiity of intrinsicay atypica sequences. Fig. 3 compare with Fig. confirms this experimentay. The

29 9 atypica detection is based on CTW, which as expained in Section IV-B beow, is a good approximation of FSM modeing. On the other hand, if one of the FSM modes do in fact fit the data, the chance of detecting the sequence is greaty increased, athough hard to quantify. If we think of intrinsicay atypica sequences as fase aarms, this shows the power of the methodoogy. Since FSM sources has the same P A as in the iid case, it seems reasonabe to conjecture that Theorem 5 is sti vaid, that is P A X n < for sufficienty arge τ, which is ceary an essentia theoretica property of atypicaity. However, as Theorem 5 does not foow directy from Theorem, to verify the conjecture requires a forma proof which we do not have at present. B. Atypica Encoding In terms of coding, Definition can be stated in the foowing form Cx P Cx > 0 Here Cx P is the code ength of x encoded with the optimum coder according to the typica aw, and Cx is x encoded in itsef. As argued in Section III, we need to put a header in atypica sequences to inform the encoder that an atypica encoder is used. We can therefore write Cx = τ + Cx, where τ is the number of bits for the header, and Cx is the number of bits used for encoding the data itsef. For encoding the data itsef an obvious soution is to use a universa source coder. There are many approaches to universa source coding: Lempe-Ziv [6], [44], [45], Burrows-Wheeer transform [46], partia predictive mapping PPM [47], [48], or T-compexity [49], [6], [50], [5], [5], [53], [54], and anyone of them coud be appied to the probem considered in this paper. The idea of atypicaity is not inked to any particuar coding strategy. In fact a coding strategy does not need to be decided. We coud try severa source coders and choose the the one giving the shortest code ength; or they coud even be combined as in [55]. However, to contro compexity, we choose a singe source coder. The most popuar and simpest approach to source coding is perhaps Lempe-Ziv [6], [44], [45]. The issue with this is that whie Cx it is optimum in the sense that im sup = HX wp, the convergence is very sow. According to [56] [ ] [ ] E Cx HX og whie var Cx. Thus, Lempe-Ziv is poor for short sequences, which is exacty what we are interested in for atypicaity. We have therefore chosen to use the Context Tree Weighing CTW agorithm [43]. The CTW approach has some advantages in our setup: it is a natura extension of the simpe exampe considered in Section III, it aows estimation of code ength without actuay encoding, there is fexibiity in how to estimate probabiities. Importanty, it can be seen as a practica impementation of the FSM based descriptive ength used in Section IV-A. C. Typica Encoding and Training In Definition and the exampe in Section III we have assumed that the typica mode of data is exacty known. If that is the case, typica encoding is straightforward, using for exampe arithmetic coding notice that we just need codeength, which can be cacuated for arithmetic coding without actuay encoding. However, in many cases

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,