Draft February Novelty Detection using Extreme Value Statistics. London, UK. February 23, 1999.

Size: px

Start display at page:

Download "Draft February Novelty Detection using Extreme Value Statistics. London, UK. February 23, 1999."

Aldous Wilkinson
5 years ago
Views:

1 Draft February 1999 Novelty Detection using Extreme Value Statistics Stephen J. Roberts Department of Electrical & Electronic Engineering Imperial College of Science, Technology & Medicine London, UK s.j.roberts@ic.ac.uk February 23, 1999 Abstract Extreme value theory (EVT) is a branch of statistics which concerns the distributions of data of unusually low or high value i.e. in the tails of some distribution. These extremal points are important in many applications as they represent the outlying regions of normal events against which we may wish to dene abnormal events. In the context of density modelling, novelty detection, or Radial-Basis function (RBF) systems points which lie outside of the range of expected extreme values may be agged as outliers. Whilst there has been interest in the area of novelty detection decisions as to whether a point is an outlier or not tend to be made on the basis of exceeding some (heuristic) threshold. In this paper it is shown that a more principled approach may be taken using extreme value statistics. 1 Introduction Extreme value theory (EVT) concerns the distributions of data of abnormally low or high value in the tails of some data generating distribution. As we observe more data points the statistics of the largest (or smallest) values we observe change. Knowledge of these statistics may be useful in a number of areas, for example detecting events which are outside our expected data extremes for outlier detection or for rejecting classication or regression on data which lies far from the expected statistics of some training data set. Novelty, outlier or abnormality detection is a method of choice in the case where the usual assumption of supervised classication breaks down. This assumption implies that the data set contains viable information regarding all classication hypotheses. This assumption will break down under two major conditions: 1. There are very few examples of an important class within the data set. This is often the case in medical data or in condition monitoring when data from `normal' system states are far easier and less costly to acquire than abnormal exemplars (patients with rare conditions, for example). In a classication system it becomes increasingly dicult to make decisions into this class as the 1

2 number of examples reduces. It is argued that, in such a case, it is better to formulate a model of the `normal' data and test for `novel' data against this model [1, 2]. 2. The data set contains information regarding all classes in the classication set, but the classication set itself is incomplete i.e. data from some extra class is observed at a future point. Such a classication system will classify this `novel' data erroneously into one of the classes in the classication set. For the rst of these conditions, furthermore, this paper distinguishes between two sub-cases: 1a. A data set exists which is known to be representative of `normal' data. No occurrences of `abnormalities' are believed to be present. 1b. The data set is believed to consist of a number of examples of `normal' data with a smaller, but unknown, number of `abnormal' data. The rst of these cases is, of course, simpler. A representation of the `normal' data may have any desired complexity (one can imagine a crude representation based on a single normal distribution, or a Parzen windows approach in which the representation is based upon each datum in the set). If the same approach is applied to case 1b the representation runs the risk of tting abnormal data and hence making its subsequent detection more dicult. It is desirable, therefore, to impose some form of model-complexity penalty. This scheme may be used to form representations for case 1a as well. This issue has been addressed previously in the context of novelty detection by using a cross-validation heuristic [2]. More principled methods exist, however, and these are detailed in subsection These generic issues of novelty detection have been dealt with previously in several ways. Typical methods develop a semi-parametric representation using, for example, a Gaussian mixture model (GMM) against which data are assessed. Some methods rely upon heuristic approaches, using maximum (un-normalised) Gaussian response as a measure of novelty [1, 2], creating an articial `novelty' class [3] or utilising thresholds on a density estimate of the `normal' data [4]. 2 Theory Extreme-value theory forms representations for the tails of distributions. When discussing the properties of the tails of a distribution we will, for convenience, discuss the right-hand tail. The same theory 2

3 applies, with minor modication, to the left-hand tail. Consider a set Z m = fz 1 ; z 2 ; :::; z m g of m i.i.d. (independent & identically distributed) random variables z i 2 R drawn from a distribution function D and dene x m = max(z m ), i.e. the largest element observed in m samples of z i. In pioneering work on classical extreme-value theory Fisher & Tippett (1928) [5] (also see [6, 7, 8, 9, 1]) showed that if the distribution function over x m, H, is to be stable in the limit m! 1 then it must weakly converge under a positive ane transform of the form x d m = m x + m (1) for appropriate norming parameters m 2 R + (the scale parameter) and m 2 R (the location parameter). The use of = d denotes a (weak) convergence of distributions rather than a strict equality. Fisher & Tippett proved furthermore that the normalised form H(x m x) d = H( 1 m (x m)) (2) is the only form for the limit distribution of maxima from any D. The second key theorem in extreme-value statistics is referred to as the Fisher-Tippett theorem [5] and states that if H is a non-degenerate limit distribution for normalised maxima of the form 1 m (x m m ) then H is of one of only three forms. Denoting y m = 1 m (x m ) (often referred to as the reduced variate) these three forms are: H 1 (y m ) = exp( exp( y m )) H 2 (y m ; ) = H 3 (y m ; ) = if ym exp ( ym ) if y m > exp ( ( ym ) ) if y m 1 if y m > (3) for 2 R +. These three forms are respectively referred to as the type I or Gumbel, type II or Frechet and type III or Weibull distributions. These three forms may also be regarded as special cases of the generalised extreme-value (GEV) distribution, n H GEV (y m ; ; ) = exp [1 + y m ] 1=o (4) for some 2 R (the shape parameter). This remarkable result is the equivalent, for maxima of random variables, to the central limit theorem for sums of random variables. As m! 1 the distribution of 3

4 maxima from any D is in the domain of attraction of one of the three limit forms. This paper is ultimately concerned with samples assumed drawn from D = jn(; 1)j (i.e. a one-sided normal distribution) whose maxima distribution converges to Gumbel form [9, 6]. The other two limit forms will not be considered further. The interested reader is pointed to (e.g.) Embrechts et al. [6] for more detailed discussions. 2.1 The Gumbel distribution The GEV distribution of Equation 4 gives the Gumbel distribution in the limit case of! [8, 9, 1]. The probability of observing some extremum, x m x is hence given as: P (x m x j m ; m ) = expf exp( y m )g (5) where, as before, y m is the reduced variate y m = 1 m (x m m ) (6) For notational simplicity we will henceforth denote P (x m x j m ; m ) = P (x j m). The appropriate norming parameters, m ; m require estimation, however. In classical extremevalue theory these parameters depend only on the sample size m. In [11] it is proposed that the parameters m and m are coupled via: m = = c (7) m = ln(m) + r in which c and r are constants, so that a plot of m against ln(m) is a straight line. The parameter m may also be experimentally determined from Monte-Carlo simulation as it represents the expected largest element in m samples drawn from D, i.e. if fz i g represent samples from the distribution, then: expt m = hmaxfz 1 ; :::; z m gi (8) A plot of expt m against ln(m) for random variables drawn from jn(; 1)j (i.e. a one-sided Gaussian) is shown in Figure 1. Results were obtained over 1 Monte-Carlo simulations each with m ranging from 1 to 1. Note that the relationship is roughly linear, but that a linear approximation is not appropriate if we wish to describe the relationship over a wide range of m. Whilst maximum-likelihood 4

5 estimation of the norming parameters is required for arbitrary distributions, if the analytic form of the observations distribution, D, is known then m and m may be estimated in closed form. These forms are asymtotically correct as m! 1. As will be shown, however, they parameterise extremevalue distributions which t experimental data well even for small m. The calculation of the norming parameters is a standard problem in EVT and the results are quoted in this paper. Considerably more detail is available in [6, 8]. The asymptotic form of the location parameter for D = jn(; 1)j is given as (following, for example, the results for the normal distribution [6, p.145]): and the scale parameter as: m = (2 ln m) 1=2 ln ln m + ln 2 2(2 ln m) 1=2 (9) m = (2 ln m) 1=2 (1) Figure 2 plots the experimental (1 Monte Carlo trials) and theoretical values of 1 m m for m = 2:::1. One standard deviation errors are also shown for the experimental data, estimated over the Monte Carlo trials. Note that the t between theoretical and experimental results is good. If the cumulative distribution is dened via Equation 5 then the associated density function may be written as: p(x m = x j m ; m ) = 1 m expf y m exp( y m )g (11) where y m is dened via Equation 6. Once more, for notational simplicity, p(x m = x j m ; m ) is henceforth written simply as p(x j m). Figures 3 and 4 respectively show the experimental (1 Monte-Carlo simulations) and theoretical P (x j m) and p(x j m) with x for diering values of m. Note that the original distribution was once more, one-sided normal with zero-mean and unit variance. Note also the movement and form of the extreme-value distribution with increasing m. The distributions obtained from Equations 5, 9, 1 and 11 are in good agreement with the experimental results. For the case of m = 1 in Figure 5, however, note that there is indication that the predicted curve is an underestimate as P (x j m)! 1. It is noted that in the limit m = 1 the resultant probabilities are obtained from D = jn(; 1)j. 5

6 2.2 The Gaussian mixture model The Gaussian Mixture Model (GMM) is widely used in data density estimation and to form the latent (hidden) space of radial-basis function networks. The choice of the Gaussian kernel is not an arbitrary one. Gaussian mixture methods and radial-basis systems using Gaussian bases both have the property of universal approximation, are analytically attractive and are hence favoured in the literature. The form of the GMM is simple, consisting of a set of k = 1::K components each of which has standard Gaussian form and is uniquely parameterised by its mean and covariance matrix. For such a mixture Bayes' theorem implies that the data likelihood, p(x), may be written as mixture of component (conditional) likelihoods of the form p(x) = X k (k)p(x j k) (12) where (k) are the component priors. If we assume that the statistics of outlying events are dominated by the component closest (in the Mahalanobis sense) then we are interested in the resultant probability of data lying interior to some distance h from the component mean. Integrating the (one-sided) density function gives P (h(x)) = Z h(x) c k p(w j k )dw (13) where c k is the mean of the closest component and w is a dummy variable of integration. The more complex case in which x is multidimensional is easily dealt with by evaluating the above Equation using the Mahalanobis distance dened in its standard form: h(x) k = q (x c k ) T C 1 k (x c k ) (14) where C is the covariance matrix and c is the centroid. The Mahalanobis distance, evaluated with respect to a Gaussian component, has a density function given by: p(h(x)) = p 2 exp 12 2 h(x)2 (15) Combining Equations 13 and 15 and using the standard form for the Gaussian integral gives: P (h(x)) = erf h(x) p 2 (16) where erf(.) is the error function. 6

7 To evaluate the resultant EVT probability at some point x it is noted that the theory (and parametric form) of the EVT distribution derived for the jn(; 1)j distribution (Equation 5) may be used directly as h(x) is expected (with respect to a Gaussian component) to be distributed as jn(; 1)j. Furthermore, if a total of m points are observed and used to t the model so an estimate of the eective number of points `seen' by each component is simply m k = m(k). The EVT probability measure is hence written as: P (x j m) = P (h(x) k j m k ) (17) where the index k denotes the fact that we deal only with the component closest (in the Mahanolobis sense) to x as this dominates the EVT statistics. Figure 5 shows the evolution of P (x j m) with m for the case of a one-dimensional, two-component, Gaussian mixture model. The components are centred at 1, 1 = 1, 2 = 1=2 and have equal priors. As the number of observed data points (m) increases so the resultant extremal probabilities become less conservative. The value of m increases linearly from 2 to Model tting and complexity penalisation All the results in this paper are obtained using the Expectation-Maximisation (EM) algorithm [12] to estimate a maximum-likelihood parameter set for the means, covariance matrices and component priors in the mixture model. Details of the standard algorithmic approach taken to implement the EM procedure may be found in [13] for example. The likelihood function evaluated from maximum-likelihood methods, such as EM, is monotone increasing with the complexity of the model (i.e. increasing numbers of components). If we wish, however, to form a parsimonious representation of the data set this likelihood function is modied via a function which penalises models with larger complexity (more parameters). Such penalty measures may take several forms. One of the simplest is that of minimum description length coding (MDL), rst derived by Rissanen in 1978 [14]. Although other elegant paradigms exist for obtaining a penalised likelihood function, in particular Bayesian schemes, MDL is observed in practice to give results similar to those of the more (mathematically) complex Bayesian methods (see [15] for a description of the Bayesian approach and comparisons with MDL and other methods). The results in this paper were 7

8 all obtained using a (log) likelihood functional penalised using MDL of the form: L pen (K) = L(K) 1 2 N p(k) ln N (18) where L(K) is the unpenalised likelihood solution for the model with K components, N p (K) is the number of free parameters in that model and N is the total number of data in the data set in question. We choose, therefore, that model with the highest L pen. 3 Results Results in this section are from two medical data sets and a noise removal problem in image processing. The rst data set presented comprises auto-regressive model coecients from a study of hand tremor in patient and non-patient groups. The second data set comes from recordings of electrical activity of the brain (the electroencephalogram or EEG) in which epileptic seizure events occur. 3.1 Tremor data set The tremor data set consists of 9 two-dimensional data points representing hand tremors of a set of healthy volunteers and a set of ten points randomly drawn from the patient population. 1 A vecomponent Gaussian mixture model was trained on the non-patient data set using the EM algorithm. Figure 6 shows a surface mesh plot of the EVT probability for this data set. Figure 7 shows the data along with P = :95 boundaries from (thin line) integrating the mixture model and (thick line) the extreme value distribution. The latter gives rise to fewer false positives and, it is argued here, a qualitatively more reasonable novelty measure. It should be noted that the use of thresholding probability measures is for presentation purposes and novelty measures should ideally remain continuous. 3.2 Epilepsy data set The second data set consists of a single channel of EEG from an epileptic subject. The signal was parameterised over one-second segments using information from the singular spectrum of a phasespace embedding (see [17] for details). A four component Gaussian mixture model was adapted to the parameter space over a 1-minute section of the data in which no epileptic activity was observed. Figure 8 shows the resultant novelty probabilities from (upper trace) the error-function approach and 1 The nal numbers of patient and non-patient data were arranged to be equal, as in the original study a supervised classication method was used [16]. 8

9 (middle trace) the EVT approach. It is clear that both methods highlight, in particular, the four regions of epileptic activity in this record. The EVT method, however, gives a considerably more interpretable trace as background (non-epileptic) activity has uniformly low probability. The lower plot of the gure shows the activity highlighted by the rst region of novelty in the EVT trace (between 63 and 71 seconds into the recording). The characteristic 3Hz `spike and wave' activity is very clear. 3.3 Image noise removal The nal example presented consists of removal of `salt and pepper' noise from an image ( image of a house). A pixel-by-pixel EVT probability was calculated using grey-scale information from a 7 7 neighbourhood around each pixel to which a single Gaussian was tted. A pixel-by-pixel probability value was also calculated for the standard Gaussian integral approach using the same information. The image was also median ltered using a 3 3 neighbourhood. Denoting P [r; c] as either the EVT or Gaussian-integral outlier probability for pixel [r; c], M[r; c] the median-ltered response and O[r; c] the observed (noisy) image, then the ltered images, F [r; c] are obtained from a convex combination of the noisy image and the median-ltered version. F [r; c] = (1 P [r; c])o[r; c] + P [r; c]m[r; c] (19) i.e. pixels are replaced by their median-ltered values only if it is believed that noise corruption occurs. This is in contrast, of course, to the `traditional' median ltering paradigm in which all pixels are replaced by the neighbourhood median value. It is noted that the latter approach gave results worse than either probabilistic approach. Furthermore the convex combination dened in Equation 19 gives better results for both probability measures (EVT and Gaussian integral) than threshold-based pixel replacement methods (i.e. ltering pixel if P [r; c] > threshold) even if the threshold value is optimised. Figure 9 shows the original image, I (top left), the corrupted image, O (top right), the ltered image obtained via the Gaussian-integral outlier probability (bottom left) and the ltered image obtained via the EVT outlier probability (bottom right). Also indicated are the normalised mean-square-error (NMSE) values (averaged per pixel) between I and each other image, I, dened as and summarised in the following table, NMSE = h(i[r; c] I [r; c]) 2 i hvar (I[r; c])i 9

10 Method NMSE Median.534 Gaussian-integral lter.432 EVT lter.14 It is clear that a considerable reduction in reconstruction error is obtained using the EVT method. 4 Conclusions The issue of determining whether some datum is indicative of abnormal or novel system behaviour is paramount in areas such as medical data analysis where the acquisition of large quantities of abnormal data is dicult. Several approaches to screening data for such novel segments have been proposed previously. Extreme-value theory and the associated statistics provide, however, a complete and principled approach to the problem which avoids the need for cross-validation or setting of (possibly arbitrary) novelty thresholds. EVT forms an elegant methodology for any problem in which the chance of nding extremes of some quantity is required. It is shown that a simple parametric model for the reduced variate may be used which enables the EVT distributions from a Gaussian mixture model to be estimated. These estimates t the experimental data well. A variety of novelty problems are analysed using this method and the results are shown to be superior to those obtained using more conventional approaches. 5 Acknowledgements Thanks are rst due to Dr. Stephen Walker for help with aymptotic forms and to the anonymous referees of earlier drafts of this paper whose patient comments have helped improve the work considerably. The author would also like to thank Dirk Husmeier and Peter Sykacek for helpful comments. The author is most grateful to Julia Spyers-Ashby of the Royal Hospital for Neurodisability, Putney, UK, for the tremor data. This work was researched whilst the author was on sabbatical leave from Imperial College at the Newton Institute for Mathematical Sciences in Cambridge whose support is gratefully acknowledged. 1

11 References [1] S. Roberts and L. Tarassenko. A Probabilistic Resource Allocating Network for Novelty Detection. Neural Computation, 6:27{284, [2] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady. Novelty detection for the detection of masses in mammograms. In Proceedings of : Fourth IEE International Conference on Articial Neural Networks, pages 442{447. IEE, June [3] S. Roberts, L. Tarassenko, J. Pardey, and D. Siegwart. A Validation Index for Articial Neural Networks. Proceedings of International Conference on Neural Networks & Expert Systems in Medicine & Healthcare, Plymouth, August 1994, pages 23{3, [4] C.M. Bishop. Novelty Detection and Neural Network Validation. IEE Proc. Vis. Image Signal Processing, 141:217{222, [5] R.A. Fisher and L.H.C. Tippett. Limiting Forms of the Frequency Distributions of the Largest or Smallest Members of a Sample. Proc. Cambridge Phil. Soc., 24:18{19, [6] P. Embrechts, C. Kluppelberg, and T. Mikosch. Modelling Extremal Events. Springer-Verlag, [7] W. Weibull. A Statistical Distribution of Wide Applicability. J. Appl. Mechanics, 18:293{297, [8] R.L. Smith. Handbook of Applicable Mathematics: Supplement, chapter 14, Extreme Value Theory, pages 437{472. John Wiley, 199. [9] E.J. Gumbel. Statistics of Extremes. Columbia University Press, New York, [1] N. Balakrishnan and A.C. Cohen. Order Statistics abd Inference. Academic Press, New York, [11] S.L. Miller, W.M. Miller, and P.J. McWhorter. Extremal Dynamics: A unifying physical explanation of fractals, 1/f noise, and activated processes. J. Appl. Phys., 73(6):2617{2628, [12] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc., 39(1):1{38, [13] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, [14] J. Rissanen. Modelling by shortest data description. Automatica, 14:465{471, [15] S.J. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to mixture modelling. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2(11):1133{1142, [16] J.M. Spyers-Ashby. The Recording and Analysis of Tremor in Neurological Disorders. PhD thesis, Imperial College, University of London, [17] I.A. Rezek and S.J. Roberts. Stochastic complexity measures for physiological signal analysis. IEEE Transactions on Biomedical Engineering, 44(9),

12 6 Figures µ m expt Figure 1: Plot of expt m against (log) m evaluated over Monte-Carlo simulations. Note that whilst the relationship is roughly linear, a straight-line t would be inappropriate if we wish to model over a large range of m. m 12

13 σ m 1 µm m Figure 2: Plot of experimental (solid line) and theoretical (dashed line) values of 1 m m. One standard deviation error intervals are also shown for the experimental data (the thinner solid lines) P(x m).5 m = 1.4 m = m = x Figure 3: This plot shows the form of the cumulative distribution P (x j m) for m = 1; 1; 1. Note that as m increases the resultant distributions become more extreme. Circles denote experimental results from Monte-Carlo simulations and the solid line the theoretical result of Equation 5. 13

14 1.5 1 m = 1 p(x m) m = x Figure 4: This plot shows the form of the density function p(x j m) for m = 1; 1. Note that as m increases the resultant distributions become more extreme. Circles denote experimental results from Monte-Carlo simulations and the solid line the theoretical result of Equation 11. P(x m) m x Figure 5: Evolution of P (x j m) with m for a one-dimensional, two-component, Gaussian mixture model. The components are centred at 1, 1 = 1, 2 = 1=2 and have equal priors. As the number of observed data increases so the resultant extremal probabilities become less conservative. The value of m increases linearly from 2 to 1. 14

15 1.8 P(x m) x x1 1 2 Figure 6: Mesh plot of the EVT probability (Equation 17) over the two-dimensional space of the hand tremor data x x1 Figure 7: Contour plot of P > :95 boundary on EVT probability (Equation 17) [thick curve] along with the P > :95 boundary from the error function approach (Equation 16) [thin curve]. Data from the non-patient class are shown as open circles (N = 9) and data from patients (N = 1) are shown as lled circles. Note the larger number of false positives using the integrated mixture approach and the one unavoidable false negative present in both approaches. 15

16 1 P(x) P(x m) t (seconds) Figure 8: Novelty probability for epileptic seizure activity using error-function method (upper plot), EVT (middle plot). The lower trace shows the rst period of activity agged as novel by the EVT approach. All x-axis times are in seconds from the start of the recording. 16

17 Original NSME =.5312 NSME =.432 NSME =.14 Figure 9: Original image (top left), corrupted image (top right), median ltered image (bottom left) and EVT-ltered image (bottom right). Normalised mean-square errors of distortion from the original are also given. 17

EXTREME VALUE STATISTICS FOR NOVELTY DETECTION IN BIOMEDICAL SIGNAL PROCESSING

EXTREME VALUE STATISTICS FOR NOVELTY DETECTION IN BIOMEDICAL SIGNAL PROCESSING Stephen J. Roberts Department of Engineering Science, University of Oxford, UK NOVELTY DETECTION Novelty, outlier or abnormality