EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

Size: px

Start display at page:

Download "EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS"

Bernice Butler
6 years ago
Views:

1 International Journal of Advances in Engineering & Technology, May 23. EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS Nada Badr, Noureldien A. Noureldien Department of Computer Science University of Science and Technology, Omdurman, Sudan ABSTRACT Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two different robustification techniques for the PCA. The results obtained from experiments show that PCA generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much accurate and both reveals the effects of masking and swamping undergo the PCA method. KEYWORDS: Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance Determinant, Projection Pursuit. I. INTRODUCTION Principal Components Analysis (PCA) is a multivariate statistical method that concerned with analyzing and understanding data in high dimensions, that is to say, PCA method analyzes data sets that represent observations which are described by several dependent variables that are inter correlated. PCA is one of the best known and most used multivariate exploratory analysis technique [5]. Several robust competitors to classical PCA estimators have been proposed in the literature. A natural way to robustify PCA is to use robust location and scatter estimators instead of the PCA's sample mean and sample covariance matrix when estimating the eigenvalues and eigenvectors of the population covariance matrix. The minimum covariance determinant (MCD) method is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations out of n whose covariance matrix has the lowest determinant. The MCD location estimate then is the mean of these h points, and the estimate of scatter is their covariance matrix. Another robust method for principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, by applying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD and PP. The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 was dedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4. In section 5 the experiment results are shown, conclusions and future work are drawn in section 6. II. RELATED WORK A number of researches have utilized principal components analysis to reduce the dimensionality and to detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced 573 Vol. 6, Issue 2, pp

2 International Journal of Advances in Engineering & Technology, May 23. by Lakhina [3] whereby principal components analysis is used to decompose the structure of Origin- Destination flows from two backbone networks into three main constituents, namely periodic trends, bursts and noise. Labib [2] utilized PCA in reducing the dimension of the traffic data and for visualizing and identifying attacks. Bouzida et, al. [7] presented a performance study of two machine learning algorithms, namely, nearest neighbors and decision trees algorithms, when used with traffic data with or without PCA. They discover that when PCA is applied to the KDD99 dataset to reduce dimension of the data, the algorithms learning speed was improved while accuracy remained the same. Terrel [9] used principal components analysis on features of aggregated network traffic of a link connecting a university campus to the Internet in order to detect anomalous traffic. Sastry [] proposed the use of singular value decomposition and wavelet transform for detecting anomalies in self similar network traffic data. Wong [2] proposed an anomaly intrusion detection model based on PCA for monitoring network behaviors. The model utilizes PCA in reducing the dimensions of a historical data and in building the normal profile, as represented by the first few components principals. An anomaly is flagged when distance between the new observation and normal profile exceeds a predefined threshold. Mei-ling [4] proposed an anomaly detection scheme on robust principal components analysis. Two classifiers were implemented to detect anomalies, one was based on major components that capture most of the variations in the data, and the second was based on minor components or residuals. A new observation is considered to be an outlier or anomalous when the sum of squares of the weighted principal components exceeds the threshold in any of the two classifiers. Lakhina [6] applied principal components analysis to Origin-Destination (OD) flows traffic, the traffic isolated into normal and anomalous spaces by projecting the data onto the resulting principal components one at a time, ordered from high to low, Principal components (PC) are added to the normal space as long as a predefined threshold is not exceeded. When the threshold is exceeded, then the PC and the subsequent PCs are added to anomalous space. New OD flow traffic is projected into the anomalous space and anomaly is flagged if the value of the square prediction error or Q-statistic exceeds a predefined limit. Therefore PCA is widely used to identify lower dimensional structure in data, and is commonly applied to high-dimensional data. PCA represents data by a small number of components that account for the variability in the data. This dimension reduction step can be followed by other multivariate methods, such as regression, discriminant analysis, cluster analysis, etc. In classical PCA the sample mean and the sample covariance matrix are used to derive the principal components. These two estimators are highly sensitive to outlying observations, and render PCA unreliable, when outliers are encountered. III. CLASSICAL PCA MODEL The PCA detection model detects outliers by projecting observations of the dataset on the new computed axes known as PCs. The outliers detected by PCA method are two types, outliers detected by major PCs, and outliers detected by minor PCs. The basic goals of PCA [5] are to extract important information from data set, to compress the size of the data set by keeping only this important information and to simplify the description of data and analyze the structure of the observation and variables (finding patterns with similarities and difference). To achieve these goals PCA calculate new variables from the original variables, called Principal Components (PCs). The computed variables are linear combination of the original variables (to maximize variance of the projected observation) and uncorrelated. The first computed PCs, called major PCs has the largest inertia ( total variance in data set ), while the second calculated PCs, called minor PCs has the greater residual inertia,and orthogonal to the first principal components. The Principal Components define orthogonal directions in the space of observations. In other words, PCA just makes a change of orthogonal reference frame, the original variables being replaced by the Principal Components. 574 Vol. 6, Issue 2, pp

3 International Journal of Advances in Engineering & Technology, May PCA Advantages PCA common advantages are: 3.. Exploratory Data Analysis PCA is mostly used for making 2-dimensional plots of the data for visual examination and interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of Principal Components chosen among the first ones (that is, the most significant ones). From these plots, one will try to extract information about the data structure, such as the detection of outliers (observations that are very different from the bulk of the data). Due to most researches [8][], PCA detect two types of outliers, type(): the outlier that inflate variance and this is detected by the major PCs and type (2): outlier that violate structure, which are detected by minor PCs Data Reduction Technique All multivariate techniques are prone to the bias variance tradeoff, which states that the number of variables entering a model should be severely restricted. Data is often described by many more variables than necessary for building the best model. PCA is better than other statistical reduction techniques in that, it select and feed the model with reduced number of variables Low Computational Requirement PCA needs low computational efforts since its algorithm constitutes simple calculations. 3.2 PCA Disadvantages It may be noted that the PCA is based on the assumptions that, the dimensionality of data can be efficiently reduced by linear transformation and most information is contained in those directions where input data variance is maximum. As it is evident, these conditions are by no means always met. For example, if points of an input set are positioned on the surface of a hyper sphere, no linear transformation can reduce dimension (nonlinear transformation, however, can easily cope with this task). From the above the following disadvantage of PCA are concluded Depending On Linear Algebra It relies on simple linear algebra as its main mathematical engine, and is quite easy to interpret geometrically. But this strength is also a weakness, for it might very well be that other synthetic variables, more complex than just linear combinations of the original variables, would lead to a more complex data description Smallest Principal Components Have No Attention in Statistical Techniques The lack of interest is due to the fact that, compared with the largest principal components that contain most of the total variance in the data, the smallest principal components only contain the noise of the data and, therefore, appear to contribute minimal information. However, because outliers are a common source of noise, the smallest principal components should be useful for outlier detection High False Alarms Principal components are sensitive to outliers, since the principal components are determined by their directions and calculated from classical estimator such classical mean and classical covariance or correlation matrices. IV. PCA ROBUSTIFICATION In real datasets, it often happens that some observation are different from the majority, such observation are called outliers, intrusion, discordant, etc. However classical PCA method can be 575 Vol. 6, Issue 2, pp

4 International Journal of Advances in Engineering & Technology, May 23. affected by outliers so that PCA model cannot detect all the actual real deviating observation, this is known as masking effect. In addition some good data points might even appear to be outliers which are known as swamping effect. Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarms using robust estimators was proposed, since outlying points are less likely to enter into the calculation of the robust estimators. The well-known PCA Robustification methods are the minimum covariance determinant (MCD) and Projection-Pursuit (PP) principle. The objective of the raw MCD is to find h > n/2 observations out of n whose covariance matrix has the smallest determinant. Its breakdown value is (bn= [n- h+]/n), hence the number h determines the robustness of the estimator. In Projection-Pursuit principle [3], one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. PP is applied where the number of variables or dimensions is very large, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset not to exceed 5 dimensions. Principal Component Analysis (PCA) is an example of the PP approach, because they both search for directions with maximal dispersion of the data projected on it, but PP instead of using variance as measure of dispersion, they use robust scale estimator [4]. V. EXPERIMENTS AND RESULTS In this section we show how we test PCA and its robustification methods MCD and PP on a dataset. The data that was used consist of OD (Origin-Destination) flows which, are collected and made available by Zhang []. The dataset is an extraction of sixty minutes traffic flows from first week of the traffic matrix on 24-3-, which is the traffic matrix Yin Zhang was built from Abilene network. Availability of the dataset is on offline mode, where it is extracted from offline traffic matrix. 5. PCA on Dataset At first, the dataset or the traffic matrix is arranged into the data matrix X, where rows represent observations and columns represent variables or dimensions. x, x,2 X (44 2) =[ ], x 44, x 44,2 The following steps are considered in apply PCA method on the dataset. Centering the dataset to have zero mean, so the mean vector is calculated from the following equation: μ = n x n i= i () and subtracted off the mean for each dimension. The product of this step is another centered data matrix Y, which has the same size as original dataset Y (n,p) = (x i,j μ(x)) (2) Covariance matrix is calculated from the following equation: C(X)orΣ(X) = (X n T(X))T. (X T(X)) (3) Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonal elements of the matrix by using eigen-decomposition technique in equation (4). E Σ Y E =ʎ (4) Where E is the eigenvectors, ʎ is the eigenvalues. Ordering eigenvalues in decreasing order and sorting eigenvectors according to the ordered eigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix. Calculating scores matrix (dataset projected on principal components), which declares the relations between principal components and observations. The scores matrix is calculated from the following equations: scores (n,p) = Y (n,p) loadings (p,p) (5) 576 Vol. 6, Issue 2, pp

5 International Journal of Advances in Engineering & Technology, May 23. Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, data projected on minor PCS) to reveal outliers automatically. The ellipse is defined by these data points whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees of freedom. The form of the distance is dist x 2 p,.975 (6) The screeplot is used and studied and the first and the second principal components accounted for 98% of total variance of the dataset, so retaining the first two principal components to represent the dataset as whole, figure () shows the screeplot, the plotting of the data projected onto the first two principal components in order to reveal the outliers on the dataset visually is shown in figure (2) x 7 data projected on major pcs 2 totalvariance variances PC principal components Figure : PCA Screeplot PC x 7 Figure 2: PCA Visual outliers Figure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording of outliers from scatter plots of data projected on robust minor principal components and the outliers detected by robust minor principal components tuned by tolerance ellipse respectively. PC2 x PC x 7 Figure 3: PCA Tolerance Ellipse last PC 8 x 5 data projected on minor pcs last PC- x 5 Figure 4: PCA type2 Outliers Vol. 6, Issue 2, pp

6 International Journal of Advances in Engineering & Technology, May 23. x PC MCD on Dataset PC Figure 5: Tuned Minor PCS Testing robust statistics MCD (Minimum Covariance Determinant) estimator yields robust location measure T mcd and robust dispersion Σ mcd. The following steps are applied to test MCD on the dataset in order to reach the robust principal components. MCD measure is calculated from the formula: R=(x i-t mcd(x)) T.inv(Σ mcd(x)).(x i-t mcd(x) ) for i= to n (7) T mcd or µ mcd =.e+6 * From robust covariance matrix Σ mcd calculating the followings: C(X) mcd or Σ(x) mcd =.e+2 * * find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h * find robust eigenvectors as loading matrix as in equation (5). Calculating robust scores matrix as in the following form robustscores (n,p) = Y (n,p) loadings (p,p) (8) The robust screeplot retaining the first two robust principal components which accounted above of 98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visual recording of outliers from scatter plots of data projected on robust major principal components, and the outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9) and () shows the visual recording of outliers from scatter plots of data projected on robust minor principal components and the outliers detected by robust minor principal components tuned by tolerance ellipse respectively. x 5 robust mcd screeplot to retain robust PCS 2.5 x 7 major pcs from robust estimator total variance robustmcd PC principal components Figure 6: MCD screeplot robustmcd PC x 7 Figure 7: MCD Visual Outliers 578 Vol. 6, Issue 2, pp

7 International Journal of Advances in Engineering & Technology, May x x 6 data project on robustmcd minor PCS 3.5 robustmcdpc robustmcdpc 2 4 x 7 Figure 8: MCD Tolerance Ellipse robustmcd last pc robustmcd last- pc x 6 Figure 9: MCD type2 Outliers x robustmcd pclast robustmcd pclast- Figure : MCD Tuned Minor PCs x Projection Pursuit on Dataset Testing the projection pursuit method on the dataset is included in the following steps: Center the data matrix X (n,p), around L-median to reach centralized data matrix Y (n,p) as : Y (n,p) = (X (n,p) L(X)) (9) Where L(X) is high robust estimator of multivariate data location with 5% resist of outliers []. Construct the directions p i as normalized rows of matrix, `this process include the following: PY = (Y[i, : ]) for i, : n () let NPY = max(svd(py)) () Where SVD stand for singular value decomposition. P i = PY (2) NPY Project all dataset on all possible directions. T i = Y (P i )t (3) Calculate robust scale estimator for all the projections and find the directions that maximize qn estimator,q = max(qn(t i )) (4) qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two data points [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared of value of the robust scale estimator is the eigenvalues. project all data on the selected direction q to obtain robust principal components as in the following : t T i = Y n,p P q (5) Update data matrix by its orthogonal complement as in the followings: Y = Y (P q P t q ). Y (6) 579 Vol. 6, Issue 2, pp

8 International Journal of Advances in Engineering & Technology, May 23. Project all data on the orthogonal complement, scores = Y P i (7) The Plotting of the data projected on the first two robust principal components to detect outliers visually, is shown in figure (), and the tuning the first two robust principal components by tolerance ellipse is shown in figure (2). Figures (3) and (4) show respectively the plotting of the data projected on minor robust principal components to detect outliers visually, and the tuning of the last robust principal components by tolerance ellipse. x 7 data projected on robust major PCS by PP method x 7 PP robust PC PProbust PC PProbust PC x 7 Figure : PP Visual Outliers PProbust PC x 7 Figure 2: PP Tolerance Ellipse PP robust PC2.5 x 6 data projected on robust minor PCS by PP PProbust PC x PProbust PC x 6 Figure 3: MCD type2 Outliers PProbust PC Figure 4: MCD Tuned Minor PCs 2 x Results Table () summarizes the outliers detected by each method. The table shows that PCA suffers from both masking and swamping. The MCD and PP methods results reveal the effects of masking and swamping of the PCA method. The PP method results are similar to MCD with slight difference since we use 2 dimensions on the dataset. PCA Outlier detected by major and Minor PCS Table : Outliers Detection MCD Outliers detected by major and minor PCS PP Outliers detected by major and minor PCS False alarms effects Masking Swamping No No No No No No No No No No No No No No No No 58 Vol. 6, Issue 2, pp

9 International Journal of Advances in Engineering & Technology, May No No No No No No Normal Normal 69 Yes No Normal Normal 7 Yes No 7 Normal normal No Yes 76 Normal normal No Yes 8 Normal normal No Yes Normal normal No Yes 4 Normal normal No Yes Normal normal No Yes 44 Normal normal No Yes Normal normal Yes No Normal normal Yes No Normal Yes No Normal Yes No VI. CONCLUSION AND FUTURE WORK The study has examined the PCA and its robustification methods (MCD, PP) performance for intrusion detection by presenting the bi-plots and extracted outlying observation that are very different from the bulk of data. The study showed that tuned results are identical to visualized one. The study returns the PCA false alarms shortness due to masking and swamping effect. The comparison proved that PP results are similar to MCD with slight difference in outliers type 2 since are considered as source of noise. Our future work will go into applying the hybrid method (ROBPCA), which takes PP as reduction technique and MCD as robust measure for further performance, and applying dynamic robust PCA model with regards to online intrusion detection. REFERENCES []. Abilene TMs, collected by Zhang. research, visited on 3/7/22 [2]. Khalid Labib and V.Rao Vemuri. "An application of principal Components analysis to the detection and visualization of computer network ". Annals of telecommunications, pages 2834, 25. [3]. C. Croux, A. Ruiz-Gazen, "A fast algorithm for robust principal components based on projection pursuit", COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,9, [4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. "Anovel anomaly detection scheme based on principal components classifier". In proceedings of the IEEE foundations and New directions of Data Mining workshop, in conjuction with third IEEE international conference on data mining (ICOM3). [5]. J.Edward Jackson. "A user guide to principal components". Wiely interscience Ist edition 23. [6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. "Diagnosing network wide traffic anomalies".proceedings of the 24 conference on Applications, technologies, architectures, protocols for computer communication. ACM 24. [7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. "Efficient Intrusion Detection Using Principal Component Analysis ". La londe, France, June 24. [8]. R.Gnandesikan, "Methods for statistical data analysis of multivariate observations". Wiely-interscience publication New York, 2 nd edition 997. [9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, "Multivariate SVD analysis for a network anomaly detection ". In proceedings of the ACM SIGOMM Conference 25. []. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, "Netwok traffic analysis using singular value decomposition and multiscale transforms ". information sciences : an international journal Vol. 6, Issue 2, pp

International Journal of Advances in Engineering & Technology, May 23. []. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network,2 nd edition 27. [2].

A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis of network traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 24.

10 International Journal of Advances in Engineering & Technology, May 23. []. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network,2 nd edition 27. [2]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, "Processing of massive audit data streams for real time anomaly intrusion detection". Computer communications, Elsevier 28. [3]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis of network traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 24. AUTHORS BIOGRAPHIES Nada Badr earned her BSC in Mathematical and Computer Science at University of Gezira, Sudan. She received the MSC in Computer Science at University of Science and Technology. She is pursuing her PHD in Computer Science at University of Science and Technology, Omdurman, Sudan. She currently serving lecturer at the University of Science and Technology, Faculty of Computer Science and Information Technology. Noureldien A. Noureldien is working as an associate professor in Computer Science, department of Computer Science and Information Technology, University of Science and Technology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School of Mathematical Sciences, University of Khartoum, and received his PhD in Computer Science in 2 from University of Science and Technology, Khartoum, Sudan. He has many papers published in journals of repute. He currently working as the dean of the Faculty of Computer Science and Information Technology at the University of Science and Technology, Omdurman, Sudan. 582 Vol. 6, Issue 2, pp

Principal Component Analysis

Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand