Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013)

Outline of Talk Motivation : Biometrics SVM(Supervised learning) Approach Unsupervised L 2 E Estimation Approach Experimental Results Concluding Remarks

Introduction We are drowning in the deluge of data that are being collected world-wide, while starving for knowledge at the same time. Anomalous events occur relatively infrequently However, when they do occur, their consequences can be quite dramatic and quite often in a negative sense

Need for Accurate Speaker Recognition Method of recognizing a person based on his voice One of the forms of biometric identification Need for accurate and scalable speaker recognition - VoIP applications Applications in diverse areas- telephone, internet banking,online trading,forensics Corporate and government sectors security enforcement

What is an intrusion detection? Intrusions are the activities that violate the security policy of system. Intrusion Detection is the process used to identify malicious behavior that targets a network and its resources

Intrusion Detection System Intrusion Detection Systems(IDSs) plays a key role as defense mechanism against malicious attacks in network security. Monitors traffic between users and networks; abnormal activity. Analyzes patterns/signatures based on data packets.

Intrusion Detection Techniques misuse intrusion detection-intrusion signatures statistical/anomaly intrusion detection

Misuse intrusion detection Catch the intrusions in terms of the characteristics of known attacks or system vulnerabilities Built with knowledge of bad behaviors Collection of signatures-signature Analysis Examine event stream for signature match-pattern Matching Cannot detect novel or unknown attacks

Anomaly detection? Anomaly is a pattern in the data that does not conform to the expected behavior Also referred to as outliers, exceptions, peculiarities, surprise, etc. Detect any action that significantly deviates from the normal behavior Built with knowledge of normal behaviors Examine event stream for deviations from normal

Applications of Anomaly detection Network intrusion detection Insurance / Credit card fraud detection Healthcare Informatics / Medical diagnostics Industrial Damage Detection Image Processing / Video surveillance Novel Topic Detection in Text Mining

Real world Analomies

Key Challenges Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Data is extremely huge, noisy, can be complex Normal behavior keeps evolving Fast and accurate real-time detection

Novelty detection Identification of new or unknown data or signal that a machine learning system is not aware of during training. Fundamental requirements of good classification or identification system Abnormalities are very rare or there may be no data describes the faulty conditions

Techniques/approaches to detect anomalies Supervised - The data (observations, measurements, etc.) are labeled with pre-defined classes. Unsupervised - Class labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data

Support Vector Machine (SVM) A popular supervised anomaly detection technique SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative The common features in normal and adversary groups need to be learned and need to be differentiate Discovering the key characteristics of network traffic patterns, a decision making boundary is superimposed in the space of feature representations.

SVM for Network Traffic Classification Effectively understand the patterns of network traffic and detect measurements deemed untrustworthy from malicious targets Eliminates the need for arbitrary assumptions about the underlying network topology and parameters or thresholds in favor of direct training data. Discover key characteristics of network traffic patterns by superimposing a boundary in the space of measurements.

SVM Framework Cast the problem of detecting malicious nodes in a SVM classification framework Labeled Training Examples: ( x i, y i ), where x i is the representation of the i th example in the feature space and y i {1, 1} is the corresponding label Decision Boundary Function: y( x ) = w. x. + w 0 where w is the weight vector and w 0 is the bias.

SVM Framework Network Traffic Features: x Optimization Function: w and w 0 Prediction of Training Set Label: w. x +w 0

SVM Optimization Problem min 1 2 W 2 + γ N i=1 subject to y i ( W.Φ( x ) + W 0 ) 1 ε i, i ε i where N : number of training examples. ε i : collection of non-negative slack variables that account for possible misclassification s. γ : trade off factor between the slack variables and the regularization on the norm of the weight vector W. The constraint in this minimization implies that we want our predictions, W.Φ ( x). + W to be similar to labels.

Solution to the SVM Optimization Problem Solve optimization by quadratic programming in dual Parameter estimation by cross validation of training set Given a W and W 0, predict whether a node is adversary or not by looking at the sign of W.Φ ( x) + W 0. LibSVM package to implement the SVM model based anomaly detection

Key Challenges in Supervised learning Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Data is extremely huge, noisy, can be complex Normal behavior keeps evolving Fast and accurate real-time detection

What is Mixture Model Let f θm ( x) denote the general mixture probability density function with m components. f θm ( x) = m π i f ( x φ i ). i=1 π i 0, m π i = 1 for i = 1,..., m; i=1 θ m = (π 1,..., π m 1, π m, φ T 1,..., φ T m) T. In theory, the f ( x φ i ) s could be any parametric density, although in practice they are often from the same parametric family (usually Gaussian)

Estimation Approach with Built-in Robustness using L 2 E When m is known, we want to find f θm ( x) is close to g( x) in L 2 distance. That is, L 2 (f θm, g( x)) = [f θm ( x) g( x)] 2 d x. The aim is to derive an estimate of θ m that minimizes the L 2 distance

Estimation Approach with Built-in Robustness L 2 (f θm ( x), g( x)) = 2 + f 2 θ m ( x)d x f θm ( x)g( x)d x g( x) 2 d x

Estimation Approach with Built-in Robustness The last integral is constant with with respect to θ m The first integral is often available as a closed form expression The second integral is simply the average height of the density estimate, which may be estimated as 2n 1 n i=1 f θ m ( X i ) where X i is a sample observation.

Computational Algorithm The L 2 E estimator of θ m is given by ˆθ L 2E m = arg min θ m [ f θ 2 ( x)d x 2n 1 m ] n f θm ( X i ), i=1

Computational Algorithm Normal Identity φ(x µ 1, σ 1 2 )φ(x µ 2, σ 2 2 )dx = φ(µ 1 µ 2 0, σ 1 2 + σ 2 2 ), where φ(x µ, σ 2 ) is the normal density function with mean µ and variance σ 2. For multivariate Gaussian mixtures-gmm, f ( x φ i ) = φ( x µ i, Σ i ), the use of the above identity reduces the key integral to

Computational Algorithm f 2 θ m ( x)d x = m m π k π l φ( µ k µ l 0, Σ k + Σ l ) k=1 l=1 Making the integral tractable and thereby significantly reducing the computations involved in minimizing L 2 E. Thus, the estimation of L 2 E may be performed by any standard optimization algorithm.

Data Analysis The effective detection and identification of anomalies in traffic requires the ability to separate them from normal network traffic. Network traffic data set from University of New Mexico. Trace files contained 13831 sample observations with process IDs and their respective system calls. We apply our L 2 E(unsupervised) and compare the performance with SVM(supervised)

Results: Accuracy with increasing dimensions-(70%-30% train-test partition of the data) Dimensions L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate 2 0.774 1.000 0.9926 3 0.663 1.000 0.9926 4 0.561 1.000 0.9924 5 0.390 0.989 0.9924 6 0.322 1.000 0.9924 7 0.189 1.000 0.9924 8 0.000 0.980 0.9924

Results: Accuracy with varied testing training data-(using 8 dimensions of the data) Train- Testing L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate 50-50 0.0003 0.9884 0.9914 60-40 0.0001 0.9786 0.9919 70-30 0.0002 0.9836 0.9920 80-20 0.0002 0.9781 0.9898 90-10 0.0001 0.9814 0.9884

Results: Accuracy with increasing testing sample size-(using 70%-30% train-test partition using 8 dimensions of the data) Testing Sample Size L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate 500 0.0000 0.9792 0.9960 1000 0.0000 0.9744 0.9960 1500 0.0000 0.9686 0.9920 2000 0.0000 0.9814 0.9935 2500 0.0004 0.9844 0.9912 3000 0.0000 0.9876 0.9907 3500 0.0003 0.9840 0.9925 4000 0.0003 0.9836 0.9927

Observations The false detection rate for SVM for all scenarios for this data set is zero. Despite the lack of labeled training data, the true detection rate of the L 2 E algorithm is comparable to the SVM for all scenarios.

Analysis for Simulated data Case: 5 dimension, n=10000 and we use 80/20 random split. Dataset : mu 1 = (2, 2, 2, 2), mu 2 = (2.5, 2.5, 2.5, 2.5),σ 1 = diag(.1),σ 2 = diag(.4),pi 1 = 0.8,pi 2 = 0.2 We apply our L 2 E(unsupervised) and compare the performance with SVM(supervised) and some other machine learning algorithms. Classification accuracy for L 2 E is better than the alternatives.

Results: Comparing Machine Learning Algorithm for the simulate data:testing sample size- Classifier Time False -ve False +ve L 2 E 2.1 0.0345 0.0055 EM 16 0.0315 0.006 Trees 0.31 0.186 0.011 SVM 1.95 0.167 0.007 NN 5.2 0.214 0.01

Conclusion-Significance of our L 2 E Does not require the labeled training data or special configuration Ease of use Efficiency in achieving accuracy with out computational overhead Results are Comparable to SVM and other machine learning algorithms

Current and Future work Evaluating the performance using multiple network traffic data sets for speaker recognition. Applying real data sets with higher dimensions and large number of components. Estimating the number of components. Data Mining-Random Forest/Boosting

Some Reference Article L 2 E Estimation of Mixture complexity for Count Data- CSDA(Oct,2009) Simultaneous Robust Estimation in Finite Mixture: The Continuous Case- JISA(Special-Golden Jubilee-2012) Detection of Anomalies in Network traffic using L 2 E for Accurate speaker recognition, IEEE Midwest, August, 2012. Elements of Statistical Learning- Bookhttp://www-stat.stanford.edu/ tibs/elemstatlearn/