Unsupervised Anomaly Detection for High Dimensional Data

Size: px

Start display at page:

Download "Unsupervised Anomaly Detection for High Dimensional Data"

Leonard Gray
6 years ago
Views:

1 Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013)

2 Outline of Talk Motivation : Biometrics SVM(Supervised learning) Approach Unsupervised L 2 E Estimation Approach Experimental Results Concluding Remarks

3 Introduction We are drowning in the deluge of data that are being collected world-wide, while starving for knowledge at the same time. Anomalous events occur relatively infrequently However, when they do occur, their consequences can be quite dramatic and quite often in a negative sense

4 Need for Accurate Speaker Recognition Method of recognizing a person based on his voice One of the forms of biometric identification Need for accurate and scalable speaker recognition - VoIP applications Applications in diverse areas- telephone, internet banking,online trading,forensics Corporate and government sectors security enforcement

5 What is an intrusion detection? Intrusions are the activities that violate the security policy of system. Intrusion Detection is the process used to identify malicious behavior that targets a network and its resources

6 Intrusion Detection System Intrusion Detection Systems(IDSs) plays a key role as defense mechanism against malicious attacks in network security. Monitors traffic between users and networks; abnormal activity. Analyzes patterns/signatures based on data packets.

7 Intrusion Detection Techniques misuse intrusion detection-intrusion signatures statistical/anomaly intrusion detection

8 Misuse intrusion detection Catch the intrusions in terms of the characteristics of known attacks or system vulnerabilities Built with knowledge of bad behaviors Collection of signatures-signature Analysis Examine event stream for signature match-pattern Matching Cannot detect novel or unknown attacks

9 Anomaly detection? Anomaly is a pattern in the data that does not conform to the expected behavior Also referred to as outliers, exceptions, peculiarities, surprise, etc. Detect any action that significantly deviates from the normal behavior Built with knowledge of normal behaviors Examine event stream for deviations from normal

10 Applications of Anomaly detection Network intrusion detection Insurance / Credit card fraud detection Healthcare Informatics / Medical diagnostics Industrial Damage Detection Image Processing / Video surveillance Novel Topic Detection in Text Mining

11 Real world Analomies

12 Key Challenges Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Data is extremely huge, noisy, can be complex Normal behavior keeps evolving Fast and accurate real-time detection

13 Novelty detection Identification of new or unknown data or signal that a machine learning system is not aware of during training. Fundamental requirements of good classification or identification system Abnormalities are very rare or there may be no data describes the faulty conditions

14 Techniques/approaches to detect anomalies Supervised - The data (observations, measurements, etc.) are labeled with pre-defined classes. Unsupervised - Class labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data

15 Support Vector Machine (SVM) A popular supervised anomaly detection technique SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative The common features in normal and adversary groups need to be learned and need to be differentiate Discovering the key characteristics of network traffic patterns, a decision making boundary is superimposed in the space of feature representations.

16 SVM for Network Traffic Classification Effectively understand the patterns of network traffic and detect measurements deemed untrustworthy from malicious targets Eliminates the need for arbitrary assumptions about the underlying network topology and parameters or thresholds in favor of direct training data. Discover key characteristics of network traffic patterns by superimposing a boundary in the space of measurements.

17 SVM Framework Cast the problem of detecting malicious nodes in a SVM classification framework Labeled Training Examples: ( x i, y i ), where x i is the representation of the i th example in the feature space and y i {1, 1} is the corresponding label Decision Boundary Function: y( x ) = w. x. + w 0 where w is the weight vector and w 0 is the bias.

18 SVM Framework Network Traffic Features: x Optimization Function: w and w 0 Prediction of Training Set Label: w. x +w 0

19 SVM Optimization Problem min 1 2 W 2 + γ N i=1 subject to y i ( W.Φ( x ) + W 0 ) 1 ε i, i ε i where N : number of training examples. ε i : collection of non-negative slack variables that account for possible misclassification s. γ : trade off factor between the slack variables and the regularization on the norm of the weight vector W. The constraint in this minimization implies that we want our predictions, W.Φ ( x). + W to be similar to labels.

20 Solution to the SVM Optimization Problem Solve optimization by quadratic programming in dual Parameter estimation by cross validation of training set Given a W and W 0, predict whether a node is adversary or not by looking at the sign of W.Φ ( x) + W 0. LibSVM package to implement the SVM model based anomaly detection

21 Key Challenges in Supervised learning Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Data is extremely huge, noisy, can be complex Normal behavior keeps evolving Fast and accurate real-time detection

22 What is Mixture Model Let f θm ( x) denote the general mixture probability density function with m components. f θm ( x) = m π i f ( x φ i ). i=1 π i 0, m π i = 1 for i = 1,..., m; i=1 θ m = (π 1,..., π m 1, π m, φ T 1,..., φ T m) T. In theory, the f ( x φ i ) s could be any parametric density, although in practice they are often from the same parametric family (usually Gaussian)

23 Estimation Approach with Built-in Robustness using L 2 E When m is known, we want to find f θm ( x) is close to g( x) in L 2 distance. That is, L 2 (f θm, g( x)) = [f θm ( x) g( x)] 2 d x. The aim is to derive an estimate of θ m that minimizes the L 2 distance

24 Estimation Approach with Built-in Robustness L 2 (f θm ( x), g( x)) = 2 + f 2 θ m ( x)d x f θm ( x)g( x)d x g( x) 2 d x

25 Estimation Approach with Built-in Robustness The last integral is constant with with respect to θ m The first integral is often available as a closed form expression The second integral is simply the average height of the density estimate, which may be estimated as 2n 1 n i=1 f θ m ( X i ) where X i is a sample observation.

26 Computational Algorithm The L 2 E estimator of θ m is given by ˆθ L 2E m = arg min θ m [ f θ 2 ( x)d x 2n 1 m ] n f θm ( X i ), i=1

27 Computational Algorithm Normal Identity φ(x µ 1, σ 1 2 )φ(x µ 2, σ 2 2 )dx = φ(µ 1 µ 2 0, σ σ 2 2 ), where φ(x µ, σ 2 ) is the normal density function with mean µ and variance σ 2. For multivariate Gaussian mixtures-gmm, f ( x φ i ) = φ( x µ i, Σ i ), the use of the above identity reduces the key integral to

28 Computational Algorithm f 2 θ m ( x)d x = m m π k π l φ( µ k µ l 0, Σ k + Σ l ) k=1 l=1 Making the integral tractable and thereby significantly reducing the computations involved in minimizing L 2 E. Thus, the estimation of L 2 E may be performed by any standard optimization algorithm.

29 Data Analysis The effective detection and identification of anomalies in traffic requires the ability to separate them from normal network traffic. Network traffic data set from University of New Mexico. Trace files contained sample observations with process IDs and their respective system calls. We apply our L 2 E(unsupervised) and compare the performance with SVM(supervised)

30 Results: Accuracy with increasing dimensions-(70%-30% train-test partition of the data) Dimensions L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate

31 Results: Accuracy with varied testing training data-(using 8 dimensions of the data) Train- Testing L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate

32 Results: Accuracy with increasing testing sample size-(using 70%-30% train-test partition using 8 dimensions of the data) Testing Sample Size L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate

33 Observations The false detection rate for SVM for all scenarios for this data set is zero. Despite the lack of labeled training data, the true detection rate of the L 2 E algorithm is comparable to the SVM for all scenarios.

34 Analysis for Simulated data Case: 5 dimension, n=10000 and we use 80/20 random split. Dataset : mu 1 = (2, 2, 2, 2), mu 2 = (2.5, 2.5, 2.5, 2.5),σ 1 = diag(.1),σ 2 = diag(.4),pi 1 = 0.8,pi 2 = 0.2 We apply our L 2 E(unsupervised) and compare the performance with SVM(supervised) and some other machine learning algorithms. Classification accuracy for L 2 E is better than the alternatives.

35 Results: Comparing Machine Learning Algorithm for the simulate data:testing sample size- Classifier Time False -ve False +ve L 2 E EM Trees SVM NN

36 Conclusion-Significance of our L 2 E Does not require the labeled training data or special configuration Ease of use Efficiency in achieving accuracy with out computational overhead Results are Comparable to SVM and other machine learning algorithms

37 Current and Future work Evaluating the performance using multiple network traffic data sets for speaker recognition. Applying real data sets with higher dimensions and large number of components. Estimating the number of components. Data Mining-Random Forest/Boosting

38 Some Reference Article L 2 E Estimation of Mixture complexity for Count Data- CSDA(Oct,2009) Simultaneous Robust Estimation in Finite Mixture: The Continuous Case- JISA(Special-Golden Jubilee-2012) Detection of Anomalies in Network traffic using L 2 E for Accurate speaker recognition, IEEE Midwest, August, Elements of Statistical Learning- Bookhttp://www-stat.stanford.edu/ tibs/elemstatlearn/

Anomaly Detection. Jing Gao. SUNY Buffalo

Anomaly Detection. Jing Gao. SUNY Buffalo Anomaly Detection Jing Gao SUNY Buffalo 1 Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their