Unsupervised Anomaly Detection for High Dimensional Data

Similar documents
Anomaly Detection. Jing Gao. SUNY Buffalo

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Introduction to Support Vector Machines

Brief Introduction of Machine Learning Techniques for Content Analysis

Using an Ensemble of One-Class SVM Classifiers to Harden Payload-based Anomaly Detection Systems

Roberto Perdisci^+, Guofei Gu^, Wenke Lee^ presented by Roberto Perdisci. ^Georgia Institute of Technology, Atlanta, GA, USA

Machine Learning And Applications: Supervised Learning-SVM

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CS145: INTRODUCTION TO DATA MINING

When Dictionary Learning Meets Classification

An Overview of Outlier Detection Techniques and Applications

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Models, Data, Learning Problems

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

L5 Support Vector Classification

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Day 3: Classification, logistic regression

Support Vector Machine (SVM) and Kernel Methods

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Constrained Optimization and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Statistical Machine Learning from Data

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Research Overview. Kristjan Greenewald. February 2, University of Michigan - Ann Arbor

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Machine Learning Midterm Exam

Support Vector Machines

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

Anomaly Detection in Logged Sensor Data. Master s thesis in Complex Adaptive Systems JOHAN FLORBÄCK

Midterm Exam, Spring 2005

Neural Networks and Ensemble Methods for Classification

Machine Learning Recitation 8 Oct 21, Oznur Tastan

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Nearest Neighbors Methods for Support Vector Machines

ECE521 week 3: 23/26 January 2017

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Support Vector Machine (SVM) and Kernel Methods

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

This is an author-deposited version published in : Eprints ID : 17710

Chapter 14 Combining Models

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS534 Machine Learning - Spring Final Exam

FINAL: CS 6375 (Machine Learning) Fall 2014

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Machine Learning Concepts in Chemoinformatics

Pattern Recognition 2018 Support Vector Machines

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

CS145: INTRODUCTION TO DATA MINING

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Sparse Kernel Machines - SVM

Machine learning for pervasive systems Classification in high-dimensional spaces

L11: Pattern recognition principles

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Machine Learning Midterm, Tues April 8

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Bits of Machine Learning Part 1: Supervised Learning

Classification and Pattern Recognition

Introduction to Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Adversarial Machine Learning: Big Data Meets Cyber Security

Support Vector Machine (continued)

Learning with multiple models. Boosting.

Holdout and Cross-Validation Methods Overfitting Avoidance

CS246 Final Exam, Winter 2011

What is semi-supervised learning?

Statistical Learning Reading Assignments

Qualifying Exam in Machine Learning

Classification: The rest of the story

Introduction to Machine Learning Midterm Exam Solutions

Anomaly Detection using Support Vector Machine

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Recent Advances in Bayesian Inference Techniques

Linear Programming-based Data Mining Techniques And Credit Card Business Intelligence

Unsupervised Learning with Permuted Data

Contextual Profiling of Homogeneous User Groups for Masquerade Detection. Pieter Bloemerus Ruthven

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

A Framework for Adaptive Anomaly Detection Based on Support Vector Data Description

Applied Machine Learning Annalisa Marsico

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Kernel Methods and Support Vector Machines

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

10-701/ Machine Learning - Midterm Exam, Fall 2010

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Finding Multiple Outliers from Multidimensional Data using Multiple Regression

SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA. Outlier Detection And Description Workshop 2013

Data classification (II)

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Transcription:

Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013)

Outline of Talk Motivation : Biometrics SVM(Supervised learning) Approach Unsupervised L 2 E Estimation Approach Experimental Results Concluding Remarks

Introduction We are drowning in the deluge of data that are being collected world-wide, while starving for knowledge at the same time. Anomalous events occur relatively infrequently However, when they do occur, their consequences can be quite dramatic and quite often in a negative sense

Need for Accurate Speaker Recognition Method of recognizing a person based on his voice One of the forms of biometric identification Need for accurate and scalable speaker recognition - VoIP applications Applications in diverse areas- telephone, internet banking,online trading,forensics Corporate and government sectors security enforcement

What is an intrusion detection? Intrusions are the activities that violate the security policy of system. Intrusion Detection is the process used to identify malicious behavior that targets a network and its resources

Intrusion Detection System Intrusion Detection Systems(IDSs) plays a key role as defense mechanism against malicious attacks in network security. Monitors traffic between users and networks; abnormal activity. Analyzes patterns/signatures based on data packets.

Intrusion Detection Techniques misuse intrusion detection-intrusion signatures statistical/anomaly intrusion detection

Misuse intrusion detection Catch the intrusions in terms of the characteristics of known attacks or system vulnerabilities Built with knowledge of bad behaviors Collection of signatures-signature Analysis Examine event stream for signature match-pattern Matching Cannot detect novel or unknown attacks

Anomaly detection? Anomaly is a pattern in the data that does not conform to the expected behavior Also referred to as outliers, exceptions, peculiarities, surprise, etc. Detect any action that significantly deviates from the normal behavior Built with knowledge of normal behaviors Examine event stream for deviations from normal

Applications of Anomaly detection Network intrusion detection Insurance / Credit card fraud detection Healthcare Informatics / Medical diagnostics Industrial Damage Detection Image Processing / Video surveillance Novel Topic Detection in Text Mining

Real world Analomies

Key Challenges Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Data is extremely huge, noisy, can be complex Normal behavior keeps evolving Fast and accurate real-time detection

Novelty detection Identification of new or unknown data or signal that a machine learning system is not aware of during training. Fundamental requirements of good classification or identification system Abnormalities are very rare or there may be no data describes the faulty conditions

Techniques/approaches to detect anomalies Supervised - The data (observations, measurements, etc.) are labeled with pre-defined classes. Unsupervised - Class labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data

Support Vector Machine (SVM) A popular supervised anomaly detection technique SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative The common features in normal and adversary groups need to be learned and need to be differentiate Discovering the key characteristics of network traffic patterns, a decision making boundary is superimposed in the space of feature representations.

SVM for Network Traffic Classification Effectively understand the patterns of network traffic and detect measurements deemed untrustworthy from malicious targets Eliminates the need for arbitrary assumptions about the underlying network topology and parameters or thresholds in favor of direct training data. Discover key characteristics of network traffic patterns by superimposing a boundary in the space of measurements.

SVM Framework Cast the problem of detecting malicious nodes in a SVM classification framework Labeled Training Examples: ( x i, y i ), where x i is the representation of the i th example in the feature space and y i {1, 1} is the corresponding label Decision Boundary Function: y( x ) = w. x. + w 0 where w is the weight vector and w 0 is the bias.

SVM Framework Network Traffic Features: x Optimization Function: w and w 0 Prediction of Training Set Label: w. x +w 0

SVM Optimization Problem min 1 2 W 2 + γ N i=1 subject to y i ( W.Φ( x ) + W 0 ) 1 ε i, i ε i where N : number of training examples. ε i : collection of non-negative slack variables that account for possible misclassification s. γ : trade off factor between the slack variables and the regularization on the norm of the weight vector W. The constraint in this minimization implies that we want our predictions, W.Φ ( x). + W to be similar to labels.

Solution to the SVM Optimization Problem Solve optimization by quadratic programming in dual Parameter estimation by cross validation of training set Given a W and W 0, predict whether a node is adversary or not by looking at the sign of W.Φ ( x) + W 0. LibSVM package to implement the SVM model based anomaly detection

Key Challenges in Supervised learning Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Data is extremely huge, noisy, can be complex Normal behavior keeps evolving Fast and accurate real-time detection

What is Mixture Model Let f θm ( x) denote the general mixture probability density function with m components. f θm ( x) = m π i f ( x φ i ). i=1 π i 0, m π i = 1 for i = 1,..., m; i=1 θ m = (π 1,..., π m 1, π m, φ T 1,..., φ T m) T. In theory, the f ( x φ i ) s could be any parametric density, although in practice they are often from the same parametric family (usually Gaussian)

Estimation Approach with Built-in Robustness using L 2 E When m is known, we want to find f θm ( x) is close to g( x) in L 2 distance. That is, L 2 (f θm, g( x)) = [f θm ( x) g( x)] 2 d x. The aim is to derive an estimate of θ m that minimizes the L 2 distance

Estimation Approach with Built-in Robustness L 2 (f θm ( x), g( x)) = 2 + f 2 θ m ( x)d x f θm ( x)g( x)d x g( x) 2 d x

Estimation Approach with Built-in Robustness The last integral is constant with with respect to θ m The first integral is often available as a closed form expression The second integral is simply the average height of the density estimate, which may be estimated as 2n 1 n i=1 f θ m ( X i ) where X i is a sample observation.

Computational Algorithm The L 2 E estimator of θ m is given by ˆθ L 2E m = arg min θ m [ f θ 2 ( x)d x 2n 1 m ] n f θm ( X i ), i=1

Computational Algorithm Normal Identity φ(x µ 1, σ 1 2 )φ(x µ 2, σ 2 2 )dx = φ(µ 1 µ 2 0, σ 1 2 + σ 2 2 ), where φ(x µ, σ 2 ) is the normal density function with mean µ and variance σ 2. For multivariate Gaussian mixtures-gmm, f ( x φ i ) = φ( x µ i, Σ i ), the use of the above identity reduces the key integral to

Computational Algorithm f 2 θ m ( x)d x = m m π k π l φ( µ k µ l 0, Σ k + Σ l ) k=1 l=1 Making the integral tractable and thereby significantly reducing the computations involved in minimizing L 2 E. Thus, the estimation of L 2 E may be performed by any standard optimization algorithm.

Data Analysis The effective detection and identification of anomalies in traffic requires the ability to separate them from normal network traffic. Network traffic data set from University of New Mexico. Trace files contained 13831 sample observations with process IDs and their respective system calls. We apply our L 2 E(unsupervised) and compare the performance with SVM(supervised)

Results: Accuracy with increasing dimensions-(70%-30% train-test partition of the data) Dimensions L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate 2 0.774 1.000 0.9926 3 0.663 1.000 0.9926 4 0.561 1.000 0.9924 5 0.390 0.989 0.9924 6 0.322 1.000 0.9924 7 0.189 1.000 0.9924 8 0.000 0.980 0.9924

Results: Accuracy with varied testing training data-(using 8 dimensions of the data) Train- Testing L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate 50-50 0.0003 0.9884 0.9914 60-40 0.0001 0.9786 0.9919 70-30 0.0002 0.9836 0.9920 80-20 0.0002 0.9781 0.9898 90-10 0.0001 0.9814 0.9884

Results: Accuracy with increasing testing sample size-(using 70%-30% train-test partition using 8 dimensions of the data) Testing Sample Size L 2 E False Detection Rate L 2 E True Detection Rate SVM True Detection Rate 500 0.0000 0.9792 0.9960 1000 0.0000 0.9744 0.9960 1500 0.0000 0.9686 0.9920 2000 0.0000 0.9814 0.9935 2500 0.0004 0.9844 0.9912 3000 0.0000 0.9876 0.9907 3500 0.0003 0.9840 0.9925 4000 0.0003 0.9836 0.9927

Observations The false detection rate for SVM for all scenarios for this data set is zero. Despite the lack of labeled training data, the true detection rate of the L 2 E algorithm is comparable to the SVM for all scenarios.

Analysis for Simulated data Case: 5 dimension, n=10000 and we use 80/20 random split. Dataset : mu 1 = (2, 2, 2, 2), mu 2 = (2.5, 2.5, 2.5, 2.5),σ 1 = diag(.1),σ 2 = diag(.4),pi 1 = 0.8,pi 2 = 0.2 We apply our L 2 E(unsupervised) and compare the performance with SVM(supervised) and some other machine learning algorithms. Classification accuracy for L 2 E is better than the alternatives.

Results: Comparing Machine Learning Algorithm for the simulate data:testing sample size- Classifier Time False -ve False +ve L 2 E 2.1 0.0345 0.0055 EM 16 0.0315 0.006 Trees 0.31 0.186 0.011 SVM 1.95 0.167 0.007 NN 5.2 0.214 0.01

Conclusion-Significance of our L 2 E Does not require the labeled training data or special configuration Ease of use Efficiency in achieving accuracy with out computational overhead Results are Comparable to SVM and other machine learning algorithms

Current and Future work Evaluating the performance using multiple network traffic data sets for speaker recognition. Applying real data sets with higher dimensions and large number of components. Estimating the number of components. Data Mining-Random Forest/Boosting

Some Reference Article L 2 E Estimation of Mixture complexity for Count Data- CSDA(Oct,2009) Simultaneous Robust Estimation in Finite Mixture: The Continuous Case- JISA(Special-Golden Jubilee-2012) Detection of Anomalies in Network traffic using L 2 E for Accurate speaker recognition, IEEE Midwest, August, 2012. Elements of Statistical Learning- Bookhttp://www-stat.stanford.edu/ tibs/elemstatlearn/