Proximity-Based Anomaly Detection using Sparse Structure Learning

Similar documents
Probabilistic Graphical Models

Probabilistic Graphical Models

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Robust and sparse Gaussian graphical modelling under cell-wise contamination

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

10708 Graphical Models: Homework 2

Multivariate Normal Models

Multivariate Normal Models

An Introduction to Graphical Lasso

Robust Inverse Covariance Estimation under Noisy Measurements

Permutation-invariant regularization of large covariance matrices. Liza Levina

Gaussian Graphical Models and Graphical Lasso

Graphical Model Selection

High-dimensional covariance estimation based on Gaussian graphical models

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Causal Inference: Discussion

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

11 : Gaussian Graphic Models and Ising Models

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Learning Gaussian Graphical Models with Unknown Group Sparsity

Recent advances in sensor data analytics

Learning Markov Network Structure using Brownian Distance Covariance

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Sparse Graph Learning via Markov Random Fields

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Sparse inverse covariance estimation with the lasso

Estimation of Graphical Models with Shape Restriction

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Lecture 25: November 27

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Machine Learning (BSMC-GA 4439) Wenke Liu

Graphical Models for Non-Negative Data Using Generalized Score Matching

Markov Network Estimation From Multi-attribute Data

Optimization methods

Sparse and Locally Constant Gaussian Graphical Models

TESTING FOR CO-INTEGRATION

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Joint Gaussian Graphical Model Review Series I

Sparse Inverse Covariance Estimation for a Million Variables

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 2 Part 1 Optimization

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Learning discrete graphical models via generalized inverse covariance matrices

The Nonparanormal skeptic

Computing Correlation Anomaly Scores using Stochastic Nearest Neighbors

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Extended Bayesian Information Criteria for Gaussian Graphical Models

Probabilistic Graphical Models

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Nearest Neighbor Gaussian Processes for Large Spatial Data

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Learning Binary Classifiers for Multi-Class Problem

Equivalent Partial Correlation Selection for High Dimensional Gaussian Graphical Models

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties

25 : Graphical induced structured input/output models

Probabilistic Models of Past Climate Change

Variable Selection and Weighting by Nearest Neighbor Ensembles

A General Framework for High-Dimensional Inference and Multiple Testing

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Change Detection in Multivariate Data

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Introduction to graphical models: Lecture III

Linear Models for Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CPSC 540: Machine Learning

Linear Models for Regression. Sargur Srihari

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Chapter 17: Undirected Graphical Models

Robust Portfolio Risk Minimization Using the Graphical Lasso

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

ORIE 4741: Learning with Big Messy Data. Regularization

Linear Models for Regression CS534

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Prediction & Feature Selection in GLM

ECE521 Lectures 9 Fully Connected Neural Networks

Formalizing expert knowledge through machine learning

Downloaded by Stanford University Medical Center Package from online.liebertpub.com at 10/25/17. For personal use only. ABSTRACT 1.

A Stepwise Approach for High-Dimensional Gaussian Graphical Models

Comment on Article by Scutari

High-dimensional graphical model selection: Practical and information-theoretic limits

Robust methods and model selection. Garth Tarr September 2015

Scalable robust hypothesis tests using graphical models

A Consistent Method for Graph Based Anomaly Localization

CPSC 540: Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Gaussian Process Approximations of Stochastic Differential Equations

Bayesian Learning of Sparse Gaussian Graphical Models. Duke University Durham, NC, USA. University of South Carolina Columbia, SC, USA I.

Transcription:

Proximity-Based Anomaly Detection using Sparse Structure Learning Tsuyoshi Idé (IBM Tokyo Research Lab) Aurelie C. Lozano, Naoki Abe, and Yan Liu (IBM T. J. Watson Research Center) 2009/04/ SDM 2009 / Anomaly-Detection

Goal: Compute anomaly score for each variable to capture anomalous behaviors in variable dependencies. x 2 Something wrong between x 2 and x 4 x 4 Some anomalies cannot be detected only by looking at individual variables (e.g. no increase in RPM when accelerating) reference data In practice, we need to reduce pairwise information to an anomaly score for each variable anomaly score variable Page 2 /16

Difficulty -- Correlation values are extremely unstable (1/2): Example from econometrics data. Data: daily spot prices over two different terms foreign currencies in dollars No evidence that the international relationships changed between the terms term 1 However, most of the correlation coefficients are completely different term 2 Data source http://www.stat.duke.edu/data-sets/mw/ts_data/all_exrates.html Page 3 /16

Difficulty -- Correlation values are extremely unstable (2/2): We can make meaningful comparisons by focusing on neighborhoods. Important observation: Highly correlated pairs are stable. Look only at neighborhood of each variable for robust comparisons. Page 4 /16

We want to remove spurious dependencies caused by noise, and leave essential dependencies. Input: Multivariate (time-series) data Output: Weighted graph representing essential dependencies of variables The graph will be sparse Node = variable Edge = dependency between two variables No edge = two nodes are independent of each other Page 5 /16

Approach: (1) Select neighbors using sparse learning method, (2) Compute anomaly score based on the selected neighbors. Our problem: Compute anomaly (or change) score of each variable based on comparison with reference data. (1) Sparse structure learning (2) Scoring each variable reference data Page 6 /16

We use the Graphical Gaussian Model (GGM) for structure learning, where each graph is uniquely related to a precision matrix. Precision matrix = Inverse of covariance matrix S General rule: No edge if corresponding element of is zero Ex.1: If, there is no edge between x 1 and x 2 Implying they are statistically independent given the rest of the variables. Why? Because this condition factorizes the distribution. Ex. 2: A six variable case Page 7 /16

Recent trends in GGM: Classical methods are being replaced with modern sparse algorithms. Covariance selection (classical method) Dempster [1972]: Sequentially pruning smallest elements in precision matrix Drton and Perlman [2008]: Improved statistical tests for pruning Serious limitations in practice: breaks down when covariance matrix is not invertible L 1 -regularization based method (hot!) Meinshausen and Bühlmann [Ann. Stat. 06]: Used LASSO regression for neighborhood selection Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matrix Friedman et al. [Biostatistics 08]: Efficient fixed-point equations based on a sub-gradient algorithm Structure learning is possible even when # variables > # samples Page 8 /16

One-page summary of Meinshausen-Bühlmann (MB) algorithm: Solving separated Lasso for every single variables. Step 1: Pick up one variable Step 2: Think of it as y, and the rest as z Step 3: Solve Lasso regression problem between y and z Step 4: Connect the k-th node to those having nonzero weight in w Page 9 /16

Instead, we solve an L 1 -regularized maximum likelihood equation for structure learning. Input: Covariance matrix S Assumes standardized data (mean=0, variance=1) S is generally rank-deficient Thus the inverse does not exist Output: Sparse precision matrix Originally, is defined as the inverse of S, but not directly invertible Need to find a sparse matrix that can be thought as of as an inverse of S Approach: Solve an L 1 -regularized maximum likelihood equation log likelihood regularizer Page 10 /16

From matrix optimization to vector optimization: Solving coupled Lasso for every single variables. Focus only on one row (column), keeping the others constant Optimization problem for blue vector is shown to be Lasso (L 1 -regularized quadratic programming) (See the paper for derivation) Difference from MB s: Resulting Lasso problems are coupled The gray part is actually not constant; changes after solving one Lasso problem This coupling is essential for stability under noise, as discussed later Page 11 /16

Defining anomaly score using the sparse graphical models. Now we have two Gaussians for reference and target data reference target We use Kullback Leibler divergence as a discrepancy metric Result for anomaly score of the i-th variable: KL div d i AB = (change in degrees of node x i ) + (change in tightness of node x i ) + (change in variance of node x i itself) neighbors of node x i in data set A neighbors of node x i in data set B Page 12 /16

Experiment (1/4) -- Structure learning under collinearities: Experimental settings Data: daily spot prices Strong collinearity exists (See the beginning slides) Focused on a single term Observed the change of structure after introducing noise Perform structure learning from the data Learning again after introducing noise Added Gaussian noise having sigma = 10% standard deviation of the original data Compared three structure learning methods Glasso Friedman, Hastie, & Tibshirani., Biostatistics, 2008 Lasso Meinshausen & Bühlmann, Ann. Stats. 2006 AdaLasso Improved version of MB s algorithm, where regression is based on Adaptive Lasso [H. Zou, JASA, 2006] rather than simple Lasso Page 13 /16

Experiment (2/4) -- Structure learning under collinearities: Only Graphical lasso was stable MB s algorithm doesn t work under collinearities, while Glasso shows reasonable stability This is due to the general tendency that Lasso selects one of correlated features almost at random c.f. Bolasso [Bach 08], Stability Selection [MB 08] Glasso avoids this problem by solving coupled version of Lasso Don t reduce structure learning to separated regression problems of individual variables. Treat the precision matrix as matrix. Sparsity ratio of disconnected edges to all possible edges Flip prob. pro. of how many edges are changed after introducing noise Page 14 /16

Experiment (3/4) -- Anomaly detection using automobile sensor data: Experimental settings Automobile sensor data 44 variables 79 reference and 20 faulty data sets In faulty data, two variables exhibit a correlation anomaly x 24 and x 25 (not shown) Compute a set of anomaly scores for each of 79 x 20 data pairs Result is summarized in ROC curve loss of correlation Area Under Curve (AUC) will be 1 if top 2 variables in anomaly score are always occupied by truly faulty variables normal faulty Page 15 /16

Experiment (4/4) -- Anomaly detection using automobile sensor data: Our method substantially reduced false positives. Methods compared likelihood-based score (conventional) k-nn method for neighborhood selection a stochastic neighborhood selection method [Idé et al, ICDM 07] our approach Our KL-divergence-based method gives the best results Page 16 /16

Thank you!