Department of Computer Science, University of Pittsburgh. Brigham and Women's Hospital and Harvard Medical School

Similar documents
Statistical Methods for Forecasting

A Flexible Forecasting Framework for Hierarchical Time Series with Seasonal Patterns: A Case Study of Web Traffic

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Predicting freeway traffic in the Bay Area

Time Series 4. Robert Almgren. Oct. 5, 2009

Lecture 2: Univariate Time Series

Time Series Analysis -- An Introduction -- AMS 586

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

LOAD FORECASTING APPLICATIONS for THE ENERGY SECTOR

Graph-Based Anomaly Detection with Soft Harmonic Functions

Section 6-5 THE CENTRAL LIMIT THEOREM AND THE SAMPLING DISTRIBUTION OF. The Central Limit Theorem. Central Limit Theorem: For all samples of

Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall.

WRF Webcast. Improving the Accuracy of Short-Term Water Demand Forecasts

Forecasting: principles and practice. Rob J Hyndman 1.1 Introduction to Forecasting

A SEASONAL TIME SERIES MODEL FOR NIGERIAN MONTHLY AIR TRAFFIC DATA

1. Fundamental concepts

Journal of Asian Scientific Research PREDICTION OF FATAL ROAD TRAFFIC CRASHES IN IRAN USING THE BOX-JENKINS TIME SERIES MODEL. Ayad Bahadori Monfared

Time Series Outlier Detection

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Chapter 3 Probability Distribution

Forecasting: Principles and Practice. Rob J Hyndman. 1. Introduction to forecasting OTexts.org/fpp/1/ OTexts.org/fpp/2/3

Traffic Profile Prediction

The ARIMA Procedure: The ARIMA Procedure

Solar irradiance forecasting for Chulalongkorn University location using time series models

Evaluating Electricity Theft Detectors in Smart Grid Networks. Group 5:

Using Temporal Hierarchies to Predict Tourism Demand

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Chapter (4) Discrete Probability Distributions Examples

Capacity Planning and Headroom Analysis for Taming Database Replication Latency - Experiences with LinkedIn Internet Traffic

at least 50 and preferably 100 observations should be available to build a proper model

Study of Time Series and Development of System Identification Model for Agarwada Raingauge Station

TRANSFER FUNCTION MODEL FOR GLOSS PREDICTION OF COATED ALUMINUM USING THE ARIMA PROCEDURE

Functional time series

Chapter 12: An introduction to Time Series Analysis. Chapter 12: An introduction to Time Series Analysis

Seasonal Adjustment using X-13ARIMA-SEATS

TIME SERIES ANALYSIS. Forecasting and Control. Wiley. Fifth Edition GWILYM M. JENKINS GEORGE E. P. BOX GREGORY C. REINSEL GRETA M.

Modelling Monthly Rainfall Data of Port Harcourt, Nigeria by Seasonal Box-Jenkins Methods

Some Time-Series Models

Optimization of Short-Term Traffic Count Plan to Improve AADT Estimation Error

Anomaly Detection. Jing Gao. SUNY Buffalo

FORECASTING THE INVENTORY LEVEL OF MAGNETIC CARDS IN TOLLING SYSTEM

MATH 250 / SPRING 2011 SAMPLE QUESTIONS / SET 3

TIME SERIES DATA PREDICTION OF NATURAL GAS CONSUMPTION USING ARIMA MODEL

Forecasting. Simon Shaw 2005/06 Semester II

A Comparison of the Forecast Performance of. Double Seasonal ARIMA and Double Seasonal. ARFIMA Models of Electricity Load Demand

Analyzing the effect of Weather on Uber Ridership

Empirical Approach to Modelling and Forecasting Inflation in Ghana

C.6 Normal Distributions

Integrated Electricity Demand and Price Forecasting

Prompt Network Anomaly Detection using SSA-Based Change-Point Detection. Hao Chen 3/7/2014

Development of the Probe Information Based System PROROUTE

Forecasting Using Time Series Models

Statistical Methods. for Forecasting

Bus 216: Business Statistics II Introduction Business statistics II is purely inferential or applied statistics.

A Robust Graph-based Algorithm for Detection and Characterization of Anomalies in Noisy Multivariate Time Series

Short-term traffic volume prediction using neural networks

TIME SERIES ANOMALY DETECTION. A practical guide to detecting anomalies in time series using artificial intelligence concepts

Discrete probability distributions

A Course in Time Series Analysis

Mean, Median, Mode, and Range

ECON 427: ECONOMIC FORECASTING. Ch1. Getting started OTexts.org/fpp2/

Probability and Probability Distributions. Dr. Mohammed Alahmed

Multivariate Time Series Analysis and Its Applications [Tsay (2005), chapter 8]

Time Series and Forecasting

Expectation propagation for signal detection in flat-fading channels

Inference for Distributions Inference for the Mean of a Population

An Overview of Outlier Detection Techniques and Applications

Land Use Modeling at ABAG. Mike Reilly October 3, 2011

Section 2.3: One Quantitative Variable: Measures of Spread

Parking Occupancy Prediction and Pattern Analysis

Automatic Forecasting

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

SHORT TERM LOAD FORECASTING

Analysis. Components of a Time Series

Early Detection of Important Animal Health Events

Statistical Models and Algorithms for Real-Time Anomaly Detection Using Multi-Modal Data

Applied Statistics in Business & Economics, 5 th edition

Short-term electricity demand forecasting in the time domain and in the frequency domain

Forecasting using R. Rob J Hyndman. 1.3 Seasonality and trends. Forecasting using R 1

arxiv: v1 [stat.co] 11 Dec 2012

Basic Statistics and Probability Chapter 3: Probability

Dengue Forecasting Project

Estimation and application of best ARIMA model for forecasting the uranium price.

Stat 565. (S)Arima & Forecasting. Charlotte Wickham. stat565.cwick.co.nz. Feb

Practice problems from chapters 2 and 3

Elements of Multivariate Time Series Analysis

Absolute Value Equations and Inequalities

Anomaly Detection in Logged Sensor Data. Master s thesis in Complex Adaptive Systems JOHAN FLORBÄCK

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS

Time Series and Forecasting

Prediction of Annual National Coconut Production - A Stochastic Approach. T.S.G. PEIRIS Coconut Research Institute, Lunuwila, Sri Lanka

Chart types and when to use them

Firstly, the dataset is cleaned and the years and months are separated to provide better distinction (sample below).

AN OS METHOD FOR DECOMPOSITION OF CYCLIC COMPONENT SIGNALS

A Hybrid Method of Forecasting in the Case of the Average Daily Number of Patients

REVIEW OF SHORT-TERM TRAFFIC FLOW PREDICTION TECHNIQUES

Dennis Bricker Dept of Mechanical & Industrial Engineering The University of Iowa. Forecasting demand 02/06/03 page 1 of 34

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Applied Time Series Analysis FISH 507. Eric Ward Mark Scheuerell Eli Holmes

Describing and Forecasting Video Access Patterns

Transcription:

Siqi Liu 1, Adam Wright 2, and Milos Hauskrecht 1 1 Department of Computer Science, University of Pittsburgh 2 Brigham and Women's Hospital and Harvard Medical School

Introduction Method Experiments and Results Conclusion

A time series is a sequence of data points indexed by (discrete) time. For example, a univariate time series y t R: t = 1, 2,. Generally, the points are not independent of each other.

Daily prices of stocks Monthly usage of electricity Daily temperature, humidity,... Patient s heart rate, blood pressure,... Number of items sold every month Number of cars passed through a highway every hour

By monitoring some attribute of a target (e.g., the heart rate of a patient), we naturally get a time series. Analyzing the time series gives us insights about the target. In this work, we are interested in finding outliers in the time series in real time.

Outliers are the points that do not follow the pattern of the majority of the data. More strictly, they are points that do not follow the probability distribution generating the majority of the data. Outliers provide useful insights, because they indicate anomaly or novelty, i.e., events requiring attention. extremely low volume on a highway traffic accident unusually frequent access to a server server being attacked increasing use of a rare word on a social network new trending topic

Detecting outliers in time series is challenging because of the nonstationarity (i.e., the distribution of the data changes over time) Specifically, the changes could be long-term changes periodic changes (a.k.a. seasonality) These hinder outlier detection, because they result in false positives and false negatives

An extreme value in the past could be normal now A normal value in the past could be extreme now outlier normal value outlier time

outlier value normal normal time

By considering the context, some outliers become normal; some normal points become outliers.

Existing work in outlier detection in time series usually assumes a model like autoregressive-moving-average (ARMA). (e.g., Tsay 1988; Yamanishi and Takeuchi 2002) These models cannot deal with nonstationary (seasonal) time series directly. A solution is to difference the time series, resulting in: autoregressiveintegrated-moving-average (ARIMA). We use it as a baseline in our experiments.

Introduction Method Experiments and Results Conclusion

A time t = 1, 2, we sequentially receive the observations of the target time series y = {y t R: t = 1,2, }, and the associated context variables x = x t R p : t = 1,2,. Our model consists of two layers: First layer uses a sliding window to compute a local score; Second layer combines the local score with the context variables to compute a global score (which is the final outlier score).

First, we decompose the time series (within a sliding window) into 3 components using a nonparametric decomposition algorithm called STL (Cleveland et al. 1990).

Then, we compute a local deviation score z t = y (R) (R) t μ t (R) σ t remainder. on the

At each time t, given (z t, x t ), where z t is the local score (first-layer output) and x t is the context variables, keep updating a Bayesian linear model z t w, β, x t N x t T w, β 1, with the conjugate prior w, β N w m 0, β 1 S 0 Gam β a 0, b 0. The model is built globally (aggregating all the information from the beginning), because contextual variables may correspond to rare events (e.g., holidays), but we need enough examples to have a good model; local scores are normalized locally, so no need to worry about nonstationarity.

The final outlier score is calculated based on the marginal distribution of z t given x t and the history z t D t 1, x t St(z t μ t, σ t 2, ν t ), where D t 1 = z u, x u u = 1, 2,, t 1}.

Introduction Method Experiments and Results Conclusion

Bike data consists of the time series (of length 733) that records the daily bike trip counts taken in San Francisco Bay Area through the bike share system from August 2013 to August 2015. CDS data consists of daily rule firing counts of a clinical decision support (CDS) system in a large teaching hospital. (111 time series of length 1187) Traffic data consists of time series of vehicular traffic volume measurements collected by sensors placed on major highways. (2 time series of length 365)

Outliers are injected into the time series by randomly sampling a small proportion p of points and changing their value by a specified size δ as y i = y i δ for each y i in the sample. We vary p and δ to see the effects.

RND - detects outliers randomly. SARI - ARIMA(1,1,0) (1,1,0) 7, ARIMA with a weekly (7 day) period, (seasonal) differencing, and (seasonal) order-1 autoregressive term. SIMA - ARIMA(0,1,1) (0,1,1) 7, ARIMA with a weekly period, (seasonal) differencing, and (seasonal) order-1 moving-average term. SARIMA - ARIMA(1,1,1) (1,1,1) 7, ARIMA combining the above two. ND - our first-layer STL-based model, using absolute value of the output as outlier scores. TL1 - our two-layer model using holiday information as a contextual variable. TL2 - our two-layer model using holiday and additional information (if available) as context variables.

Alert rate: the proportion of alerts raised out of all points. Precision: the proportion of true outliers out of alerts raised. We calculate AUC to compare the overall performance. Notice we focus on low-alertrate region for practicality.

By comparing the AUC, we have the following observations: When the size of the outliers are small, all methods perform similarly to random. In the other cases, our two-layer method is almost always the best method. Even using only the first-layer can achieve similar or better results as the ARIMA-based methods. Using additional information (e.g., using weather besides holiday info) improves the performance of the two-layer method.

Introduction Method Experiments and Results Conclusion

We have proposed a two-layer method to detect outliers in time series in real time. Our method takes account of the nonstationarity and the context of the data to detect outliers more accurately. Experiments on data sets from different domains have shown the advantages of our method.