Time Series. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. Mining and Forecasting

Similar documents
Time Series Mining and Forecasting

15-826: Multimedia Databases and Data Mining. Must-Read Material. Thanks. C. Faloutsos CMU 1

Time Series. CSE 6242 / CX 4242 Mar 27, Duen Horng (Polo) Chau Georgia Tech. Nonlinear Forecasting; Visualization; Applications

Text Analytics (Text Mining)

Text Analytics (Text Mining)

Data Mining. CS57300 Purdue University. Jan 11, Bruno Ribeiro

CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #12: Time Series Mining

Probabilistic Near-Duplicate. Detection Using Simhash

4. DATA ASSIMILATION FUNDAMENTALS

COS 424: Interacting with Data

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C.

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

Ordinary Least Squares Linear Regression

Faloutsos, Tong ICDE, 2009

Generating Uniform Random Numbers

Fundamentals of Similarity Search

Data Mining 4. Cluster Analysis

Least Squares Estimation Namrata Vaswani,

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Constructing comprehensive summaries of large event sequences

Lecture 7: Interpolation

Time Series (1) Information Systems M

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Describing and Forecasting Video Access Patterns

In the derivation of Optimal Interpolation, we found the optimal weight matrix W that minimizes the total analysis error variance.

Unsupervised Anomaly Detection for High Dimensional Data

Generating Uniform Random Numbers

Generating Uniform Random Numbers

CMU SCS. Large Graph Mining. Christos Faloutsos CMU

CSE 123: Computer Networks

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Tools of AI. Marcin Sydow. Summary. Machine Learning

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

COMP 175 COMPUTER GRAPHICS. Lecture 04: Transform 1. COMP 175: Computer Graphics February 9, Erik Anderson 04 Transform 1

LEC1: Instance-based classifiers

Introduction to Signal Processing

Incremental Pattern Discovery on Streams, Graphs and Tensors

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence

Switching-state Dynamical Modeling of Daily Behavioral Data

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Computing the Entropy of a Stream

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Slide05 Haykin Chapter 5: Radial-Basis Function Networks

16.1 L.P. Duality Applied to the Minimax Theorem

Fourier transform. Alejandro Ribeiro. February 1, 2018

An economic application of machine learning: Nowcasting Thai exports using global financial market data and time-lag lasso

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Streaming multiscale anomaly detection

Online Data Mining for Co-Evolving Time Sequences

DATA MINING AND MACHINE LEARNING

Introducing GIS analysis

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Reservoir Computing and Echo State Networks

Applied Time Series Topics

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

From Non-Negative Matrix Factorization to Deep Learning

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Multiple Representations: Equations to Tables and Graphs Transcript

ES-2 Lecture: More Least-squares Fitting. Spring 2017

BNAD 276 Lecture 10 Simple Linear Regression Model

A Tutorial on Recursive methods in Linear Least Squares Problems

COMP9334: Capacity Planning of Computer Systems and Networks

Machine Learning on temporal data

Accelerated Advanced Algebra. Chapter 1 Patterns and Recursion Homework List and Objectives

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Singular Value Decompsition

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

GSI DATA ASSIMILATION SYSTEM: COMMUNITY SUPPORT AND PRELIMINARY TESTING RESULTS

Spatial Analysis II. Spatial data analysis Spatial analysis and inference

Similarity and Dissimilarity

One Variable Calculus: Foundations and Applications

The GeoCLIM software for gridding & analyzing precipitation & temperature. Tamuka Magadzire, FEWS NET Regional Scientist for Southern Africa

Data mining in large graphs

Matrix-Vector Operations

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Constructing comprehensive summaries of large event sequences

On prediction. Jussi Hakanen Post-doctoral researcher. TIES445 Data mining (guest lecture)

Big Data Analytics: Optimization and Randomization

CS570 Introduction to Data Mining

Matrix Assembly in FEA

Mining Classification Knowledge

Time Series Data Cleaning

1. Introduction. Hang Qian 1 Iowa State University

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Lecture 5: Clustering, Linear Regression

Large-Scale Behavioral Targeting

TGDR: An Introduction

Dimensionality Reduction and Principle Components Analysis

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data

Adaptive Burst Detection in a Stream Engine

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Number Theory. Zachary Friggstad. Programming Club Meeting

Overview of Spatial Statistics with Applications to fmri

EE5585 Data Compression April 18, Lecture 23

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Time Series Mining and Forecasting Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

Outline Motivation Similarity search distance functions Linear Forecasting Non-linear forecasting Conclusions

Problem definition Given: one or more sequences x 1, x 2,, x t, (y 1, y 2,, y t, ) ( ) Find similar sequences; forecasts patterns; clusters; outliers

Motivation - Applications Financial, sales, economic series Medical ECGs +; blood pressure etc monitoring reactions to new drugs elderly care

Motivation - Applications (cont d) Smart house sensors monitor temperature, humidity, air quality video surveillance

Motivation - Applications (cont d) Weather, environment/anti-pollution volcano monitoring air/water pollutant monitoring

Motivation - Applications (cont d) Computer systems Active Disks (buffering, prefetching) web servers (ditto) network traffic monitoring...

Stream Data: Disk accesses #bytes time

Problem #1: Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress count lynx caught per year (packets per day; temperature per day) year

Problem#2: Forecast Given x t, x t-1,, forecast x t+1 Number of packets sent 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick??

Problem#2 : Similarity search E.g.., Find a 3-tick pattern, similar to the last one Number of packets sent 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick??

Problem #3: Given: A set of correlated time sequences Forecast Sent(t) 90 Number of packets 68 45 23 sent lost repeated 0 1 4 6 9 11 Time Tick

Important observations Patterns, rules, forecasting and similarity indexing are closely related: To do forecasting, we need to find patterns/rules to find similar settings in the past to find outliers, we need to have forecasts (outlier = too far away from our forecast)

Outline Motivation Similarity search and distance functions Euclidean Time-warping...

Importance of distance functions Subtle, but absolutely necessary: A must for similarity indexing (-> forecasting) A must for clustering Two major families Euclidean and Lp norms Time warping and variations

Euclidean and Lp = = n i x i y i y x D 1 2 ) ( ), ( x(t) y(t)... = = n i p i i p y x y x L 1 ), ( L 1 : city-block = Manhattan L 2 = Euclidean L

Observation #1 Time sequence -> n-d vector Day-n... Day-2 Day-1

Observation #2 Euclidean distance is closely related to cosine similarity dot product cross-correlation function Day-n... Day-2 Day-1

Time Warping allow accelerations - decelerations (with or without penalty) THEN compute the (Euclidean) distance (+ penalty) related to the string-editing distance

stutters : Time Warping

Time warping Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix of length j of second sequence y

Time warping Thus, with no penalty for stutter, for sequences x 1, x 2,, x i,; y 1, y 2,, y j D( i, j) = x[ i] y[ j] + D( i min# D( i " D( i 1,, j 1) 1, j 1) j) no stutter x-stutter y-stutter http://www.psb.ugent.be/cbd/papers/gentxwarper/dtwalgorithm.htm

Time warping VERY SIMILAR to the string-editing distance D( i, j) = x[ i] y[ j] + D( i min# D( i " D( i 1,, j 1) 1, j 1) j) no stutter x-stutter y-stutter

Time warping Complexity: O(M*N) - quadratic on the length of the strings Many variations (penalty for stutters; limit on the number/percentage of stutters; ) popular in voice processing [Rabiner + Juang]

Other Distance functions piece-wise linear/flat approx.; compare pieces [Keogh+01] [Faloutsos+97] cepstrum (for voice [Rabiner+Juang]) do DFT; take log of amplitude; do DFT again Allow for small gaps [Agrawal+95] See tutorial by [Gunopulos + Das, SIGMOD01]

Other Distance functions In [Keogh+, KDD 04]: parameter-free, MDL based

Prevailing distances: Euclidean and time-warping Conclusions

Outline Motivation Similarity search and distance functions Linear Forecasting Non-linear forecasting Conclusions

Linear Forecasting

Outline Motivation... Linear Forecasting Auto-regression: Least Squares; RLS Co-evolving time sequences Examples Conclusions

Problem#2: Forecast Example: give x t-1, x t-2,, forecast x t Number of packets sent 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick??

Forecasting: Preprocessing MANUALLY: remove trends spot periodicities 7 days 6 3 5 2 3 2 2 1 0 1 2 3 4 5 6 7 8 9 10 0 1 3 5 7 9 11 13 time time

Problem#2: Forecast Solution: try to express x t as a linear function of the past: x t-1, x t-2,, (up to a window of w) Formally: 90 80 70 60 50?? 40 30 20 10 0 1 3 5 7 9 11 Time Tick

(Problem: Back-cast; interpolate) Solution - interpolate: try to express x t as a linear function of the past AND the future: x t+1, x t+2, x t+wfuture; x t-1, x t-wpast (up to windows of w past, w future ) EXACTLY the same algo s?? 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick

Refresher: Linear Regression patient weight height 1 27 43 2 43 54 3 54 72 N 25?? Body height 85 80 75 70 65 60 55 50 45 40 15 25 35 45 Body weight Express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

Refresher: Linear Regression patient weight height 1 27 43 2 43 54 3 54 72 N 25?? Body height 85 80 75 70 65 60 55 50 45 40 15 25 35 45 Body weight Express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

Refresher: Linear Regression patient weight height 1 27 43 2 43 54 3 54 72 N 25?? Body height 85 80 75 70 65 60 55 50 45 40 15 25 35 45 Body weight Express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

Refresher: Linear Regression patient weight height 1 27 43 2 43 54 3 54 72 N 25?? Body height 85 80 75 70 65 60 55 50 45 40 15 25 35 45 Body weight Express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

Linear Auto Regression Time Packets Sent (t-1) Packets Sent(t) 1-43 2 43 54 3 54 72 N 25??

Linear Auto Regression Time Packets Sent (t-1) Packets Sent(t) 1-43 2 43 54 3 54 72 N 25?? #packets sent at time t lag-plot #packets sent at time t-1 Lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

Linear Auto Regression Time Packets Sent (t-1) Packets Sent(t) 1-43 2 43 54 3 54 72 N 25?? #packets sent at time t lag-plot #packets sent at time t-1 Lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

Linear Auto Regression Time Packets Sent (t-1) Packets Sent(t) 1-43 2 43 54 3 54 72 N 25?? #packets sent at time t lag-plot #packets sent at time t-1 Lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

Linear Auto Regression Time Packets Sent (t-1) Packets Sent(t) 1-43 2 43 54 3 54 72 N 25?? #packets sent at time t lag-plot #packets sent at time t-1 Lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

More details: Q1: Can it work with window w > 1? A1: YES x t x t-1 x t-2

More details: Q1: Can it work with window w > 1? A1: YES (we ll fit a hyper-plane, then) x t x t-1 x t-2

More details: Q1: Can it work with window w > 1? A1: YES (we ll fit a hyper-plane, then) x t x t-1 x t-2

More details: Q1: Can it work with window w > 1? A1: YES The problem becomes: X [N w] a [w 1] = y [N 1] OVER-CONSTRAINED a is the vector of the regression coefficients X has the N values of the w indep. variables y has the N values of the dependent variable

More details: X [N w] a [w 1] = y [N 1] " # % & = " # % & " # % & N w Nw N N w w y y y a a a X X X X X X X X X " 2 1 2 1 2 1 2 22 21 1 12 11,,,,,,,,, Ind-var1 Ind-var-w time

More details: X [N w] a [w 1] = y [N 1] " # % & = " # % & " # % & N w Nw N N w w y y y a a a X X X X X X X X X " 2 1 2 1 2 1 2 22 21 1 12 11,,,,,,,,, Ind-var1 Ind-var-w time

More details Q2: How to estimate a 1, a 2, a w = a? A2: with Least Squares fit a = ( X T X ) -1 (X T y) (Moore-Penrose pseudo-inverse) a is the vector that minimizes the RMSE from y

More details Straightforward solution: a = ( X T X ) -1 (X T y) w a : Regression Coeff. Vector X : Sample Matrix X N : N Observations: Sample matrix X grows over time needs matrix inversion O(N w 2 ) computation O(N w) storage

Even more details Q3: Can we estimate a incrementally? A3: Yes, with the brilliant, classic method of Recursive Least Squares (RLS) (see, e.g., [Yi+00], for details). We can do the matrix inversion, WITHOUT inversion (How is that possible?)

Even more details Q3: Can we estimate a incrementally? A3: Yes, with the brilliant, classic method of Recursive Least Squares (RLS) (see, e.g., [Yi+00], for details). We can do the matrix inversion, WITHOUT inversion (How is that possible?) A: our matrix has special form: (X T X)

SKIP More details w At the N+1 time tick: X N+1 X N : N x N+1

SKIP More details: key ideas Let G N = ( X N T X N ) -1 ( gain matrix ) G N+1 can be computed recursively from G N without matrix inversion w G N w

Comparison: Straightforward Least Squares Needs huge matrix (growing in size) O(N w) Costly matrix operation O(N w 2 ) Recursive LS Need much smaller, fixed size matrix O(w w) Fast, incremental computation O(1 w 2 ) no matrix inversion N = 10 6, w = 1-100

SKIP EVEN more details: 1 T N + 1 = GN [ c] [ GN xn + 1 ] xn + 1 G G N 1 x w row vector c = + [ 1+ x G x N + 1 N N 1 T ] Let s elaborate (VERY IMPORTANT, VERY VALUABLE)

SKIP EVEN more details: a T = 1 T N [ X X ] [ X y ] N + 1 N + 1 N + 1 + 1

SKIP EVEN more details: a T = 1 T N [ X X ] [ X y ] N + 1 N + 1 N + 1 + 1 [w x 1] [w x (N+1)] [(N+1) x w] [w x (N+1)] [(N+1) x 1]

SKIP EVEN more details: a T = 1 T N [ X X ] [ X y ] N + 1 N + 1 N + 1 + 1 [w x (N+1)] [(N+1) x w]

SKIP EVEN more details: a T = 1 T N [ X X ] [ X y ] N + 1 N + 1 N + 1 + 1 gain matrix G T 1 N + 1 [ X N + 1 X N + 1] wxw wxw 1x1 wxw wx1 1xw wxw 1 T G N + 1 = GN [ c] [ GN xn + 1 ] xn + 1 GN SCALAR c = + [ 1+ x G x N + 1 N N 1 T ]

SKIP Altogether: G 0 δ I where I: w x w identity matrix δ: a large positive number

Comparison: Straightforward Least Squares Needs huge matrix (growing in size) O(N w) Costly matrix operation O(N w 2 ) Recursive LS Need much smaller, fixed size matrix O(w w) Fast, incremental computation O(1 w 2 ) no matrix inversion N = 10 6, w = 1-100

Pictorially: Given: Dependent Variable Independent Variable

Pictorially:. Dependent Variable new point Independent Variable

Pictorially: RLS: quickly compute new best fit Dependent Variable new point Independent Variable

Even more details Q4: can we forget the older samples? A4: Yes - RLS can easily handle that [Yi+00]:

Adaptability - forgetting Dependent Variable eg., #bytes sent Independent Variable eg., #packets sent

Adaptability - forgetting Trend change Dependent Variable eg., #bytes sent (R)LS with no forgetting Independent Variable eg. #packets sent

Adaptability - forgetting Trend change Dependent Variable Independent Variable (R)LS with no forgetting (R)LS with forgetting RLS: can *trivially* handle forgetting