BIRTE 17 Detection of Highly Correlated Live Data Streams R. Alseghayer, Daniel Petrov, P.K. Chrysanthis, M. Sharaf, A. Labrinidis University of Pittsburgh The University of Queensland
Motivation U, m, <CPU, Mem, Net1, Net1, <CPU, Net2, Mem, Net2, Net3> Net1, <CPU, Mem, Net3> Net2, Net1, <CPU, Mem, Net3> Net2, Net1, <CPU, Mem, Net3> Net2, Net1, Mem, <CPU, Net3> Net2, Net1, Mem, <CPU, Net3> Net2, Net1, Mem, <CPU, Net3> Net2, Net1, <CPU, Mem, Net3> Net2, Net1, Mem, <CPU, Net3> Net2, Net1, Mem, Net3> Net2, Net1, NeN 2
System Model tuple t = (timestamp, value) Interval Deadline Micro Sliding Set of data window batch streams 3
Problem Definition For each micro-batch B of a set of data streams DS with an arrival interval I and deadline d, detect the pairs from DS, each of which has at least A correlated sliding windows with Pearson Correlation Coefficient (PCC) threshold τ, by the deadline d Challenges: deadline d = interval I number of pairs ~ DS 2 data produced at high velocity, in real time 4
Goal and Approach Goal maximize the number of identified number of pairs within a deadline #identified correlated pairs DCS_Precision = Total # correlated pairs Approach scheduling principles early termination pruning caching 5
Outline Background Detection of Correlated Streams (DCS) mode Experimental Evaluation Conclusions 6
Pearson Correlation Coefficient (PCC) 2 pass corr x, y = m (xi μ x )(y i μ y ) i=1 σ x σ y 1 pass (5 sufficient statistics) cov corr x, y = varx vary Where: sumx sumy cov = sumprodxy, varx = sumxx (sumx)2 m m vary = sumyy (sumy)2 m 7,
Basic Algorithms Caching 1 data pass with incremental computation of PCC ibraid * round-robin scheduler fair PriCe * priority scheduler informed priority function Pr = corr ( M )/C totalexp * D. Petrov et al, Interactive Exploration of Correlated Time Series, ExploreDB 17 8
Outline Background Detection of Correlated Streams (DCS) mode Experimental Evaluation Conclusions 9
DCS Mode Early Termination Pruning A A correlatedwindows > (I slidingwindowposition) 10
Start Phase (S) Start Phase 11
Cold Start Phase S 12
Warm Start Phase 13
Warm High Phase Scheduler: Promising Non Promising 14
Warm Low Phase Scheduler: Promising Non Promising 15
Outline Background Detection of Correlated Streams (DCS) mode Experimental Evaluation Conclusions 16
Evaluation Metrics Execution cost # of operations to produce a result Precision optimization criterion DCS_Precision = #identified correlated pairs Total # correlated pairs 17
Dataset Yahoo Financial Historical Data 53 companies on the NYSE for the last 28 y. each has 6 data streams (318 in total) each of length 7100 tuples data granularity is a day Values for each company opening CPUprice, closing Memory price, highest Net1price, lowest Net2 price, amount Net3 of shares traded, and the Net4 adjusted close for that day 18
Parameters Parameter Value(s) PCC threshold (τ) [0.75, 0.90] Target # of Correlated Windows (A) [112, 225, 450] Interval (I) 900 tuples (180 seconds) Deadline (d) [25%, 50%, 75%]* Interval Window length (w) 8 # data streams 72 # of micro batches 4 19
Exp1 (A=450, d=, τ = 0.75) 35% decrease 1812 1822 1822 1787 429 580 236 234 Baseline 20
Exp2 (A=112, d=25%, τ = 0.9) 5x 21
Exp2 (A=450, d=25%, τ = 0.9) 22
Exp3 (A=112, d=50%, τ = 0.9) 30% 23
Exp3 (A=112, d=50%, τ = 0.75) 24
Conclusions We proposed DCS mode of operation, which combines scheduling, early termination, pruning and caching avoids unnecessary computations and produces at least twice as many results at reduced cost Future work: investigate new methods (exploitation vs diversity), sensitivity analysis, experiment with more datasets 25
5 Sufficient Statistics w sumx = x i, sumxx = sumy = i=1 w i=1 y i, sumyy = w sumprodxy = x i y i i=1 w i=1 w i=1 x i 2, y i 2, 26