High-dimensional Time Series Clustering via Cross-Predictability

Size: px

Start display at page:

Download "High-dimensional Time Series Clustering via Cross-Predictability"

Candace Reeves
5 years ago
Views:

1 High-dimensiona Time Series Custering via Cross-Predictabiity Dezhi Hong Quanquan Gu Kamin Whitehouse University of Virginia University of Virginia University of Virginia Abstract The key to time series custering is how to characterize the simiarity between any two time series. In this paper, we expore a new simiarity metric caed cross-predictabiity : the degree to which a future vaue in each time series is predicted by past vaues of the others. However, it is chaenging to estimate such cross-predictabiity among time series in the high-dimensiona regime, where the number of time series is much arger than the ength of each time series. We address this chaenge with a sparsity assumption: ony time series in the same custer have significant cross-predictabiity with each other. We demonstrate that this approach is computationay attractive, and provide a theoretica proof that the proposed agorithm wi identify the correct custering structure with high probabiity under certain conditions. To the best of our knowedge, this is the first practica high-dimensiona time series custering agorithm with a provabe guarantee. We evauate with experiments on both synthetic data and rea-word data, and resuts indicate that our method can achieve more than 80% custering accuracy on rea-word data, which is 20% higher than the state-of-art baseines. 1 INTRODUCTION The proiferation of cheap, ubiquitous sensing infrastructure has enabed continuous monitoring of the word, and many expect the Internet of Things to have over 25 biion devices by 2020 [12]. In this paradigm, time series data wi often be high-dimensiona: the number of time series d (i.e., number of sensors) wi Proceedings of the 20 th Internationa Conference on Artificia Inteigence and Statistics (AISTATS) 2017, Fort Lauderdae, Forida, USA. JMLR: W&CP voume 54. Copyright 2017 by the author(s). be much arger than the ength of each time series T. Time series custering often serves as an important first step for many appications and poses ong-standing chaenges. In this paper, we expore the chaenge of time series custering in the high-dimensiona regime. The key to time series custering is how to characterize the simiarity between any two time series. In the past severa decades, various metrics for measuring the simiarity/distance between time series have been investigated [8, 13, 6, 11, 18, 26, 10, 20, 34, 3], and so on. Hidden Markov Modes [29, 24] have aso been utiized to derive the distances between time series for custering. Recenty, a few new metrics [27, 19] to measure the simiarity between time series have been proposed and appied to custer brain-computer interface data and motion capture data. However, a the aforementioned work either did not provide theoretica guarantees for their methods, or ony considered scenarios where the number of observations per time series T far exceeds the number of time series d. In this paper, we expore a new simiarity metric caed cross-predictabiity : the degree to which a future vaue in each time series is predicted by past vaues of the others. This metric captures causa reationships between time series, such as seasona or diurna e ects on mutipe environment sensors, market e ects on mutipe stock prices, and so on. However, it is chaenging to estimate such cross-predictabiity among time series in the high-dimensiona setting where d>t: a conventiona regression task, for exampe, woud have d variabes and T equations, which is under-constrained. Intuitivey, ony time series in the same custer woud have significant cross-predictabiity for each other, thus yieding sparse reationships that are indicative of the custer structure. Consequenty, we propose to estimate cross-predictabiity by imposing a sparsity assumption on the cross-predictabiity matrix, i.e., that ony time series in the same custer have significant cross-predictabiity with each other. To do this, we propose a new reguarized Dantzig seector, which is a variant of standard Dantzig seector [7], to estimate the simiarity among the time series. We demonstrate that this approach is computationay attractive because it

2 High-dimensiona Time Series Custering via Cross-Predictabiity invoves soving d reguarized Dantzig seectors that can be optimized by aternating direction method of mutipiers (ADMM) [4] in parae. Additionay, we provide a theoretica proof that the proposed agorithm wi identify the correct custering structure with high probabiity, if two conditions hod: 1) the individua time series themseves can be modeed with an autoregressive mode [14], and 2) the transition matrix for the vector autoregressive mode is bock diagona, i.e., that it is actuay possibe to create custers such that time series in the same custer are cross-predictive whie those in di erent custers are not. To the best of our knowedge, this is the first practica high-dimensiona time series custering agorithm with a provabe guarantee. It is worth noting that the proposed agorithm can be generay appied to custer any high dimensiona times series, regardess of the underying data distribution. We make the autoregressive mode assumption soey for the purpose of providing the theoretica guarantees for our method. To demonstrate the e ectiveness of our method, we conduct experiments on a rea-word data set of sensor time series as we as simuations with synthetic data. Our method can achieve more than 80% custering accuracy on the rea-word data set, which is 20% higher than the state-of-art baseines. Notations We compie here some standard notations used throughout the paper. In this paper, we use owercase etters x, y,... to denote scaars, bod owercase etters x, y,... for vectors, and bod uppercase etters X, Y,... for matrices. We denote random vectors by X, Y. We denote the (i, j) entry of a matrix as M ij, and use M i to index the i-th row of a matrix (ikewise, M j for the j-th coumn). We aso use M S,T to represent a submatrix of M with its rows indexed by the indices in set S and coumns indexed by T.In addition, we write S c to denote the compement of a set S. For any matrix M, P(M) representsthesymmetric convex hu of its coumns, i.e., P(M) = conv(±x). For any matrices M 1, M 2,...M k, we denote a bock diagona matrix by diag(m 1, M 2,...,M k ) such that the k-th diagona bock is M k. Throughout the paper, we wi use vector norm `q for 0 <q<1 and `1 of v defined as kvk q = P i v i q 1/q, kvk1,1 = max i v i, and matrix norm `q, eement-wise`1 and `F of M as kmk q = max kvkq=1 kmvk q, kmk 1,1 = max ij M ij, kmk F = P i,j M ij 2 1/2. 2 RELATED WORK There has been a substantia body of work on time series custering, and in this section we briefy overview two reated categories: custering based on simiarity and subspace custering. Simiarity/Distance-based Time Series Custering A wide range of cassica simiarity/distance metrics have been deveoped and studied [8], incuding Pearson s correation coe cient [13], cosine simiarity [6], autocorreation [11], dynamic time warping [18, 26], Eucidean Distance [10], edit distance [20], distance metric earning [34, 3], and so on. Studies have aso shown that time series can be modeed as generated from Hidden Markov Modes [29, 24], and the estimated weight for each mixture can be used to custer the time series. Recenty, Ryabko et a. [27] considered brain-computer interface data for which independence assumptions do not hod, and for custering they proposed a new distance metric to measure the simiarity between two time series distributions. Khaeghi et a. [19] formuated a nove metric to quantify the distance between time series and proved the consistency of k-means for custering processes according ony to their distributions. However, in the aforementioned studies, they either did not provide theoretica anaysis of the performance or ony handed settings where the number of time series d is smaer than the number of observations T. Di erent from the above simiarity or distance metrics, we define the simiarity between time series from a new perspective - time series are custered based on how much they can be predicted by each other. Subspace Custering (SC) Another reevant ine of research on high-dimensiona data anaysis is subspace custering [9], where the assumption is that data ie on the union of mutipe ower-dimensiona inear spaces and data points can be custered into the subspace they beong to. SC has been widey appied to face images custering [1], socia graphs [17] and so on. Recenty, extensions to handing noisy data [21, 31, 33] and data with irreevant features [25] havebeenstud- ied as we. SC achieves the state-of-art performance whie enjoying rigorous theoretica guarantees. The key di erence between SC and our method is two-fod: first, SC assumes data ie on di erent subspaces and even data in the same subspace are independent and identicay distributed (i.i.d.), whie we assume the time series foow a VAR mode and are dependent for the ones in the same custer; second, SC mathematicay soves for each data point a inear regression probem with a other data points being the candidate, whie in contrast, our study soves the regression probem to estimate the prediction weights between observations from di erent time stamps using a the time series.

3 Dezhi Hong, Quanquan Gu, Kamin Whitehouse 3 METHODOLOGY 3.1 The VAR Mode Our agorithm is motivated by the autoregressive mode, and the ater-on theoretica guarantee for our agorithm aso reies on the autoregressive mode assumption, so we briefy review the stationary firstorder vector autoregressive mode with Gaussian noise here. Let random vectors X 1,...,X T be from a stationary process (X t ) 1 t= 1, and we further define X = [X 1,...,X t,...x T ] > 2 R T d, where X t = (x 1,...,x d ) > 2 R d is a d-dimensiona vector and each coumn of X is a one-dimensiona time series with T sampes. In particuar, we assume each X t can be modeed by a first-order vector autoregressive mode: X t+1 = AX t + Z t, for t =1, 2,...,T 1. (3.1) To secure the above process to be stationary, the transition matrix A must have bounded spectra norm, i.e., kak 2 < 1. We aso assume Z t N(0, )isi.i.d. additive noise independent of X t, and X t has zero mean and a covariance matrix, i.e.,x t N(0, ), where = E[X t X t > ] is the autocovariance matrix. In addition, we have the ag-1 autocovariance matrix as 1 = E[X t X t+1]. > Since (X t ) 1 t= 1 is stationary, it is easy to observe that the covariance matrix depends on A and,i.e., = A > A +, and we further have: A > = 1. (3.2) Essentiay, the zero and nonzero entries in the transition matrix A directy refect the Granger noncausaities and causaities with regard to the stochastic time series. In other words, a nonzero entry A ij impies that the j-th time series is predictive for the i-th time series, with the magnitude A ij indicating how much the predictive power is. The new simiarity metric in our custering agorithm is buit upon such crosspredictive reationship between time series. Now we set to introduce the custering agorithm. 3.2 The Proposed Custering Agorithm Our agorithm first estimates the cross-predictabiity among the time series, and then identifies the custering structure based on the estimated reationship. To introduce our proposed agorithm, we need the foowing notations: X S = [X 1,...,X T 1 ] > 2 R (T 1) d, X T = [X 2,...,X T ] > 2 R (T 1) d, ˆ = X > S X S/(T 1), and ˆ 1 = X > S X T /(T 1). Inspired by the reationship in Eq. (3.2), our main idea is to estimate A based on the reationship between A and the autocovariance and ag-1 autocovariance matrices. This motivates the foowing Dantzig seector type estimator [15], Â = arg min kak 1 subject to k ˆ A > ˆ 1 k 1,1 appe µ, A (3.3) where µ>0 is a tuning parameter. Since each row of A is independent, the above optimization probem can be decomposed into d independent sub-probems and soved individuay as foows: ˆi = arg min k i k 1 subject to k ˆ i ˆik 1,1 appe µ, i (3.4) where ˆi =(ˆ 1 ) i = X > S (X T ) i /(T 1), i.e., ˆi is the i-th coumn of ˆ 1, and Â = h ˆ1,..., ˆdi > 2 R d d with each ˆi 2 R d. Therefore, the ˆi in (3.4) is an estimation of the i-th row of the transition matrix A. Furthermore, for each µ>0, there aways exists a >0 such that (3.4) is equivaent to the foowing reguarized Dantzig seector type estimator: ˆi = arg min i k ˆ i ˆik 1,1 + k i k 1, (3.5) where is a reguarization parameter to determine the sparsity of the estimation. (3.4) can be soved by aternating direction method of mutipiers (ADMM) [4]. A the d optimization probems can be soved in parae, thus computationay e cient. After soving the probem in (3.5), we construct an a nity matrix W based on Â = h ˆ1,..., ˆdi > by symmetrization, and compute the corresponding Lapacian to perform standard spectra custering [23, 28] to recover the custers in the input time series. The procedure is summarized in Agorithm Discussion At first gance, the reguarized Dantzig seector in (3.5) and Lasso appear simiar. However, di erent from Lasso, the input of the regression probem in Eq (3.5) is the ag-0 covariance matrix, and the response is the ag-1 covariance matrix. Here the ag-one covariance matrix encodes and incudes into consideration the first-order tempora information, which is missing in conventiona simiarity metrics such as correation. Additionay, di erent from the Lasso-based estimation procedure [2], which penaizes the square oss, the reguarized Dantzig seector estimator penaizes the `1,1 oss. It is aso worth noting that Agorithm 1 shares a simiar high eve idea as the subspace custering (SC) agorithm [30, 33, 31], but the key di erence is that SC considers the reationship between each data point and a the other points, whie in contrast, our estimator soves the regression probem to estimate the predictive reationship between the observations from time

4 High-dimensiona Time Series Custering via Cross-Predictabiity 4.1 Agorithm 1: Time Series Custering Agorithm > Input: Time series X = [X1,... XT ] 2 RT d, > XS = [X1,..., XT 1 ] 2 R(T 1) d, > XT = [X2,..., XT ] 2 R(T 1) d, ˆ = X> XS /(T 1), and ˆi = X> (XT ) i /(T 1) S S Output: Custer membership of each time series Y 1. Sove for each i = 1,..., d: ˆi = arg min ˆ k i To define the custers among time series X under the context of VAR mode, we assume A = diag(a1,..., A,..., Ak ) to be bock diagona, where A 2 Rd d Pk and the number of time series d satisfies d = =1 d. Consequenty, we can rewrite X as X = X1,..., X,..., Xk with each X 2 RT d obeying: ˆi k1,1 + k i k1 ; i Xt+1 = A Xt + Zt, for t = 1, 2,..., T h i> 2. Set A = ˆ1,..., ˆd ; 3. Construct the affinity graph G with nodes being the d time series in X, and edge weights given by the > matrix W = A + A ; 4. Compute the unnormaized Lapacian L = M W of graph G, with M = diag(m1, m2,..., md ) and Pd mi = j=1 Wij ; 5. Compute the first k eigenvectors 1,..., k of L and et V 2 Rd k be the matrix containing as coumns the first k eigenvectors; 6. Custer time series x0i 2 Rk, as the i-th row of V, with the k-means agorithm into custers C1,..., C,..., Ck. A XS XT X2 XT T-1 X1 Xt+1,di Xt, XT 1 d T-1 A,di d d Figure 1: Iustration of the Proposed Reguarized Dantzig Seector: it soves the regression probem to estimate the predictive reationship between the observations from time t + 1 and time t considering a the time series. t + 1 and the observations from time t considering a the time series, as iustrated in Figure 1. Another fundamenta di erence here is, SC assumes data are i.i.d. and ie on di erent subspaces, whie here the time series data are obviousy dependent. This poses a big chaenge to the theoretica anaysis of our agorithm. 4 Preiminaries MAIN RESULTS In this section, we state our main theory - a provabe guarantee for successfuy recovering the underying custering structure of the input time series. We first introduce some necessary definitions for understanding our main theorem. d 1, which essentiay defines the custering structure in the time series, such that the data Xt+1 2 Rd at time point t + 1 depends ony on the data Xt from the previous time point t in the same bock indexed by A. In other words, as an e ect of A, data are more predictive for each other in the same bock, rather than for those in the other bocks. The bock diagona transition matrix A gives rise to the fact that the time series in X 2 RT d formuate k custers C1,..., C,..., Ck of RT d, and each C contains d one-dimensiona time series of RT denoted as X. Without oss of generaity, et X = X1,..., X,..., Xk be ordered. We further write S to denote the set of indices corresponding to the coumns of X that beong to custer C. Definition 4.1 (Custer Recovery Property). The custers {C }k=1 and the time series X from these custers obey the custer recovery property (CRP) with a parameter, if and ony if it hods that for a i, the optima soution ˆi to (3.5) satisfies: (1) ˆi is nonzero; (2) the indices of nonzero entries in ˆi correspond to ony the coumns of X that are in the same custer as X i. This property ensures that the output coefficient matrix A and affinity matrix W wi be exacty bock diagona, with each custer represented in a disjoint bock. Particuary, reca that we assume the transition matrix A in the VAR mode to be bock diagona, and therefore the CRP is guaranteed to hod for data generated from such a mode. For convenience, we wi refer to the second requirement as the Sef-Reconstruction Property (SRP) from now on. Definition 4.2 (Inradius [30]). The inradius of a convex body P, denoted by r(p), is defined as the radius of the argest Eucidean ba inscribed in P. By the definition, the radius of a P(X) measures the dispersion of the time series in X. Naturay, wedispersed data wi yied a arge inradius whie data with skewed distribution wi have a sma inradius. 4.2 Theoretica Guarantees One of our major contributions in this paper is to provide theoretica guarantees for successfuy recovering the custering structure in the data.

5 Dezhi Hong, Quanquan Gu, Kamin Whitehouse Theorem 4.3. Under the assumption of VAR mode with a bock diagona transition matrix, we compacty denote P0 = P( S,S ), P1 = P(( 1 ) S,S ), r0 = r(p0), r1 = r(p1), andr 0 r 1 = min r0r 1 for =1, 2,...,k,and et = 16k k r 2 max j jj 6 og d +4. (4.1) min j jj (1 kak 2 ) T Furthermore, if r 0 r 1 > k S c,s k 1,1 +2 k S k 1,1 2, (4.2) where S 2 R d is a coumn of 1, then with probabiity at east 1 6d 1 the custer recovery property hods for a the vaues of the reguarization parameter in the range: 1 r 0 r 1 (k S k 1,1 2 ) < < 1, + k S c,s k 1,1 which is guaranteed to be non-empty. (4.3) We defer the fu proof of the theorem in the suppementary materia. The theorem provides an upper bound and a ower bound for the reguarization parameter, to successfuy recover the underying custering structure in the time series: on the one hand, cannot be too arge, otherwise A wi be too dense to perform custering on. On the other hand, as approximates 0 the connectivity among time series decreases because the optima soution to (3.5) becomes more sparse. To guarantee the obtained soution is nontrivia (i.e., ˆi is nonzero), must be arger than a certain vaue. In addition, a ower bound on r 0 r 1 is estabished, which imposes a requirement on the dispersion of the covariance between time series within the same custer. We further make the foowing remarks: Remark 4.4 (Toerance of Noise across Custers). From (4.2) we see that, for the CRP to hod, the dispersion of the coumns of (each coumn is taken as adatapoint2 R d ) needs to be su cienty arge. r 0 is the dispersion of covariance between time series in a custer S,andr 1 is the dispersion of ag-1 covariance between time series in a custer S.TheRHSof(4.2) depends on the scae of S c,s and refects the maximum correation between the time series in one custer S and any time series from a the other custers. Remark 4.5 (Sampe Compexity). We can observe that the factor before the square root in (4.1) is bounded by the argest and smaest eigenvaue of, and therefore we can rewrite = appe p (6 og d + 4)/T where appe is a constant dependent on. We can further derive from (4.2) that T > 4appe 2 (6 og d + 4)(r 0 r 1 + 1) 2 /(r 0 r 1 k S k 1,1 k S c,s k 1,1 ) 2, which essentiay indicates that the sampe compexity for CRP to hod is O(og d). In other words, for the agorithm to succeed, the number of time series d is aowed to grow exponentiay with the ength of time series T ;asongas og d is smaer than the ength of time series T,our theory hods. Indeed, it is desired to see such a property for high-dimensiona data, as d can often possiby far exceeds the number of sampes T. Remark 4.6 (A Uniform Parameter ). Another direct observation from the main theorem is that we can find a uniform vaue for, within the range as specified in (4.3), which can work for the regression task in (3.5) for a i =1,...,d. In other words, probem in (3.5) is sovabe with a singe and can be soved in parae for each i =1,...,d. 5 EXPERIMENTS In this section, we demonstrate the correctness of our theoretica findings and the e ectiveness of our proposed custering agorithm on both synthetic and reaword data. We first experiment with di erent sets of parameters, incuding the number of time series d, the ength of the time series T, and the number of custers k in the data. The experimenta resuts confirm that, when the required condition in Theorem 4.3 is satisfied, the custers in the data can be recovered perfecty. We further appy the agorithm to a rea-word data set where the task is to group sensor time series by their type of measurement (e.g., a temperature sensor vs. a humidity sensor). Our agorithm is abe to outperform the state-of-art baseines by more than 20% measured by adjusted rand index. 5.1 Baseines In our proposed custering agorithm, we estimate the simiarity matrix with the reguarized Dantzig seector (referred to as CP). As baseines, instead of using our estimator, we consider the foowing methods to obtain the simiarity matrix, and the rest of the custering procedure remains the same as ours: Correation Coe cient (CC): In this baseine, we compute the Pearson correation coe cient between a pairs of time series, and use these coe cients to construct the simiarity matrix. Cosine Simiarity (Cosine): In the second baseine, we compute the pairwise cosine simiarity for a the time series, and preserve ony the simiarity scores for the top-k nearest neighbors for each time series and put them as the row of the simiarity matrix. Our experiment shows that the resuts are not sensitive to k and we set k = 5. Autocorreation (ACF): This baseine first com-

6 High-dimensiona Time Series Custering via Cross-Predictabiity putes the autocorreation vectors (with a ag up to 50) for each time series, and then further cacuates the Eucidean distance between each pair of time series based on the autocorreation vectors. We use the impementation in [22] to obtain the distance matrix first, and then convert the distance into simiarity score with Gaussian kerne function. (Smaer distances shoud map to arger simiarity scores.) Dynamic Time Warping (DTW): DTW is a popuar method to compute the simiarity between time series. Here we compute the pairwise DTW simiarity score for a the time series, and then normaize simiarity scores to between 0 and 1. We aso impement a baseine that does not rey on the simiarity between time series: Principa Component Anaysis (PCA): Inthis method, PCA is first appied to reduce the dimensionaity of each origina time series by preserving d(=4) principe components, and then k-means is appied to these PCA scores for custering. 5.2 Synthetic Data In this section, we show the e ectiveness of our proposed custering agorithm via numerica simuations. Particuary, the data X t at a time point t are generated from a VAR mode as defined in (3.1), and we generate the input time series X as foows: (1) We first generate the bock diagona transition matrix A with k custers, and the vaues within each bock are generated with a Bernoui distribution; (2) Since we assume (X) 1 t= 1to be stationary, we then rescae A such that its spectra norm kak 2 = <1; (3) Given A, is generated such that the eements on the diagona equa to 1 and the o -diagona eements are set to a same sma vaue, e.g., 0.1. Then we rescae to have its spectra norm satisfy k k 2 =2kAk 2 ; (4) Next, according to the stationary property, the covariance matrix of the additive noise Z t foows = A > A, where must be a positive definite matrix; (5) We can then generate X 1 from the mutivariate norma distribution with the parameters generated in previous steps, and obtain the foowing X t with the VAR mode. We fix the number of time series d at 100, and choose the ength of the time series T from a grid of {1, 3, 5, 7, 9} og(d) (rounded to the coset integer), i.e, the ratio of T/og(d) varies from 1 to 9. For each vaue of T, we repeat the data generation process for 100 times and report the average of the experimenta resuts. We first experiment with di erent vaues for the reguarization parameter and examine if the two requirements as stated in Definition 4.1 are satisfied. We scan through an exponentia space of from 1/(og(d)/T ) 10 1 to 1/(og(d)/T ) 10 3 and define the metric Sef-Reconstruction Property Vioation Rate (VioRate) of the estimated transition matrix A as foows: P i,j /2C VioRate= A ij Pi,j2C A ij, where (i, j) 2C denotes that the i-th time series X i and the j-th time series X j are in the same custer C for some (ikewise for (i, j) /2 C ). By definition, VioRatemeasures reativey how significant the predictive weights are for pairs of time series across di erent custers, compared to the weights for pairs in the same custer. For a trivia soution, i.e., A = 0, theviorate is defined to be 1 whie for a soution satisfying the sef-reconstruction property, the VioRate shoud be exacty 0. The vioation rates for di erent T/og(d) and vaues when k = 25 are iustrated in Figure 2a; the resuts confirm our theoretica findings. We observe that when is sma, the soution vioates the nonzero requirement, thus the VioRatebeing 1 (refer to the two rows at the bottom). When is su cienty arge within a range, the vioation rates are zero, indicating a the entries in the o -diagona bocks of the estimated A are zero, which satisfies the SRP. In Figure 2b, we show the quaity of time series custering (measured by adjusted rand index and higher is better) with the corresponding A obtained in Figure 2a. We can notice that cases perfecty satisfying the nonzero and SRP requirements can produce perfect custering resuts. Furthermore, it is aso cear that exact sefreconstruction condition is not necessary for perfect custering. We next investigate how the number of custers k a ects the custering performance, where we vary the vaue of k from 5 to 25. For the reguarization parameter,we scan through the same exponentia space as the above experiment with 5-fod cross-vaidation, and choose the one with the minima cross-vaidation error. We fix kak 2 at 0.4 and report the average resuts of the 100 runs for each set of parameters as iustrated in Figure 3. We ceary see that a arger k eads to better custering resuts, which makes sense since the more the number of custers is, the sparser A is, and therefore the more accurate the estimation of A is. We aso examine the e ect of the transition matrix s spectra norm kak 2 on the custering quaity. To this end, we set kak 2 = and vary from 0.1 to 0.9, and the covariance matrix and are generated in the same way as described earier. For the parameter, we take the same cross-vaidation procedure as above. We fix the number of custers k at 25 and report the average resuts of the 100 runs for each set of parameters, as shown in Figure 4. We observe that, for a certain vaue of T/og(d), the custering quaity increases as the spectra norm of the transition matrix decreases.

7 Dezhi Hong, Quanquan Gu, Kamin Whitehouse og(λ og(d)/t ) Vaue of T/og(d) og(λ og(d)/t ) Vaue of T/og(d) (a) Sef-Reconstruction Property (SRP) vioation rate for di erent T/og(d) against di erent with k = 25: toosmaa wi produce trivia soutions (A =0,thusVioRate =1)whieasu cienty arge gives a soution satisfying both the nonzero and SRP requirements (VioRate= 0). (b) Custering quaity for di erent T/og(d) against di erent with k = 25: cases satisfying the nonzero and SRP conditions yied perfect custering resuts. It is aso cear that the exact sef-reconstruction condition (VioRate= 0)isnot necessary for perfect custering. Figure 2: Sef-Reconstruction Property Vioation Rate and the Corresponding Custering Quaity (measured by Adjusted Rand Index) with Di erent T/og(d) Against Di erent. Adjusted Rand Index k=5 k=10 k=15 k=20 k= Vaue of T/og(d) Figure 3: Custering Quaity for Di erent T/og(d) Against Di erent Number of Custers k: arger k is better. Adjusted Rand Index α=0.1 α=0.3 α=0.5 α=0.7 α= Vaue of T/og(d) Figure 4: Custering Quaity for Di erent T/og(d) Against Di erent Vaues for Spectra Norm kak 2 : smaer is better. This indicates that the spectra norm of the transition matrix is a critica factor and verifies the theoretica findings in (4.1). To compare our method with the baseines described in 5.1, we further conduct two sets of experiments on synthetic data with di erent parameters. To generate the synthetic data in the first experiment (referred to as Synthetic Data-1 in Tabe 1), we set the number of time series d = 50, the ength of time series T = 50, the number of custers k = 5, and the transition matrix s spectra norm kak 2 =0.5. For the second experiment (Synthetic Data-2 in Tabe 1), we change T to 100, and the rest of parameters remain the same. We see that, when d is comparabe to T in the first experiment, our method (CP) performs significanty better than the baseines. When the number of sampes T is increased to 100, a the baseines see performance boost, whie our method produces perfect custering resuts. One sha note the better performance of PCA, our understanding is that PCA extracts better expanatory components out of the sampe covariance matrix, which sti captures the underying causa reationship between variabes, though it does not consider the firstorder tempora information as our proposed method does. For the other baseines, they simpy compute simiarity directy between variabes, which is not su - cienty e ective in characterizing the reationship between time series in the high-dimensiona setting.

8 High-dimensiona Time Series Custering via Cross-Predictabiity Tabe 1: Experimenta Comparisons with Baseines: resuts on synthetic and rea data demonstrate the advantage of our proposed agorithm (CP), and each ce incudes the average custering performance (adjusted rand index) of 10 runs with standard deviation. CC Cosine ACF DTW PCA CP Synthetic Data-1 (d =50,T =50) ± ± ± ± ± ± Synthetic Data-2 (d =50,T =100) ± ± ± ± ± ± Rea Data ± ± ± ± ± ± Rea-word Data To further examine how e ective our proposed agorithm is in practice, we aso appy it to a rea-word data set, where the assumption of VAR mode with bock diagona transition matrix might not be perfecty satisfied. The data set [16] contains data coected from 204 sensor time series from 51 rooms on 4 different foors of a arge o ce buiding on a university campus. Each room is instrumented with 4 di erent types of sensors: a CO 2 sensor, a temperature sensor, a humidity sensor and a ight sensor. The data from each sensor is recorded every 15 minutes and the data set contains one-week worth of data. There are missing vaues in the one-week period, so the tota number of observations T is smaer than the number of sensor time series d. Our goa is to assign each sensor time series into the correct type custer, e.g., a temperature custer or a CO 2 custer. Recognizing the type of sensors is often an important step for many usefu appications. For instance, when appying anaytics stacks comprised of a bunde of anaytics jobs to a buiding for energy savings, every particuar anaytics job requires as input some specific types of sensors. Adjusted Rand Index Reguarization Parameter 1/λ Figure 5: Custering Quaity of Our Reguarized Dantzig Seector-based Spectra Custering Agorithm: the agorithm works with a wide range of. In this case, we do not know the vaues of the parameters in the su cient condition in Theorem 4.3, so we cannot fine-tune. We roughy scan through the entire range of [0,1] for 1/ and the resuts are shown in Figure 5 (the data points beyond 0.15 a drop to zero, thus omitted in the figure). It again confirms our theoretica findings in the sense that the proposed custering agorithm can work when is su cienty arge, even not perfecty. We aso examine how we the baseines (detaied in 5.1) perform on the rea data set, and the resuts are summarized in Tabe 1. Our method can achieve more than 80% accuracy and outperforms the best baseine by more than 20%, indicating that our method can sti be e ective when the assumption of VAR with a bock diagona transition matrix might not be satisfied. 6 CONCLUSIONS In this paper, we study the time series custering probem with a new simiarity metric in the highdimensiona regime, where the number of time series is much arger than the ength of time series. Di erent from existing metrics, our simiarity metric measures the cross-predictabiity between time series, i.e., the degree to which a future vaue in each time series is predicted by past vaues of the others. We impose a sparsity assumption and propose a reguarized Dantzig seector estimator to earn the cross-predictabiity among time series for custering. We further provide a theoretica proof that the proposed agorithm wi successfuy recover the custering structure in the data with high probabiity under certain conditions. Experiments on both synthetic and rea-word data verify the correctness of our findings, and demonstrate the e ectiveness of the agorithm. For the rea-word task of sensor type custering, our method is abe to outperform the state-of-art baseines by more than 20% with regard to custering quaity. 7 Acknowedgments We woud ike to thank the anonymous reviewers for their invauabe comments. This work was funded by NSF awards CNS , IIS , and IIS The views and concusions contained in this paper are those of the authors and shoud not be interpreted as representing any funding agencies.

9 Dezhi Hong, Quanquan Gu, Kamin Whitehouse References [1] R. Basri and D. W. Jacobs. Lambertian refectance and inear subspaces. Pattern Anaysis and Machine Inteigence, IEEE Transactions on, 25(2): , [2] S. Basu, G. Michaiidis, et a. Reguarized estimation in sparse high-dimensiona time series modes. The Annas of Statistics, 43(4): , [3] M. Bienko, S. Basu, and R. J. Mooney. Integrating constraints and metric earning in semi-supervised custering. In Proceedings of the 21st internationa conference on Machine earning, page 11. ACM, [4] S. Boyd, N. Parikh, E. Chu, B. Peeato, and J. Eckstein. Distributed optimization and statistica earning via the aternating direction method of mutipiers. Foundations and Trends R in Machine Learning, 3(1):1 122, [5] R. Brandenberg, A. Dattasharma, P. Gritzmann, and D. Larman. Isoradia bodies. Discrete & Computationa Geometry, 32(4): , [6] D. Cai, X. He, and J. Han. Document custering using ocaity preserving indexing. Knowedge and Data Engineering, IEEE Transactions on, 17(12): , [7] E. Candes and T. Tao. The dantzig seector: statistica estimation when p is much arger than n. The Annas of Statistics, pages , [8] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimenta comparison of representations and distance measures. Proceedings of the VLDB Endowment, 1(2): , [9] E. Ehamifar and R. Vida. Sparse subspace custering. In Computer Vision and Pattern Recognition, CVPR IEEE Conference on, pages IEEE, [10] C. Faoutsos, M. Ranganathan, and Y. Manoopouos. Fast subsequence matching in time-series databases, voume 23. ACM, [11] P. Gaeano and D. Peña. Mutivariate anaysis in vector time series [12] Gartner Inc. newsroom/id/ [13] X. Goay, S. Koias, G. Sto, D. Meier, A. Vaavanis, and P. Boesiger. A new correation-based fuzzy ogic custering agorithm for fmri. Magnetic Resonance in Medicine, 40(2): , [14] J. D. Hamiton. Time series anaysis, voume 2. Princeton university press Princeton, [15] F. Han, H. Lu, and H. Liu. A direct estimation of high dimensiona stationary vector autoregressions. Journa of Machine Learning Research, 16: , [16] D. Hong, J. Ortiz, K. Whitehouse, and D. Cuer. Towards automatic spatia verification of sensor pacement in buidings. In Proceedings of the 5th ACM Workshop on Embedded Systems For Energy- E cient Buidings, pages 1 8. ACM, [17] A. Jaai, Y. Chen, S. Sanghavi, and H. Xu. Custering partiay observed graphs via convex optimization. In Proceedings of the 28th Internationa Conference on Machine Learning, pages , [18] E. Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th internationa conference on Very Large Data Bases, pages VLDB Endowment, [19] A. Khaeghi, D. Ryabko, J. Mary, and P. Preux. Consistent agorithms for custering time series. Journa of Machine Learning Research, 17(3):1 32, [20] J.-G. Lee, J. Han, and K.-Y. Whang. Trajectory custering: a partition-and-group framework. In Proceedings of the 2007 ACM SIGMOD internationa conference on Management of data, pages ACM, [21] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by ow-rank representation. In Proceedings of the 27th internationa conference on machine earning, pages , [22] P. Montero and J. A. Viar. Tscust: An r package for time series custering. [23] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectra custering: Anaysis and an agorithm. In Advances in Neura Information Processing Systems, voume 14, pages , [24] A. Panuccio, M. Bicego, and V. Murino. A hidden markov mode-based approach to sequentia data custering. In Structura, Syntactic, and Statistica Pattern Recognition, pages Springer, [25] C. Qu and H. Xu. Subspace custering with irreevant features via robust dantzig seector. In Advances in Neura Information Processing Systems, pages , 2015.

10 High-dimensiona Time Series Custering via Cross-Predictabiity [26] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining triions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD internationa conference on Knowedge discovery and data mining, pages ACM, [27] D. Ryabko and J. Mary. Reducing statistica timeseries probems to binary cassification. In Advances in Neura Information Processing Systems, pages , [28] J. Shi and J. Maik. Normaized cuts and image segmentation. Pattern Anaysis and Machine Inteigence, IEEE Transactions on, 22(8): , [29] P. Smyth. Custering sequences with hidden markov modes. Advances in neura information processing systems, pages , [30] M. Sotanokotabi and E. J. Candes. A geometric anaysis of subspace custering with outiers. The Annas of Statistics, pages , [31] M. Sotanokotabi, E. Ehamifar, E. J. Candes, et a. Robust subspace custering. The Annas of Statistics, 42(2): , [32] M. J. Wainwright. Sharp threshods for highdimensiona and noisy sparsity recovery usingconstrained quadratic programming (asso). Information Theory, IEEE Transactions on, 55(5): , [33] Y.-X. Wang and H. Xu. Noisy sparse subspace custering. Journa of Machine Learning Research, 17(12):1 41, [34] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russe. Distance metric earning with appication to custering with side-information. Advances in neura information processing systems, 15: , 2003.

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA) 1 FRST 531 -- Mutivariate Statistics Mutivariate Discriminant Anaysis (MDA) Purpose: 1. To predict which group (Y) an observation beongs to based on the characteristics of p predictor (X) variabes, using