THE Partial Least Squares (PLS) is a well-established

Size: px
Start display at page:

Download "THE Partial Least Squares (PLS) is a well-established"

Transcription

1 Higher-Order Partial Least Squares (HOPLS): A Generalized Multi-Linear Regression Method Qibin Zhao, Cesar F. Caiafa, Danilo P. Mandic, Zenas C. Chao, Yasuo Nagasaka, Naotaka Fujii, Liqing Zhang and Andrzej Cichocki Abstract A new generalized multilinear regression model, termed the Higher-Order Partial Least Squares (HOPLS), is introduced with the aim to predict a tensor (multiway array) Y from a tensor X through projecting the data onto the latent space and performing regression on the corresponding latent variables. HOPLS differs substantially from other regression models in that it explains the data by a sum of orthogonal Tucker tensors, while the number of orthogonal loadings serves as a parameter to control model complexity and prevent overfitting. The low dimensional latent space is optimized sequentially via a deflation operation, yielding the best joint subspace approximation for both X and Y. Instead of decomposing X and Y individually, higher order singular value decomposition on a newly defined generalized cross-covariance tensor is employed to optimize the orthogonal loadings. A systematic comparison on both synthetic data and real-world decoding of D movement trajectories from electrocorticogram (ECoG) signals demonstrate the advantages of HOPLS over the existing methods in terms of better predictive ability, suitability to handle small sample sizes, and robustness to noise. Index Terms Multilinear regression, Partial least squares (PLS), Higher-order singular value decomposition (HOSVD), Constrained block Tucker decomposition, Electrocorticogram (ECoG), Fusion of behavioral and neural data. INTRODUCTION THE Partial Least Squares (PLS) is a well-established framework for estimation, regression and classification, whose objective is to predict a set of dependent variables (responses) from a set of independent variables (predictors) through the extraction of a small number of latent variables. One member of the PLS family is Partial Least Squares Regression (PLSR) - a multivariate method which, in contrast to Multiple Linear Regression (MLR) and Principal Component Regression (PCR), is proven to be particularly suited to highly collinear data [], []. In order to predict response variables Y from independent variables X, PLS finds a set of latent variables (also called latent vectors, score vectors or components) by projecting both X and Y onto a new subspace, while at the same time maximizing the pairwise covariance between the latent variables of X and Y. A standard way to optimize the model parameters is the Non- Q. Zhao is with Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN, Saitama, Japan and Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China. C. F. Caiafa is with Instituto Argentino de Radioastronomía (IAR), CCT La Plata - CONICET, Buenos Aires, Argentina. Danilo P. Mandic is with Communication and Signal Processing Research Group, Department of Electrical and Electronic Engineering, Imperial College, London, UK Z. C. Chao, Y. Nagasaka and N. Fujii are with Laboratory for Adaptive Intelligence, Brain Science Institute, RIKEN, Saitama, Japan. L. Zhang is with MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems and Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China. A. Cichocki is with Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN, Japan and Systems Research Institute in Polish Academy of Science. linear Iterative Partial Least Squares (NIPALS) []; for an overview of PLS and its applications in neuroimaging see [4], [5], [6]. There are many variations of the PLS model including the orthogonal projection on latent structures (O-PLS) [7], Biorthogonal PLS (BPLS) [8], recursive partial least squares (RPLS) [9], nonlinear PLS [], []. The PLS regression is known to exhibit high sensitivity to noise, a problem that can be attributed to redundant latent variables [], whose selection still remains an open problem []. Penalized regression methods are also popular for simultaneous variable selection and coefficient estimation, which impose e.g., L or L constraints on the regression coefficients. Algorithms of this kind are Ridge regression and Lasso [4]. The recent progress in sensor technology, biomedicine, and biochemistry has highlighted the necessity to consider multiple data streams as multi-way data structures [5], for which the corresponding analysis methods are very naturally based on tensor decompositions [6], [7], [8]. Although matricization of a tensor is an alternative way to express such data, this would result in the Large p Small n problem and also make it difficult to interpret the results, as the physical meaning and multi-way data structures would be lost due to the unfolding operation. The N-way PLS (N-PLS) decomposes the independent and dependent data into rank-one tensors, subject to maximum pairwise covariance of the latent vectors. This promises enhanced stability, resilience to noise, and intuitive interpretation of the results [9], []. Owing to these desirable properties N-PLS has found applications in areas ranging from chemometrics [], [], [] to neuroscience [4], [5]. A modification of the N-PLS and the multi-way covariates regression were studied in [6],

2 [7], [8], where the weight vectors yielding the latent variables are optimized by the same strategy as in N- PLS, resulting in better fitness to independent data X while maintaining no difference in predictive performance. The tensor decomposition used within N-PLS is Canonical Decomposition /Parallel Factor Analysis (CANDECOMP/PARAFAC or CP) [9], which makes N- PLS inherit both the advantages and limitations of CP []. These limitations are related to poor fitness ability, computational complexity and slow convergence when handling multivariate dependent data and higher order (N > ) independent data, causing N-PLS not to be guaranteed to outperform standard PLS [], []. In this paper, we propose a new generalized mutilinear regression model, called Higer-Order Partial Least Squares (HOPLS), which makes it possible to predict an M th-order tensor Y (M ) (or a particular case of two-way matrix Y) from an Nth-order tensor X(N ) by projecting tensor X onto a low-dimensional common latent subspace. The latent subspaces are optimized sequentially through simultaneous rank-(, L,..., L N ) approximation of X and rank-(, K,..., K M ) approximation of Y (or rank-one approximation in particular case of two-way matrix Y). Owing to the better fitness ability of the orthogonal Tucker model as compared to CP [6] and the flexibility of the block Tucker model [], the analysis and simulations show that HOPLS proves to be a promising multilinear subspace regression framework that provides not only an optimal tradeoff between fitness and model complexity but also enhanced predictive ability in general. In addition, we develop a new strategy to find a closed-form solution by employing higher-order singular value decomposition (HOSVD) [], which makes the computation more efficient than the classical iterative procedure. The article is structured as follows. In Section, an overview of two-way PLS is presented, and the notation and notions related to multi-way data analysis are introduced. In Section, the new multilinear regression model is proposed, together with the corresponding solutions and algorithms. Extensive simulations on synthetic data and a real world case study on the fusion of behavioral and neural data are presented in Section 4, followed by conclusions in Section 5. BACKGROUND AND NOTATION. Notation and definitions N th-order tensors (multi-way arrays) are denoted by underlined boldface capital letters, matrices (two-way arrays) by boldface capital letters, and vectors by boldface lower-case letters. The ith entry of a vector x is denoted by x i, element (i, j) of a matrix X is denoted by x ij, and element (i, i,..., i N ) of an Nth-order tensor X R I I I N by x ii...i N or (X) ii...i N. Indices typically range from to their capital version, e.g., i N =,..., I N. The mode-n matricization of a tensor is denoted by X (n) R In I In In+ I N. The nth factor matrix in a sequence is denoted by A (n). The n-mode product of a tensor X R I In I N and matrix A R Jn In is denoted by Y = X n A R I In Jn In+ I N and is defined as: y ii...i n j ni n+...i N = i n x ii...i n...i N a jni n. () The rank-(r, R,..., R N ) Tucker model [4] is a tensor decomposition defined and denoted as follows: Y G A () A () N A (N) = [G; A (),..., A (N) ], () where G R R R.. R N, (R n I n ) is the core tensor and A (n) R In Rn are the factor matrices. The last term is the simplified notation, introduced in [5], for the Tucker operator. When the factor matrices are orthonormal and the core tensor is all-orthogonal this model is called HOSVD [], [5]. The CP model [6], [9], [6], [7], [8] became prominent in Chemistry [8] and is defined as a sum of rankone tensors: Y r= λ r a () r a () r a r (N), () where the symbol denotes the outer product of vectors, a (n) r is the column-r vector of matrix A (n), and λ r are scalars. The CP model can also be represented by (), under the condition that the core tensor is superdiagonal, i.e., R = = R N and g ii,...,i N = if i n i m for all n m. The -mode product between G R I I N and t R I is of size I I I N, and is defined as (G t) ii...i N = g i...i N t i. (4) The inner product of two tensors A, B R I I N is defined by A, B = i i...i N a ii...i N b ii...i N, and the squared Frobenius norm by A F = A, A. The n-mode cross-covariance between an N th-order tensor X R I In I N and an Mth-order tensor Y R J In J M with the same size I n on the nth-mode, denoted by COV {n;n} (X, Y) R I In In+ I N J J n J n+ J M, is defined as C = COV {n;n} (X, Y) =< X, Y > {n;n}, (5) where the symbol <, > {n;n} represents an n-mode multiplication between two tensors, and is defined as c i,...,i n,i n+...i N,j,...,j n j n+...j M = I n i n= x i,...,i n,...,i N y j,...,i n,...,j M. (6) As a special case, for a matrix Y R In M, the n-mode cross-covariance between X and Y simplifies as COV {n;} (X, Y) = X n Y T, (7)

3 under the assumption that n-mode column vectors of X and columns of Y are mean-centered.. Standard PLS (two-way PLS) where ZQ T +F is the residual matrix. Thus () indicates that the problem boils down to finding common latent variables T that explain the variance of both X and Y, as illustrated in Fig.. HIGHER-ORDER PLS (HOPLS) Fig. : The PLS model: data decomposition as a sum of rankone matrices. The PLS regression was originally developed for e- conometrics by H. Wold [], [9] in order to deal with collinear predictor variables. The usefulness of PLS in chemical applications was illuminated by the group of S. Wold [4], [4], after some initial work by Kowalski et al. [4]. Currently, the PLS regression is being widely applied in chemometrics, sensory evaluation, industrial process control, and more recently, in the analysis of functional brain imaging data [4], [44], [45], [46], [47]. The principle behind PLS is to search for a set of latent vectors by performing a simultaneous decomposition of X R I J and Y R I M with the constraint that these components explain as much as possible of the covariance between X and Y. It can be formulated as X = TP T + E = t r p T r + E, (8) Y = UQ T + F = r= u r q T r + F, (9) r= where T = [t, t,..., t R ] R I R consists of R orthonormal latent variables from X, i.e. T T T = I, and U = [u, u,..., u R ] R I R are latent variables from Y having maximum covariance with T column-wise. The matrices P and Q represent loadings and E, F are respectively the residuals for X and Y. In order to find the first set of components, the classical PLS algorithm is to optimize the two sets of weights w, q so as to satisfy max {w,q} [wt X T Yq], s. t. w T w =, q T q =. () The latent variables are then given by t = Xw/ Xw and u = Yq. Here, two assumptions are made: i) the latent variables {t r } R r= are good predictors of Y; ii) a linear inner relation between the latent variables t and u exists; i.e., U = TD + Z where D is a linear relation diagonal matrix and Z denotes the matrix of Gaussian i.i.d. residuals. The maximum likelihood estimate of parameters D is D = (T T T) T T U. Upon combining it with the decomposition of Y matrix, (9) can be written as Y = TDQ T + (ZQ T + F) TDQ T, () Fig. : Schematic diagram of the HOPLS model: approximating X as a sum of rank-(, L, L ) tensors. Approximation for Y follows a similar principle with shared common latent components T. In the case of a two-way matrix, column rank and row rank are equal, and correspond to the minimal number of rank-one terms into which the matrix can be decomposed, and thus the low-rank approximation is equivalent to subspace approximation. However, for a higher-order tensor, these criteria lead to two typical tensor decompositions (i.e., CP and Tucker model). The CP decomposes a tensor as a sum of rank-one tensors defined in (), which can be considered as a low-rank approximation, while the Tucker model decomposes a tensor into a core tensor multiplied by matrices along each mode (), which is a rank-(r, R, R ) subspace approximation. Both the CP and Tucker decompositions can be regarded as higher-order generalizations of the matrix singular value decomposition (SVD) and principal component analysis (PCA). In fact, CP can be viewed as a special case of Tucker where the core tensor is superdiagonal. Block term decompositions unify the Tucker and CP decompositions into one framework by decomposing a tensor into a sum of rank-(r, R, R ) terms []. Consider an N th-order independent tensor X R I I N and an Mth-order dependent tensor Y R J J M, having the same size on the first mode, i.e., I = J. Our objective is to find an optimal subspace approximation of X and Y, in which the latent vectors from X and Y have maximum pairwise covariance. Considering a linear relation between the latent vectors, the problem boils down to finding the common latent subspace which can approximate both X and Y simultaneously. We firstly address the general case of a tensor

4 4 X(N ) and a tensor Y(M ). A particular case with a tensor X(N ) and a matrix Y(M = ) is presented separately in Sec.., using a slightly different approach.. Proposed model Applying Tucker decomposition within a PLS framework is not straightforward, and to that end we propose a novel block-wise orthogonal Tucker approach to model the data. More specifically, we assume X is decomposed as a sum of rank-(, L,..., L N ) Tucker blocks, while Y is decomposed as a sum of rank-(, K,..., K M ) Tucker blocks (see Fig. ), which can be expressed as X = Y = r= r= G r t r P () r N P (N ) r +E R, D r t r Q () r M Q (M ) r +F R, () where R is the number of { latent } vectors, t r R I N is the r-th latent vector, P (n) r R In+ Ln+ { } n= M and Q (m) r R Jm+ Km+ are loading matrices on mode-n and mode-m respectively, and G r m= R L L N and D r R K K M are core tensors. However the Tucker decompositions in () are not unique [6] due to the permutation, rotation, and scaling issues. To alleviate this problem, additional constraints should be imposed such that the core tensors G r and D r are all-orthogonal, a sequence of loading matrices are column-wise orthonormal, i.e., P (n)t r P (n) r = I and Q (m)t r Q (m) r = I, the latent vector is of length one, i.e. t r F =. Thus, each term in () is represented as an orthogonal Tucker model, implying essentially uniqueness as it is subject only to trivial indeterminacies []. By defining a latent matrix T = [t,..., t R ], mode-n loading matrix P (n) = [P (n),..., P(n) R ], mode-m loading matrix Q (m) = [Q (m),..., Q (m) R ] and core tensor G = blockdiag(g,..., G R ) R R RL RL N, D = blockdiag(d,..., D R ) R R RK RK M, the HOPLS model in () can be rewritten as X = G T P () N P (N ) + E R, Y = D T Q () M Q (M ) + F R, () where E R and F R are residuals after extracting R components. The core tensors G and D have a special blockdiagonal structure (see Fig. ) and their elements indicate the level of local interactions between the corresponding latent vectors and loading matrices. Note that the tensor decomposition in () is similar to the block term decomposition discussed in [], which aims to the decomposition of only one tensor. However, HOPLS attempts to find the block Tucker decompositions of two tensors with block-wise orthogonal constraints, which at the same time satisfies a certain criteria of sharing the common latent components on a specific mode. Benefiting from the advantages of Tucker decomposition over the CP model [6], HOPLS promises to approximate data better than N-PLS. Specifically, HO- PLS differs substantially from the N-PLS model in the sense that extraction of latent components in HOPLS is based on subspace approximation rather than on lowrank approximation and the size of loading matrices is controlled by a hyperparameter, providing a tradeoff between fitness and model complexity. Note that HOPLS simplifies into N-PLS if we define n : {L n } = and m : {K m } =.. Optimization criteria and algorithm There are two different approaches for extracting the latent components: sequential and simultaneous methods. A sequential method extracts one latent component at a time, deflates the proper tensors and calculates the next component from the residuals. In a simultaneous method, all components are calculated simultaneously by minimizing a certain criterion. In the following, we employ a sequential method. The tensor decompositions in () can be represented as an optimization problem of approximating X and Y by orthogonal Tucker model, while having a common latent component on a specific mode. If we apply HOSVD individually on X and Y, the best rank-(, L,..., L N ) approximation for X and the best rank-(, K,..., K M ) approximation for Y can be obtained while the common latent vector t r cannot be ensured. Another way is to find the best approximation of X by HOSVD first, subsequently, Y can be approximated by a fixed t r. However, this procedure, which resembles multi-way principal component regression [8], has the drawback that the common latent components are not necessarily predictive for Y. The optimization of subspace transformation according to () will be formulated as a problem of determining a set of orthogonormal loadings P r (n), Q (m) r, r =,,..., R and latent vectors t r that satisfies a certain criterion. Since each term can be optimized sequentially with the same criteria based on deflation, in the following, we shall simplify the problem to that of finding the first latent vector t and two sequences of loading matrices P (n) and Q (m). In order to develop a strategy for the simultaneous minimization of the Frobenius norm of residuals E and F, while keeping a common latent vector t, we first need to introduce the following basic results: Proposition.. Given a tensor X R I I N and column orthonormal matrices P (n) R In+ Ln+, n =,..., N, t R I with t F =, the least-squares (LS) solution to min G X G t P () N P (N ) F is given by G = X t T P ()T N P (N )T. Proof: This result is very well known and is widely used in the literature [6], []. A simple proof is based on writing the mode- matricization of tensor

5 5 X as X () = tg () (P (N ) P () ) T + E (), (4) where tensor E () is the residual and the symbol denotes the Kronecker product. Since t T t = and (P (N ) P () ) is column orthonormal, the LS solution of G () with fixed matrices t and P (n) is given by G () = t T X () (P (N ) P () ); writing it in a tensor form we obtain the desired result. Proposition.. Given a tensor X R I I N, the following two constrained optimization problems are equivalent: ) min {P (n),t,g} X G t P () N P (N ) F, s. t. matrices P (n) are column orthonormal and t F =. ) max {P (n),t} X t T P ()T N P (N )T F, s. t. matrices P (n) are column orthonormal and t F =. The proof is available in [6] (see pp ). Assume that the orthonormal matrices P (n), Q (m), t are given, then from Proposition., the core tensors in () can be computed as G = X t T P ()T N P (N )T, D = Y t T Q ()T M Q (M )T. (5) According to Proposition., minimization of E F and F F under the orthonormality constraint is equivalent to maximization of G F and D F. However, there is no straightforward tensor decomposition method to maximize G F and D F simultaneously over the factor matrices {P (n) } N n=, {Q(m) } M m= and a common latent vector t. To this end, we propose to maximize a product of norms of two core tensors, i.e., max{ G F D F }. Since the latent vector t is determined by P (n), Q (m), the first step is to optimize the orthonormal loadings, then the common latent vectors can be computed by the fixed loadings. Proposition.. Let G R L L N and D R K K M, then < G, D > {;} F = G F D F. Proof: < G,D > {;} F = vec(g)vec T (D) F = trace ( vec(d)vec T (G)vec(G)vec T (D) T ) = vec(g) F vec(d) F. (6) where vec(g) R LL...L N is the vectorization of the tensor G. From Proposition., observe that to maximize G F D F is equivalent to maximizing < G, D > {;} F. According to (5) and tt t =, < G, D > {;} F can be expressed as [[< X, Y > {;} ; P ()Ț...,P (N )T, Q ()Ț...,Q (M )T ]]. F (7) Note that this form is quite similar to the optimization problem for two-way PLS in (), where the crosscovariance matrix X T Y is replaced by < X, Y > {;}. In addition, the optimization item becomes the norm of a small tensor in contrast to a scalar in (). Thus, if we define < X, Y > {;} as a mode- cross-covariance tensor C = COV {;} (X, Y) R I I N J J M, the optimization problem can be finally formulated as max [C; P ()T,...,P (N )T, Q ()T,..., Q (M )T ] {P (n),q (m) } s. t. P (n)t P (n) = I Ln+, Q (m)t Q (m) = I Km+, (8) where P (n), n =,..., N and Q (m), m =,..., M are the parameters to optimize. Based on Proposition. and orthogonality of P (n), Q (m), the optimization problem in (8) is equivalent to find the best subspace approximation of C as C [G (C) ; P (),..., P (N ), Q (),..., Q (M ) ], (9) which can be obtained by rank-(l,..., L N, K,..., K M ) HOSVD on tensor C. Based on Proposition., the optimization term in (8) is equivalent to the norm of core tensor G (C). To achieve this goal, the higherorder orthogonal iteration (HOOI) algorithm [6], [7], which is known to converge fast, is employed to find the parameters P (n) and Q (m) by orthogonal Tucker decomposition of C. Subsequently, based on the estimate of the loadings P (n) and Q (m), we can now compute the common latent vector t. Since our goal is to predict Y from X, similarly to the PLS method, we need to estimate t from predictors X and also to estimate the regression coefficient D in order to predict responses Y. For a given set of loading matrices {P (n) }, the latent vector t should explain variance of X as much as possible, that is X t = arg min [G; t, P (),..., P (N ) ], () t which can be easily achieved by choosing t as the first leading left singular vector of the matrix (X P ()T N P (N )T ) () as used in the HOOI algorithm (see [6], [5]). Thus, the core tensors G and D are computed by (5). The above procedure should be carried out repeatedly using the deflation operation, until an appropriate number of components (i.e., R) are obtained, or the norms of residuals are smaller than a certain threshold. The deflation is performed by subtracting from X and Y the information explained by a rank-(, L,..., L N ) tensor X and a rank-(, K,..., K M ) tensor Ŷ, respectively. The HOPLS algorithm is outlined in Algorithm.. The case of the tensor X and matrix Y Consider an N th-order independent tensor X R I I N (N ) and a two-way dependent data Y R I M, with the same sample size I. The HOPLS models independent data X as a sum of rank-(, L,..., L N ). Note that the latent vectors are not orthogonal in HOPLS algorithm, which is related to deflation. The theoretical explanation and proof are given in the supplement material. F F

6 6 Algorithm The Higher-order Partial Least Squares (HOPLS) Algorithm for a Tensor X and a Tensor Y Input: X R I I N, Y R J J M, N, M and I = J. Number of latent vectors is R and number of loading vectors are {L n } N n= and {K m } M m=. Output: {P (n) r }; {Q (m) r }; {G r }; {D r }; T r =,..., R; n =,..., N ; m =,..., M. Initialization: E X, F Y. for r = to R do if E r F > ε and F r F > ε then C r < E r, F r > {,} ; Rank-(L,..., L N, K,..., K M ) orthogonal Tucker decomposition of C r by HOOI [6] as C r [G (Cr) r ; P () r,..., P (N ) r, Q () r,..., Q (M ) r ]; t r the first [ leading left singular vector by ( ) SVD E r P ()T r N P (N )T r G r [E r ; t T r, P ()T r,..., P (N )T r ]; D r [F r ; t T r, Q ()T r,..., Q (M )T r ]; Deflation: E r+ E r [G r ; t r, P () r,..., P (N ) r ]; F r+ F r [D r ; t r, Q () r,..., Q (M ) r ]; else Break; end if end for tensors while dependent data Y is represented by a sum of rank-one matrices as X = G r t r P () r N P (N ) r +E R, Y = r= () u r q T r + F R r= = I, n. Similarly to the standard PLS method, the assumption employed was that of a linear inner relationship between the latent variables t r and u r, that is U = TD+Z where D is the diagonal matrix, and Z denotes the Gaussian residuals. A combination with () yields the decomposition of Y in the form where q r = and t r =, P (n)t r Y = TDQ T + F R = P (n) r () ] ; d r t r q T r + F R. () r= The following proposition states the conditions for the optimal selection of u given Y and q. Proposition.4. Let Y R I M and q R M is of length one, then min u Y uq T F is solved by u = Yq. In other words, a linear combination of the columns of Y using a weighting vector q of length one is second order optimal for the approximation of Y in the least squares sense. Proof: Since q is given and q =, it is obvious that the ordinary least squares solution to solve the problem is u = Yq(q T q), and hence, u = Yq gives the best fit of Y for that q. The sequential deflation type optimization is employed for the model parameters. Observe that according to Proposition.4 the optimal fit is achieved for u = Yq. Thus, combining the linear relation between t and u with the expression for the core tensor G in (5), we can optimize the parameters of X-loading matrices P (n) and Y-loading vector q via max X Y T q T P ()T N P (N )T {P (n) F,,q} s. t. P (n)t P (n) = I, q F =. () This form is similar to (8), but has a different crosscovariance tensor C = X Y T defined between a tensor and a matrix, implying that the problem can be solved by performing a rank-(, L,..., L N ) HOSVD on C, which also makes it possible to compute the core tensor G (C) corresponding to C. Subsequently, the latent vector t is estimated from X using the given loading matrices P (n). The mode- matricizacion of the model for X allows us to write X () = tg () (P (N )T P () ) T + E (), (4) where G () R LL...L N still remains unknown. However, owing to the linear relationship between the core tensor G (i.e., [X; t T, P ()T,..., P (N )T ]) and the core tensor G (C) (i.e., [C; q T, P ()T,..., P (N )T ]), so that G (C) = dg, a least square solution for the normalized t, which minimizes the squared norm of the residual E () F, can be obtained from t (X P ()T N P (N )T ) () G (C)+ (), t t/ t F, (5) where we used the property that P (n) are columnwise orthonormal and the symbol + denotes Moore-Penrose pseudoinverse. Based on the estimated latent vector t of length one and loadings q, the regression coefficient used to predict Y is computed as d = t T u = t T Yq. (6) The procedure for a two-way response matrix is summarized in Algorithm. In this case, the HOPLS model is shown to unify both the standard PLS and N-PLS, when the appropriate parameters L n are selected..4 Prediction of the Response Variables Predictions from the new observations X new are performed in two steps: projecting the data to the lowdimensional latent space based on model parameters G r, P (n) r, and predicting the response data based on latent vectors T new and model parameters Q (m) r, D r. For simplicity, we use a matricized form to express the prediction procedure as Ŷ new () T new Q T = X new () WQ T, (7). Explanation and proof are given in the supplement material.

7 7 Algorithm Higher-order Partial Least Squares (HOPLS) for a Tensor X and a Matrix Y Input: X R I I I N, N and Y R I M The Number of latent vectors is R and the number of loadings are {L n } N n=. Output: {P (n) r }; Q; {G r }; D; T; r =,..., R, n =,..., N. Initialization: E X, F Y. for r = to R do if E r F > ε and F r F > ε then C r E r F T r ; Perform rank-(, L,, L N ) HOOI on C r as C r G (C) r q r P () r N P (N ) ( ) ( r ; t r E r P () r t r t r / t r F ; G r [E r ; t T r, P ()T r u r F r q r ; d r u T r t r ; Deflation: N P (N ) r E r+ E r [G r ; t r, P () r F r+ F r d r t r q T r ; end if end for (),..., P (N )T r ];,..., P (N ) r ]; +; vec T (G (C) r )) where W and Q have R columns, represented by ( ) w r = P (N ) r P () r G + r(), ( ) T (8) q r = D r() Q (M ) r Q () r. In the particular case of a two-way matrix Y, the prediction is performed by Ŷ new X new () WDQT, (9) where D is a diagonal matrix whose entries are d r and rth column of Q is q r, r =,..., R..5 Properties of HOPLS Robustness to noise. An additional constraint of keeping the largest {L n } N n= loading vectors on each mode is imposed in HOPLS, resulting in a flexible model that balances the two objectives of fitness and the significance of associated latent variables. For instance, a larger L n may fit X better but introduces more noise to each latent vector. In contrast, N-PLS is more robust due to the strong constraint of rank-one tensor structure, while lacking good fit to the data. The flexibility of HOPLS allows us to adapt the model complexity based on the dataset in hands, providing considerable prediction ability (see Fig. 4, 6). Large p, Small n problem. This is particularly important when the dimension of independent variables is high. In contrast to PLS, the relative low dimension of model parameters that need to be optimized in HOPLS. For instance, assume that a th-order tensor X has the dimension of 5, i.e., there are 5 samples and features. If we apply PLS on X () with size of 5, there are only five samples available to optimize a -dimensional loading vector p, resulting in an unreliable estimate of model parameters. In contrast, HOPLS allows us to optimize loading vectors, having relatively low-dimension, on each mode alternately; thus the number of samples is significantly elevated. For instance, to optimize -dimensional loading vectors on the second mode, 5 samples are available, and to optimize the -dimensional loading vectors on the third mode there are 5 samples. Thus, a more robust estimate of low-dimensional loading vectors can be obtained, which is also less prone to overfitting and more suitable for Large p, Small n problem (see Fig. 4). Ease of interpretation. The loading vectors in P (n) reveal new subspace patterns corresponding to the n-mode features. However, the loadings from Unfold-PLS are difficult to interpret since the data structure is destroyed by the unfolding operation and the dimension of loadings is relatively high. Computation. N-PLS is implemented by combining a NIPALS-like algorithm with the CP decomposition. Instead of using an iterative algorithm, HOPLS can find the model parameters using a closed-form solution, i.e., applying HOSVD on the cross-covariance tensor, resulting in enhanced computational efficiency. Due to the flexibility of HOPLS, the tuning parameters of L n and K m, controlling the model complexity, need to be selected based on calibration data. Similarly to the parameter R, the tuning parameters can be chosen by cross-validation. For simplicity, two alternative assumptions will been utilized: a) n, m, L n = K m = λ; b) L n = ηr n, K m = ηr m, < η, i.e., explaining the same percentage of the n-mode variance. 4 EXPERIMENTAL RESULTS In the simulations, HOPLS and N-PLS were used to model the data in a tensor form whereas PLS was performed on a mode- matricization of the same tensors. To quantify the predictability, the index Q was defined as Q = Y Ŷ F / Y F, where Ŷ denotes the prediction of Y using a model created from a calibration dataset. Root mean square errors of prediction (RMSEP) were also used for evaluation [48]. 4. Synthetic data In order to quantitatively benchmark our algorithm a- gainst the state of the art, an extensive comparative exploration has been performed on synthetic datasets to evaluate the prediction performance under varying conditions with respect to data structure, noise levels and ratio of variable dimension to sample size. For parameter selection, the number of latent vectors (R) and number of loadings (L n = K m = λ) were chosen based on fivefold cross-validation on the calibration dataset. To reduce random fluctuations, evaluations were performed over 5 validation datasets generated repeatedly according to the same criteria.

8 Datasets with matrix structure The independent data X and dependent data Y were generated as: X = TP T + ξe, Y = TQ T + ξf, () where latent variables {t, p, q} N (, ), E, F are Gaussian noises whose level is controlled by the parameter ξ. Both the calibration and the validation datasets were generated according to (), with the same loadings P, Q, but a different latent T which follows the same distribution N (, ). Subsequently, the datasets were reorganized as N th-order tensors. To investigate how the prediction performance is affected by noise levels and small sample size, {X, Y} R (Case ) and {X, Y} R (Case ) were generated under varying noise levels of db, 5dB, db and -5dB. In the case, {X, Y} R were generated with the loadings P, Q drawn from a uniform distribution U(, ). The datasets were generated from five latent variables (i.e., T has five columns) for all the three cases db db dB dB Fig. : Five-fold cross-validation performance of HOPLS at different noise levels versus the number of latent variables (R) and loadings (λ). The optimal values for these two parameters are marked by green squares. TABLE : The selection of parameters R and λ in Case. SNR PLS N-PLS HOPLS HOPLS SNR PLS N-PLS R λ R λ db db dB dB 5 There are two tuning parameters, i.e., number of latent variables R and number of loadings λ for HOPLS and only one parameter R for PLS and N-PLS, that need to be selected appropriately. The number of latent variables R is crucial to prediction performance, resulting in undermodelling when R was too small while overfitting easily when R was too large. The cross-validations were performed when R and λ were varying from to with the step length of. In order to alleviate the computation burden, the procedure was stopped when the performance starts to decrease with increasing λ. Fig. shows the grid of cross-validation performance of HOPLS in Case with the optimal parameters marked by green squares. Observe that the optimal λ for HOPLS is related to the noise levels, and for increasing noise levels, the best performance is obtained by smaller λ, implying that only few significant loadings on each mode are kept in the latent space. This is expected, due to the fact that the model complexity is controlled by λ to suppress noise. The optimal R and λ for all three methods at different noise levels are shown in Table. Predicted Q Predicted Q Predicted Q Predicted Q HOPLS N PLS PLS SNR=dB Case Case Case SNR=5dB Case Case Case SNR=dB Case Case Case SNR= 5dB Case Case Case Fig. 4: The prediction performance comparison among HO- PLS, N-PLS and PLS at different noise levels for three cases. Case: {X, Y} R and {P, Q} N (, ); Case : {X, Y} R and {P, Q} N (, ); Case : {X, Y} R and {P, Q} U(, ). After the selection the parameters, HOPLS, N-PLS and PLS are re-trained on the whole calibration dataset using the optimal R and λ, and were applied to the validation datasets for evaluation. Fig. 4 illustrates the predictive performance over 5 validation datasets for the three cases at four different noise levels. In Case, a relatively larger sample size was available, when SNR=dB, HO- PLS achieved a similar prediction performance to PLS while outperforming N-PLS. With increasing the noise level in both the calibration and validation datasets, HOPLS showed a relatively stable performance whereas

9 the performance of PLS decreased significantly. The superiority of HOPLS was shown clearly with increasing the noise level. In Case where a smaller sample size was available, HOPLS exhibited better performance than the other two models and the superiority of HOPLS was more pronounced at high noise levels, especially for SNR 5dB. These results demonstrated that HOPLS is more robust to noise in comparison with N-PLS and PLS. If we compare Case with Case at different noise levels, the results revealed that the superiority of HOPLS over the other two methods was enhanced in Case, illustrating the advantage of HOPLS in modeling datasets with small sample size. Note that N-PLS also showed better performance than PLS when SNR db in Case, demonstrating the advantages of modeling the dataset in a tensor form for small sample sizes. In Case, N-PLS showed much better performance as compared to its performance in Case and Case, implying sensitivity of N-PLS to data distribution. With the increasing noise level, both HOPLS and N-PLS showed enhanced predictive abilities over PLS. 4.. Datasets with tensor structure Note that the datasets generated by () do not originally possess multi-way data structures although they were organized in a tensor form, thus the structure information of data was not important for prediction. We here assume that HOPLS is more suitable for the datasets which originally have multi-way structure, i.e. information carried by interaction among each mode are useful for our regression problem. In order to verify our assumption, the independent data X and dependent data Y were generated according to the Tucker model that is regarded as a general model for tensors. The latent variables t were generated in the same way as described in Section 4... A sequence of loadings P (n), Q (m) and the core tensors were drawn from N (, ). For the validation dataset, the latent matrix T was generated from the same distribution as the calibration dataset, while the core tensors and loadings were fixed. Similarly to the study in Section 4.., to investigate how the prediction performance is affected by noise levels and sample size, {X, Y} R (Case ) and {X, Y} R (Case ) were generated under noise levels of db, 5dB, db and -5dB. The datasets for both cases were generated from five latent variables. TABLE : The selection of parameters R and λ in Case. SNR PLS N-PLS HOPLS HOPLS SNR PLS N-PLS R λ R λ db db dB dB 4 The optimal parameters of R and λ were shown in Table. Observe that the optimal R is smaller with the increasing noise level for all the three methods. The parameter λ in HOPLS was also shown to have a similar db db dB dB Fig. 5: Five-fold cross-validation performance of HOPLS at different noise levels versus the number of latent variables (R) and loadings (λ). The optimal values for these two parameters are marked by green squares. behavior. For more detail, Fig. 5 exhibits the crossvalidation performance grid of HOPLS with respect to R and λ. When SNR was db, the optimal λ was 4, while it were, and for 5dB, db and -5dB respectively. This indicates that the model complexity can be adapted to provide a better model when a specific dataset was given, demonstrating the flexibility of HOPLS model. Predicted Q Predicted Q SNR=dB Case Case SNR=dB Case Case Predicted Q Predicted Q SNR=5dB Case Case SNR= 5dB Case Case.. HOPLS N PLS PLS Fig. 6: The prediction performance comparison among HO- PLS, N-PLS and PLS at different noise levels for the two cases (i.e., Case: {X, Y} R and Case : {X, Y} R ) with different sample size. The prediction performance evaluated over 5 validation datasets using HOPLS, N-PLS and PLS with individually selected parameters were compared for different noise levels and different sample sizes (i.e., two cases). As shown in Fig. 6, for both the cases, the prediction performance of HOPLS was better than both N-PLS and PLS at db, and the discrepancy among them

10 was enhanced when SNR changed from db to -5dB. The performance of PLS decreased significantly with the increasing noise levels while HOPLS and N-PLS showed relative robustness to noise and outperformed PLS when SNR 5dB, illustrating the advantages of tensor-based methods with respect to noisy data. Regarding the small sample size problem, we found the performances of all the three methods were decreased when comparing Case with Case. Observe that the superiority of HOPLS over N-PLS and PLS were enhanced in Case as compared to Case at all noise levels. A comparison of Fig. 6 and Fig. 4 shows that the performances are significantly improved when handling the datasets having tensor structure by tensor-based methods (e.g., HOPLS and N-PLS). As for N-PLS, it outperformed PLS when the datasets have tensor structure and in the presence of high noise, but it may not perform well when the datasets have no tensor structure. By contrast, HOPLS performed well in both cases, in particular, it outperformed both N-PLS and PLS in critical cases with high noise and small sample size. 4.. Comparison on matrix response data In this simulation, the response data was a two-way matrix, thus HOPLS algorithm was used to evaluate the performance. X R and Y R 5 were generated from a full-rank normal distribution N (, ), which satisfies Y = X () W where W was also generated from N (, ). Fig. 7(A) visualizes the predicted and original data with the red line indicating the ideal prediction. Observe that HOPLS was able to predict the validation dataset with smaller error than PLS and N-PLS. The independent data and dependent data are visualized in the latent space as shown in Fig. 7(B). 4. Decoding of ECoG signals In [46], ECoG-based decoding of D hand trajectories was demonstrated by means of classical PLS regression 4 [49]. The movement of monkeys was captured by an optical motion capture system (Vicon Motion Systems, USA). In all experiments, each monkey wore a custommade jacket with reflective markers for motion capture affixed to the left shoulder, elbows, wrists and hand, thus the response data was naturally represented as a th-order tensor (i.e., time D positions markers). Although PLS can be applied to predict the trajectories corresponding to each marker individually, the structure information among four markers would be unused. For every point in time of the movement trajectory, a corresponding ECoG epoch (one second before the current time point) was extracted and transformed to the timefrequency domain in order to construct the predictors. Hence, the independent data is also naturally represented as a higher-order tensor (i.e., epoch channel. The Matlab code and one simulation dataset are available from 4. The datasets and more detailed description are freely available from A Predictions Predictions Predictions HOPLS model Actual response PLS model Actual response N PLS model Actual response L.V. B.5.5 C Latent vector space L.V L.V. Square Error of Prediction HOPLS PLS N PLS Fig. 7: (A) The scatter plot of predicted against actual data for each model. (B) Data distribution in the latent spaces. Each blue point denotes one sample of the independent variable, while the red points denote samples of response variables. (C) depicts the distribution of the square error of prediction on the validation dataset. time frequency). In this study, the proposed HOPLS regression model was applied for decoding movement trajectories based on ECoG signals to verify its effectiveness in real-world applications. Fig. 8: The scheme for decoding of D hand movement trajectories from ECoG signals. The overall scheme of ECoG decoding is illustrated in Fig. 8. Specifically, ECoG signals were preprocessed by a band-pass filter with cutoff frequencies at. and 6Hz and a spatial filter with a common average reference. Motion marker positions were down-sampled to Hz. In order to represent features related to the movement trajectory from ECoG signals, the Morlet wavelet transformation at different center frequencies (-5Hz, arranged in a logarithmic scale) was used to obtain the time-frequency representation. The epochs of one-second ECoG signals before every time point of the movement data were extracted and downsampled to Hz in order to construct the predictors, which can be represented by channel time frequency ( ). We first applied the HOPLS algorithm to predict only the hand movement trajectory, represented as a matrix Y R I (time points D positions ) from a threeorder tensor of ECoG predictors X R I (epoch channel time-frequency) that was constructed by combining the time and frequency modes into one mode. The ECoG data was divided into a calibration dataset

11 Fig. 9: Panels (A) and (B) depict data distributions in the X- latent space T and Y-latent space U, respectively. (C) presents a joint distribution between X- and Y-latent vectors. ( minutes) and a validation dataset (5 minutes). To select the optimal parameters of L n and R, the crossvalidation was applied on the calibration dataset. Finally, L n = and R = were selected for the HOPLS model. Likewise, the best values of R for PLS and N-PLS were 9 and 6, respectively. The X-latent space is visualized in Fig. 9(A), where each point represents one sample of independent variables, while the Y-latent space is presented in Fig. 9(B), with each point representing one dependent sample. Observe that the distributions of these two latent variable spaces were quite similar, and the two dominant clusters are clearly distinguished. The joint distributions between each t r and u r are depicted in Fig. 9(C). Two clusters can be observed from the first component which might be related to the movement and non-movement behaviors. Fig. : (A) Spatial loadings P () r corresponding to the first two latent components. Each row shows 5 significant loading vectors. Likewise, (B) depicts time-frequency loadings P () r, with β and γ-band exhibiting significant contribution. Another advantage of HOPLS was better physical interpretation of the model. To investigate how the spatial, spectral, and temporal structure of ECoG data were used to create the regression model, loading vectors can be regarded as a subspace basis in spatial and timefrequency domains, as shown in Fig.. With regard to time-frequency loadings, the β- and γ-band activities were most significant implying the importance of β, γ- band activities for encoding of movements; the duration of β-band was longer than that of γ-band, which indicates that hand movements were related to long history oscillations of β-band and short history oscillations of γ-band. These findings also demonstrated that a high gamma band activity in the premotor cortex is associated with movement preparation, initiation and maintenance [5]. From Table, observe that the improved prediction performances were achieved by HOPLS, for all the performance metrics. In particular, the results from dataset demonstrated that the improvements by HOPLS over N- PLS were. for the correlation coefficient of X-position,. for averaged RMSEP,.4 for averaged Q, whereas the improvements by HOPLS over PLS were. for the correlation coefficient of X-position,. for averaged RMSEP, and. for averaged Q. Since HOPLS enables us to create a regression model between two higher-order tensors, all trajectories recorded from shoulder, elbow, wrist and hand were contructed as a tensor Y R I 4 (samples D positions markers). In order to verify the superiority of HOPLS for small sample sizes, we used second data for calibration and second data for validation. The resolution of time-frequency representations was improved to provide more detailed features, thus we have a 4th-order tensor X R I (samples channels time frequency). The prediction performances from HOPLS, N-PLS and PLS are shown in Fig., illustrating the effectiveness of HOPLS when the response data originally has tensor structure. Time-frequency features of ECoG epochs for each sample are extremely overlapped, resulting in a lot of information redundancy and high computational burden. In addition, it is generally not necessary to predict behaviors with a high time-resolution. Hence, an additional analysis has been performed by down-sampling motion marker positions at Hz, to ensure that nonoverlapped features were used in any adjacent samples. The cross-validation performance was evaluated for all the markers from the ten minute calibration dataset and the best performance for PLS of Q =.9 was obtained using R =, for N-PLS it was Q =. obtained by R = 5, and for HOPLS it was Q =.8 obtained by R = 4, λ = 5. The prediction performances on the five minute validation dataset are shown in Fig., implying the significant improvements obtained by HOPLS over N-PLS and PLS for all the four markers. For visualization, Fig. exhibits the observed and predicted D hand trajectories in the 5s time window. The video

12 TABLE : Comprehensive comparison of the HOPLS, N-PLS and PLS for the prediction of D hand trajectories on the four datasets recorded from the same monkey brain. The numbers of latent vector for HOPLS, N-PLS and Unfold-PLS were, 6, and 9, respectively. Data Set Model Q (ECoG) Dataset Dataset Dataset Dataset 4 Q (D hand positions) RMSEP (D hand positions) Correlation X Y Z Mean X Y Z Mean X Y Z HOPLS N-PLS Unfold-PLS HOPLS N-PLS Unfold-PLS HOPLS N-PLS Unfold-PLS HOPLS N-PLS Unfold-PLS Prediction Q Prediction Q PLS PLS Shoulder NPLS HOPLS Wrist NPLS HOPLS PLS PLS Elbow NPLS HOPLS Hand NPLS HOPLS X position Y position Z position Fig. : The prediction performance of D trajectories recorded from shoulder, elbow, wrist and hand. The optimal R are 6, 8, 49 for PLS, N-PLS and HOPLS, respectively, and λ = 5 for HOPLS. Prediction Q Prediction Q Shoulder PLS NPLS HOPLS Wrist PLS NPLS HOPLS Elbow PLS NPLS HOPLS Hand PLS NPLS HOPLS X position Y position Z position Fig. : The prediction performance of D trajectories for shoulder, elbow, wrist and hand using non-overlapped ECoG features. is also available for visualization of the results 5. 5 CONCLUSIONS The higher-order partial least squares (HOPLS) has been proposed as a generalized multilinear regression model. The analysis and simulations have shown that the advantages of the proposed model include its robustness to noise and enhanced performance for small sample sizes. In addition, HOPLS provides an optimal tradeoff between fitness and overfitting due to the fact that model complexity can be adapted by a hyperparameter. The proposed strategy to find a closed-form solution for HOPLS makes computation more efficient than the existing algorithms. The results for a real-world application in decoding D movement trajectories from ECoG signals have also demonstrated that HOPLS would be a promising multilinear subspace regression method X position Y position Z position 6 4 D hand trajectory Observed trajectory HOPLS prediction N PLS prediction PLS prediction 5 5 Time (s) Fig. : Visualization of observed trajectories (5s time window) and the trajectories predicted by HOPLS, N-PLS and PLS.

13 REFERENCES [] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression Analysis, rd ed. New York: John Wiley & Sons,. [] C. Dhanjal, S. Gunn, and J. Shawe-Taylor, Efficient sparse kernel feature extraction based on partial least squares, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol., no. 8, pp. 47 6, 9. [] H. Wold, Soft modelling by latent variables: The non-linear iterative partial least squares (NIPALS) approach, Perspectives in Probability and Statistics, pp. 7 4, 975. [4] A. Krishnan, L. Williams, A. McIntosh, and H. Abdi, Partial least squares (PLS) methods for neuroimaging: A tutorial and review, NeuroImage, vol. 56, no., pp ,. [5] H. Abdi, Partial least squares regression and projection on latent structure regression (PLS Regression), Wiley Interdisciplinary Reviews: Computational Statistics, vol., no., pp. 97 6,. [6] R. Rosipal and N. Krämer, Overview and recent advances in partial least squares, Subspace, Latent Structure and Feature Selection, pp. 4 5, 6. [7] J. Trygg and S. Wold, Orthogonal projections to latent structures (O-PLS), Journal of Chemometrics, vol. 6, no., pp. 9 8,. [8] R. Ergon, PLS score-loading correspondence and a bi-orthogonal factorization, Journal of Chemometrics, vol. 6, no. 7, pp. 68 7,. [9] S. Vijayakumar and S. Schaal, Locally weighted projection regression: An O (n) algorithm for incremental real time learning in high dimensional space, in Proceedings of the Seventeenth International Conference on Machine Learning, vol.,, pp [] R. Rosipal and L. Trejo, Kernel partial least squares regression in reproducing kernel Hilbert space, The Journal of Machine Learning Research, vol., p.,. [] J. Ham, D. Lee, S. Mika, and B. Schölkopf, A kernel view of the dimensionality reduction of manifolds, in Proceedings of the twenty-first international conference on Machine learning. ACM, 4, pp [] R. Bro, A. Rinnan, and N. Faber, Standard error of prediction for multilinear PLS-. Practical implementation in fluorescence spectroscopy, Chemometrics and Intelligent Laboratory Systems, vol. 75, no., pp , 5. [] B. Li, J. Morris, and E. Martin, Model selection for partial least squares regression, Chemometrics and Intelligent Laboratory Systems, vol. 64, no., pp ,. [4] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), pp , 996. [5] R. Bro, Multi-way analysis in the food industry, Models, Algorithms, and Applications. Academish proefschrift. Dinamarca, 998. [6] T. Kolda and B. Bader, Tensor Decompositions and Applications, SIAM Review, vol. 5, no., pp , 9. [7] A. Cichocki, R. Zdunek, A. H. Phan, and S. I. Amari, Nonnegative Matrix and Tensor Factorizations. John Wiley & Sons, 9. [8] E. Acar, D. Dunlavy, T. Kolda, and M. Mørup, Scalable tensor factorizations for incomplete data, Chemometrics and Intelligent Laboratory Systems,. [9] R. Bro, Multiway calibration. Multilinear PLS, Journal of Chemometrics, vol., no., pp. 47 6, 996. [], Review on multiway analysis in chemistry 5, Critical Reviews in Analytical Chemistry, vol. 6, no., pp. 79 9, 6. [] K. Hasegawa, M. Arakawa, and K. Funatsu, Rational choice of bioactive conformations through use of conformation analysis and -way partial least squares modeling, Chemometrics and Intelligent Laboratory Systems, vol. 5, no., pp. 5 6,. [] J. Nilsson, S. de Jong, and A. Smilde, Multiway calibration in D QSAR, Journal of chemometrics, vol., no. 6, pp. 5 54, 997. [] K. Zissis, R. Brereton, S. Dunkerley, and R. Escott, Two-way, unfolded three-way and three-mode partial least squares calibration of diode array HPLC chromatograms for the quantitation of lowlevel pharmaceutical impurities, Analytica Chimica Acta, vol. 84, no., pp. 7 8, 999. [4] E. Martinez-Montes, P. Valdés-Sosa, F. Miwakeichi, R. Goldman, and M. Cohen, Concurrent EEG/fMRI analysis by multiway partial least squares, NeuroImage, vol., no., pp. 4, 4. [5] E. Acar, C. Bingol, H. Bingol, R. Bro, and B. Yener, Seizure recognition on epilepsy feature tensor, in 9th Annual International Conference of the IEEE EMBS 7, 7, pp [6] R. Bro, A. Smilde, and S. de Jong, On the difference between lowrank and subspace approximation: improved model for multilinear PLS regression, Chemometrics and Intelligent Laboratory Systems, vol. 58, no., pp.,. [7] A. Smilde and H. Kiers, Multiway covariates regression models, Journal of Chemometrics, vol., no., pp. 48, 999. [8] A. Smilde, R. Bro, and P. Geladi, Multi-way analysis with applications in the chemical sciences. Wiley, 4. [9] R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an explanatory multimodal factor analysis, UCLA Working Papers in Phonetics, vol. 6, pp. 84, 97. [] A. Smilde, Comments on multilinear PLS, Journal of Chemometrics, vol., no. 5, pp , 997. [] M. D. Borraccetti, P. C. Damiani, and A. C. Olivieri, When unfolding is better: unique success of unfolded partial leastsquares regression with residual bilinearization for the processing of spectral-ph data with strong spectral overlapping. analysis of fluoroquinolones in human urine based on flow-injection phmodulated synchronous fluorescence data matrices, Analyst, vol. 4, pp , 9. [] L. De Lathauwer, Decompositions of a higher-order tensor in block terms - Part II: Definitions and uniqueness, SIAM J. Matrix Anal. Appl, vol., no., pp. 66, 8. [] L. De Lathauwer, B. De Moor, and J. Vandewalle, A multilinear singular value decomposition, SIAM Journal on Matrix Analysis and Applications, vol., no. 4, pp. 5 78,. [4] L. R. Tucker, Implications of factor analysis of three-way matrices for measurement of change, in Problems in Measuring Change, C. W. Harris, Ed. University of Wisconsin Press, 96, pp. 7. [5] T. Kolda, Multilinear operators for higher-order decompositions. Tech. Report SAND6-8, Sandia National Laboratories, Albuquerque, NM, Livermore, CA, 6. [6] J. D. Carroll and J. J. Chang, Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart- Young decomposition, Psychometrika, vol. 5, pp. 8 9, 97. [7] L. De Lathauwer, B. De Moor, and J. Vandewalle, On the Best Rank- and Rank-(R, R,..., RN) Approximation of Higher-Order Tensors, SIAM Journal on Matrix Analysis and Applications, vol., no. 4, pp. 4 4,. [8] L.-H. Lim and V. D. Silva, Tensor rank and the ill-posedness of the best low-rank approximation problem, SIAM Journal of Matrix Analysis and Applications, vol., no., pp. 84 7, 8. [9] H. Wold, Soft modeling: the basic design and some extensions, Systems Under Indirect Observation, vol., pp. 5, 98. [4] S. Wold, M. Sjostroma, and L. Erikssonb, PLS-regression: a basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, vol. 58, pp. 9,. [4] S. Wold, A. Ruhe, H. Wold, and W. Dunn III, The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM Journal on Scientific and Statistical Computing, vol. 5, p. 75, 984. [4] B. Kowalski, R. Gerlach, and H. Wold, Systems under indirect observation, Chemical Systems under Indirect Observation, pp. 9 9, 98. [4] A. McIntosh and N. Lobaugh, Partial least squares analysis of neuroimaging data: applications and advances, Neuroimage, vol., pp. S5 S6, 4. [44] A. McIntosh, W. Chau, and A. Protzner, Spatiotemporal analysis of event-related fmri data using partial least squares, Neuroimage, vol., no., pp , 4. [45] N. Kovacevic and A. McIntosh, Groupwise independent component decomposition of EEG data and partial least square analysis, NeuroImage, vol. 5, no., pp., 7. [46] Z. Chao, Y. Nagasaka, and N. Fujii, Long-term asynchronous decoding of arm motion using electrocorticographic signals in monkeys, Frontiers in Neuroengineering, vol., no.,. [47] L. Trejo, R. Rosipal, and B. Matthews, Brain-computer interfaces for -D and -D cursor control: designs using volitional control of the EEG spectrum or steady-state visual evoked potentials, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 4, no., pp. 5 9, 6.

14 4 [48] H. Kim, J. Zhou, H. Morse III, and H. Park, A three-stage framework for gene expression data analysis by L-norm support vector regression, International Journal of Bioinformatics Research and Applications, vol., no., pp. 5 6, 5. [49] Y. Nagasaka, K. Shimoda, and N. Fujii, Multidimensional recording (MDR) and data sharing: An ecological open research and educational platform for neuroscience, PLoS ONE, vol. 6, no. 7, p. e56,. [5] J. Rickert, S. de Oliveira, E. Vaadia, A. Aertsen, S. Rotter, and C. Mehring, Encoding of movement direction in different frequency ranges of motor cortical local field potentials, The Journal of Neuroscience, vol. 5, no. 9, pp , 5. Zenas C. Chao received the PhD degree in biomedical engineering from the Georgia Institute of Technology in 7. He is currently a research scientist in the Laboratory for Adpative Intelligence at RIKEN Brain Science Institute (BSI), Japan, where the research focus is on primate neurophsiological recording and analysis. Qibin Zhao was born in 979. He received the Ph.D. degree in engineering from Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, in 9. He is currently a research scientist at Laboratory for Advanced Brain Signal Processing in RIKEN Brain Science Institute, Japan and a visiting research scientist at Non-invasive BMI Unit of BSI- TOYOTA Collaboration Center, RIKEN. His research interests include multi-way data analysis, brain computer interface and machine learning. Cesar F. Caiafa was born in Buenos Aires, Argentina, in 97. He received the Ph.D. degree in engineering from Faculty of Engineering, University of Buenos Aires, in 7. He is currently Adjunct Researcher with the Argentinean Radioastronomy Institute (IAR) - CONICET and Assistant Professor with Faculty of Engineering, University of Buenos Aires. He is also Visiting Scientist at Lab. for Advanced Brain Signal Processing, BSI - RIKEN, Japan. Danilo P. Mandic (M 99-SM ) is a Professor of signal processing at Imperial College London, London, U.K., where he has been working in the area of nonlinear and multidimensional adaptive signal processing and time-frequency analysis. His publication record includes two research monographs titled Recurrent Neural Networks for Prediction (West Sussex, U.K.: Wiley, August ) and Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models (West Sussex, U.K.: Wiley, April 9), an edited book titled Signal Processing for Information Fusion (New York: Springer, 8), and more than publications on signal and image processing. He has been a member of the IEEE Technical Committee on Machine Learning for Signal Processing, an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II, the IEEE TRANSACTIONS ON SIGNAL PROCESSING, the IEEE TRANSACTIONS ON NEURAL NETWORKS, and the International Journal of Mathematical Modelling and Algorithms. Yasuo Nagasaka received the PhD degree in psychology from Rikkyo University, Japan in. He was a postdoctoral fellow at the U- niversity of Iowa in 4-6. He is currently a postdoctoral fellow at RIKEN Brain Science Institute, Japan. His research interests broadly span the areas of animal cognition, including illusory perception, motor control behavior, and social brain function. Naotaka Fujii received MD and PhD from Tohoku University School of Medicine, Sendai, Japan in 99 and 997 respectively. After PhD was granted, he worked in BCS, MIT as postdoctoral fellow. Currently he is a head of Laboratory for Adaptive Intelligence in Brain Science Institute, RIKEN, Japan. His research interests span the wide areas, social neuroscience in primate and human and development of brain machine interface and substitute reality system. Liqing Zhang received the Ph.D. degree from Zhongshan University, Guangzhou, China, in 988. He was promoted to full professor position in 995 at South China University of Technology. He worked as a research scientist in RIKEN Brain Science Institute, Japan from 997 to. He is now a Professor with Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China. His current research interests cover computational theory for cortical networks, brain-computer interface, perception and cognition computing model, statistical learning and inference. He has published more than 7 papers in international journals and conferences. field worldwide. Andrzej Cichocki received the Ph.D. and Dr.Sc. (Habilitation) degrees, all in electrical engineering, from Warsaw University of Technology (Poland). He is currently the senior team leader head of the laboratory for Advanced Brain Signal Processing, at RIKEN Brain Science Institute (JAPAN). He is co-author of more than 5 technical papers and 4 monographs (two of them translated to Chinese). According to a analysis, he is an author and coauthor of one of the top % most highly cited papers in his

KERNEL-BASED TENSOR PARTIAL LEAST SQUARES FOR RECONSTRUCTION OF LIMB MOVEMENTS

KERNEL-BASED TENSOR PARTIAL LEAST SQUARES FOR RECONSTRUCTION OF LIMB MOVEMENTS KERNEL-BASED TENSOR PARTIAL LEAST SQUARES FOR RECONSTRUCTION OF LIMB MOVEMENTS Qibin Zhao, Guoxu Zhou, Tulay Adali 2, Liqing Zhang 3, Andrzej Cichocki RIKEN Brain Science Institute, Japan 2 University

More information

Fast Recursive Low-rank Tensor Learning for Regression

Fast Recursive Low-rank Tensor Learning for Regression Fast Recursive Low-rank Tensor Learning for Regression Ming Hou and Brahim Chaib-draa Department of Computer Science and Software Engineering, Laval University, Quebec, Canada ming.hou.1@ulaval.ca, brahim.chaib-draa@ift.ulaval.ca

More information

Fundamentals of Multilinear Subspace Learning

Fundamentals of Multilinear Subspace Learning Chapter 3 Fundamentals of Multilinear Subspace Learning The previous chapter covered background materials on linear subspace learning. From this chapter on, we shall proceed to multiple dimensions with

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

/16/$ IEEE 1728

/16/$ IEEE 1728 Extension of the Semi-Algebraic Framework for Approximate CP Decompositions via Simultaneous Matrix Diagonalization to the Efficient Calculation of Coupled CP Decompositions Kristina Naskovska and Martin

More information

A New Generation of Brain-Computer Interfaces Driven by Discovery of Latent EEG-fMRI Linkages Using Tensor Decomposition

A New Generation of Brain-Computer Interfaces Driven by Discovery of Latent EEG-fMRI Linkages Using Tensor Decomposition A New Generation of Brain-Computer Interfaces Driven by Discovery of Latent EEG-fMRI Linkages Using Tensor Decomposition Gopikrishna Deshpande AU MRI Research Center AU Department of Electrical and Computer

More information

Slice Oriented Tensor Decomposition of EEG Data for Feature Extraction in Space, Frequency and Time Domains

Slice Oriented Tensor Decomposition of EEG Data for Feature Extraction in Space, Frequency and Time Domains Slice Oriented Tensor Decomposition of EEG Data for Feature Extraction in Space, and Domains Qibin Zhao, Cesar F. Caiafa, Andrzej Cichocki, and Liqing Zhang 2 Laboratory for Advanced Brain Signal Processing,

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 6: Some Other Stuff PD Dr.

More information

Multilinear and Nonlinear Generalizations of Partial Least Squares: An Overview of Recent Advances

Multilinear and Nonlinear Generalizations of Partial Least Squares: An Overview of Recent Advances Multilinear and Nonlinear Generalizations of Partial Least Squares: An Overview of Recent Advances Qibin Zhao, Liqing Zhang and Andrzej Cichocki Abstract Partial Least Squares (PLS) is an efficient multivariate

More information

CP DECOMPOSITION AND ITS APPLICATION IN NOISE REDUCTION AND MULTIPLE SOURCES IDENTIFICATION

CP DECOMPOSITION AND ITS APPLICATION IN NOISE REDUCTION AND MULTIPLE SOURCES IDENTIFICATION International Conference on Computer Science and Intelligent Communication (CSIC ) CP DECOMPOSITION AND ITS APPLICATION IN NOISE REDUCTION AND MULTIPLE SOURCES IDENTIFICATION Xuefeng LIU, Yuping FENG,

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

ANALYSIS OF NONLINEAR PARTIAL LEAST SQUARES ALGORITHMS

ANALYSIS OF NONLINEAR PARTIAL LEAST SQUARES ALGORITHMS ANALYSIS OF NONLINEAR PARIAL LEAS SQUARES ALGORIHMS S. Kumar U. Kruger,1 E. B. Martin, and A. J. Morris Centre of Process Analytics and Process echnology, University of Newcastle, NE1 7RU, U.K. Intelligent

More information

Forecast comparison of principal component regression and principal covariate regression

Forecast comparison of principal component regression and principal covariate regression Forecast comparison of principal component regression and principal covariate regression Christiaan Heij, Patrick J.F. Groenen, Dick J. van Dijk Econometric Institute, Erasmus University Rotterdam Econometric

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

ReducedPCR/PLSRmodelsbysubspaceprojections

ReducedPCR/PLSRmodelsbysubspaceprojections ReducedPCR/PLSRmodelsbysubspaceprojections Rolf Ergon Telemark University College P.O.Box 2, N-9 Porsgrunn, Norway e-mail: rolf.ergon@hit.no Published in Chemometrics and Intelligent Laboratory Systems

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Forecasting 1 to h steps ahead using partial least squares

Forecasting 1 to h steps ahead using partial least squares Forecasting 1 to h steps ahead using partial least squares Philip Hans Franses Econometric Institute, Erasmus University Rotterdam November 10, 2006 Econometric Institute Report 2006-47 I thank Dick van

More information

Common and Discriminative Subspace Kernel-Based Multiblock Tensor Partial Least Squares Regression

Common and Discriminative Subspace Kernel-Based Multiblock Tensor Partial Least Squares Regression Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-6) Common and Discriminative Subspace Kernel-Based Multiblock Tensor Partial Least Squares Regression Ming Hou, Qibin Zhao,3,

More information

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

FERMENTATION BATCH PROCESS MONITORING BY STEP-BY-STEP ADAPTIVE MPCA. Ning He, Lei Xie, Shu-qing Wang, Jian-ming Zhang

FERMENTATION BATCH PROCESS MONITORING BY STEP-BY-STEP ADAPTIVE MPCA. Ning He, Lei Xie, Shu-qing Wang, Jian-ming Zhang FERMENTATION BATCH PROCESS MONITORING BY STEP-BY-STEP ADAPTIVE MPCA Ning He Lei Xie Shu-qing Wang ian-ming Zhang National ey Laboratory of Industrial Control Technology Zhejiang University Hangzhou 3007

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 6. Principal component analysis (PCA) 6.1 Overview 6.2 Essentials of PCA 6.3 Numerical calculation of PCs 6.4 Effects of data preprocessing

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

The Singular-Value Decomposition

The Singular-Value Decomposition Mathematical Tools for Data Science Spring 2019 1 Motivation The Singular-Value Decomposition The singular-value decomposition (SVD) is a fundamental tool in linear algebra. In this section, we introduce

More information

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy Caroline Chaux Joint work with X. Vu, N. Thirion-Moreau and S. Maire (LSIS, Toulon) Aix-Marseille

More information

New Machine Learning Methods for Neuroimaging

New Machine Learning Methods for Neuroimaging New Machine Learning Methods for Neuroimaging Gatsby Computational Neuroscience Unit University College London, UK Dept of Computer Science University of Helsinki, Finland Outline Resting-state networks

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Topological Data Analysis for Brain Networks

Topological Data Analysis for Brain Networks Topological Data Analysis for Brain Networks Relating Functional Brain Network Topology to Clinical Measures of Behavior Bei Wang Phillips 1,2 University of Utah Joint work with Eleanor Wong 1,2, Sourabh

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

AN ONLINE NIPALS ALGORITHM FOR PARTIAL LEAST SQUARES. Alexander E. Stott, Sithan Kanna, Danilo P. Mandic, William T. Pike

AN ONLINE NIPALS ALGORITHM FOR PARTIAL LEAST SQUARES. Alexander E. Stott, Sithan Kanna, Danilo P. Mandic, William T. Pike AN ONLINE NIPALS ALGORIHM FOR PARIAL LEAS SQUARES Alexander E. Stott, Sithan Kanna, Danilo P. Mandic, William. Pike Electrical and Electronic Engineering Department, Imperial College London, SW7 2AZ, UK

More information

18 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008

18 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008 18 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008 MPCA: Multilinear Principal Component Analysis of Tensor Objects Haiping Lu, Student Member, IEEE, Konstantinos N. (Kostas) Plataniotis,

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences Journal of Information Hiding and Multimedia Signal Processing c 2016 ISSN 207-4212 Ubiquitous International Volume 7, Number 5, September 2016 Truncation Strategy of Tensor Compressive Sensing for Noisy

More information

Novel Alternating Least Squares Algorithm for Nonnegative Matrix and Tensor Factorizations

Novel Alternating Least Squares Algorithm for Nonnegative Matrix and Tensor Factorizations Novel Alternating Least Squares Algorithm for Nonnegative Matrix and Tensor Factorizations Anh Huy Phan 1, Andrzej Cichocki 1,, Rafal Zdunek 1,2,andThanhVuDinh 3 1 Lab for Advanced Brain Signal Processing,

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1 In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Towards a Regression using Tensors

Towards a Regression using Tensors February 27, 2014 Outline Background 1 Background Linear Regression Tensorial Data Analysis 2 Definition Tensor Operation Tensor Decomposition 3 Model Attention Deficit Hyperactivity Disorder Data Analysis

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 2017-2018 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI

More information

EXTENDING PARTIAL LEAST SQUARES REGRESSION

EXTENDING PARTIAL LEAST SQUARES REGRESSION EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical

More information

Dreem Challenge report (team Bussanati)

Dreem Challenge report (team Bussanati) Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the

More information

A Multi-Affine Model for Tensor Decomposition

A Multi-Affine Model for Tensor Decomposition Yiqing Yang UW Madison breakds@cs.wisc.edu A Multi-Affine Model for Tensor Decomposition Hongrui Jiang UW Madison hongrui@engr.wisc.edu Li Zhang UW Madison lizhang@cs.wisc.edu Chris J. Murphy UC Davis

More information

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 12 Fall Linear regression DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,

More information

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Image Registration Lecture 2: Vectors and Matrices

Image Registration Lecture 2: Vectors and Matrices Image Registration Lecture 2: Vectors and Matrices Prof. Charlene Tsai Lecture Overview Vectors Matrices Basics Orthogonal matrices Singular Value Decomposition (SVD) 2 1 Preliminary Comments Some of this

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Faloutsos, Tong ICDE, 2009

Faloutsos, Tong ICDE, 2009 Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

Lecture 7 MIMO Communica2ons

Lecture 7 MIMO Communica2ons Wireless Communications Lecture 7 MIMO Communica2ons Prof. Chun-Hung Liu Dept. of Electrical and Computer Engineering National Chiao Tung University Fall 2014 1 Outline MIMO Communications (Chapter 10

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Lecture Notes 5: Multiresolution Analysis

Lecture Notes 5: Multiresolution Analysis Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and

More information

Lecture Notes 2: Matrices

Lecture Notes 2: Matrices Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Tensors (also called multiway arrays) are a generalization. Kernelization of Tensor-Based Models for Multiway Data Analysis

Tensors (also called multiway arrays) are a generalization. Kernelization of Tensor-Based Models for Multiway Data Analysis [ Qibin Zhao, Guoxu Zhou, ülay Adali, Liqing Zhang, and Andrzej Cichocki ] Kernelization of ensor-based Models for Multiway Data Analysis Advances in Kernel-based Learning for Signal Processing istockphoto.com/

More information

arxiv: v2 [math.na] 13 Dec 2014

arxiv: v2 [math.na] 13 Dec 2014 Very Large-Scale Singular Value Decomposition Using Tensor Train Networks arxiv:1410.6895v2 [math.na] 13 Dec 2014 Namgil Lee a and Andrzej Cichocki a a Laboratory for Advanced Brain Signal Processing,

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Deep Learning Book Notes Chapter 2: Linear Algebra

Deep Learning Book Notes Chapter 2: Linear Algebra Deep Learning Book Notes Chapter 2: Linear Algebra Compiled By: Abhinaba Bala, Dakshit Agrawal, Mohit Jain Section 2.1: Scalars, Vectors, Matrices and Tensors Scalar Single Number Lowercase names in italic

More information

Dominant Feature Vectors Based Audio Similarity Measure

Dominant Feature Vectors Based Audio Similarity Measure Dominant Feature Vectors Based Audio Similarity Measure Jing Gu 1, Lie Lu 2, Rui Cai 3, Hong-Jiang Zhang 2, and Jian Yang 1 1 Dept. of Electronic Engineering, Tsinghua Univ., Beijing, 100084, China 2 Microsoft

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

THERE is an increasing need to handle large multidimensional

THERE is an increasing need to handle large multidimensional 1 Matrix Product State for Feature Extraction of Higher-Order Tensors Johann A. Bengua 1, Ho N. Phien 1, Hoang D. Tuan 1 and Minh N. Do 2 arxiv:1503.00516v4 [cs.cv] 20 Jan 2016 Abstract This paper introduces

More information

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams Jimeng Sun Spiros Papadimitriou Philip S. Yu Carnegie Mellon University Pittsburgh, PA, USA IBM T.J. Watson Research Center Hawthorne,

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

DYNAMIC AND COMPROMISE FACTOR ANALYSIS

DYNAMIC AND COMPROMISE FACTOR ANALYSIS DYNAMIC AND COMPROMISE FACTOR ANALYSIS Marianna Bolla Budapest University of Technology and Economics marib@math.bme.hu Many parts are joint work with Gy. Michaletzky, Loránd Eötvös University and G. Tusnády,

More information

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University Linear regression Experiment Consider spectrophotometry as an example Beer-Lamberts law: A = cå Experiment Make three known references

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c08 2013/9/9 page 147 le-tex 8.3 Principal Component Analysis (PCA) 147 Figure 8.1 Principal and independent components

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

This appendix provides a very basic introduction to linear algebra concepts.

This appendix provides a very basic introduction to linear algebra concepts. APPENDIX Basic Linear Algebra Concepts This appendix provides a very basic introduction to linear algebra concepts. Some of these concepts are intentionally presented here in a somewhat simplified (not

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Neuroscience Introduction

Neuroscience Introduction Neuroscience Introduction The brain As humans, we can identify galaxies light years away, we can study particles smaller than an atom. But we still haven t unlocked the mystery of the three pounds of matter

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Linear Algebra (Review) Volker Tresp 2017

Linear Algebra (Review) Volker Tresp 2017 Linear Algebra (Review) Volker Tresp 2017 1 Vectors k is a scalar (a number) c is a column vector. Thus in two dimensions, c = ( c1 c 2 ) (Advanced: More precisely, a vector is defined in a vector space.

More information

SVD, PCA & Preprocessing

SVD, PCA & Preprocessing Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees

More information