arxiv: v1 [stat.ml] 31 Jan 2018

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 31 Jan 2018"

Sophie Howard
5 years ago
Views:

1 Increental kernel PCA and the Nyströ ethod arxiv: v [stat.ml] 3 Jan 208 Fredrik Hallgren Departent of Statistical Science University College London London WCE 6BT, United Kingdo fredrik.hallgren@ucl.ac.uk Abstract Increental versions of batch algoriths are often desired, for increased tie efficiency in the streaing data setting, or increased eory efficiency in general. In this paper we present a novel algorith for increental kernel PCA, based on rank one updates to the eigendecoposition of the kernel atrix, which is ore coputationally efficient than coparable existing algoriths. We extend our algorith to increental calculation of the Nyströ approxiation to the kernel atrix, the first such algorith proposed. Increental calculation of the Nyströ approxiation leads to further gains in eory efficiency, and allows for epirical evaluation of when a subset of sufficient size has been obtained. INTRODUCTION Kernel ethods ake use of non-linear patterns in data whilst being able to use linear solution ethods, through a non-linear transforation of data exaples into a feature space where inner products correspond to the application of a kernel function between data exaples (Hofann et al., 2008). Many kernel ethods have been conceived as the direct application of well-known linear ethods in this feature space, occasionally reforulated to be expressed entirely in the for of inner products. This is the case for kernel PCA, obtained through the application of linear PCA in feature space (Schölkopf et al., 998) and involving an eigendecoposition of the kernel atrix. It has been shown to outperfor linear PCA in a nuber of applications (Chin and Suter, 2007). Paul Northrop Departent of Statistical Science University College London London WCE 6BT, United Kingdo p.northrop@ucl.ac.uk Increental algoriths, where a solution is updated for additional data exaples, are often desirable. If data arrives sequentially in tie and a solution is required for each additional data exaple, ore efficient increental algoriths are often available than repeated application of a batch procedure. Furtherore, increental algoriths often have a lower eory footprint than their batch counterparts. In this paper, we propose a novel algorith for increental kernel PCA, which accounts for the change in ean in the covariance atrix fro each additional data exaple. It works by writing the expanded ean-adjusted kernel atrix fro an additional data point in ters of a nuber of rank one updates, to which a rank one update algorith for the eigendecoposition can be applied. We use a rank one update algorith based on work in Golub (973) and Bunch et al. (978). A few previous exact increental algoriths for kernel PCA have been proposed, soe of which are based on the application of an increental linear PCA ethod in feature space (Ki et al., 2005; Chin and Suter, 2007; Hoegaerts et al., 2007). Rank one update algoriths for the eigendecoposition have not previously been applied to kernel PCA, to the best of our knowledge. If the ean of the feature vectors is not adjusted, our algorith corresponds to an increental procedure for the eigendecoposition of the kernel atrix, which can be ore widely applied. Our algorith has the sae tie and eory coplexities as existing algoriths for increental kernel PCA and it is ore coputationally efficient than the coparable algorith in Chin and Suter (2007), which also allows for a change in ean. Furtherore, it can be considered ore flexible, since it is straightforward to apply a different rank one update algorith to the one we have used, for potentially iproved efficiency. Approxiate algoriths could also be applied, for exaple fro randoized linear algebra (Mahoney, 20). The usefulness of kernel ethods is liited by their large coputational requireents in tie and eory, which

2 scale in the nuber of data points, since the diension of the transfored variables often is very large, or they are not explicitly available, and one therefore ust express a solution in ters of transfored data exaples. This is particularly true for kernel PCA since it requires an eigendecoposition of the kernel atrix, an expensive operation. As a reedy, various approxiate ethods have been introduced, such as the Nyströ ethod (Willias and Seeger, 200), which creates a low-rank approxiation to the kernel atrix based on a randoly sapled subset of data exaples. We also extend our algorith for increental kernel PCA to increental calculation of the Nyströ approxiation to the kernel atrix. We increentally add data exaples to the subset used to create the Nyströ approxiation to kernel PCA. This allows one to evaluate epirically the accuracy of the Nyströ approxiation for each added data exaple. Rudi et al. (205) presented an increental updating procedure for the Nyströ approxiation to kernel ridge regression, based on rank one updates to the Cholesky decoposition. Our proposed increental procedure can be applied to any kernel ethod requiring the eigendecoposition or inverse of the kernel atrix. Cobining an increental algorith with the Nyströ ethod also leads to further iproveents in eory efficiency, copared with either ethod on its own. 2 BACKGROUND 2. KERNEL METHODS Kernel ethods allow for the application of linear ethods to discover non-linear patterns between variables, through a non-linear transforation of data points φ(x) into a feature space where linear algoriths can be applied (Hofann et al., 2008). They rely on two things. First, the calculation of inner products between transfored data exaples through a syetric positive definite kernel k(x, y); second, the expression of a solution linearly in the space of transfored data exaples, rather than in the space of transfored variables. We have a set of n observations {x i } n i=. Linear ethods generally scale in the diension of the observations. For exaple, if each x i is a real vector x i = (x () i,x (2) i,..., x (d) i ), a linear ethod will scale as the nuber of variables d. Let each x i be an eleent fro a set X. In general, no further restrictions need to be placed on the set X, which is a great benefit of kernel ethods. For exaple, X can be a collection of text strings or graphs (Lodhi et al., 2002; Vishwanathan et al., 200). Let H be a Hilbert space of real-valued functions on X, with inner product, H. If X is a vector space, then H is a closed subspace of X, the dual space of bounded linear functionals on X. Consider H, the dual space of linear functionals on H. For each x X there is an eleent δ x H such that δ x (f)=f(x), tered the evaluation functional. If δ x is bounded (i.e. continuous), then by the Riesz representation theore there is a unique eleent g x H such that δ x (f)= g x,f H (Bollobás, 999). If we consider g x as a function of x, say k(x, ), then k(x, ) has the reproducing property, i.e. k(x, ),f( ) H = f(x). Furtherore, by the reproducing property, we have k(x, ),k(y, ) H =k(x,y). Then k(x,y) is a syetric positive definite function by the syetric positive definite property of the inner product. The function k(x, ) is also often denoted by φ(x), tered a feature ap. The space H has uncountable diension, but since every (separable) Hilbert space is isoetrically isoorphic to l 2, the space of square-suable sequences (Bollobás, 999), each eleent φ(x i ) has a representation as a vector φ(x i )=(φ (x i ),φ 2 (x i ),...,φ d (x i )) over R with φ(x i ),φ(x j ) H = d k= φ k(x i )φ k (x j ). We call these feature vectors. However, this representation is often not known, or d is very large, so it ight not be possible to apply a linear ethod directly on the variables φ (x),φ 2 (x),...,φ d (x). Thanks to the representer theore (Schölkopf et al., 200), a solution can instead often be expressed in ters of eleents in H, as f(x)= n i= α ik(x i,x) with coefficients α i. We arrange the feature vectors along the rows of a data atrix Φ. The kernel atrix is given by K := (k(x i,x j )) R n n =ΦΦ T. 2.2 KERNEL PCA PCA finds the set of orthogonal linear cobinations of variables that axiizes the variance of each linear cobination in turn. PCA can be used for diensionality reduction, in regression and classification probles, and to detect outliers, aong other applications (Jolliffe, 2002). The principal coponents are obtained by calculating the eigendecoposition of the saple covariance atrix C = n XT X, for a data atrix of (centred) observations X, where each observation occupies a row. This gives the decoposition C =V ΛV T where the coluns of V are the directions of axiu variance. The principal coponents can also be obtained through the related singular value decoposition (SVD). Assuing centered data, kernel PCA perfors the eigen-

3 decoposition of the covariance atrix in feature space through (Schölkopf et al., 998) n ΦT Φv =λv resulting in the decoposition n ΦT Φ=V ΣV T. Henceforth we will ignore the factor n and only be concerned with the eigendecoposition of Φ T Φ. Noting that span{φ T }=span{v }, we can write v in ters of an n-diensinal vector u as v =Φ T u. Left-ultiplying the eigenvalue equation by Φ we obtain Ku=λu and the decoposition K =UΛU T. If the data vectors in feature space are not assued to be centred, we need to subtract the ean of each variable fro Φ and instead calculate the eigendecoposition of K =(Φ n Φ)(Φ n Φ) T =K n K K n + n K n () where n is an n n atrix for which ( n ) i,j = n, i.e. with every eleent equal to n. 2.3 INCREMENTAL KERNEL PCA Increental algoriths update an existing solution for one or several additional data exaples, also referred to as online learning. The goal is that specialized algoriths will achieve greater tie or eory perforance than repeated application of batch procedures. There are any use cases for increental versions of batch algoriths, for exaple when eory capacity is constrained, or when data exaples arrive sequentially in tie, tered streaing data, and a solution is desired for each additional data exaple. A few algoriths for exact increental kernel PCA have been proposed. The algorith in Chin and Suter (2007) is based on the increental linear PCA algorith fro Li et al. (2004). The tie coplexity is O(n 3 ) and the eory coplexity O(n 2 ). Hoegaerts et al. (2007) write the kernel atrix expanded with an additional data exaple in ters of two rank one updates, without adjusting for a change in ean, and hence propose an algorith to update a subset of doinant eigenvalues and corresponding eigenvectors. If the algorith is applied to update all eigenpairs, the coplexities in tie and eory are O(n 3 ) and O(n 2 ), respectively. Iterative algoriths produce a sequence of iproving approxiate solutions that converges to the exact solution as the nuber of steps increases (Golub and Van Loan, 203). An iterative algorith can often be ade to operate efficiently in an increental fashion, by expanding the data set with additional data exaples and restarting the iterative procedure. An exaple of an iterative ethod for kernel PCA that can be ade to operate increentally is the kernel Hebbian algorith (Ki et al., 2005), based on the generalized Hebbian algorith (Oja, 982) applied in feature space. Various approxiations to increental kernel PCA have also been proposed. See for exaple Tokuoto and Ozawa (20) or Sheikholeslai et al. (205). Since we present an exact algorith for increental kernel PCA, we will not describe these or siilar works further. 2.4 THE NYSTRÖM METHOD The Nyströ ethod (Willias and Seeger, 200) randoly saples data exaples fro the full dataset, often uniforly, and calculates a low-rank approxiation K to the full kernel atrix through K =K n, K,K,n where K n, is an n atrix obtained by choosing coluns fro the original atrix K, K,n is its transpose and K, contains the intersection of the sae coluns and rows. 3 KERNEL PCA THROUGH RANK ONE UPDATES In this section we present an algorith for increental kernel PCA based on rank one updates to the eigendecoposition of the kernel atrix K, or the eanadjusted kernel atrix K. Any increental algorith for the eigendecoposition of the kernel atrix K can be applied where the explicit or iplicit inverse of the sae is required, such as kernel regression and kernel SVM. Various ethods other than kernel PCA are also based on the eigendecoposition of the kernel atrix, such as kernel FDA (Mika et al., 999). Even when ore efficient solution ethods are available, access to the eigendecoposition can be highly useful for statistical regularization or controlling nuerical stability. In contrast to the covariance atrix in linear PCA, the kernel atrix expands in size for each additional data point, which needs to be taken into account, and the effect on the eigensyste deterined. We write the kernel atrix K +,+ created with + data exaples in ters of an expansion and a sequence of syetric rank one updates to the kernel atrix K,, and apply a rank one update algorith to the eigendecoposition of K, to obtain the eigendecoposition of K +,+.

4 A nuber of algoriths have been suggested to perfor rank one odification to the syetric eigenproble. Golub (973) presented a procedure to deterine the eigenvalues of a diagonal atrix updated through a rank one perturbation. Bunch et al. (978) extended the results to the deterination of both eigenvalues and eigenvectors of an arbitrary perturbed atrix, including an iproved procedure to deterine the eigenvalues. Stability issues in the calculation of the eigenvectors, including loss of nuerical orthogonality, later otivated several iproveents (Dongarra and Sorensen, 987; Sorensen and Tang, 99; Gu and Eisenstat, 994). Alternatively, one could potentially eploy update algoriths for the singular value decoposition, such as the algorith suggested in Brand (2006) for the thin singular value decoposition. We use the rank one update algorith for eigenvalues fro Golub (973) and the deterine the eigenvectors according to Bunch et al. (978). In the experients our approach sees to be sufficiently stable and accurate for ost use cases. We assue throughout that the kernel atrix reains non-singular after each update. Our algorith has the sae tie and eory coplexities as copeting ethods. The algorith ost coparable to ours is the one in Chin and Suter (2007), which also accounts for a change in ean. If one additional data exaple is added increentally, and all eigenpairs are retained, it requires the eigendecoposition of an atrix, the eigendecoposition of the unadjusted kernel atrix, and a ultiplication of two atrices at each step. Since a ultiplication of two atrices requires 2 3 flops, and the stateof-the-art QR algorith for the syetric eigenproble about 9 3 flops (Golub and Van Loan, 203), the algorith thus requires 20 3 flops to the O( 3 ) factor. Our proposed algorith requires 8 3 flops to the O( 3 ) factor if the ean is adjusted, and 4 3 flops otherwise, fro one ultiplication of two + + atrices for each rank one update. Our algorith is thus ore than twice as efficient. 3. RANK ONE UPDATE PROCEDURE If we know the eigendecoposition of K,= U Λ U T and write K +,+ in ters of an expansion and nuber of syetric rank one updates to K,, we can then apply a rank one update algorith to obtain the eigendecoposition of K +,+= U + Λ + U T Zero-ean data If we assue that the data exaples have zero ean in feature space, then the ean does not need to be updated for previous data points and K, only needs to be expanded with an additional row and colun. In this case we can devise a rank one update procedure fro K, to K +,+ in two steps. We denote k i,j =k(x i,x j ) and a=[k,+ k 2,+ k,+ ] T, i.e. a colun vector with eleents k,+, k 2,+,..., k,+ and let Then we have v =[a T v 2 =[a T 2 k +,+] T 4 k +,+] T σ=4/k +,+ K +,+ = [ K, 0 = 0 T 4 k +,+ :=K 0,+σv v T σv 2 v T 2 ] +σv v T σv 2 v2 T corresponding to an expansion of K, to K, 0 and two rank one updates, where 0 is a colun vector of zeros. Copared to the eigensyste of K,, K, 0 will have an additional eigenvalue λ + = 4 k +,+ and corresponding eigenvector u + =[0 0 0 ] T. The atrix K, 0 is syetric positive definite (SPSD), since all eigenvalues are positive. It will reain SPSD after the first update, since it is a su of two SPSD atrices, as v v T is a Gra atrix, if each eleent is instead seen as a separate vector. The resulting atrix after the second update will be SPSD since this holds for K +,+. The algorith for one updating iteration is described in Algorith, given a function rankoneupdate(σ,v,l,u) that updates the eigenvalues L and eigenvectors U fro a rank one additive perturbation σvv T. (2) Algorith Increental eigendecoposition of kernel atrix Input: Dataset {x i} + i= ; row vector of eigenvalues L and atrix of eigenvectors U of K,; kernel function k(, ) Output: Eigenvalues L and eigenvectors U of K +,+ : L [L [ k +,+/4] ] U 0 2: U 0 k +,+/4 3: siga 4/k +,+ 4: k [k,+ k 2,+... k +,+/2] 5: k0 [k,+ k 2,+... k +,+/4] 6: L,U rankoneupdate(siga, k, L, U) 7: L,U rankoneupdate( siga, k0, L, U)

5 If we liit ourselves to kernel functions for which k(x,x) is constant, without loss of generality we can set k(x,x)= and the above expression siplifies Mean-adjusted data To construct a rank one update procedure fro K, to K +,+, all the eleents of K, need to be adjusted in addition to the expansion with another row and colun. We first devise two rank one updates that adjust the ean of K, to account for the additonal data exaple. We then expand the resulting atrix and perfor syetric updates to set the last row and colun to the required values, siilarly to (2). Recall that when taking the ean into account, one perfors an eigendecoposition of the adjusted kernel atrix K =K n K+K n n K n. The eleents of K, can thus be adjusted through the following forula K,:=(K +,+) :,: =K,+ K, +K, K, +( + K +,+ K +, K +,+ + ) :,: where ( ) :,: denotes the first rows and coluns of a atrix. The latter six ters are all rank one atrices. The atrices K, and ( + K +,+ ) :,: are constant along the coluns, and hence their su, and siilarly for the rows of K, (K +,+ + ) :,:. The atrix K, has constant entries, equal to the su of all eleents of K, ultiplied by a factor / 2, and siilarly for ( + K +,+ + ) :,:. Consequently, all ters can be written as two rank one updates. We have K, ( + K +,+ ) :,: = + ( K, a T ) K, (K +,+ + ) :,: = + (K, a T ) with a as in section 3.. above and where is a colun vector of ones. Since K, is syetric for all, we have K, =(K, ) T and ( + K +,+ ) :,: =(K +,+ + ) T :,:, and can set u= (+) K, + a+ 2 C C= 2 Σ + (+) 2 Σ + where we have denoted Σ = T K,, the su of all eleents of K,, to obtain K,=K,+ u T +u T =K,+ 2 ( +u)( +u) T 2 ( u)( u) T which is two syetric rank one updates to K,. Σ and K, can easily be updated between iterations like so Σ + =Σ +2a T +k +,+ K +,+ + =[K, +a; a T +k +,+ ] where [b; c] denotes a colun vector b expanded with an additional eleent c. We now expand K, to K +,+, analogously to (2), but taking the adjusted ean into account. The required last row and colun is given by v := k + ( + T +k+k +,+ + with k=[a T k(x +,x + )] T. If we let v =[(v) : ; 2 (v) +] v 2 =[(v) : ; 4 (v) +] σ=4/(v) + + Σ + + ) where (v) : is a vector of the first eleents of v, and (v) + is its last eleent, we have [ ] K K +,+=, 0 +σv v T σv 2 v2 T 0 T 4 (v) + :=K 0,+σv v T σv 2 v T 2 (3) We have thus devised a procedure to update K, to K +,+ using four syetric rank one updates, for which a rank one eigendecoposition update algorith can be applied. The full procedure is described in Algorith 2. Note that the atrix K, or its expansion do not need to be kept in eory. The procedure is linear in tie and eory, since all constituent quantities are updated increentally. 3.2 UPDATE ALGORITHM FOR THE EIGENDECOMPOSITION Here we describe an algorith for updating the eigendecoposition after a rank one perturbation. Suppose

6 Algorith 2 Increental eigendecoposition of adjusted kernel atrix Input: Dataset {x i} + i= ; row vector of eigenvalues L and atrix of eigenvectors U of K,; kernel function k(, ); su of all eleents of K,, denoted S; su of rows of K,, i.e. K,, denoted K Output: Eigenvalues L and eigenvectors U of K +,+ : a [k,+ k 2,+... k,+] 2: S2 S+2 su(a)+k +,+ 3: C S/ 2 +S2/(+) 2 4: u K/( (+)) 2 a/(+)+0.5 C ones() 5: L,U rankoneupdate(0.5, +u, L, U) 6: L,U rankoneupdate( 0.5, u, L, U) 7: K [K+a su(a)+k] 8: S S2 9: + 0: v k (ones() (su(a)+k)+k S/)/ : v0 v[] 2: v v[: ] 3: L [L [ v0/4] ] U 0 4: U 0 v0/4 5: siga 4/v0 6: v [v v0/2] 7: v2 [v v0/4] 8: L,U rankoneupdate(siga, v, L, U) 9: L,U rankoneupdate( siga, v2, L, U) we know the eigendecoposition of a syetric atrix A=UΛU T. Let B=UΛU T +σvv T =U(Λ+σzz T )U T where z=u T v, and look for the eigendecoposition of B=Λ+σzz T :=Ũ ΛŨ T (Bunch et al., 978). Then the eigendecoposition of B is given by UŨ ΛŨ T U T with unchanged eigenvalues and eigenvectors U B := UŨ, since the product of two orthogonal atrices is orthogonal and since the eigendecoposition is unique, provided all eigenvalues are distinct. The eigenvalues of B can be calculated in O(n 2 ) tie by finding the roots of the secular equation (Golub, 973) n zi 2 ω( λ):=+σ (4) λ i λ i= The eigenvalues of the odified syste are subject to the following bounds λ i λ i λ i+ λ n λ n λ n +σz T z λ i λ i λ i λ +σz T z λ λ i=,2,...,n, σ>0 σ>0 i=2,3,...,n, σ<0 σ<0 (5) which can be used to supply initial guesses for the root finding algorith. Note that after expanding the eigensyste, as described above, the eigenpairs need to be reordered for the bounds to be valid. Once the updated eigenvalues have been calculated the eigenvectors of the perturbed atrix B are given by (Bunch et al., 978) u B i = UD i z z D i where D i :=Λ λ i I. Since U and D i are and D i is diagonal the denoinator is O() and the nuerator is O( 2 ), leading to O( 3 ) tie coplexity to update all eigenvectors. The nuber of flops for the full procedure is 2n 3 +O(n 2 ). Equation (6) requires the creation of an additional n n atrix, hence the full procedure is quadratic in eory. 4 INCREMENTAL NYSTRÖM (6) In this section we extend our proposed algorith to increental calculation of the Nyströ approxiation to the kernel atrix. Having access to an increental procedure for the Nyströ ethod can be highly useful. Different sizes of subsets used in the approxiation can efficiently be evaluated, to deterine a suitable size for the proble at hand or for epirical investigation of the characteristics of the Nyströ ethod for subsets of different sizes. For very large datasets, the cobination of the Nyströ ethod with increental calculation results in further gains in eory efficiency. Rudi et al. (205) previously proposed an increental algorith for the Nyströ approxiation applied to kernel ridge regression, based on rank one updates to the Cholesky decoposition. Our proposed procedure can be seen as a generalization of their work. To the best of our knowledge, it is the first increental algorith for calculation of the full Nyströ approxiation to the kernel atrix. Given the eigenvalues Λ and eigenvectors U of the atrix K,, the corresponding approxiate eigenvalues and eigenvectors of K are given by (Willias and Seeger, 200) Λ nys := n Λ (7) U nys := n K n,uλ To obtain an increental procedure for K = U nys Λ nys U nyst, calculate U and Λ increentally using Algorith (2), then at each iteration add an extra colun to K n, corresponding to the additional data exaple, and calculate the rescaling (7). The rescaling

7 Nor Frobenius trace spectral Frobenius ean trace ean spectral ean agic Nor Frobenius trace spectral Frobenius ean trace ean spectral ean yeast Figure : Difference between batch and increental calculation of K of size 20+ for the two datasets. has O( 2 n) tie coplexity fro the atrix product in (7). Note that the proposed increental calculation of the Nyströ approxiation exactly reproduces batch coputation at each, save for nuerical differences. The accuracy of the Nyströ approxiation has been extensively studied, including coparisons with other ethods (Gittens and Mahoney, 206; Yang et al., 202). 5 EXPERIMENTAL ANALYSIS In this section we present the results of a nuber of experients. We run the experients on two different datasets fro the UCI Machine Learning Repository (Lichan, 203), the siulated Magic gaa telescope dataset and the Yeast dataset, containing cellular protein location sites. Where applicable, we reove the target variable when this is categorical and not continuous. Throughout the experients we use the radial basis functions kernel ( ) k(x,y)=exp x y 2 2 σ where σ is a paraeter. For each dataset, we set σ to be the edian of the distances between all pairs of data exaples (in a subset of the full dataset), a coon heuristic. Source code in Python is available at 5. INCREMENTAL KERNEL PCA We ipleent and evaluate our algorith for increental kernel PCA both with and without adjustent of the ean of the feature vectors. Nuerical accuracy is generally good, whether adjusting the ean or not. A slight loss of orthogonality is discovered in the eigenvectors, as easured by how close UU T is to the identity, particularly for ean-adjusted data that requires four updates at each step and involves ore nuerical operations. We have previously assued that the kernel atrix reains of full rank after each added data exaple. This will always be the case in theory if data contains noise, however near nuerical rank deficiency can cause issues in practice. Equation (4) ay then lack the required nuber of roots. In this instance one can deflate the atrix (see e.g. Bunch et al. (978) for details), but for the purposes of our experients we have contended with excluding the specific data exaple fro the algorith. An excluded data point does not add any tie overhead to the O(n 3 ) factor. Every nuerical operation leads to a sall loss in accuracy, due to the finite representation of floating-point nubers, which is propagated, with varying severity, over subsequent operations. An increental procedure involves substantionally ore operations than a batch procedure, which leads to worse accuracy in coparison, often tered drift. We illustrate this by plotting the Frobenius, spectral and trace nors of the difference between the adjusted kernel atrix K, and the reconstruction using the increentally calculated eigendecoposition, for different nubers of data points, i.e. K, U Λ U T. We plot the difference for

8 Nor agic Frobenius trace spectral Frobenius ean trace ean spectral ean Nor yeast Frobenius trace spectral Frobenius ean trace ean spectral ean Figure 2: Difference between K and K of size 20+ for the two datasets. one run of the algorith as well as the ean difference for each value of over 50 runs. Please see Figure. The drift for reconstruction of the unadjusted atrix is saller and is not plotted. Our results show that the drift is sall. 5.2 INCREMENTAL NYSTRÖM We ipleent the proposed increental calculation of the Nyströ approxiation, using the first 000 observations fro each dataset. Having access to an increental algorith for calculating the Nyströ approxiation lets us investigate explicitly how the approxiation iproves with each additional data point for a specific data set. We calculate the Frobenius nor, spectral nor and trace nor of the difference between the the Nyströ approxiation and the full kernel atrix at each step of the algorith. All these three nor can be of interest to a downstrea achine learning practitioner (Gittens and Mahoney, 206). Again, we plot the results for one run of the algorith and for an average of 50 runs. Please see Figure 2. As seen in the plots, the Nyströ approxiation sees to provide a high degree of accuracy in approxiating the atrix K, even for a fairly sall nuber of basis points. 6 CONCLUSION We have in this paper presented an algorith for increental kernel PCA based on rank one updates to the eigendecoposition of the kernel atrix K or the eanadjusted kernel atrix K, which we extended to increental calculation of the Nyströ approxiation to the kernel atrix. Rank one update algoriths for the eigendecoposition other than the one chosen in this paper could also be applied to the kernel PCA proble, for potentially iproved accuracy and efficiency, including algoriths potentially not yet conceived. Furtherore, it could be straightforward to adapt the proposed algorith for increental kernel PCA to only aintain a subset of the eigenvectors and eigenvalues. An increental procedure for the Nyströ ethod can aid in deterining a suitable size of the subset used for the approxiation through epirical evaluation. A fairly liited aount of work has been dedicated to the deterination of this hyperparaeter or equivalent hyperparaeters for other approxiate kernel ethods. Various bounds on the statistical accuracy of the Nyströ ethod and related approxiations have been derived, which could guide the choice of this hyperparaeter, but this ight not be the ost suitable strategy. Acknowledgeents We would like to thank Ricardo Silva at the Departent of Statistical Science at UCL for helpful coents and guidance. References Bollobás, B. (999). Linear analysis. Cabridge University Press, Cabridge, UK, 2nd edition. Brand, M. (2006). Fast low-rank odifications of the thin singular value decoposition. Linear Algebra and its Applications, 45(): Bunch, J. R., Nielsen, C. P., and Sorensen, D. C. (978).

9 Rank-one odification of the syetric eigenproble. Nuerische Matheatik, 3():3 48. Chin, T.-J. and Suter, D. (2007). Increental kernel principal coponent analysis. IEEE Transactions on Iage Processing, 6(6): Dongarra, J. J. and Sorensen, D. C. (987). A fully parallel algorith for the syetric eigenvalue proble. SIAM Journal on Scientific and Statistical Coputing, 8(2): Gittens, A. and Mahoney, M. W. (206). Revisiting the Nyströ ethod for iproved large-scale achine learning. Journal of Machine Learning Research, 7(Dec): 65. Golub, G. H. (973). Soe odified atrix eigenvalue probles. Sia Review, 5(2): Golub, G. H. and Van Loan, C. F. (203). Matrix coputations. John Hopkins University Press, Baltiore, MD, 4th edition. Gu, M. and Eisenstat, S. C. (994). A stable and efficient algorith for the rank-one odification of the syetric eigenproble. SIAM Journal on Matrix Analysis and Applications, 5(4): Hoegaerts, L., De Lathauwer, L., Goethals, I., Suykens, J. A., Vandewalle, J., and De Moor, B. (2007). Efficiently updating and tracking the doinant kernel principal coponents. Neural Networks, 20(2): Hofann, T., Schölkopf, B., and Sola, A. J. (2008). Kernel ethods in achine learning. The Annals of Statistics, 36(3): Jolliffe, I. (2002). Principal coponent analysis. Springer, New York, NY, 2nd edition. Ki, K. I., Franz, M. O., and Schökopf, B. (2005). Iterative kernel principal coponent analysis for iage odeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(9): Lichan, M. (203). UCI achine learning repository. Li, J., Ross, D. A., Lin, R.-S., and Yang, M.-H. (2004). Increental learning for visual tracking. In Advances in Neural Inforation Processing Systes, pages Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). Text classification using string kernels. Journal of Machine Learning Research, 2(Feb): Mahoney, M. W. (20). Randoized algoriths for atrices and data. Foundations and Trends R in Machine Learning, 3(2): Mika, S., Rätsch, G., Weston, J., Schölkopf, B., and Müller, K.-R. (999). Fisher discriinant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 999 IEEE Signal Processing Society Workshop, pages IEEE. Oja, E. (982). Siplified neuron odel as a principal coponent analyzer. Journal of Matheatical Biology, 5(3): Rudi, A., Caoriano, R., and Rosasco, L. (205). Less is ore: Nyströ coputational regularization. In Advances in Neural Inforation Processing Systes, pages Schölkopf, B., Herbrich, R., and Sola, A. (200). A generalized representer theore. In Coputational Learning Theory (COLT), pages Springer. Schölkopf, B., Sola, A., and Müller, K.-R. (998). Nonlinear coponent analysis as a kernel eigenvalue proble. Neural coputation, 0(5): Sheikholeslai, F., Berberidis, D., and Giannakis, G. B. (205). Kernel-based low-rank feature extraction on a budget for big data streas. In IEEE Global Conference on Signal and Inforation Processing (Global- SIP), pages IEEE. Sorensen, D. C. and Tang, P. T. P. (99). On the orthogonality of eigenvectors coputed by divide-andconquer techniques. SIAM Journal on Nuerical Analysis, 28(6): Tokuoto, T. and Ozawa, S. (20). A fast increental kernel principal coponent analysis for learning strea of data chunks. In International Joint Conference on Neural Networks (IJCNN), pages IEEE. Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. (200). Graph kernels. Journal of Machine Learning Research, (Apr): Willias, C. and Seeger, M. (200). Using the Nyströ ethod to speed up kernel achines. In Advances in Neural Inforation Processing Systes, pages Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., and Zhou, Z.-H. (202). Nyströ ethod vs rando Fourier features: A theoretical and epirical coparison. In Advances in Neural Inforation Processing Systes, pages

Feature Extraction Techniques

Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that