Robust Low Rank Kernel Embeddings of Multivariate Distributions

Size: px

Start display at page:

Download "Robust Low Rank Kernel Embeddings of Multivariate Distributions"

Ruth Brown
5 years ago
Views:

1 Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology Abstract Kernel embeing of istributions has le to many recent avances in machine learning. However, latent an low rank structures prevalent in real worl istributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embeing literature has aresse the issue of robust embeing when the latent an low rank information are misspecifie. In this paper, we propose a hierarchical low rank ecomposition of kernels embeings which can exploit such low rank structures in ata while being robust to moel misspecification. We also illustrate with empirical evience that the estimate low rank embeings lea to improve performance in ensity estimation. Introuction Many applications of machine learning, ranging from computer vision to computational biology, require the analysis of large volumes of high-imensional continuous-value measurements. Complex statistical features are commonplace, incluing multi-moality, skewness, an rich epenency structures. Kernel embeing of istributions is an effective framework to aress challenging problems in this regime [, 2]. Its key iea is to implicitly map istributions into potentially infinite imensional feature spaces using kernels, such that subsequent comparison an manipulation of these istributions can be achieve via feature space operations (e.g., inner prouct, istance, projection an spectral analysis). This new framework has le to many recent avances in machine learning such as kernel inepenence test [3] an kernel belief propagation [4]. However, algorithms esigne with kernel embeings have rarely taken into account latent an low rank structures prevalent in high imensional ata arising from various applications such as gene expression analysis. While these information have been extensively exploite in other learning contexts such as graphical moels an collaborative filtering, their use in kernel embeings remains scarce an challenging. Intuitively, these intrinsically low imensional structures of the ata shoul reuce the effect number of parameters in kernel embeings, an allow us to obtain a better estimator when facing with high imensional problems. As a emonstration of the above intuition, we illustrate the behavior of low rank kernel embeings (which we will explain later in more etails) when applie to ensity estimation (Figure ). 00 ata points are sample i.i.. from a mixture of 2 spherical Gaussians, where the latent variable is the cluster inicator. The fitte ensity base on an orinary kernel ensity estimator has quite ifferent contours from the groun truth (Figure (b)), while those provie by low rank embeings appear to be much closer to the groun truth ((Figure (c)). Essentially, the low rank approximation step enows kernel embeings with an aitional mechanism to smooth the estimator which can be beneficial when the number of ata points is small an there are clusters in the ata. In our later more systematic experiments, we show that low rank embeings can lea to ensity estimators which can significantly improve over alternative approaches in terms of hel-out likelihoo. While there are a hanful of exceptions [5, 6] in the kernel embeing literature which have exploite latent an low rank information, these algorithms are not robust in the sense that, when such information are misspecification, no performance guarantee can be provie an these algorithms can fail rastically. The hierarchical low rank kernel embeings we propose in this paper can be

2 (a) Groun truth (b) Orinary KDE (c) Low Rank KDE Figure : We raw 00 samples from a mixture of 2 spherical Gaussians with equal mixing weights. (a) the contour plot for the groun truth ensity, (b) for orinary kernel ensity estimator (KDE), (c) for low rank KDE. We use cross-valiation to fin the best kernel banwith for both the KDE an low rank KDE. The latter prouces a ensity which is visibly closer to the groun truth, an in term of the integrate square error, it is smaller than the KDE ( vs. 0.02). consiere as a kernel generalization of the iscrete value tree-structure latent variable moels stuie in [7]. The objective of the current paper is to aress previous limitations of kernel embeings as applie to graphical moels an make them more practically useful. Furthermore, we will provie both theoretical an empirical support to the new approach. Another key contribution of the paper is a novel view of kernel embeing of multivariate istributions as infinite imensional higher orer tensors, an the low rank structure of these tensors in the presence of latent variables. This novel view allows us to introuce moern multi-linear algebra an tensor ecomposition tools to aress challenging problems in the interface between kernel methos an latent variable moels. We believe our work will play a synergistic role in briging together largely separate areas in machine learning research, incluing kernel methos, latent variable moels, an tensor ata analysis. In the remainer of the paper, we will first present the tensor view of kernel embeings of multivariate istributions an its low rank structure in the presence of latent variables. Then we will present our algorithm for hierarchical low rank ecomposition of kernel embeings by making use of a sequence of neste kernel singular value ecompositions. Last, we will provie both theoretical an empirical support to our propose approach. 2 Kernel Embeings of Distributions We will focus on continuous omains, an enote X a ranom variable with omain Ω an ensity p(x). The instantiations of X are enote by lower case character, x. A reproucing kernel Hilbert space (RKHS) F on Ω with a kernel k(x, x ) is a Hilbert space of functions f : Ω R with inner prouct, F. Its element k(x, ) satisfies the reproucing property: f( ), k(x, ) F = f(x), an consequently, k(x, ), k(x, ) F = k(x, x ), meaning that we can view the evaluation of a function f at any point x Ω as an inner prouct. Alternatively, k(x, ) can be viewe as an implicit feature map φ(x) where k(x, x ) = φ(x), φ(x ) F. For simplicity of notation, we assumes that the omain of all variables are the same an the same kernel function is applie to all variables. A kernel embeing represents a ensity by its expecte features, i.e., µ X := E X [φ(x)] = φ(x)p(x)x, or a point in a potentially infinite-imensional an implicit feature space of a k- Ω ernel [8,, 2]. The embeing µ X has the property that the expectation of any RKHS function f F can be evaluate as an inner prouct in F, µ X, f F := E X [f(x)]. Kernel embeings can be reaily generalize to joint ensity of variables, X,..., X, using - th orer tensor prouct feature space F, In this feature space, the feature map is efine as i= φ(x i) := φ(x ) φ(x 2 )... φ(x ), an the inner prouct in this space satisfies i= φ(x i ), i= φ(x i ) = F i= φ(x i), φ(x i ) F = i= k(x i, x i ). Then we can embe a joint ensity p(x,..., X ) into a tensor prouct feature space F by [ C X: := E X: i= φ(x i ) ] ( = i= φ(x i ) ) p(x,..., x ) x i, () Ω where we use X : to enote the set of variables {X,..., X }. i= 2

3 The kernel embeings can also be generalize to conitional ensities p(x z) [9] µ X z := E X z [φ(x)] = φ(x) p(x z) x (2) Given this embeing, the conitional expectation of a function f F can be compute as E X z [f(x)] = f, µ X z. Unlike the orinary embeings, an embeing of conitional istribution is not a single element in the RKHS, but will instea sweep out a family of points in the F RKHS, each inexe by a fixe value z of the conitioning variable Z. It is only by fixing Z to a particular value z, that we will be able to obtain a single RKHS element, µ X z F. In other wors, conitional embeing is an operator, enote as C X Z, which can take as input an z an output an embeing, i.e., µ X z = C X Z φ(z). Likewise, kernel embeing of conitional istributions can also be generalize to joint istribution of variables. We will represent an observation from a iscrete variable Z taking r possible value using the stanar basis in R r (or one-of-r representation). That is when z takes the i-th value, the i-th imension of vector z is an other imensions 0. For instance, when r = 3, Z can take three possible value (, 0, 0), (0,, 0) an (0, 0, ). In this case, we let φ(z) = Z an use the linear kernel k(z, Z ) = Z Z. Then, the conitional embeing operator reuces to a separate embeing µ X z for each conitional ensity p(x z). Conceptually, we can concatenate these µ X z for ifferent value of z in columns C X Z := (µ X z=(,0,0), µ X z=(0,,0), µ X z=(0,0,) ). The operation C X Z φ(z) essentially picks up the corresponing embeing (or column). 3 Kernel Embeings as Infinite Dimensional Higher Orer Tensors The above kernel embeing C X: can also be viewe as a multi-linear operator (tensor) of orer mapping from F... F to R. (For generic introuction to tensor an tensor notation, please see [0]). The operator is linear in each argument (moe) when fixing other arguments. Furthermore, the application of the operator to a set of elements {f i F} i= can be efine using the inner prouct from the tensor prouct feature space, i.e., C X: f 2... f := [ ] C X:, i=f = E F X: φ(x i ), f i F, (3) i= where i means applying f i to the i-th argument of C X:. Furthermore, we can efine the generalize Frobenius norm of C X: as C X: 2 = i = i = (C X : e i 2... e i ) 2 using an orthonormal basis {e i } i= F. We can also efine the inner prouct for the space of such operator that C X: < as C X:, C X: = (C X: e i 2... e i )( C X: e i 2... e i ). (4) i = i = When C X: has the form of E X: [ i= φ(x i ) ], the above inner prouct reuces to E X: [ C X: φ(x ) 2... φ(x )]. In this paper, the orering of the tensor moes is not essential so we simply label them using the corresponing ranom variables. We can reshape a higher orer tensor into a lower orer tensor by partitioning its moes into several isjoint groups. For instance, let I = {X,..., X s } be the set of moes corresponing to the first s variables an I 2 = {X s+,..., X }. Similarly to the Matlab function, we can obtain a 2n orer tensor by Ω C I;I 2 = reshape (C X:, I, I 2 ) : F s F s R. (5) In the reverse irection, we can also reshape a lower orer tensor into a higher orer one by further partitioning certain moe of the tensor. For instance, we can partition I into I = {X,..., X t } an I = {X t+,..., X s }, an turn C I;I 2 into a 3r orer tensor by C I ;I = reshape (C ;I2 I ;I 2, I, I, I 2 ) : F t F s t F s R. (6) Note that given a orthonormal basis {e i } i= F, we can reaily obtain an orthonormal basis for, e.g., F t, as {e i... e it } i,...,i t=, an hence efine the generalize Frobenius norm for C I;I 2 an C I ;I ;I2. This also implies that the generalize Frobenius norms are the same for all these reshape tensors, i.e., C X: = C I;I 2 = CI ;I. ;I2 3

4 Z X X 3 Z Z 2 Z Z Z 2 X X 2 X 2 X 4 X X 2 X (a) X X 2 Z (b) X :2 X 3:4 Z :2 (c) Caterpillar tree (hien Markov moel) Figure 2: Three latent variable moel with ifferent tree topologies The 2n orer tensor C I;I 2 can also be viewe as the cross-covariance operator between two sets of variables in I an I 2. In this case, we can essentially use notation an operations for matrices. For instance, we can perform singular value ecomposition of C I;I 2 = i= s i(u i v i ) where s i R are orere in nonincreasing manner, {u i } i= F s an {v i } i= F s are singular vectors. The rank of C I;I 2 is the smallest r such that s i = 0 for i r. In this case, we will also efine U r = (u, u 2,..., u r ), V r = (v, v 2,..., v r ) an S r = iag (s, s 2,..., s r ), an enote the low rank approximation as C I;I 2 = U r S r Vr. Finally, a st orer tensor reshape (C X:, {X : }, ), is simply a vector where we we will use vector notation. 4 Low Rank Kernel Embeings Inuce by Latent Variables In the presence of latent variables, the kernel embeing C X: will be low rank. For example, the two observe variables X an X 2 in the example in Figure is conitional inepenent given the latent cluster inicator variable Z. That is the joint ensity factorizes as p(x, X 2 ) = z p(z)p(x z)p(x 2 z) (see Figure 2(a) for the graphical moel). Throughout the paper, we assume that z is iscrete an takes r possible values. Then the embeing C XX 2 of p(x, X 2 ) has a rank at most r. Let z be represente as the stanar basis in R r. Then C XX 2 = E Z [( EX Z[φ(X )]Z ) ( E X2 Z[φ(X 2 )]Z )] = C X Z E Z [Z Z] ( C X2 Z) (7) where E Z [Z Z] is an r r matrix, an hence restricting the rank of C XX 2 to be at most r. In our secon example, four observe variables are connecte via two latent variables Z an Z 2 each taking r possible values. The conitional inepenence structure implies that the ensity of p(x, X 2, X 3, X 4 ) factorizes as z,z 2 p(x z )p(x 2 z )p(z, z 2 )p(x 3 z 2 )p(x 4 z 2 ) (see Figure 2(b) for the graphical moel). Reshaping its kernel embeing C X:4, we obtain C X:2;X 3:4 = reshape (C X:4, {X :2 }, {X 3:4 }) which factorizes as E X:2 Z [φ(x ) φ(x 2 )] E ZZ 2 [Z Z 2 ] ( E X3:4 Z 2 [φ(x 3 ) φ(x 4 )] ) where E ZZ 2 [Z Z 2 ] is an r r matrix. Hence the intrinsic rank of the reshape embeing is only r, although the original kernel embeing C X:4 is a 4th orer tensor with infinite imensions. In general, for a latent variable moel p(x,..., X ) where the conitional inepenence structure is a tree T, various reshapings of its kernel embeing C X: accoring to eges in the tree will be low rank. More specifically, each ege in the latent tree correspons to a pair of latent variables (Z s, Z t ) (or an observe an a hien variable (X s, Z t )) which inuces a partition of the observe variables into two groups, I an I 2. One can imagine splitting the latent tree into two subtrees by cutting the ege. One group of variables resie in the first subtree, an the other group in the secon subtree. If we reshape the tensor accoring to this partitioning, then Theorem Assume that all observe variables are leaves in the latent tree structure, an all latent variables take r possible values, then rank(c I;I 2 ) r. Proof Due to the conitional inepenence structure inuce by the latent tree, p(x,..., X ) = z s z t p(i z s )p(z s, z t )p(i 2 z t ). Then its embeing can be written as C I;I 2 = C I Z s E ZsZ t [Z s Z t ] ( C I2 Z t ), (9) where C I Z s an C I2 Z t are the conitional embeing operators for p(i z s ) an p(i 2 z t ) respectively. Since E ZsZ t [Z s Z t ] is a r r matrix, rank(c I;I 2 ) r. Theorem implies that, given a latent tree moel, we obtain a collection of low rank reshapings {C I;I 2 } of the kernel embeing C X:, each corresponing to an ege (Z s, Z t ) of the tree. We (8) 4

5 will enote by H(T, r) the class of kernel embeings C X: whose various reshapings accoring to the latent tree T have rank at most r. We will also use C X: H(T, r) to inicator such a relation. In practice, the latent tree moel assumption may be misspecifie for a joint ensity p(x,..., X ), an consequently the various reshapings of its kernel embeing C X: are only approximately low rank. In this case, we will instea impose a (potentially misspecifie) latent structure T an a fixe rank r on the ata an obtain an approximate low rank ecomposition of the kernel embeing. The goal is to obtain a low rank embeing C X: H(T, r), while at the same time insure C X: C X: is small. In the following, we will present such a ecomposition algorithm. 5 Low Rank Decomposition of Kernel Embeings For simplicity of exposition, we will focus on the case where the latent tree structure T has a caterpillar shape (Figure 2(c)). This ecomposition can be viewe as a kernel generalization of the hierarchical tensor ecomposition in [, 2, 7]. The ecomposition procees by reshaping the kernel embeing C X: accoring to the first ege (Z, Z 2 ), resulting in A := C X;X 2:. Then we perform a rank r approximation for it, resulting in A U r S r Vr. This leas to the first intermeiate tensor G = U r, an we reshape S r Vr an recursively ecompose it. We note that Algorithm contains only pseuo coes, an not implementable in practice since the kernel embeing to ecompose can have infinite imensions. We will esign a practical kernel algorithm in the next section. Algorithm Low Rank Decomposition of Kernel Embeings In: A kernel embeing C X:, the caterpillar tree T an esire rank r Out: A low rank embeing C X: H(T, r) as intermeiate tensors {G,..., G } : A = reshape(c X:, {X }, {X 2: }) accoring to tree T. 2: A U r S r V r, approximate A using its r leaing singular vectors. 3: G = U r, an B = S r V r. G can be viewe as a moel with two variables, X an Z ; an B as a new caterpillar tree moel T with variable X remove from T. 4: for j = 2,..., o 5: A j = reshape(b j, {Z j, X j }, {X j+: }) accoring to tree T j. 6: A j U r S r V r, approximate A j using its r leaing singular vectors. 7: G j = reshape(u r, {Z j }, {X j }, {Z j }), an B j = S r V r. G j can be viewe as a moel with three variables, X j, Z j an Z j ; an B j as a new caterpillar tree moel T j with variable Z j an X j remove from T j. 8: en for 9: G = B Once we finish the ecomposition, we obtain the low rank representation of the kernel embeing as a set of intermeiate tensors {G,..., G }. In particular, we can think of G as a secon orer tensor with imension r, G as a secon orer tensor with imension r, an G j for 2 j as a thir orer tensor with imension r r. Then we can apply the low rank kernel embeing C X: to a set of elements {f i F} i= as follows C X: f 2... f = (G f ) (G 2 2 f 2 )... (G 2 f )(G 2 f ). Base on the above ecomposition, one can obtain a low rank ensity estimate by p(x,..., X ) = C X: φ(x ) 2... φ(x ). We can also compute the ifference between C X: an the operator C X: by using the generalize Frobenius norm C X: C X:. 6 Kernel Algorithm In practice, we are only provie with a finite number of samples { (x i,..., x i )} n raw i.i.. from i= p(x,..., X ), an we want to obtain an empirical low rank ecomposition of the kernel embeing. In this case, we will perform a low rank ecomposition of the empirical kernel embeing C X: = n ( n i= j= φ(x i j )). Although the empirical kernel embeing still has infinite imensions, we will show that we can carry out the ecomposition using just the kernel matrices. Let us enote the kernel matrix for each imension of the ata by K j where j {,..., }. The (i, i )-th entry in K j can be compute as Kj ii = k(x i j, xi j ). Alternatively, one can think of implicitly forming One can reaily generalize this notation to ecompositions where ifferent reshapings have ifferent ranks. 5

6 the feature matrix Φ j = ( φ(x j ),..., φ(xn j )), an the corresponing kernel matrix is K j = Φ j Φ j. Furthermore, we enote the tensor feature matrix forme from imension j + to of the ata as Ψ j = ( j =j+ φ(x j ),..., j =j+ φ(xn j )). The corresponing kernel matrix L j = Ψ j Ψ j with the (i, i )-th entry in L j efine as L ii j = j =j+ k(xi j, xi j ). Step -3 in Algorithm. The key builing block of the algorithm is a kernel singular value ecomposition (Algorithm 2), which we will explain in more etails using the example in step 2 of Algorithm. Using the implicitly efine feature matrix, A can be expresse as A = n Φ Ψ. For the low rank approximation, A U r S r Vr, using singular value ecomposition, the leaing r singular vector U r = (u,..., u r ) will lie in the span of Φ, i.e., U r = Φ (β,..., β r ) where β R n. Then we can transform the singular value ecomposition problem for an infinite imensional matrix to a generalize eigenvalue problem involving kernel matrices, A A u = λ u n Φ 2 Ψ Ψ Φ Φ β = λ Φ β n K 2 L K β = λ K β. Let the Cholesky ecomposition of K be R R, then the generalize eigenvalue ecomposition problem can be solve by reefining β = Rβ, an solving an orinary eigenvalue problem n 2 RL R β = λ β, an obtain β = R β. (0) The resulting singular vectors satisfy u l u l = β l Φ Φ β l = βl Kβ l = β l β l = δ ll. Then we can obtain B := S r Vr = Ur A by projecting the column of A using the singular vectors U r, B = n (β,..., β r ) Φ Φ Ψ = n (β,..., β r ) K Ψ =: (γ,..., γ n )Ψ () where γ R r can be treate as the reuce r-imensional feature representation for each feature mappe ata point φ(x i ). Then we have the first intermeiate tensor G = U r = Φ (β,..., β r ) =: Φ (θ,..., θ n ), where θ R r. Then the kernel singular value ecomposition can be carrie out recursively on the reshape tensor B. Step 5-7 in Algorithm. When j = 2, we first reshape B = S r Vr to obtain A 2 = Φ n 2 Ψ 2, where Φ 2 = (γ φ(x 2),..., γ n φ(x n 2 )). Then we can carry out similar singular value ecomposition as before, an obtain U r = Φ 2 (β,..., β r ) =: Φ 2 (θ,..., θ n ). Then we have the secon operator G 2 = n i= γi φ(x i 2) θ i. Last, we efine B 2 := S r Vr = Ur A 2 as B 2 = n (β,..., β r ) Φ 2 Φ2 Ψ 2 = n (β,..., β r ) (Γ K 2 )Ψ 2 =: n (γ,..., γ n )Ψ 2, (2) an carry out the recursive ecomposition further. The result of the algorithm is an empirical low rank kernel embeing, ĈX :, represente as a collection of intermeiate tensors {G,..., G }. The overall algorithm is summarize in Algorithm 3. More etails about the erivation can be foun in Appenix A. The application of the set of intermeiate tensor {G,..., G } to a set of elements {f i F} can be expresse as kernel operations. For instance, we can obtain a ensity estimate by p(x,..., x ) = Ĉ X: φ(x ) 2... φ(x ) = z,...,z g (x, z )g 2 (z, x 2, z 2 )... g (z, x ) where (see Appenix A for more etails) g (x, z ) = G φ(x ) 2 z = n i= (z θ i )k(x i, x ) (3) g j (z j, x j, z j ) = G j z j 2 φ(x j ) 3 z j = n i= (z j γ i )k(x i j, x j )(zj θ i ) (4) g (z, x ) = G z x = n i= (z γ i )k(x i, x ) (5) In the above formulas, each term is a weighte combination of kernel functions, an the weighting is etermine by the kernel singular value ecomposition an the values of the latent variable {z j }. 7 Performance Guarantees As we mentione in the introuction, the impose latent structure use in the low rank ecomposition of kernel embeings may be misspecifie, an the ecomposition of empirical embeings may suffer from sampling error. In this section, we provie finite guarantee for Algorithm 3 even when the latent structures are misspecifie. More specifically, we will boun, in terms of the gen- 6

7 Algorithm 2 KernelSVD(K, L, r) Out: A collection of vectors (θ,..., θ n ) : Perform Cholesky ecomposition K = R R 2: Solve eigen ecomposition n RLR β 2 = λ β, an keep the leaing r eigen vectors ( β,..., β r ) 3: Compute β = R β,..., β r = R β r, an reorgnaize (θ,..., θ n ) = (β,..., β r ) Algorithm 3 Kernel Low Rank Decomposition of Empirical Embeing C X: In: A sample { (x i,..., x i )} n i=, esire rank r, a query point (x,..., x ) Out: A low rank embeing ĈX : H(T, r) as intermeiate operators {G,..., G } : L = 2: for j =,,..., o 3: Compute matrix K j with Kj ii = k(x i j, xi j ); furthermore, if j <, then L j = L j+ K j+ 4: en for 5: (θ,..., θ n ) = KernelSVD(K, L, r) 6: G = Φ (θ,..., θ n ), an compute (γ,..., γ n ) = (θ,..., θ n )K 7: for j = 2,..., o 8: Γ = (γ,..., γ n ) (γ,..., γ n ), an compute (θ,..., θ n ) = KernelSVD(K i Γ, L i, r) 9: G j = n i= γi φ(x i j ) θi, an compute (γ,..., γ n ) = (θ,..., θ n )K i 0: en for : G = (γ,..., γ n )Φ eralize Frobenius norm C X: ĈX :, the ifference between the true kernel embeings an the low rank kernel embeings estimate from a set of n i.i.. samples { (x i,..., x i )} n. First i= we observe that the ifference can be ecompose into two terms C X: ĈX : C X: C X: + }{{} C X: ĈX : (6) }{{} E : moel error E 2: estimation error where the first term is ue to the fact that the latent structures may be misspecifie, while the secon term is ue to estimation from finite number of ata points. We will boun these two sources of error separately (the proof is eferre to Appenix B) Theorem 2 Suppose each reshaping C I;I 2 of C X: accoring to an ege in the latent tree structure has a rank r approximation U r S r Vr with error CI;I 2 U r S r Vr ɛ. Then the low rank ecomposition C X: from Algorithm satisfies C X: C X: ɛ. Although previous work [5, 6] have also use hierarchical ecomposition for kernel embeings, their ecompositions make the strong assumption that the latent tree moels are correctly specifie. When the moels are misspecifie, these algorithms have no guarantees whatsoever, an may fail rastically as we show in later experiments. In contrast, the ecomposition we propose here are robust in the sense that even when the latent tree structure is misspecifie, we can still provie the approximation guarantee for the algorithm. Furthermore, when the latent tree structures are correctly specifie an the rank r is also correct, then C I;I 2 has rank r an hence ɛ = 0 an our ecomposition algorithm oes not incur any moeling error. Next, we provie boun for the the estimation error. The estimation error arises from ecomposing the empirical estimate C X: of the kernel embeing, an the error can accumulate as we combine intermeiate tensors {G,..., G } to form the final low rank kernel embeing. More specifically, we have the following boun (the proof is eferre to Appenix C) Theorem 3 Suppose the r-th singular value of each reshaping C I;I 2 of C X: accoring to an ege in the latent tree structure is lower boune by λ, then with probability at least δ, C X: Ĉ X: (+λ) 2 C λ 2 X: C X: (+λ) 2 c λ 2, with some constant c associate with the n kernel an the probability δ. 7

8 From the above theorem, we can see that the smaller the r-th singular value, the more ifficult it is to estimate the low rank kernel embeing. Although in the boun the error grows exponential in /λ 2, in our experiments, we i not observe such exponential egraation of performance even in relatively high imensional atasets. 8 Experiments Besies the synthetic ataset we showe in Figure where low rank kernel embeing can lea to significant improvement in term of estimating the ensity, we also experimente with real worl atasets from UCI ata repository. We take atasets with varying imensions an number of ata points, an the attributes of the atasets are continuous-value. We whiten the ata an compare low rank kernel embeings (Low Rank) obtaine from Algorithm 3 to 3 other alternatives for continuous ensity estimation, namely, mixture of Gaussian with full covariance matrix, orinary kernel ensity estimator (KDE) an the kernel spectral algorithm for latent trees (Spectral) [6]. We use Gaussian kernel k(x, x ) = 2πs exp( x x 2 /(2s 2 )) for KDE, Spectral an our metho (Low rank). We split each ataset into 0 subsets, an use neste cross-valiation base on helout likelihoo to choose hyperparameters: the kernel parameter s for KDE, Spectral an Low rank ({2 3, 2 2, 2,, 2, 4, 8} times the meian pairwise istance), the rank parameter r for Spectral an Low rank (range from 2 to 30), an the number of components in the Gaussian mixture (range from 2 to # Sample 30 ). For both Spectral an Low rank, we use a caterpillar tree in Figure 2(c) as the structure for the latent variable moel. From Table, we can see that low rank kernel embeings provie the best or comparable hel-out negative log-likelihoo across the atasets we experimente with. In some atasets, low rank kernel embeings can lea to rastic improvement over the alternatives. For instance, in ataset sonar an yeast, the improvement is ramatic. The Spectral approach performs even worse sometimes. This makes sense, since the caterpillar tree supplie to the algorithm may be far away from the reality an Spectral is not robust to moel misspecification. Meanwhile, the Spectral algorithm also cause numerical problem in practical. In contrast, our metho Low Rank uses the same latent structure, but achieve much more robust results. Table : Negative log-likelihoo on hel-out ata (the lower the better). Metho Data Set # Sample Dim. Gaussian mixture KDE Spectral Low rank australian ± ± ± ±0. bupa ± ± ± ±0.4 german ± ± ± ± 0.26 heart ± ± ± ± 0.3 ionosphere ± ± ± ±.00 pima ± ± ± ± 0. parkinsons ± ± ± ± 0.37 sonar ± ± ± ± 2.67 wpbc ± ± ± ± 0.86 wine ± ± ± ± 0.7 yeast ± ± ± ± Discussion an Conclusion In this paper, we presente a robust kernel embeing algorithm which can make use of the low rank structure of the ata, an provie both theoretical an empirical support for it. However, there are still a number of issues which eserve further research. First, the algorithm requires a sequence of kernel singular ecompositions which can be computationally intensive for high imensional an large atasets. Developing efficient algorithms yet with theoretical guarantees will be interesting future research. Secon, the statistical analysis coul be sharpene. For the moment, the analysis oes not seem to suggest that the obtaine estimator by our algorithm is better than orinary KDE. Thir, it will be interesting empirical work to explore other applications for low rank kernel embeings, such as kernel two-sample tests, kernel inepenence tests an kernel belief propagation. 8

9 References [] A. J. Smola, A. Gretton, L. Song, an B. Schölkopf. A Hilbert space embeing for istributions. In Proceeings of the International Conference on Algorithmic Learning Theory, volume 4754, pages 3 3. Springer, [2] B. Sriperumbuur, A. Gretton, K. Fukumizu, G. Lanckriet, an B. Schölkopf. Injective Hilbert space embeings of probability measures. In Proc. Annual Conf. Computational Learning Theory, pages 22, [3] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, an A. J. Smola. A kernel statistical test of inepenence. In Avances in Neural Information Processing Systems 20, pages , Cambrige, MA, MIT Press. [4] L. Song, A. Gretton, D. Bickson, Y. Low, an C. Guestrin. Kernel belief propagation. In Proc. Intl. Conference on Artificial Intelligence an Statistics, volume 0 of JMLR workshop an conference proceeings, 20. [5] L. Song, B. Boots, S. Siiqi, G. Goron, an A. J. Smola. Hilbert space embeings of hien markov moels. In International Conference on Machine Learning, 200. [6] L. Song, A. Parikh, an E.P. Xing. Kernel embeings of latent tree graphical moels. In Avances in Neural Information Processing Systems, volume 25, 20. [7] L. Song, M. Ishteva, H. Park, A. Parikh, an E. Xing. Hierarchical tensor ecomposition of latent tree graphical moels. In International Conference on Machine Learning (ICML), 203. [8] A. Berlinet an C. Thomas-Agnan. Reproucing Kernel Hilbert Spaces in Probability an Statistics. Kluwer, [9] L. Song, J. Huang, A. J. Smola, an K. Fukumizu. Hilbert space embeings of conitional istributions. In Proceeings of the International Conference on Machine Learning, [0] Tamara. G. Kola an Brett W. Baer. Tensor ecompositions an applications. SIAM Review, 5(3): , [] L. Graseyck. Hierarchical singular value ecomposition of tensors. SIAM Journal on Matrix Analysis an Applications, 3(4): , 200. [2] I Oseleets. Tensor-train ecomposition. SIAM Journal on Scientific Computing, 33(5): , 20. [3] L. Rosasco, M. Belkin, an E.D. Vito. On learning with integral operators. Journal of Machine Learning Research, : ,

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.edu, bodai@gatech.edu Abstract Kernel embedding of