Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning

Size: px

Start display at page:

Download "Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning"

Valerie Thornton
5 years ago
Views:

1 Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning Xian Wei, Martin Kleinsteuber, and Hao Shen Department of Electrical and Computer Engineering Technische Universität München, Germany {xian.wei, kleinsteuber, Abstract. This paper proposes an invertible nonlinear dimensionality reduction method via jointly learning dictionaries in both the original high dimensional data space and its low dimensional representation space. We construct an appropriate cost function, which preserves inner products of data representations in the low dimensional space. We employ a conjugate gradient algorithm on smooth manifold to minimize the cost function. By numerical experiments in image processing, our proposed method provides competitive and robust performance in image compression and recovery, even on heavily corrupted data. Thus, it can also be considered as an alternative approach to compressed sensing. However, our approach can outperform compressed sensing in task-driven learning problems, such as data visualization. Keywords: Invertible nonlinear dimensionality reduction, joint dictionary learning, inner products preservation, compressed sensing. 1 Introduction Dimensionality reduction (DR) is a powerful instrument to tackle large scale signal processing problems. It often serves as a preprocessing step to transform original high dimensional data to a low dimensional space. Then specific tasks, such as filtering or 2D visualization, can be performed directly on the low dimensional representations, cf. [1]. Most classic DR algorithms focuses on finding a low-dimensional embedding of original data, which are not reversible. In other words, there is no reliable reconstruction from the low dimensional space back to the original high dimensional space. However, in many applications, such as communication transmission, image down-sampling and super-resolution, and modeling the time varying data (dynamic textures), it requires that the DR process can be reversible. Finding an invertible nonlinear DR mapping is a long standing problem in the community. Recently, the technique of compressed sensing (CS) [2] has shown that high dimensional signals and images can be reconstructed from the measurements in far lower dimensional space than what is usually considered necessary. Formally, it assumes that a signal x R m admits a factorization of x = Dα under an atoms set D, also called dictionary, where α R k is sparse. Then the CS problem can

2 2 Xian Wei, Martin Kleinsteuber, and Hao Shen be formulated as recovering x from its low dimensional representation y = Ax or y = ADα, y R d m with d m, where A R d m is called a projection matrix which could be chosen as a random Gaussian matrix. This paper considers an alternative process of DR associated with the dictionary learning (DL) models [3, 4], that is, D is not given as some standard orthonormal basis, but learned from training samples. This problem has been studied somewhat within CS framework, known as Blind CS in [5], as well as the models in which D and A are simultaneously learned from data via some joint optimizations [6, 7]. However, one challenge of CS model is that it has to guarantee the incoherence between projection matrix A and dictionary D, as well as the incoherence between pair atoms in D and A themselves [2], which is often difficult to be achieved when D is redundant. In addition, learning tasks (such as 2D visualization) in compressed domain, is often difficult [8]. Different to the methods of CS via optimized projection matrix [6, 7], we propose an alternative approach to model the process of DR, using a couple dictionaries (D R m k, P R d k ), d m, called DRCDL. DRCDL could successfully achieve the task of interest, but avoid to learn the projection matrix directly. Finally, we present a joint learning optimization problem on a product manifold, which is efficiently resolved via geometric conjugate gradient (CG) method. 2 Joint dictionary learning under inner products preservation Let us denote by X := [x 1,..., x n ] R m n the data matrix containing n data samples x i R m, and Y := [y 1,..., y n ] R d n with d < m be its corresponding low dimensional representation via some DR mapping g : x i y i for all i = 1,..., n. In this work, we assume that both the original data and its low dimensional representation share the same or quite similar sparse structure. Such an assumption is popularly shared by the coupled sparse representation [9]. We assume that all data points x i R m admit sparse representations with respect to a common dictionary D := [d 1,..., d k ] R m k, i.e. x i = Dφ i, for all i = 1,..., n, (1) where φ i R k is the corresponding sparse representation of x i. In this work, we further assume that all columns of the dictionary D have unit norm. We then define the set S(m, k) := {D R m k ddiag(d D) = I k }, (2) where ddiag(z) is the diagonal matrix whose entries on the diagonal are those of Z, and I k denotes the identity matrix. We assume that the low dimensional representations Y share the same sparse structure with respect to a low dimensional dictionary P := [p 1,..., p k ] R d k, i.e. y i = P φ i with P S(d, k). By a slight abuse of notations, we denote by φ D : x i φ i and φ P : y i φ i the sparse

3 Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning 3 coding in the original data space and the low dimensional representation space, respectively. We propose a nonlinear DR mapping g : x i P φ D (x i ), (3) and reversely g 1 : y i Dφ P (y i ), (4) The aim of DR is to find a DR mapping g : x i y i, which is stable and preserves as much useful structure as possible. According to well known Johnson- Lindenstrauss (JL) lemma, cf. [10], every n-point subset of Euclidean space can be embedded in dimension O(ɛ 2 log n) with 1 + ɛ distortion with 0 < ɛ < 1/2. Specifically, distance or inner product information of the high dimensional data is preserved in the low dimensional representation space, when ɛ is close to zero [11, 12]. Specifically, the loss introduced by the DR mapping g can be measured by the following function G(X; Y ) := n (x i x j yi y j ) 2. (5) i=1 Recall the assumption that both the original data point x i and its low dimensional representation y i := g(x i ) share the same sparse structure, i.e. x i = Dφ i and y i = P φ i. We adopt the loss function (5) directly to the current coupled sparse representation setting as G (D,P ) (X; Y ) = n ( ( φ i D D P P ) ) 2 φ j. (6) i=1 Roughly speaking, the loss G (D,P ) is small, if either the sparse representations are pair-wise conjugate with respect to D D P P, or the difference D D P P is essentially small. In this work, we consider the second argument. As both dictionaries D and P are often assumed to be full rank, P can be also considered as a low rank approximation of D. In order to ensure stability of the proposed nonlinear DR mapping g, we need to guarantee moderate mutual incoherence in both the high and low dimensional dictionaries, i.e. D R m k and P R d k, according to the theory in sparse representation, cf. [13]. However, when the difference D D P P is sufficiently small, the mutual coherence of D is ensured to be close to the mutual coherence of P. Hence, instead of penalizing on both D and P, we propose to apply a logarithmic barrier function to enforce the mutual coherence of P, i.e. r(p ) = 1 i<j k log ( 1 (p i p j ) 2). (7) Finally, let us denote Φ(Y, P ) := [φ P (y 1 ),..., φ P (y n )] R k n. Then, by considering the reconstruction error in the original data space, we propose the following

4 4 Xian Wei, Martin Kleinsteuber, and Hao Shen cost function f : S(m, k) S(d, k) R d n R (D, P, Y ) 1 2n X DΦ(Y, P ) 2 F + µ 1 2k 2 D D P P 2 F + µ 2r(P ), where µ 1 > 0 weighs between the loss of distance preservation of DR and how accurately DΦ(Y, P ) reconstructs the training samples, and µ 2 > 0 controls the mutual coherence of the learned dictionary. As an extension, if we assume the relationship between D and P is linear, it reads as P = U D, therefore, y could be directly obtained via y = U Dφ = U x. Here, U R m d is adopted as the eigenvectors of D corresponding to its first d largest eigenvalues. We call this model as compressed couple dictionaries learning (CCDL) in this paper. (8) 3 A conjugate gradient DR algorithm Recall the fact that the set S(m, k) is the product of k unit spheres, i.e. a k(m 1) dimensional smooth manifold. In what follows, we adopt the conjugate gradient algorithm on smooth manifolds, which has demonstrated its competitive performance in (co-)sparse dictionary learning, cf.[3, 4], to minimize the cost function f on the product manifold S(m, k) S(d, k) R d n. In this work, we employ the sparse solution given by solving an elastic-net problem, cf. [14], as φ 1 := argmin 2 y P φ λ 1 φ 1 + λ2 2 φ 2 2, (9) φ R k where λ 1 > 0 and λ 2 > 0 are regularization parameters, which ensures stability and uniqueness of solutions. Let us define the set of indices of the non-zero entries of the solution φ = [ϕ 1,..., ϕ k ] R k as Λ := {i {1,..., k} ϕ i 0}. Then the solution of the elastic net (9) has a closed-form expression as φ D(y) := ( D Λ D Λ + λ 2 I d ) 1 ( D Λ y λ 1 s Λ ), (10) where s Λ {±1} Λ carries the signs of φ Λ, D Λ R m Λ is the subset of D in which the index of atoms (rows) fall into support Λ. The solution φ P (y) shares an algorithmically convenient property of being locally twice differentiable with respect to both P and y, cf. [15, 16]. Recall the tangent space T D S(m, k) of S(m, k) at D S(m, k) as T D S(m, k) := {Ξ R m k ddiag(ξ D) = 0}, (11) and the orthogonal projection of a matrix Z R m k onto the tangent space T D S(m, k) with respect to the inner product Ξ, Ψ = tr(ξ Ψ) as Π D (Z) := Z D ddiag(d Z). (12)

5 Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning 5 Then, by computing the first derivation of f at (D, P, Y ) in tangent direction (H D, H P, H Y ) T (D,P,Y ) S(m, k) S(d, k) R d n, we get the Riemannian gradient of f at (D, P, Y ) as grad f(d, P, Y ) = ( Π D ( f (D) ), Π P ( f (P ) ), f (Y ) ), (13) where f (D), f (P ), and f (Y ) are the Euclidean gradients of f with respect to the three arguments, respectively. Firstly, the Euclidean gradient f (D) of f with respect to D is computed as f (D) = n i=1 (Dφ P (y i ) 2x i ) φ P (y i ) + 2µ 1 k 2 D(D D P P ). (14) Using some shorthand notation, let Λ i be the support of nonzero entries of φ P (y i ), and denote K i := PΛ i P Λi λ 2 I k, r i := PΛ i y i λ 1 s Λi, x i := x i Dφ P (y i ), and q i := r i x i. Then, the Euclidean gradient f (P ) of f is computed as with f (P ) = n i=1 2V{ y i ( x i ) D Λi K 1 i + P Λi K 1 i DΛ i qi K 1 i + P Λi K 1 i q i D Λi K 1 i } + 2µ 1 k 2 P (P P D D) + µ 1 r (P ), r (P ) = P 1 i<j n (15) 2p i p j 1 (p i p j) 2 (E ij + E ji ) (16) being the gradient of the logarithmic barrier function (7). Here, V{ } denotes the full vector of { }. By E ij, we denote a matrix whose i th entry in the j th column is equal to one, and all others are zero. Finally, the Euclidean gradient f (Y ) is computed as f (Y ) = [ V{D Λ1 K 1 1 x 1 },..., V{D Λn K 1 n xn } ]. (17) By assembling the Riemannian gradients, geodesics and parallel transports on the underlying manifolds, a conjugate gradient algorithm on S(m, k) S(d, k) R d n is straightforward. Due to the page limit, we omit the presentation of the algorithm, and refer to [4] for more technical details. 4 Numerical experiments In this section, we investigate the performance of our proposed DR framework via couple dictionaries learning (DRCDL) and its linear extension - compressed CDL (CCDL) for signal compression, reconstruction, and visualization. Before presenting our experiments, we briefly discuss the question of choosing the parameters in our formulation. Considering the high coherence among the images

6 Xian Wei, Martin Kleinsteuber, and Hao Shen or imaginary patches, we prefer the dictionary with low redundancy, that is, k 2m for D R m k. For parameters (λ 1, λ 2 ) in Eq.

6 6 Xian Wei, Martin Kleinsteuber, and Hao Shen or imaginary patches, we prefer the dictionary with low redundancy, that is, k 2m for D R m k. For parameters (λ 1, λ 2 ) in Eq.(9), we put an emphasis on sparse solutions and choose λ 2 (0, λ1 10 ), as proposed in [14]. The parameters for µ 1, µ 2 in (8) could be well tuned via performing cross validation. The CMU Multi-PIE [17] faces and MNIST handwritten digital databases 1 are used as the benchmark dataset for images compression, reconstruction and 2D visualization in our experiments. In order to evaluate our proposed method on DR and reconstruction, we compare it with CS associated with the random Gaussian sensing matrix [2] (Gaussian CS), and robust principle components analysis (RPCA) [18]. In Figure 1 and 2, 5000 images are randomly chosen for training D, and 500 images are randomly taken from remained database for testing. We first reduce the dimensionality from m = 1024, 784 to d = 16 for PIE and MNIST, and then recover them using Gaussian CS, RPCA, and proposed DRCDL, CCDL, respectively. Figure 1 and 2(e) demonstrate that the proposed methods perform much better on signal reconstruction, in comparison with Gaussian CS and RPCA. In Figure 2, we impose PCA on original data and reduced data respec- (a) PIE data (b) Reovered data using RPCA (c) Reovered data using Gaussian CS (d) Reovered using DRCDL data Fig. 1. From (b) to (d), recovering the reduced data from d = 16, using RPCA, Gaussian CS, and DRCDL, respectively. The PSNR is 23.01dB, 25.28dB and 31.12dB. tively, to achieve a 2D visualization. Figure 2(b) 2(c) 2(d) show that learning directly in the compressed domain, is feasible. Compared to Gaussian CS, our proposed methods (DRCDL and CCDL) exhibit more stable and competitive performance on the results of PCA, even in very low-dimensional compressed domain, i.e. d = 10. Figure 3 shows the results of image compression and recovery on single image - Lena. Compare to Bayesian CS (BCS) [19] and JPEG2000, our proposed methods DRCDL and CCDL exhibit the strong performance when input data is heavy corrupted. 5 Conclusions This paper proposed a couple dictionary learning approach to achieve the task of invertible nonlinear DR, called DRCDL. Following the Johnson-Lindenstrauss 1

Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning 7 0.6 0.4 0.2 0 0.2 0.4 0.6 0.4 0.2 0 0.2 0.4 0.6 0.4 0.2 0 0.2 0.4 0.4 0.2 0 0.2 0.4 0.6 (a) PCA on original data, m = d = 784.

5 1 (d) Gaussian CS on original data, d = 10. (e) Reconstructed MNIST images. Fig. 2. From (a) to (d), employing PCA on original data and reduced data, respectively.

7 Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning (a) PCA on original data, m = d = (b) PCA on compressed data (DRCDL, d = 10) (c) PCA on compressed data (CCDL, d = 10) (d) Gaussian CS on original data, d = 10. (e) Reconstructed MNIST images. Fig. 2. From (a) to (d), employing PCA on original data and reduced data, respectively. (e) shows that reconstructing MNIST images from d = 16 to m = 784, using CS and DRCDL, CCDL. From top to bottom (six rows): original data, recovered images using Gaussian CS, Gaussian CS based on K-SVD dictionary, DRCDL and CCDL. (a) 33 Training images used for learning dictionary (b) Lena (c) DRCDL (d) CCDL (e) JPEG (f) BCS Fig. 3. Recovery performance on Lena with a compression rate η = 32. (b) is the corrupted image with PSNR = 16.12dB; (c) to (f) are recovered images using DR- CDL, CCDL, JPEG2000 and BCS. The PSNR (db) is 26.52, 26.44, and 22.28, respectively. (JL) lemma in the process of DR, we develop a joint dictionary learning method, which preserves the distance information of the high dimensional data. Our experiments using single image, digits and facial images verified our idea. The proposed model is flexible and can be extended to some other cases and applications.

8 8 Xian Wei, Martin Kleinsteuber, and Hao Shen References 1. Van der Maaten, L.J., Postma, E.O., van den Herik, H.J.: Dimensionality reduction: A comparative review. Journal of Machine Learning Research 10(1-41) (2009) Donoho, D.L.: Compressed sensing. IEEE Transactions on Information Theory 52(4) (2006) Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 54(11) (2006) Hawe, S., Seibert, M., Kleinsteuber, M.: Separable dictionary learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2013) Gleichman, S., Eldar, Y.C.: Blind compressed sensing. IEEE Transactions on Information Theory 57(10) (2011) Duarte Carvajalino, J.M., Sapiro, G.: Learning to sense sparse signals: Simultaneous sensing matrix and sparsifying dictionary optimization. IEEE Transactions on Image Processing 18(7) (2009) Elad, M.: Optimized projections for compressed sensing. IEEE Transactions on Signal Processing 55(12) (2007) Calderbank, R., Jafarpour, S., Schapire, R.: Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical report, Computer Science, Princeton University (2009) 9. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparserepresentations. In: Curves and Surfaces. Volume 6920., Springer (2010) Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26( ) (1984) Kim, H., Park, H., Zha, H.: Distance preserving dimension reduction for manifold learning. In: SDM, SIAM (2007) Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry property for random matrices. Constructive Approximation 28(3) (2008) Elad, M.: Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer (2010) 14. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2) (2005) Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(4) (2012) Wei, X., Shen, H., Kleinsteuber, M.: An adaptive dictionary learning approach for modeling dynamical textures. In: Proceedings of the 39 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). (2014) Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression (PIE) database. In: Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FG), IEEE (2002) De la Torre, F., Black, M.J.: Robust principal component analysis for computer vision. In: Eighth IEEE International Conference on Computer Vision (ICCV). Volume 1., IEEE (2001) Ji, S., Xue, Y., Carin, L.: Bayesian compressive sensing. IEEE Transactions on Signal Processing 56(6) (2008)

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying