Sample Complexity of Bayesian Optimal Dictionary Learning

Sample Complexity of Bayesian Optimal Dictionary Learning Ayaka Sakata and Yoshiyuki Kabashima Dep. of Computational Intelligence & Systems Science Tokyo Institute of Technology Yokohama 6-85, Japan Email: ayaka@sp.dis.titech.ac.jp, kaba@dis.titech.ac.jp ariv:3.699v [cs.lg] 5 Feb 4 Abstract We consider a learning problem of identifying a dictionary matrix D R M from a sample set of M dimensional vectors Y R M P = / D R M P, where R P is a sparse matrix in which the density of non-zero entries is < ρ <. In particular, we focus on the minimum sample size P c (sample complexity) necessary for perfectly identifying D of the optimal learning scheme when D and are independently generated from certain distributions. By using the replica method of statistical mechanics, we show that P c O() holds as long as α = M/ > ρ is satisfied in the limit of. Our analysis also implies that the posterior distribution given Y is condensed only at the correct dictionary D when the compression rate α is greater than a certain critical value α M(ρ). This suggests that belief propagation may allow us to learn D with a low computational complexity using O() samples. I. ITRODUCTIO The concept of sparse representations has recently attracted considerable attention from various fields in which the number of measurements is limited. Many real-world signals such as natural images are represented sparsely in Fourier/wavelet domains; in other words, many components vanish or are negligibly small in amplitude when the signals are represented by Fourier/wavelet bases. This empirical property is exploited in the signal recovery paradigm of compressed sensing (CS), thereby enabling the recovery of sparse signals from much fewer measurements than those estimated by the yquist- Shannon sampling theorem [], [], [3], [4]. In signal processing techniques for exploiting sparsity, signals are generally assumed to be described as linear combinations of a few dictionary atoms. Therefore, the effectiveness of this approach is highly dependent on the choice of dictionary, by which the objective signals appear sparse. A method for choosing an appropriate dictionary for sparse representation is dictionary learning (DL), whereby the dictionary is constructed through a learning process from an available set of P training samples [5], [6], [7], [8]. The ambiguity of the dictionary is fatal in signal/data analysis after learning. Therefore, an important issue is the estimation of the sample complexity, i.e., the sample size P c necessary for correct identification of the dictionary. In a seminal work, Aharon et al. showed that when the training set Y R M P is generated by a dictionary D R M and a sparse matrix R P (planted solution) as Y = D, one can perfectly learn these if P > P c = (k+) C k andk is sufficiently small, where k is the number of non-zero elements in each column of [9]. Unfortunately, this bound becomes exponentially large in for k O(), which motivates us to improve the estimation. A recent study has shown that almost all dictionaries under the uniform measure are learnable with P c O(M) samples when klnm O( ) []. However, the fact that the number of unknown variables M +P and known variables MP are balanced with each other at P O() when M O() implies the possibility of DL with O() training samples. To answer this question, in this study, we evaluate the sample complexity of the optimal learning scheme defined for a given probabilistic model of dictionary learning. In a previous study, the authors assessed the sample complexity for a naive learning scheme: min D, Y D subj. to Pρ ( < ρ < ), where A indicates the Frobenius norm of A, and is the number of nonzero elements in and D is enforced to be normalized appropriately. They used the replica method of statistical mechanics and found that P c O() holds when α = M/ is greater than a certain critical value α naive (ρ) > ρ []. However, the smallest possible P c that can be obtained for α < α naive (ρ) has not been clarified thus far. In this study, we show that P c O() holds in the entire region of α > ρ for the optimal learning scheme. II. PROBLEM SETUP Let us suppose the following scenario of dictionary learning. Planted solutions, an M dictionary matrix D R M and an P sparse matrix R P, are independently generated from prior distributions, P(D) and P ρ (), respectively, where P(D) is the uniform distribution over an appropriate support and P ρ () = P ρ ( il )= { } ( ρ)δ( il )+ρf( il ). () i,l i,l The rate of non-zero elements in is given by ρ [,], and the distribution function f() does not have a finite mass probability at the origin. The set of training samples Y R M P, whose column vector corresponds to a training sample, is assumed to be given by the planted solutions as Y = D, ()

where/ is introduced for convenience in taking the largesystem limit. A learner is required to infer D and from Y. We impose the normalization constraint of µ D µi = D i = M for each column i =, to avoid the ambiguity of the product D = DA A for diagonal matrices of positive diagonal entries A. In addition, we introduce two other constraints that i) the values of M µ= D µi are set to be positive and ii) columns of D are lined up in the descending order of the absolute value of M µ= D µi so that ambiguities of simultaneous permutation and/or multiplication of same signs for columns in D and rows in are removed. In the following, the uniform prior P(D) is assumed to be defined on the support that satisfies all of these constraints. Our aim is to evaluate the minimum value of the sample size P required for perfectly identifying D and. III. BAYESIA OPTIMAL LEARIG For mathematical formulation of our problem, let us denote the estimates of D and yielded by an arbitrary learning scheme as ˆD(Y ) and ˆ(Y ). We evaluate the efficiency of the scheme using the mean squared errors (per element), MSE D ( ˆD( ))= P ρ (D,,Y) D M ˆD(Y ) (3) Y,D, MSE ( ˆ( ))= P ρ (D,,Y) P ˆ(Y ) (4) Y,D, where A B = i,j A ijb ij represents the inner product between two matrices of the same dimension A and B. We impose the normalization constraint M µ= ( ˆD(Y )) µi = M for each column index i =,,..., in order to avoid the ambiguity of the product ˆD(Y ) ˆ( Ŷ) = ˆD(Y )A A ˆ(Y ) for an arbitrary invertible diagonal matrix A. The joint distribution of D,, and Y is given by P ρ (D,,Y ) = δ(y D)P(D)P ρ (). (5) The perfect identification of D and can be characterized by MSE D = MSE =. The following theorem offers a useful basis for answering our question. Theorem. For an arbitrary learning scheme, (3) and (4) are bounded from below as MSE D ( ˆD( )) ( ) ( D ρ ) i P ρ (Y ) Y i= M (6) MSE ( ˆ( )) ( ρ P ρ (Y ) ) ρ ρ, P P Y (7) where P ρ (Y ) = D, P ρ(d,,y ), and ρ denotes the average over D and according to the posterior distribution of D and under a given Y, P ρ (D, Y ) = One could choose different constraints as long as the trivial ambiguities of the column order and the signs are resolved. (a) MSE D.5.5 θ = ρ θ =.8ρ θ =.5ρ.5.5 γ 3 3.5 4 4.5 (b). θ = ρ θ.8ρ θ =.5ρ.5 MSE..5.5.5 γ 3 3.5 4 4.5 Fig.. γ-dependence of (a) MSE D and (b) MSE at α =.5, ρ =.. Plots for θ = ρ,.8ρ, and.5ρ show the optimality of the correct parameter choice θ = ρ. Broken curves represent locally unstable branches, which are thermodynamically irrelevant. P ρ (D,,Y )/P ρ (Y ). The equalities hold when the estimates satisfy ( ˆD opt (Y )) i = M ( D ρ ) i ( D ρ ) i, ˆopt (Y ) = ρ, (8) where (A) i denotes the i-th column vector of matrix A. We refer to (8) as the Bayesian optimal learning scheme []. Proof: By applying the Cauchy-Shwartz inequality and the minimization of the quadratic function to MSE D and MSE, respectively, one can obtain (6) (8) after inserting the expression ρ (D,,Y )=P ρ (Y D,xP ) xp ρ (D, Y)=P ρ (Y ) x ρ D, for x = D and into (3). This theorem guarantees that when the setup of dictionary learning is characterized by P(D) and P ρ (), the estimates of (8) offer the best possible learning performance in the sense that (3) and (4) are minimized. As the perfect identification of D and is characterized by MSE D = MSE =, our purpose is fulfilled by analyzing the performance of the Bayesian optimal learning scheme of (8).

IV. AALYSIS For simplicity of calculation, let us set f( il ) as the Gaussian distribution with mean and variance σ, and σ is set to unity for all numerical calculations later on. For generality, we consider cases in which the sparsity assumed by the learner, denoted as θ, can differ from the actual value ρ. When θ ρ, the estimates are given by ( ˆD(Y )) i = M( D θ ) i / ( D θ ) i and ˆ(Y ) = θ instead of (8). To evaluate MSE D and MSE, we need to evaluate macroscopic quantities q D = M [ D θ D θ ] Y, m D = M [ D θ D ρ ] Y (9) Q = P [ θ] Y () q = P [ θ θ ] Y, m = P [ θ ρ ] Y, () where [ ] Y = Y P ρ(y )( ). ote that (9) () yield MSE D m D / q D and MSE = ρσ +q m. Unfortunately, evaluating these is intrinsically difficult because it generally requires averaging the quantity D,,D, P θ (Y,D, )P θ (Y,D, )(D D ) D,,D, P θ(y,d, )P θ (Y,D, ) (= D θ D θ ), () which includes summations over exponentially many terms in the denominator, with respect to Y. One promising approach for avoiding this difficulty involves multiplying P n θ (Y ) = (,D P θ(y,d,)) n (n =,3,... ) inside the operation of [ ] Y for canceling the denominator of (), which makes the evaluation of a modified average q D (n) = [Pθ n(y ) D θ D θ ] Y M [Pθ n(y )] Y (3) feasible via the saddle point assessment of [Pθ n(y )] Y for,m,p, keeping α = M/ and γ = P/ as O(). Furthermore, the resulting expression is likely to hold for n R as well. Therefore, we evaluate q D using the formula q D = lim n q D (n) with the expression, and similarly, for m D, Q, q, and m. This procedure is often termed the replica method [3], [4]. Under the replica symmetric ansatz, which assumes that the dominant saddle point in the evaluation is invariant under any permutation of replica indices a =,,...,n, the assessment is reduced to evaluating the extremum of the free entropy (density) function ( ˆQ Q + ˆq q ) φ = γ ˆm m + lnξ + α +ˆq D q D ˆm D m D ln(ˆq D +ˆq D )+ ˆq D+ˆm ) D (ˆQD ˆQ D +ˆq D αγ { qd q m D m +ρσ } +ln(q q D q ), (4) Q q D q aive computation requires us to assess a column-wise overlap C D,i = M / [( D ρ) i ( D θ ) i / ( D θ ) i ( D θ ) i ] Y for each column index i =,,...,. However, the law of large numbers and the statistical uniformity allow the simplification of C D,i m D / q D as M = α tends to infinity. where ˆσ = +(ˆQ + ˆq )σ, Ξ =( θ)+ θ ( σ exp ( ˆq z + ˆm ) ) ˆσ ˆσ ( θ)+ξ +, (5) and denotes the average over and z, which are distributed according to P ρ () and a Gaussian distribution with mean zero and variance, respectively. The extremized value 3 of φ, φ, is related to the average loglikelihood (density) of Y as Y P ρ(y )lnp θ (Y ) = lim n ( / n) { ln[pθ n(y )] Y} = φ +constant. In Fig., (a) MSE D and (b) MSE for θ = ρ =. are plotted versus γ together with those for θ =.8ρ and θ =.5ρ. At θ = ρ, MSE D and MSE of thermodynamically relevant branches have minimum values in the entire γ region, while a branch of solution characterized by MSE D = MSE = is shared by the three parameter sets. This supports the optimality of the correct parameter choice of θ = ρ, and therefore, we hereafter focus our analysis on this case to estimate the minimum value of γ for the perfect learning, MSE D = MSE =. At θ = ρ, the relationships m D = q D, m = q, and Q = ρ hold from (9) (), and the extremum problem is reduced to (Ξ q D = ˆq + D ˆq z + ˆq ), q =, (6) + ˆq D Ξ where ˆq D and ˆq are given by ˆq = αq D ρσ q Dq, ˆq D = ˆσ γq ρσ q Dq. (7) The other variables are provided as ˆQ D =, ˆQ =, ˆm = ˆq, and ˆm D = ˆq D. A. Actual solutions V. RESULTS Fig. plots q D and q versus γ for α =.5 and ρ =.. As shown in the figure, the solutions of q D and q given by (6) are classified into three types: q D =, q = ρσ, q D = q =, and < q D <, < q < ρσ. The first one yields MSE D = MSE =, indicating the correct identification of D and, and hence, we name it the success solution. The second one is referred to as the failure solution because it yields MSE D = and MSE = ρσ, which indicates complete failure of the learning of D and. The third one yields finite MSE D and MSE, < MSE D <, < MSE < ρσ, and we term it the middle solution. ) Success solution: When the expression ( δ Y D ( ) MP ( )= lim exp Y D ), τ + πτ τ (8) is used, the success solution of q D and q behaves as (ρσ q )/τ = χ and ( q D )/τ = χ D while ˆq and ˆq D scale 3 When multiple extrema exist, the maximum value among them should be chosen as long as no consistency condition is violated.

q D.8.6.4. q D γ S γ F q γ M γ.5.5 3 3.5 4..8.6.4.. q.8.6.4. Fig.. γ-dependence of q D (left axis) and q (right axis) for α =.5 and ρ =.. as ˆq = ˆθ /τ and ˆq D = ˆθ D /τ. By substituting them into the equations of q D and q, they are given by χ = ργ g, χ D = α ρσ g, ˆθ = ρ χ, ˆθD = χ D, (9) where g = (α ρ)γ α. χ and χ D must be positive by definition, and hence, the success solution exists for γ > α α ρ γ S () only when α > ρ. ) Failure solution: The failure solution q D = q = appears at γ < γ F as a locally stable solution. When q D and q are sufficiently small, they are expressed as q = ρσ αq D +O(q ), q D = γq ρσ +O(q ), () where O(q ) denotes the higher-order terms over second-order with respect to q D and q. These expressions indicate that when γ > α γ F, () the local stability of q D = q = is lost. As shown in Fig., the failure solution vanishes at γ F =. for α =.5. 3) Middle solution: We define γ M over which the middle solution with < q D < and < q < ρσ disappears, denoted as a vertical line in Fig., which is provided as γ M = 3.84... for the parameter choice of(α,ρ) = (.5,.). The value of γ M depends on (α,ρ), as shown in Fig. 3. This figure indicates that γ M diverges at ρ M =.37... for α =.5. The relation between ρ M and α, denoted as ρ M (α) (or α M (ρ)), generally accords with the critical condition that belief propagation (BP)-based signal recovery using the correct prior starts to be involved with multiple fixed points for the signal reconstruction problem of compressed sensing [6] in which the correct dictionary D is provided in advance. 8 6 γ M 4 φ ρ M.5..5 ρ..5.3.35.8.6.4. Fig. 3. γ M versus ρ for α =.5. φ S γ γ F γ M -. S φ F M.5.5 γ 3 3.5 4 Fig. 4. γ-dependence of φ for α =.5 and ρ =.. φ S diverges positively for γ > γ S = α/(α ρ) =.666... BP is also a potential algorithm for practically achieving the learning performance predicted by the current analysis because it is known that macroscopic behavior theoretically analyzed by the replica method can be confirmed experimentally for single instances by BP for many other systems [6], [7], [8]. The fact that only the success solution exists for γ > γ M implies that one may be able to perfectly identify the correct dictionary D with a computational cost of polynomial order in utilizing BP, without being trapped by other locally stable solutions, for α > α M (ρ). B. Free entropy density There are three extrema of the free entropy (density), φ S, φ F, and φ M, corresponding to the success solution, failure solution, and middle solution, respectively. Among them, the thermodynamically dominant solution that provides the correct evaluations of q D and q is the one for which the value of free entropy is the largest. Fig. 4 plots φ S, φ F, and φ M versus γ for α =.5, ρ =., where γ S =.666... and γ F =..

.8.6 α.4. ( I ) ( II ) ( III )..4 ρ.6.8 Fig. 5. Phase diagram on α ρ plane. The dashed curve in the area of (I) is the result of []. In particular, functional forms of φ S and φ F are given by [ φ S = lim g {ln( g } { )} ρσ τ + τ ) αγ ln(αγ)+α ln( α ] +γρ(lnγ lnσ) γh(ρ) (3) φ F = { αγ(+logρσ )+α}, (4) where τ + originates from the expression of (8) and H(ρ) = ( ρ)log( ρ) ρlog(ρ). Further, (3) shows that φ S diverges positively for g = (α ρ)γ α >, which guarantees that the success solution is always thermodynamically dominant for γ > γ S = α/(α ρ) as φ of other solutions is kept finite. This leads to the conclusion that the sample complexity of the Bayesian optimal learning is P c = γ S, which is guaranteed as O() as long as α > ρ. This is the main consequence of the present study. Fig. 5 plots the phase diagram in the α ρ plane. The union of the regions (I) and (II) represents the condition that the sample complexity P c is O(), while the full curve of the upper boundary of (II) denotes α M (ρ) above which BP is expected to work as an efficient learning algorithm. Dictionary learning is impossible in the region of (III). The critical condition α naive (ρ) above which the naive learning scheme of [] can perfectly identify the planted solution by O() samples is drawn as the dashed curve for comparison. The considerable difference between α naive (ρ) and ρ (or even α M (ρ)) indicates the significance of using adequate knowledge of probabilistic models in dictionary learning. VI. SUMMARY In summary, we assessed the minimum sample size required for perfectly identifying a planted solution in dictionary learning (DL). For this assessment, we derived the optimal learning scheme defined for a given probabilistic model of DL following the framework of Bayesian inference. Unfortunately, actually evaluating the performance of the Bayesian optimal learning scheme involves an intrinsic technical difficulty. For resolving this difficulty, we resorted to the replica method of statistical mechanics, and we showed that the sample complexity can be reduced to O() as long as the compression rate α is greater than the density ρ of non-zero elements of the sparse matrix. This indicates that the performance of a naive learning scheme examined in a previous study [] can be improved significantly by utilizing the knowledge of adequate probabilistic models in DL. It was also shown that when α is greater than a certain critical value α M (ρ), the macroscopic state corresponding to perfect identification of the planted solution becomes a unique candidate for the thermodynamically dominant state. This suggests that one may be able to learn the planted solution with a computational complexity of polynomial order in utilizing belief propagation for α > α M (ρ). ote added: After completing this study, the authors became aware that [9] presents results similar to those presented in this paper, where an algorithm for dictionary learning/calibration is independently developed on the basis of belief propagation. ACKOWLEDGMET This work was partially supported by a Grant-in-Aid for JSPS Fellow o. 3 4665 (AS) and KAKEHI os. 33 and 398 (YK), and JSPS Core-to-Core Program onequilibrium dynamics of soft matter and information. REFERECES [] J.-L. Starck, F. Murtagh, and J. M. Fadili, Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity (Cambridge Univ. Press, ew York, ). [] H. yquist, Certain topics in telegraph transmission theory, Trans. AIEE 47 (), pp. 67 644 (98). [3] D. L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 5 (4), pp. 89 36 (6). [4] E. J. Candès, and T Tao, Decoding by Linear Programming, IEEE Trans. Inform. Theory 5 (), pp. 43 45 (5). [5] B. A. Olshausen and D. J. Field, Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V?, Vision Res. 37 (3), pp. 33 335 (997). [6] R. Rubinstein, A. M. Bruckstein, and M. Elad, Dictionaries for Sparse Representation Modeling, Proc. of IEEE 98 (6), pp. 45 57 (). [7] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, (Springer-Verlag, ew York, ). [8] S. Gleichman, and Y. C. Eldar, Blind Compressed Sensing, IEEE Inform. Theory 57, pp. 6958 6975 (). [9] M. Aharon, M. Elad, and A. M. Bruckstein, On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them, Linear Algebra and its Applications 46 (), pp. 48 67 (6). [] D. Vainsencher, S. Mannor, and A. M. Bruckstein, The Sample Complexity of Dictionary Learning, Journal of Machine Learning Research, pp. 359 38 (). [] A. Sakata, and Y. Kabashima, Statistical mechanics of dictionary learning, ariv:3.678. [] Y. Iba, The ishimori line and Bayesian statistics, J. Phys. A: Math. Gen. 3 (), 3875 3888 (999). [3] M. Mézard, G. Parisi, and M. A. Virasoro, Spin Glass Theory and Beyond, (World Sci. Pub., 987). [4] H. ishimori, Statistical Physics of Spin Glasses and Information Processing: An Introduction, (Oxford Univ. Pr., ). [5] M. Mézard and A. Montanari, Information, Physics, and Computation, (Oxford Univ. Press, Oxford, UK, 9).

[6] F. Krzakala, M. Mézard, F. Sausset, Y. F. Sun, and L. Zdeborová Statistical-Physics-Based Reconstruction in Compressed Sensing, Phys. Rev., pp. 5-5-8 (). [7] D. J. Thouless, P. W. Anderson, and R. G. Palmer, Solution of Solvable model of a spin glass, Phil. Mag. 35 (3), pp. 593 6 (977). [8] K. Kabashima, A CDMA multiuser detection algorithm on the basis of belief propagation, J. Phys. A 36 (43), pp. (3). [9] F. Krzakala, M. Mézard, and L. Zdeborová, Phase Diagram and Approximate Message Passing for Blind Calibration and Dictionary Learning. Preprint received directly from the authors via private communication.