An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge Workshop HDP-QPh, June. 10th 2015

Introduction Low rank matrix recovery in high dimension : Relevant applications (in particular quantum tomography) Interesting theoretical challenges This talk : based on a joint work with Arlene K.Y. Kim, An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution, availabe on arxiv (arxiv:1502.04654).

Outline The matrix recovery setting Setting Results In high dimension The problem Discussion Hard thresholding estimator The estimator Results Simulations

Setting Background and notations Vector notations Let p > 0..,. : classical vectorial scalar product in C p.. q, q 0 : l p (semi, for q = 0) norm in C p Matrix notations Let d > 0..,. tr : scalar product for d d Hermitian matrices (A, B) : A, B tr = tr ( A B ).. F : Frobenius norm, A 2 F = A, A tr = i λ2 i, where the (λ 2 i ) i are the eigen values of A A.. : Trace (or Schatten 1) norm, A = i λi.. S : Spectral (or Schatten ) norm, A S = sup i λ i.

Setting The matrix regression setting For a parameter Θ, and a sensing matrix X i, of dim. d d, one observes a noisy data for i n, Y i = tr ( (X i ) Θ ) + ɛ i = X i, Θ tr + ɛ i, where ɛ R n is an i.i.d. noise vector.

Setting The matrix regression setting : objective Let X be a linear operator, s.t. X(A) = (tr ( (X i ) A )) ) ( X = i, A tr i n i n. Model : Y = X(Θ) + ɛ. Problem Given (Y, X), reconstruct Θ. Matrix version of the linear regression problem.

Setting Application to quantum tomography Source = ^ θ Measurement = ^ X See Gross (2011); Flammia et al. (2012); Kahn and Guta (2009); Guta et al. (2012); Barndorff-Nielsen et al. (2003); Butucea, Guta, Kypraios (2015); Alquier et al. (2013); Liu (2011); Gross et al. (2010) etc.

Results The least square estimator Model : Y = X(Θ) + ɛ. Most natural idea : least square, i.e. ˆΘ = arg min T X(T ) Y 2 2, which has for solution the least square estimator ˆΘ = (X X) 1 X (Y ), where X = n i=1 (Xi ) is the conjugate of X.

Results Properties of the least square estimator Under standard properties of the noise, and invertibility of the design (i.e. existence of (X X) 1 ), one has asymptotically ˆΘ Θ N (0, (X X) 1 ), and if sub-gaussian noise (e.g. bounded), with proba. 1 δ ˆΘ Θ 2 F C d2 log(1/δ), n which is minimax optimal over d d matrices. Those two results enable to do inference (estimation + confidence statements).

Outline The matrix recovery setting Setting Results In high dimension The problem Discussion Hard thresholding estimator The estimator Results Simulations

The problem Main problem Crucial assumption : X X invertible - measurement system must be complete. We thus require n d 2. Question : What if it is not the case, i.e. high dimensional setting where n d 2? Answer : Restrict set of parameters and impose condition on design.

The problem Restriction on the set of parameters Problem : If n d 2, some parameters have same image even in the absence of noise, impossible to have uniform reconstruction over d d matrices. Solution : Restrict the space of parameters. Natural restriction : low rank matrices. We write M(k), for the set of matrices of rank k.

The problem Design assumption Good sampling scheme X satisfies the matrix Restricted Isometry Property (RIP) : sup T M(k) 1 n X(T ) 2 2 T 2 F T 2 F ɛ. M(k) See e.g. Candès and Recht (2009); Candès and Tao (2010); Candès and Plan (2011); Gross (2011); Liu (2011) etc. Example : Random sub-gaussian design, or random sampling from incoherent basis (e.g. Pauli)...

The problem Remark on the following pictures M(k) is not easy to draw so for drawing the intuitions, I will resort to the sparse linear regression model, where and θ is k sparse. Y = Xθ + ɛ, M(k) 1 sparse vectors

The problem First non-convex solution If X satisfies the matrix RIP, the measurement system is complete for M(k). The estimator ˆΘ 0 = arg min X(T ) Y 2, T M(k) will satisfy with proba. larger than 1 δ ˆΘ 0 Θ 2 F C kd log(1/δ). n Problem : Non convex, horrible program.

The problem Convex relaxation Current estimator ˆΘ 0 = arg min X(T ) Y 2, T M(k) Problem : Non convex, horrible program. Idea = convex relaxation : ˆΘ = arg min X(T ) Y 2, T : T b or rather ˆΘ = arg min T X(T ) Y 2 2 + λ T.

The problem Convex relaxation Matrix Lasso : ˆΘ = arg min T X(T ) Y 2 2 + λ T. Θ^

The problem Convex relaxation Theorem If the design satisfies the matrix RIP, and if d log(1/δ) λ n then matrix lasso satisfies with proba. larger than 1 δ ˆΘ Θ 2 F C kd log(1/δ). n See e.g. Fazel et al. (2010); Candès and Plan (2011); Gross et al. (2010); Flammia et al. (2012); Koltchinskii et al. (2011) etc.

Discussion Questions From a minimax perspective, problem is solved. But How to implement efficiently the matrix lasso? Or rather, is there an estimator that is computationally efficient? What is the precision of the estimate? Entry-wise results? Limiting distribution?

Discussion Implementation ˆΘ is defined by an optimisation program - it is computationally solvable in theory. But in practice? Projected gradient descent in the noisy case Agarwal et al. (2012). Many works on this the regression setting Agarwal et al. (2012); Goldfarb and Ma (2011); Blumensath and Davies (2009); Tanner and Wei (2012) - in particular Hard thresholding in the noiseless case.

Discussion Implementation Projected gradient descent : Mean squared error Hard Thresholding : Mean squared error Convex relaxation of the constraint Actual constraint

Discussion Uncertainty quantification Uncertainty quantification? Results only for linear regression model... Global vs. local result (. F vs.. S ). Estimator with uncertainty distributed and with explicit limiting distribution van de Geer et al. (2014); Javanmard and Montanari (2014); Zhang and Zhang (2014). Remark : Minimax confidence set depending on the sparsity? Does not exist in the linear regression model Nickl and van de Geer (2014)... But exists in the matrix recovery model Carpentier, Eisert, Gross and Nickl (2015)!

Discussion Uncertainty quantification Constrained solution : no obvious lim. distr. Projected solution : Gaussian lim. distr. Θ^ Y - X( Θ^ ) Θ^ Y - X( ) X * ( ) 1 n

Outline The matrix recovery setting Setting Results In high dimension The problem Discussion Hard thresholding estimator The estimator Results Simulations

The estimator Prerequisites Let K be an upper bound on the rank of the parameter, i.e. Θ M(K). Assume that X satisfies the matrix RIP : We will need : sup T M(2K) 1 n X(T ) 2 2 T 2 F T 2 c n (2K). F c n (2K) K < 1/4. For e.g. Gaussian design, or random Pauli design, we have (up to a log(d) for Pauli) kd c n (2K) n, and so condition satisfied whenever d k n 1.

The estimator The hard thresholding estimator Initial values for the estimator ˆΘ 0 and the threshold T 0 : ˆΘ 0 = 0 R d d, T 0 = B R +. Set now recursively, for r N, r 1, T r = 4 c n (2K) KT r 1 + C d log(1/δ), and, n ˆΘ r = ˆΘ r 1 + 1 n X ( Y X( ˆΘ r 1 ) ) Tr. where M T is the matrix where all singular values of M smaller than T are thresholded (set to 0).

The estimator Interpretations Low rank projected gradient descent : A projected gradient descent on the set of low rank matrices But gradient step is 1 (so not really a gradient descent...). Gradient step of 1 is very important... Mean squared error Actual constraint

The estimator Interpretations Application of a contracting operator : We have : ˆΘ r = ˆΘ r 1 + 1 n X ( X(Θ ˆΘ r 1 )+ɛ )) Tr. Θ - ^ Θ r-1 By condition c n (2K) K < 1/4, we have X X * 1 n 1 n X X Id d S 1/4. So multiple application of a spectral contraction (with thresholdings). 1 X * X( Θ - ^Θ ) n r-1

The estimator Interpretations Taylor expansion of the inverse function : Least squares : (X X) 1 X Y. Problem : (X X) 1. Taylor expansion at r of (Id d (Id d 1/nX X)) 1 : L(r) = r (Id d 1 n X X) m. m=0 If suppression of thresholding step, the estimator we constructed is of the form 1 n L(r)X Y. Thresholding between each step controls the small eigen values...

Results General result Theorem Let r O(log(n)). With probability larger than 1 δ and for any k K/2 and also that sup Θ M(k), Θ F B Θ ˆΘ d log(1/δ) r S C 1, n sup rank( ˆΘ r ) k. Θ M(k), Θ F B Note : Minimax optimal results in Frobenius and Trace norm follow immediately.

Results Discussion Bounds are minimax-optimal. The spectral norm bound allows to have bound of the entry-wise risk. Adaptive estimator : no need to know k. Results hold also in the linear regression setting with estimator in the same vein.

Results Result in Gaussian design Theorem Assume that the elements in the design matrices X i M are i.i.d. Gaussian with mean 0 and variance 1. Then, writing Z := 1 n X (ɛ) and := n( ˆΘ r Θ) 1 n X X( ˆΘ r Θ), we have n( ˆΘ Θ) = + Z, ( ) where Z X N 0, 1 n X X. Assuming that max(k 2 d, Kd log(d)) = o(n), we have that = o P (1).

Results Discussion Limiting distribution entrywise confidence sets. Bound on the risk of each entry by 1/n (Gaussian concentration). Results hold also in the linear regression setting.

Simulations Simulations Simulations for Gaussian design Gaussian uncorrelated noise ɛ N (0, I n ) Parameter Θ of rank k Θ = k N l Nl T, where, N l N (0, I d ). l=1 Computing estimator, and entrywise Confidence intervals.

Simulations log(rescaled Risk) log(rescaled Risk) 6 4 2 0 3.5 2.5 1.5 0.5 p=64,k=3 1000 2000 3000 4000 5000 6000 n p=64,k=10 1000 2000 3000 4000 5000 6000 n log(rescaled Risk) log(rescaled Risk) 6 4 2 0 2.0 1.0 0.0 p=100,k=3 2000 3000 4000 5000 6000 7000 n p=100,k=10 2000 3000 4000 5000 6000 7000 n Figure: Logarithm of the rescaled Frobenius risk of the estimate.

Simulations p=64,k=3 p=100,k=3 log(length of CI) 4 3 2 1 0 1 1000 2000 3000 4000 5000 6000 n p=64,k=10 log(length of CI) 4 3 2 1 0 1 2000 3000 4000 5000 6000 7000 n p=100,k=10 log(length of CI) 1 0 1 2 1000 2000 3000 4000 5000 6000 n log(length of CI) 0.0 1.0 2.0 2000 3000 4000 5000 6000 7000 n Figure: Logarithm of rescaled CI length

Simulations Coverage Probability Coverage Probability 0.5 0.7 0.9 0.5 0.7 0.9 p=64,k=3 1000 2000 3000 4000 5000 6000 n p=64,k=10 1000 2000 3000 4000 5000 6000 n Coverage Probability Coverage Probability 0.5 0.7 0.9 0.5 0.7 0.9 p=100,k=3 2000 3000 4000 5000 6000 7000 n p=100,k=10 Figure: Coverage of CI 2000 3000 4000 5000 6000 7000 n

Conclusion We have Minimax-optimal bounds. In particular for spectral norm. Estimator that is very fast to compute. With limiting distribution in the case of a Gaussian design. We want Limiting distribution in non Gaussian design? Sharper bounds on entries in non-gaussian design? Results with true quantum model?

Thank you!

References I Agarwal, A., S. Negahban, and M. J. Wainwright (2012). Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Statist. 40(5), 2452 2482. P. Alquier, C. Butucea, M. Hebiri, K. Meziani, T. Morimae (2013). Rank penalized estimation of a quantum system. Physical Reviews A 88, 032113. O. E. Barndorff-Nielsen, R. D. Gill, P. E. Jupp (2003). On quantum statistical inference (with discussion). J. R. Statist. Soc. B. 65(5), 775 816. Blumensath, T. and M. E. Davies (2009). Iterative hard thresholding for compressed sensing. Appl. Computat. Har. Analysis 27(3), 265 274.

References II Butucea, Guta, Kypraios (2015). Spectral thresholding quantum tomography for low rank states. arxiv:1504.08295. Candès, E. and T. Tao (2010). The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inform. Theory 56, 2053 2080. Candès, E. J. and Y. Plan (2011). Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. IEEE Trans. Inform. Theory 57(4), 2342 2359. Candès, E. and B. Recht (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717 772.

References III Carpentier A., Eisert J., Gross D. and Nickl R. (2015). EUncertainty Quantification for Matrix Compressed Sensing and Quantum Tomography Problems. arxiv : 1504.03234. Flammia, S. T, D. Gross, Y.-K. Liu, and J. Eisert. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New J. Phys., 14(9):095022, 2012. van de Geer, S., P. Bühlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42(3), 1166 1202. Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3), 471 501.

References IV Goldfarb, D. and S. Ma (2011). Convergence of fixed-point continuation algorithms for matrix rank minimization. Found. Comput. Math. 11, 183 210. Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inform. Theory 57(3), 1548 1566. Gross, D., Y.-K. Liu, S. T Flammia, S. Becker, and J. Eisert. Quantum state tomography via compressed sensing. Physical Rev. letters, 105(15):150401, 2010. M. Guta, T. Kypraios and I. Dryden. Rank based model selection for multiple ions quantum tomography. New Journal of Physics, 14:105002, 2012.

References V H. Haffner et al. Scalable multiparticle entanglement of trapped ions. Nature, 438:643-646, 2005. Javanmard, A. and A. Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res., 15(1):2869 2909, 2014. J. Kahn and M. Guta. Local asymptotic normality for finite dimensional quantum systems. Commun. Math. Phys., 289 597-652 (2009). Koltchinskii, V., K. Lounici, and A. B. Tsybakov (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39(5), 2302 2329. Liu, Y.K. (2011). Universal low-rank matrix recovery from Pauli measurements. Adv. Neural Inf. Process. Syst., 1638 1646.

References VI Nickl, R. and van de Geer, S. (2014). Confidence sets in sparse regression. Ann. Statist. 41(6), 2852-2876. Tanner, J. and K. Wei (2012). Normalized iterative hard thresolding for matrix completion. SIAM J. Sci. Comput. 35, S104 S125. Zhang, C-H. and Zhang, S-S. (2012). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 76, 217 242.