Phase transitions in low-rank matrix estimation Florent Krzakala http://krzakala.github.io/lowramp/ arxiv:1503.00338 & 1507.03857
Thibault Lesieur Lenka Zeborová Caterina Debacco Cris Moore Jess Banks Jiaming Xu Jean Barbier Nicolas Macris Mohama Dia
Low-rank matrix estimation W = 1 p n XX T W = 1 p n UV T X 2 R n r or U 2 R m r,v 2 R n r unknown unknown Matrix W has low (finite) rank r n, m W is observe element-wise trough a noisy channel: P out (y ij w ij ) Goal: Estimate unknown X (or U & V) from known Y.
Some examples x T i =(0,...,0, 1, 0,...,0) w ij = x T i x j / p n { r-imensional variable (r=rank) (Goal: Estimate unknown X from known Y.)
Some examples x T i =(0,...,0, 1, 0,...,0) w ij = x T i x j / p n Aitive white Gaussian noise (sub-matrix localization) P out (y w) = 1 p 2 e (y w)2 2 (Goal: Estimate unknown X from known Y.)
Some examples x T i =(0,...,0, 1, 0,...,0) w ij = x T i x j / p n Aitive white Gaussian noise (sub-matrix localization) P out (y w) = 1 p 2 e (y w)2 2 Dense stochastic block moel (community etection) P out (y ij =1 w ij )=p out + µw ij P out (y ij =0 w ij )=1 p out µw ij p in = p out + µ/ p n = p out(1 p out ) µ 2. (Goal: Estimate unknown X from known Y.)
Some examples u T i =(v 1 i,...,v R i ) 2 R v T i =(0,...,0, 1, 0,...,0) Y W = 1 p n UV T +N (0, 2 ) Aitive white Gaussian noise (clustering mixture of Gaussians) (Goal: Estimate unknown U an V from known Y.)
Even more examples Sparse PCA, robust PCA Collaborative filtering (low rank matrix completion) 1-bit Collaborative filtering (like/unlike) Bi-clustering Plante clique (cf. Anrea Montanari yesteray) etc
QUESTIONS Many interesting problems can be formulate this way Q1: When is it possible to perform a goo factorization? Q2: When is it algorithmically tractable? Q3: How goo are spectral methos (main tool)? Yesteray: Anrea taught us how to analyze AMP for such problems with the state evolution approach. Toay: We continue in this irection an answer these questions in a probabilistic setting, with instances ranomly generate from a moel.
Probabilistic setting Assume X is generate from P (X) = The posterior istribution reas P (X Y )= 1 Z(Y ) ny ny i=1 i=1 P X (x i ) Y i<j P X (x i ) P out y ij x T i x j p n x j y ij x i Graphical moel where y ij are pair-wise observations of variables MMSE estimator (minimal error) Marginals probability of the posterior EXACTLY Solvable? When P X an P out known, n!1,r = O(1) Approximate message passing = ense-factor-graph simplifications of belief propagation, hopefully asymptotically exact marginals of the posterior istribution. Exact analysis possible with statistical-physics style methos Many rigorous proofs possible when one works a bit harer
OUTLINE 1) Message passing, State evolution, Mutual information 2) Universality property 3) Main results 4) Sketch of proof
OUTLINE 1) Message passing, State evolution, Mutual information 2) Universality property 3) Main results 4) Sketch of proof
AMP FOR GAUSSIAN ADDITIVE CHANNEL w ij = x T i x j / p AWGN p n y ij = w ij + B t = 1 p n Sa t 1 1 n A t = 1 n S = Y at a t X i v t i! a t 1 Mean an variance of the marginals: a t+1 i = f(a t,bi) t @f v t+1 i = (A t,b t @B i) First written for rank r=1 in Rangan, Fletcher 12 Depenence on the prior only via a thresholing function f(a,b) given by the expectation of: P (x) = 1 Z(A, B) P X(x)exp B > x x > Ax 2
AMP FOR GAUSSIAN ADDITIVE CHANNEL w ij = x T i x j / p AWGN p n y ij = w ij + B t = 1 p n Sa t 1 1 n A t = 1 n S = Y at a t X i v t i! a t 1 Mean an variance of the marginals: a t+1 i = f(a t,bi) t @f v t+1 i = (A t,b t @B i) First written for rank r=1 in Rangan, Fletcher 12 Note: for x=±1, these are nothing but the TAP equations for the Ising Sherrington-Kirkpatrick moel (on the Nishimori line) a t+1 i = tanh(b t i) v t+1 i =1 (a t+1 i ) 2
State evolution Single letter characterization of the AMP M t = 1 n ˆx AMP x r M t! # M t+1 = E x, "f M t, M t x + x Depens on the channel only though its Fisher information. 1-parameter family of channels having the same MMSE. cf. Yesteray s talk: rigorously proven in Montanari, Bayati 10 - Montanari, Deshpane 14
State evolution Single letter characterization of the AMP M t = 1 n ˆx AMP x M t+1 = E x, "f M t, M t x + r M t! x # Note: for x=±1, these are nothing but the replica symmetric equations for the Ising Sherrington-Kirkpatrick moel (on the Nishimori line) Z M t+1 = xd tanh M r! t M t x +
Replica Mutual information Most quantities of interest can be compute from the Mutual Information (free energy for physicist) I(X; Y )= Z xyp (x, y) log P (x, y) P (x)p (y) The replica methos preicts an asymptotic formula for i = I/n: i RS (m) = m2 + E x (x 2 ) 2 r m, mx mz E x,z applej + 4 with Z J (A, B) = log e Bx Ax2 /2 p(x)e FACT: The state evolution recursion for AMP is a fixe point of i RS (m) CONJECTURE: lim n!1 I(X; Y ) n =min m i rs (m)
Replica Mutual information i RS (m) = m2 + E x (x 2 ) 2 4 E x,z applej m, mx + r mz J (A, B) = log Z e Bx Ax2 /2 p(x)e Goo for AMP Ba for AMP i RS (E) i RS (E) 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.084 0.082 0.08 =0.0008 0 0.005 0.01 0.015 0.02 =0.00125 0 0.005 0.01 0.015 0.02 E 0.086 0.085 0.084 0.083 0.082 0.08 0.078 0.076 0.074 0.072 0.07 0.068 0.066 =0.0012 0 0.005 0.01 0.015 0.02 =0.0015 0 0.005 0.01 0.015 0.02 E Goo for AMP Goo for AMP E = MSE vector AMP = E[x 2 ] m
Replica Mutual information Can we prove lim n!1 I(X; Y ) n =min m i rs (m)? yes! i RS (m) = m2 + E x (x 2 ) 2 with 4 Z J (A, B) = log E x,z applej e Bx m, mx + r mz Ax2 /2 p(x)e * Proven for not-too-sparse PCA (Montanari & Deshpane 14) an for symmetric community etection (Montanari, Abbe & Deshpane 16). * Proven for the plante SK moel by Koraa & Macris 10 FK, Xu & Zeborová 16 : Upper Boun (Guerra Interpolation) I n apple i rs(m) Barbier, Dia, Macris, FK & Zeborová 16: lim n!1 (Spatial coupling+thermoynamic integration/i-mmse) I n min m i rs (m)
FORMULAS FOR UV T B t+1 u = A t+1 u = r 1 n Xf v(a t v,b t v) AMP 1 hf t v 0 (A t v,b t v)if u (A t 1 u,b t 1 u ) 1n f v(a t v,b t v)f v (A t v,b t v) T B t+1 v = A t+1 v = r 1 n XT f s (A t v,b t v) hfs 0 (A t s,b t s)if v (A t 1 v,b t 1 v ) 1n f u(a t ub t u)f u (A t u,b t u) T M t u = 1 nû AMP u State evolution M t+1 u = E u, "f u M tv, M tv x + r M t v! u # M t v = 1 m ˆv AMP v M t+1 v = E v, "f v M tu, M tu x + r M tu! v # Mutual information i RS (m u,m v )= m um v +[E(u 2 )][E(v 2 )] 2 E u,z Ju m v, m vu + p m v z E v,z Jv m u, m uv + p m u z
OUTLINE 1) Message passing, State evolution, Mutual information 2) Universality property 3) Main results 4) Sketch of proof
WHAT ABOUT OTHER CHANNELS? w ij = x T i x j / p n P out(y ij w ij ) y ij
CHANNEL UNIVERSALITY w ij = x T i x j / p n P out(y ij w ij ) y ij Depenence on the channel only via: S ij @ log P out(y ij w) @w y ij,0 Fisher-score matrix 1 EPout (y w=0) " @ log Pout (y w) @w y,0 2 # Fisher information Effective Gaussian channel with y=δs w ij = x T i x j / p AWGN p n y ij = w ij +
} CHANNEL UNIVERSALITY A physicist argument small quantity W = 1 p n XX T P out (Y ij W ij )=e log P out(y ij W ij ) = P out (Y ij 0)e W ijs ij + 1 2 W 2 ij S0 ij +O(n 3/2 ) n 2 terms P out (Y W )=P out (Y 0)e P iapplej (W ijs ij + 1 2 W 2 ij S0 ij )+O(p n) concentration P out (Y W ) P out (Y 0)e P iapplej (W ijs ij 1 2 W 2 ij )+O(p n) Effective Gaussian posterior probability P (W Y ) / P (W )e 1 2 P iapplej ( S ij W ij ) 2 +O( p n).
CHANNEL UNIVERSALITY arxiv:1603.08447 FK, Xu & Zeborova 16 Conjecture in Lesieur, FK & Zeborova 15 Rank 1 SBM use & proven in Abbe, Deshpane, Montanari 16
OUTLINE 1) Message passing, State evolution, Mutual information 2) Universality property 3) Main results 4) Sketch of proof
Minimum mean-square error x T W = p 1 i =(0,...,0, 1, 0,...,0) XX T Y = P out (W ) n Rank r=2 Size n=20000 Fisher information:
LARGER RANK? x T W = p 1 i =(0,...,0, 1, 0,...,0) XX T Y = P out (W ) n Rank >=4 Fisher information:
Two phase transitions for r>4 Fisher information: EASY < c
Two phase transitions for r>4 Fisher information: EASY HARD c < < s
Two phase transitions for r>4 Fisher information: EASY HARD IMPOSSIBLE > s
Two phase transitions for r>4 Easy phase < 1 r 2 = c Same transition in spectral methos (~BBP 05) Har phase p in p out > 1 p n r p p out (1 p out ). 1 r 2 < < 1 4r ln(r) [1 + o r(1)] Decelle, Moore, FK, Zeborova 11 = s Conjecture to be har for all polynomial algorithms Impossible phase > 1 4r ln(r) [1 + o r(1)] = s p in p out < 1 p n 2 p r log r p p out (1 p out ). Relate to Potts glass temperature (Kanter, Gross, Sompolinsky 85)
Non-symmetric Community etection 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 AMP Opt spectral Asymmetric Community Detection 1 0.8 0.6 0.4 0.2 0 MMSE AMP 0 0.1 0.2 0.3 0.4 0.5 ρ MSE( ) at ρ=0.05 AMP opt 0 0.5 1 1.5 2 2.5 2 communities +/- size +: ρn size -: (1-ρ)n p ++ = p + µ 1 p = p + µ p n (1 ) p n p + = p µ 1 p n P (x) = x r 1 +(1 ) x + r 1 = p(1 p) µ 2
CLUSTERING MIXTURES OF GAUSSIANS IN HIGH DIMENSIONS 2 groups 20 groups Algorithm first propose by Matsushita, Tanaka 13 u T i =(v 1 i,...,v R i ) 2 R α=m/n=1 v T i =(0,...,0, 1, 0,...,0) W = 1 p n UV T +N (0, 1/ ) Y
What about RANK 1 sparse PCA? X i = {0, 1} W = 1 p n XX T P (x) = x,1 +(1 ) x,0 Y = W + N (0, ) ρ 1 Rank r=1 For ρ > 0.05, Montanari an Deshpane showe that AMP achieve the optimal MMSE ρ=0.05 Δ
What about RANK 1 sparse PCA? X i = {0, 1} W = 1 p n XX T Y = W + N (0, ) P (x) = x,1 +(1 ) x,0 Rank r=1 AMP achieve the optimal MMSE everywhere EXCEPT between the blue an re curves 0.05 0.04 spectral methos ρ= Δ 0.03 ρ 0.02 0.01 0 Δ AMP spinoal inf. theoretic transition 2n spinoal 0 0.001 0.002 0.003 0.004
Easy, Har an impossible inference A very common phenomena: Information is hien by metastability (1st orer transition in physics) known bouns har known algorithms easy impossible ck cs amount of information in the measurements The har phase quantifie also in: plante constraint satisfaction, compresse sensing, stochastic block moel, ictionary learning, blin source separation, sparse PCA, error correcting coes, hien clique problem, others. Conjecture: har for all polynomial algorithms
OUTLINE 1) Message passing, State evolution, Mutual information 2) Universality property 3) Main results 4) Sketch of proof
SKETCH OF THE PROOF Result 1 [FK, Xu & Zeborová] : Upper Boun I n apple i Bethe(m) = m2 + Ex (x 2 ) 2 4 E x,z applej m, mx + r mz (note that this is true at any value of n, not only asymptotically) Metho: Guerra s interpolation+nishimori ientities
SKETCH OF THE PROOF Result 1 [FK, Xu & Zeborová] : Upper Boun Interpolate the factorization problem at t=1 from a enoising problem at t=0 P t (x Y,D) = 1 Z t p 0 (x)e 2 t P 4 iapplej xi x j p n Y ij 2 2 3 5+(1 t) P i apple (D i x i ) 2 2 D Factorization: w ij = x T i x j / p n N (0, /t) y ij Denoising: x i N (0, D/(1 t)) D i with D = m I t=1 (x; Y )=I t=0 (x; y)+ R 1 0 t It (x; Y,y)
SKETCH OF THE PROOF Result 1 [FK, Xu & Zeborová] : Upper Boun Interpolate the factorization problem at t=1 from a enoising problem at t=0 P t (x Y,D) = 1 Z t p 0 (x)e 2 t P 4 iapplej xi x j p n Y ij 2 2 3 5+(1 t) P i apple (D i x i ) 2 2 D I t=1 (x; Y )=I t=0 (x; y)+ R 1 0 t It (x; Y,y) Use Stein lemma an Nishimori s ientities, one can show that I(x;Y ) n = i rs (m) R 1 0 t (m E t[xx 0 ]) 2 4 <i rs (m) Remark: for estimation problems, Guerra s interpolation yiels an upper boun while in the usual case (i.e. CSP) it yiels a lower boun
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ - ΔRS - ΔAMP - 0
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 1: ForΔ < ΔAMP MSE AMP ( ) MMSE( ) - ΔRS - ΔAMP - 0
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 1: ForΔ < ΔAMP MSE AMP ( ) MMSE( ) - ΔRS - ΔAMP - 0 I-mmse theorem (Guo-Veru) (For physicists: this is just thermoynamics) 1 i( )=1 4 MMSE( )
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 1: ForΔ < ΔAMP MSE AMP ( ) MMSE( ) - ΔRS 1 [min Ei RS (E; )] 1 i( ) - ΔAMP - 0 I-mmse theorem (Guo-Veru) (For physicists: this is just thermoynamics) 1 i( )=1 4 MMSE( )
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 1: ForΔ < ΔAMP MSE AMP ( ) MMSE( ) - ΔRS 1 [min Ei RS (E; )] 1 i( ) - ΔAMP [min E i RS (E; )] apple i( ) - 0
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 1: ForΔ < ΔAMP MSE AMP ( ) MMSE( ) - ΔRS 1 [min Ei RS (E; )] 1 i( ) - ΔAMP [min E i RS (E; )] apple i( ) min E i RS (E; ) min E i RS (E; 0) apple i( ) i(0) - 0 Both are equal to H(x)
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 1: ForΔ < ΔAMP MSE AMP ( ) MMSE( ) - ΔRS 1 [min Ei RS (E; )] 1 i( ) - ΔAMP [min E i RS (E; )] apple i( ) min E i RS (E; ) min E i RS (E; 0) apple i( ) i(0) - 0 min E i RS (,E) apple i( )
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ Step 2: Prove that the true thermoynamic potential is analytic until ΔRS - - ΔRS ΔAMP This is the harest part. Sketch: 1) Map the problem to its spatially couple version 2) Prove both moels have the same mutual information (with Guerra s type interpolation) 3) Prove the spatially couple version is analytic until ΔRS (Threshol saturation) More in Nicolas Macris's talk this afternoon - 0
SKETCH OF THE PROOF Result 2 [Barbier, Dia, Macris, FK & Zeborová] : Converse Lower boun Δ - ΔRS Step 3: ForΔ >ΔRS, AMP provies back the MMSE Use again the I-MMSE theorem from ΔRS to Δ>ΔRS - ΔAMP - 0
Conclusions MMSE in a ranom setting of the low-rank matrix estimation evaluate (Proven rigorously for symmetric problems, open for generic UV T factorization). Promising proof strategy for the replica formula in estimation problems (possible generalization for larger ranks, tensors, etc ) Depenence on the output channel through its Fisher information: universality Two phase transitions: an algorithmic one an an information theoretic one. Sometimes equal, but not always; AMP reach the MMSE in a large region. When the problem is not balance (i.e. if the prior has a non-zero mean) AMP has a better etectability transition than spectral methos. Promising algorithm with goo convergence properties for applications beyon ranom setting (example: Restricte Boltzmann Machine). Open-source implementation available http://krzakala.github.io/lowramp/