Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014

for many problems (e.g., all UG-hard ones): better guarantees stronger relaxations more complex proxy solutions??? possible relaxations (tractable problems) original problem (hard to solve directly) flows multicommodity, expander, electrical actual solutions metrics negative-type, simplex proxy solutions distributions Gaussian, random spanning tree

example: max cut x i = 1 x i = 1 combinatorial viewpoint: polynomial viewpoint: given undirected graph G, bipartition vertex set to cut as many edges as possible given polynomial L G (x) = ij E G 1 4 x i x j 2, find maximum over x ±1 V(G) (hypercube) how to certify upper bound c on maximum? R 2 2 1 + + R t decompose c L G as sum of squares of polynomials plus quadratic polynomial vanishing over hypercube α 1 x 1 2 1 + + α n x n 2 1 n 2 -size semidefinite program

example: max cut x i = 1 x i = 1 combinatorial viewpoint: polynomial viewpoint: how to certify upper bound c on maximum? given undirected graph G, bipartition vertex set to cut as many edges as possible given polynomial L G (x) = ij E G 1 4 x i x j 2, find maximum over x ±1 V(G) (hypercube) R 2 2 1 + + R t decompose c L G as sum of squares of polynomials plus quadratic polynomial vanishing over hypercube degree-k α 1 x 1 2 1 + + α n x n 2 1 n k n 2 -size semidefinite program degree-n bound is exact (interpolate c L G as degree-n polynomial over hypercube) does degree-n o(1) bound improve over GW in worst-case? (would refute UGC) for every* candidate graph construction, degree-16 bound improves over GW [Brandao-Barak-Harrow-Kelner-S.-Zhou]

multivariate polynomials P 1,, P m R x 1,, x n when is E unsatisfiable over R n? sum-of-squares (SOS) refutation of E system of equations E = P 1 = 0,, P m = 0 idea: derive 1 0 from E obviously unsatisfiable constraint always negative Q 1 P 1 + + Q m P m + R 1 2 + + R t 2 = 1 non-negative over E ( derived constraint ) intuitive proof system: many common inequalities have proofs in this form, e.g., Cauchy-Schwarz, Hölder, l p -triangle inequalities linear case: Gaussian elimination, Farkas lemma Real Nullstellensatz every polynomial system is either satisfiable over R n or SOS refutable [Artin, Krivine, Stengle] SOS method: n O k -time algorithm to find SOS refutation with degrees k if one exists (uses SDP) [Shor, Nesterov, Parrilo, Lasserre]

multivariate polynomials P 1,, P m R x 1,, x n system of equations E = P 1 = 0,, P m = 0 what if no deg.-d SOS refutation exists? degree-d SOS implication, E d P 0, P = Q 1 P 1 + + Q m P m + R 1 2 + + R t 2 with deg Q i P i, deg R i 2 d 1 linear functional E on deg.-d polynomials E1 = 1 P P 1P2 P m E EP 0 whenever E d P 0 no appreciable difference to deg.-d moments of some distribution over solutions to E P E d P 0 deg.-d pseudo-distribution for E

SOS implication: hypercube triangle inequality suppose: E = x 2 = 1, y 2 = 1, z 2 = 1 and P = x y 2 + y z 2 x z 2 claim: E 4 P 0 polynomial P as function on ±1 3 y P = 0 x z 0 x = y = z 8 x = y z x z therefore, P = 1 2 x + y 2 x z 2 + x 2 1 Q x + y 2 1 Q y + z 2 1 Q z square polynomial

SOS implication: univariate inequalities suppose: P univariate and P 0 over R claim: deg P P 0 proof by induction on deg P P α 0 choose: minimizer α of P α R then: P = P α + x α 2 P for some polynomial P with deg P < deg P squares sum of squares by ind. hyp. useful consequence deg P deg Q P Q x 1,, x n 0 for any polynomial Q R x 1,, x n concrete infinite family of global constraints (unclear how to get with other methods)

optimization (e.g., MAX CUT) maximize P 0 over P 1 = 0,, P m = 0 v-vs-v approximation: given: sat. system P 0 = v, P 1 = 0,, P m = 0 find: solution to P 0 = v, P 1 = 0,, P m = 0 [Barak-Kelner-S. 14] claim: SOS reduces approximation in time n O k to deg.-k combining subset / distr. x of solutions to P 0 = v, P 1 = 0,, P m = 0 represented by all degree-k moments of x, e.g., E {x} x 1 x k use only properties of moments / solutions with degree-k SOS proofs single solution x to P 0 v, P 1 = 0,, P m = 0 proof: deg.-k combiner cannot distinguish between actual distributions and deg.-k pseudo-distributions

uses only that Cov(x) is p.s.d. has deg.-2 SOS proof v E xx v = E v x 2 0 distribution of solutions {x} deg.-2 combiner for MAX CUT - sample Gaussian distribution ξ with same deg.-2 moments as {x} - output x = sign ξ analysis show: if ξ i, ξ j and x i, x j have same deg.-2 moments, then P x i x j 0.878 P x i x j single solution x involves only 2 variables x i, x j has low-deg. SOS proof

dictionary learning (aka sparse coding) application: machine learning (feature extraction) neuroscience (model for visual cortex) data vectors linear transformation dictionary A = sparse vectors y 1 y T a 1 a m x 1 x T goal: given data vectors y 1,, y T, reconstruct A example: dictionary for natural images [Olshausen-Fields 96] previous works assume incoherence a 1,., a m unknown unit vectors in isotropic position x 1,, x t are i.i.d. samples from unknown nice distr. over sparse vectors (only small correlations between coord s) [Arora-Ge-Moitra, Agarwal-Anandkumar-Jain-Netrapalli-Tandon] previous methods (local search): only very sparse vectors, up to n non-zeros [Barak-Kelner-S. 14] sum-of-squares method: full sparsity range, up to constant fraction non-zeros (quasipolynomial-time for sparsity o(1); polynomial-time for n ε )

dictionary learning (aka sparse coding) application: machine learning (feature extraction) neuroscience (model for visual cortex) data vectors linear transformation dictionary A = sparse vectors example: dictionary for natural images [Olshausen-Fields 96] y 1 y T a 1 a m x 1 x T a 1,., a m unknown unit vectors in isotropic position x 1,, x t are i.i.d. samples from unknown nice distr. over sparse vectors (only small correlations between coord s) theorem: [Barak-Kelner-S. 14] suppose m = O n and correlations between coord s small enough then, O log n -SOS can recover set A {±a 1,, ±a m } in Hausdorff distance

theorem: [Barak-Kelner-S. 14] suppose m = O n and correlations between coord s small enough then, O log n -SOS can recover set A {±a 1,, ±a m } in Hausdorff distance ±a 1 ±a 2 ±a n 1 ±a m?????? 1. construct polynomial P 0 u = 1 T t y t, u 4 from data vectors can show: global optima of P 0 correspond to ±a 1,, ±a m (but no control over local optima of P 0 ) low-degree SOS proof 2. compute global optima of P 0 in general: NP-hard problem (even approximately) approach: use SOS method and degree-o log m combiner works because every solution set clustered around m points

theorem: [Barak-Kelner-S. 14] suppose m = O n and correlations between coord s small enough then, O log n -SOS can recover set A {±a 1,, ±a m } in Hausdorff distance ±a 1 ±a 2 ±a n 1 ±a m connection to robust tensor decomposition?????? M = t y t 4 close to i a i 4 in spectral norm claim: O log m -SOS finds components {±a i } 1. construct polynomial P 0 u = 1 T t y t, u 4 from data vectors can show: global optima of P 0 correspond to ±a 1,, ±a m (but no control over local optima of P 0 ) low-degree SOS proof 2. compute global optima of P 0 in general: NP-hard problem (even approximately) approach: use SOS method and degree-o log m combiner works because every solution set clustered around m points

given: 4-tensor M R n 4 that is ε-close to i a 4 i in spectral norm for orthonormal vectors a 1,, a n R n goal: find a 1,, a n polynomial system: E = M, x 4 = 1 ε x 2 2 = 1 distribution of solutions {x} to E single solution x

given: 4-tensor M R n 4 that is ε-close to i a 4 i in spectral norm for orthonormal vectors a 1,, a n R n goal: find a 1,, a n polynomial system: E = M, x 4 = 1 ε x 2 2 = 1 deg.-o(k) combiner for TENSOR DECOMP. - choose random unit vectors {w} ` - reweigh distribution {x} by w, x 2k so that P x w, x 2k P x - output top eigenvector of E xx ±a 1 ±a 2 ±a n 1 ±a m distribution of solutions {x} to E single solution x analysis has low-deg. SOS proof - solutions {x} clustered around ±a i - with probability 1/n O 1 w, a 2 1 2 max w, a i 2 i>1 reweighing increases probability of a 1 -cluster by factor 2 k relative to rest for k = O log k, reweighted distr. concentrated along ±a 1

conclusions polynomial optimization: often easy when global optima unique (occurs naturally for recovery problems) unsupervised learning: higher-degree SOS gives better guarantees for recovering hidden structures low-degree combiner: general way to make proofs into algorithms thank you!