Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru Zhang
Tensor: Multi-dimensional Array
Tensor Data Example
Motivation: Compressed Image Transmission
Motivation: Interaction Effect Model source: Contraceptive Method Choice dataset from UCI
Motivation: Interaction Effect Model
Sparse and Low-Ran Tensor Recovery
Noisy Cubic Setching Model Observe {y i, X i } from noisy cubic setching model, y }{{} i = T, X i }{{} scalar tensor inner product + ɛ i }{{} noise, i = 1,..., n. For two tensors A R p 1 p 2 p 3 inner product is defined as and B R p 1 p 2 p 3, the tensor A, B = ij A ij B ij.
Noisy Cubic Setching Model General model including: tensor regression (Zhou, Li, Zhu (2013)), tensor completion (Yuan and Zhang (2014)). Goal: Recover unnown third-order tensor parameter T. High-dimensional problem: n dim(t ) p 3.
Key Assumptions on Tensor Parameter When T R p p p is a symmetric tensor... 1 CANDECOMP/PARAFAC(CP) low-ran: Represented as sum of ran-one tensor, where p. 2 Sparse components: β 0 s for [K]. The cubic setching tensor X i for symmetric case is X i = x i x i x i, where {x i } n i=1 are Gaussian random vectors. β and β are not orthogonal. Different from singular-value decomposition in matrix case.
Key Assumptions on Tensor Parameter When T R p 1 p 2 p 3 is a non-symmetric tensor... 1 CANDECOMP/PARAFAC(CP) low-ran: 2 Sparse components: β 1 0 s 1, β 2 0 s 2, β 3 0 s 3 for [K]. The cubic setching tensor X i for non-symmetric case is X i = u i v i w i, where {u i, v i, w i } n i=1 are Gaussian random vectors.
Reduced Symmetric Tensor Recovery Model For symmetric tensor recovery model K y i = ηβ β β, x i x i x i + ɛ i = =1 Connect with interaction effect model. K η (x i β) 3 +ɛ i }{{} =1 non-linear New Goal: Recover {η, β }K =1
Reduced Non-symmetric Tensor Recovery Model For non-symmetric tensor recovery model y i = K ηβ 1 β2 β3, u i v i w i + ɛ i =1 = K η (u i β1)(v i β2)(w i β3) +ɛ i }{{} =1 non-linear Connect with compressed image transmission model. New Goal: Recover {η, β 1, β 2, β 3 }K =1.
Empirical Ris Minimization Consider Empirical Ris Minimization T = argmin {η,β } T = argmin n K (y i η (x i β ) 3 ) 2 i=1 } =1 {{ } L 1 (η,β ) n K (y i η (u i β 1 )(vi β 2 )(wi β 3 )) {η,β i } i=1 } =1 {{ } L 2 (η,β i ) Difficulties: Non-convex optimization! Non-convexity from cube structure or tri-convexity.
Our Contributions 1 Efficient two-stage implementation to non-convex optimization problem. 2 Non-asymptotic analysis. Provide optimal estimation rate.
Two-stage Implementation
Main Algorithm
Initial Step: Non-symmetric unbiased estimator Construct an unbiased empirical moment based tensor T (y i, X i ) R p 1 p 2 p 3 as following T := 1 n y i u i v i w i n i=1 }{{} only depends on observations. This is used for non-symmetric case and u i, v i, w i are independent standard Gaussian random vectors.
Initial Step: Symmetric unbiased estimator Construct an unbiased empirical moment based tensor T s (y i, X i ) R p p p as following T s := 1 [ 1 6 n n ] y i x i x i x i U i=1 }{{} only depends on observations. where the bias term U = ) p j=1 (m 1 e j e j + e j m 1 e j + e j e j m 1, and m 1 = 1 ni=1 n y i x i. Here {e j } p j=1 are the canonical vectors in R p. Bias term U is due to the correlation among three identical Gaussian random vectors.
Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.
Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.
Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.
Sparse Tensor Decomposition for Symmetric Tensor Generate L staring points {β start l } L l=1. For each starting point, compute a non-sparse factor of moment-based T s via symmetric tensor power update: β (t+1) l = T s 2 β (t) l T s 2 β (t) l 3 β (t) l 3 β (t) l 2, where for T s R p p p and x R p, define T s 2 x 3 x := j,l x jx l [T ] :,j,l. Get a sparse solution β (t+1) l via thresholding or truncation. Cluster L sets of single component {β (T ) l, β (T ) l, β (T ) l clusters to obtain a ran-k decomposition {β (0), β(0) Different from matrix SVD due to non-orthogonality. } L l=1 into K, β(0) }K =1.
Sparse Tensor Decomposition for Non-symmetric Tensor Generate L staring points {β start 1l, β start 2l, β start 3l } L l=1. For each starting point, compute a non-sparse factor of moment-based T via alternating tensor power update: β (t+1) 1l = T s 2 β (t) 2l T s 2 β (t) β (t+1) 2l β (t+1) 3l 2l 3 β (t) 3l 3 β (t) 3l 2 The updates for and Get a sparse solution {β (t+1) 1l, β (t+1) 2l truncation. Cluster L sets of single component {β (T ) 1l, β (T ), are similar., β (t+1) 3l } via thresholding or 2l, β (T ) 3l } L l=1 into K clusters to obtain a ran-k decomposition {β (0) 1, β(0) 2, β(0) 3 }K =1.
Gradient Update: Thresholded Gradient Decent Input initial estimator {η (0), β(0) }K =1. In each iteration step, update {β } K =1 as β (t+1) = β (t) µ t φ β L 1 (η (0), β(t) ) where φ = 1 n n i=1 y2 i, µ t is the step size. Sparsify current update by thresholding β (t+1) (t+1) = ϕ ρ ( β ). Normalize final update β (T ) = β(t ) and update the weight β (T ) 2 η = η (0) β (T ) 3 2. 1 Alternating update for non-symmetric tensor recovery.
Gradient Update: Thresholded Gradient Decent Input initial estimator {η (0), β(0) 1, β(0) 2, β(0) 3 }K =1. In each iteration step, alternatively update {β 1, β 2, β 3 } K =1 as β (t+1) 1 = β (t) 1 µ t φ β 1 L 2 (η (0), β(t) 1, β(t) 2, β(t) 3 ) where φ = 1 n n i=1 y2 i, µ t is the step size. The update for (t+1) and β 3 is similar. Sparsify current update by thresholding β (t+1) j = ϕ ρ ( j = 1, 2, 3. Normalize final update β (T ) j = β(t ) j β (T ) j 2 η = η (0) β (T ) 1 2 β (T ) 2 2 β (T ) 3 2. β (t+1) 2 (t+1) β j ) for and update the weight 1 Alternating update for non-symmetric tensor recovery.
Non-asymptotic Analysis
Non-asymptotic Upper Bound 定理 Suppose some regularity conditions for the true tensor parameter hold. Assume n C 0 s 2 log p for some large constant C 0. Denote Z (t) = K =1 3 η β (t) 3 η β 2 2 For any t = 0, 1, 2,..., the factor-wise estimator satisfies Z (t+1) min κ t Z (t) + C 1η }{{} 16 computational error 4 3 σ 2 s log p n } {{ } statistical error with high probability, where κ is the contraction parameter between 0 and 1, η min = min {η }, σ is the noise level and C 0, C 1 are some absolute constants.,
Remars Interesting characterization for computational error and statistical error; Geometric convergence rate to the truth in the noiseless case and minimax optimal statistical rate shown later; The error bound is dominated by computation error in the first several iterations and then is dominated by statistical error. Useful guideline for choosing stopping rule.
Remars When t T for some enough T, the final estimator is bounded by T (T ) T 2 Cσ2 Ks log p, F n with high probability. Minimax optimal rate!
Class of Sparse and Low-ran tensor Sparse CP decomposition K T = β β β, β 0 s for [K] =1 Incoherence condition(nearly orthogonal): components are incoherent such that The true tensor max i j [K] β i, β j C. s
Minimax Lower Bound 定理 Consider the class of tensor satisfy sparse CP-decomposition and incoherence condition. Suppose we sample via cubic measurements with i.i.d. standard normal setches with i.i.d. N(0, σ 2 ) noise, then we have the following lower bound result for recovery loss for this class of low-ran tensors, inf T sup T F E T T 2 F Ks log(ep/s) cσ2. n
Optimal Estimation Rate 定理 Consider the class of tensor F p,k,s satisfy sparse CP-decomposition and incoherence condition. Suppose we observe n samples {y i, X i } n i=1 from symmetric tensor cubic setching model, where n Cs 2 log p for some large constant C. Then the estimator T achieves inf T sup E T T T F p,k,s 2 F Ks log(p/s) σ2 }{{ n } when log p log p/s. Here σ is the noise level. R,
Remars Our analysis is non-asymptotic and our estimator is rate-optimal. In general, we have a trade-off R is the outcome of statistical error and optimization error trade-off. Similar argument holds for non-symmetric case. Different technical tools are used. To overcome the obstacle from high-order Gaussian random variable, we develop novel high-order concentration inequality by using truncation argument and ψ α -norm.
Application to Interaction Effect Model Given the response y R n and covariates X R n p, the regression model with three-way interactions can be formulated as p p p y l = β 0 + X li β i + γ ij X li X lj + η ij X li X lj X l + ɛ l i=1 i,j=1 i,j,=1 = B, X l X l X l + ɛ l where X l = (1, X l ) R p+1.
Application to Interaction Effect Model It is reasonable to assume that B possess low-ran and/or sparsity structures in some biometrics studies (Hung, et. 2016). K y l = η β β β, X l X l X l + ε l =1 The symmetric tensor recovery model can be treated as a high-order interaction effect model.
Some Changes for Algorithm The unbiased empirical moment based tensor A R p 1 p 2 p 3 for interaction effect model is constructed as following: Define three quantities a = 1 n n l=1 y lx l, Ã = 1 n n i=l y lx l X l X l, Ā = 1 6 (Ã p j=1 (a e j e j + e j a e j + e j e j a)). For i, j, 0, A ij = Āij. For i 0, A 0,0,i = 1 3Ã0,0,i 1 6 ( p =1 Ã,,i (p + 2)a i ). And A 0,0,0 = 1 2p 2 ( p =1 Ã0,, (p + 2)Ã0,0,0). The additional intercept term changes the model structure dramatically.
Numerical Study Recover low-ran and sparse symmetric tensor T R p p p from y i = T, X i + ɛ i, i = 1,..., n. Proportion of non-zero elements for each factor s = 0.3. Tensor CP-ran K = 3. Replication = 200. Stopping rule for initialization: β m (l+1) β m (l) 2 10 6. Stopping rule for gradient update: B (T +1) B (T ) F 10 6. The dimension, sample size and noise level vary in different scenarios.
Left panel: relative error for initialization. Right panel: percent of successful recovery with varying sample size. Both are noiseless case with p = 30.
Noisy case. Left panel: absolute error for recovering ran-three tensor with varying noise level and sample size. Right panel: absolute error for recovering ran-five tensor with varying noise level and sample size.
Thans! and Questions?