sparse and low-rank tensor recovery Cubic-Sketching

Size: px

Start display at page:

Download "sparse and low-rank tensor recovery Cubic-Sketching"

Alban Cross
5 years ago
Views:

1 Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University Math Oct. 27, 2017 Joint wor with Botao Hao and Anru Zhang

2 Tensor: Multi-dimensional Array

3 Tensor Data Example

4 Motivation: Compressed Image Transmission

5 Motivation: Interaction Effect Model source: Contraceptive Method Choice dataset from UCI

6 Motivation: Interaction Effect Model

7 Sparse and Low-Ran Tensor Recovery

8 Noisy Cubic Setching Model Observe {y i, X i } from noisy cubic setching model, y }{{} i = T, X i }{{} scalar tensor inner product + ɛ i }{{} noise, i = 1,..., n. For two tensors A R p 1 p 2 p 3 inner product is defined as and B R p 1 p 2 p 3, the tensor A, B = ij A ij B ij.

9 Noisy Cubic Setching Model General model including: tensor regression (Zhou, Li, Zhu (2013)), tensor completion (Yuan and Zhang (2014)). Goal: Recover unnown third-order tensor parameter T. High-dimensional problem: n dim(t ) p 3.

10 Key Assumptions on Tensor Parameter When T R p p p is a symmetric tensor... 1 CANDECOMP/PARAFAC(CP) low-ran: Represented as sum of ran-one tensor, where p. 2 Sparse components: β 0 s for [K]. The cubic setching tensor X i for symmetric case is X i = x i x i x i, where {x i } n i=1 are Gaussian random vectors. β and β are not orthogonal. Different from singular-value decomposition in matrix case.

11 Key Assumptions on Tensor Parameter When T R p 1 p 2 p 3 is a non-symmetric tensor... 1 CANDECOMP/PARAFAC(CP) low-ran: 2 Sparse components: β 1 0 s 1, β 2 0 s 2, β 3 0 s 3 for [K]. The cubic setching tensor X i for non-symmetric case is X i = u i v i w i, where {u i, v i, w i } n i=1 are Gaussian random vectors.

12 Reduced Symmetric Tensor Recovery Model For symmetric tensor recovery model K y i = ηβ β β, x i x i x i + ɛ i = =1 Connect with interaction effect model. K η (x i β) 3 +ɛ i }{{} =1 non-linear New Goal: Recover {η, β }K =1

13 Reduced Non-symmetric Tensor Recovery Model For non-symmetric tensor recovery model y i = K ηβ 1 β2 β3, u i v i w i + ɛ i =1 = K η (u i β1)(v i β2)(w i β3) +ɛ i }{{} =1 non-linear Connect with compressed image transmission model. New Goal: Recover {η, β 1, β 2, β 3 }K =1.

14 Empirical Ris Minimization Consider Empirical Ris Minimization T = argmin {η,β } T = argmin n K (y i η (x i β ) 3 ) 2 i=1 } =1 {{ } L 1 (η,β ) n K (y i η (u i β 1 )(vi β 2 )(wi β 3 )) {η,β i } i=1 } =1 {{ } L 2 (η,β i ) Difficulties: Non-convex optimization! Non-convexity from cube structure or tri-convexity.

15 Our Contributions 1 Efficient two-stage implementation to non-convex optimization problem. 2 Non-asymptotic analysis. Provide optimal estimation rate.

16 Two-stage Implementation

17 Main Algorithm

18 Initial Step: Non-symmetric unbiased estimator Construct an unbiased empirical moment based tensor T (y i, X i ) R p 1 p 2 p 3 as following T := 1 n y i u i v i w i n i=1 }{{} only depends on observations. This is used for non-symmetric case and u i, v i, w i are independent standard Gaussian random vectors.

19 Initial Step: Symmetric unbiased estimator Construct an unbiased empirical moment based tensor T s (y i, X i ) R p p p as following T s := 1 [ 1 6 n n ] y i x i x i x i U i=1 }{{} only depends on observations. where the bias term U = ) p j=1 (m 1 e j e j + e j m 1 e j + e j e j m 1, and m 1 = 1 ni=1 n y i x i. Here {e j } p j=1 are the canonical vectors in R p. Bias term U is due to the correlation among three identical Gaussian random vectors.

20 Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.

21 Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.

22 Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.

23 Sparse Tensor Decomposition for Symmetric Tensor Generate L staring points {β start l } L l=1. For each starting point, compute a non-sparse factor of moment-based T s via symmetric tensor power update: β (t+1) l = T s 2 β (t) l T s 2 β (t) l 3 β (t) l 3 β (t) l 2, where for T s R p p p and x R p, define T s 2 x 3 x := j,l x jx l [T ] :,j,l. Get a sparse solution β (t+1) l via thresholding or truncation. Cluster L sets of single component {β (T ) l, β (T ) l, β (T ) l clusters to obtain a ran-k decomposition {β (0), β(0) Different from matrix SVD due to non-orthogonality. } L l=1 into K, β(0) }K =1.

24 Sparse Tensor Decomposition for Non-symmetric Tensor Generate L staring points {β start 1l, β start 2l, β start 3l } L l=1. For each starting point, compute a non-sparse factor of moment-based T via alternating tensor power update: β (t+1) 1l = T s 2 β (t) 2l T s 2 β (t) β (t+1) 2l β (t+1) 3l 2l 3 β (t) 3l 3 β (t) 3l 2 The updates for and Get a sparse solution {β (t+1) 1l, β (t+1) 2l truncation. Cluster L sets of single component {β (T ) 1l, β (T ), are similar., β (t+1) 3l } via thresholding or 2l, β (T ) 3l } L l=1 into K clusters to obtain a ran-k decomposition {β (0) 1, β(0) 2, β(0) 3 }K =1.

25 Gradient Update: Thresholded Gradient Decent Input initial estimator {η (0), β(0) }K =1. In each iteration step, update {β } K =1 as β (t+1) = β (t) µ t φ β L 1 (η (0), β(t) ) where φ = 1 n n i=1 y2 i, µ t is the step size. Sparsify current update by thresholding β (t+1) (t+1) = ϕ ρ ( β ). Normalize final update β (T ) = β(t ) and update the weight β (T ) 2 η = η (0) β (T ) Alternating update for non-symmetric tensor recovery.

26 Gradient Update: Thresholded Gradient Decent Input initial estimator {η (0), β(0) 1, β(0) 2, β(0) 3 }K =1. In each iteration step, alternatively update {β 1, β 2, β 3 } K =1 as β (t+1) 1 = β (t) 1 µ t φ β 1 L 2 (η (0), β(t) 1, β(t) 2, β(t) 3 ) where φ = 1 n n i=1 y2 i, µ t is the step size. The update for (t+1) and β 3 is similar. Sparsify current update by thresholding β (t+1) j = ϕ ρ ( j = 1, 2, 3. Normalize final update β (T ) j = β(t ) j β (T ) j 2 η = η (0) β (T ) 1 2 β (T ) 2 2 β (T ) 3 2. β (t+1) 2 (t+1) β j ) for and update the weight 1 Alternating update for non-symmetric tensor recovery.

27 Non-asymptotic Analysis

28 Non-asymptotic Upper Bound 定理 Suppose some regularity conditions for the true tensor parameter hold. Assume n C 0 s 2 log p for some large constant C 0. Denote Z (t) = K =1 3 η β (t) 3 η β 2 2 For any t = 0, 1, 2,..., the factor-wise estimator satisfies Z (t+1) min κ t Z (t) + C 1η }{{} 16 computational error 4 3 σ 2 s log p n } {{ } statistical error with high probability, where κ is the contraction parameter between 0 and 1, η min = min {η }, σ is the noise level and C 0, C 1 are some absolute constants.,

29 Remars Interesting characterization for computational error and statistical error; Geometric convergence rate to the truth in the noiseless case and minimax optimal statistical rate shown later; The error bound is dominated by computation error in the first several iterations and then is dominated by statistical error. Useful guideline for choosing stopping rule.

30 Remars When t T for some enough T, the final estimator is bounded by T (T ) T 2 Cσ2 Ks log p, F n with high probability. Minimax optimal rate!

31 Class of Sparse and Low-ran tensor Sparse CP decomposition K T = β β β, β 0 s for [K] =1 Incoherence condition(nearly orthogonal): components are incoherent such that The true tensor max i j [K] β i, β j C. s

32 Minimax Lower Bound 定理 Consider the class of tensor satisfy sparse CP-decomposition and incoherence condition. Suppose we sample via cubic measurements with i.i.d. standard normal setches with i.i.d. N(0, σ 2 ) noise, then we have the following lower bound result for recovery loss for this class of low-ran tensors, inf T sup T F E T T 2 F Ks log(ep/s) cσ2. n

33 Optimal Estimation Rate 定理 Consider the class of tensor F p,k,s satisfy sparse CP-decomposition and incoherence condition. Suppose we observe n samples {y i, X i } n i=1 from symmetric tensor cubic setching model, where n Cs 2 log p for some large constant C. Then the estimator T achieves inf T sup E T T T F p,k,s 2 F Ks log(p/s) σ2 }{{ n } when log p log p/s. Here σ is the noise level. R,

34 Remars Our analysis is non-asymptotic and our estimator is rate-optimal. In general, we have a trade-off R is the outcome of statistical error and optimization error trade-off. Similar argument holds for non-symmetric case. Different technical tools are used. To overcome the obstacle from high-order Gaussian random variable, we develop novel high-order concentration inequality by using truncation argument and ψ α -norm.

35 Application to Interaction Effect Model Given the response y R n and covariates X R n p, the regression model with three-way interactions can be formulated as p p p y l = β 0 + X li β i + γ ij X li X lj + η ij X li X lj X l + ɛ l i=1 i,j=1 i,j,=1 = B, X l X l X l + ɛ l where X l = (1, X l ) R p+1.

36 Application to Interaction Effect Model It is reasonable to assume that B possess low-ran and/or sparsity structures in some biometrics studies (Hung, et. 2016). K y l = η β β β, X l X l X l + ε l =1 The symmetric tensor recovery model can be treated as a high-order interaction effect model.

37 Some Changes for Algorithm The unbiased empirical moment based tensor A R p 1 p 2 p 3 for interaction effect model is constructed as following: Define three quantities a = 1 n n l=1 y lx l, Ã = 1 n n i=l y lx l X l X l, Ā = 1 6 (Ã p j=1 (a e j e j + e j a e j + e j e j a)). For i, j, 0, A ij = Āij. For i 0, A 0,0,i = 1 3Ã0,0,i 1 6 ( p =1 Ã,,i (p + 2)a i ). And A 0,0,0 = 1 2p 2 ( p =1 Ã0,, (p + 2)Ã0,0,0). The additional intercept term changes the model structure dramatically.

38 Numerical Study Recover low-ran and sparse symmetric tensor T R p p p from y i = T, X i + ɛ i, i = 1,..., n. Proportion of non-zero elements for each factor s = 0.3. Tensor CP-ran K = 3. Replication = 200. Stopping rule for initialization: β m (l+1) β m (l) Stopping rule for gradient update: B (T +1) B (T ) F The dimension, sample size and noise level vary in different scenarios.

39 Left panel: relative error for initialization. Right panel: percent of successful recovery with varying sample size. Both are noiseless case with p = 30.

40 Noisy case. Left panel: absolute error for recovering ran-three tensor with varying noise level and sample size. Right panel: absolute error for recovering ran-five tensor with varying noise level and sample size.

41 Thans! and Questions?

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm