sparse and low-rank tensor recovery Cubic-Sketching

Similar documents
Analysis of Greedy Algorithms

Reconstruction from Anisotropic Random Measurements

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

arxiv: v1 [cs.it] 21 Feb 2013

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Structured signal recovery from non-linear and heavy-tailed measurements

An iterative hard thresholding estimator for low rank matrix recovery

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

(Part 1) High-dimensional statistics May / 41

Optimal Linear Estimation under Unknown Nonlinear Transform

Does l p -minimization outperform l 1 -minimization?

Robust Principal Component Analysis

STAT 200C: High-dimensional Statistics

Interpolation-Based Trust-Region Methods for DFO

arxiv: v1 [math.st] 10 Sep 2015

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Solving Corrupted Quadratic Equations, Provably

Elaine T. Hale, Wotao Yin, Yin Zhang

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Compressed Sensing Using Bernoulli Measurement Matrices

Linear Regression and Its Applications

SPARSE signal representations have gained popularity in recent

Sparse linear models

8.1 Concentration inequality for Gaussian random matrix (cont d)

Estimating Unknown Sparsity in Compressed Sensing

Inference For High Dimensional M-estimates. Fixed Design Results

Bias-free Sparse Regression with Guaranteed Consistency

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Optimization methods

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

OWL to the rescue of LASSO

Model Selection and Geometry

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Nonparametric Inference In Functional Data

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Overlapping Variable Clustering with Statistical Guarantees and LOVE

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy

Recent developments on sparse representation

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Constrained optimization

Recovering overcomplete sparse representations from structured sensing

Sparse Solutions of an Undetermined Linear System

MSA220/MVE440 Statistical Learning for Big Data

On the l 1 -Norm Invariant Convex k-sparse Decomposition of Signals

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /


Supremum of simple stochastic processes

Introduction to Compressed Sensing

Convex relaxation for Combinatorial Penalties

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

high-dimensional inference robust to the lack of model sparsity

An Introduction to Sparse Approximation

Adaptive one-bit matrix completion

Sparse and Low Rank Recovery via Null Space Properties

Linear Models for Regression CS534

Vast Volatility Matrix Estimation for High Frequency Data

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

Information-Theoretic Limits of Group Testing: Phase Transitions, Noisy Tests, and Partial Recovery

Orthogonal tensor decomposition

Permutation-invariant regularization of large covariance matrices. Liza Levina

Linear Methods for Regression. Lijun Zhang

Reconstruction of Block-Sparse Signals by Using an l 2/p -Regularized Least-Squares Algorithm

EECS 275 Matrix Computation

arxiv: v1 [math.st] 6 Sep 2018

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Robust PCA. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

12.4 Known Channel (Water-Filling Solution)

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Estimators based on non-convex programs: Statistical and computational guarantees

Approximate Message Passing Algorithms

Numerical Linear and Multilinear Algebra in Quantum Tensor Networks

1 Sparsity and l 1 relaxation

Estimation of large dimensional sparse covariance matrices

Conditions for Robust Principal Component Analysis

Dictionary Learning Using Tensor Methods

Signal Recovery, Uncertainty Relations, and Minkowski Dimension

Quantifying conformation fluctuation induced uncertainty in bio-molecular systems

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Mathematical introduction to Compressed Sensing

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences

Provable Alternating Minimization Methods for Non-convex Optimization

19.1 Problem setup: Sparse linear regression

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

1 Regression with High Dimensional Data

IEOR 265 Lecture 3 Sparse Linear Regression

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

Identification of ARX, OE, FIR models with the least squares method

Random Matrix Theory and its Applications to Econometrics

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

On the convergence of higher-order orthogonality iteration and its extension

STAT 200C: High-dimensional Statistics

Graphlet Screening (GS)

arxiv: v1 [cs.it] 26 Oct 2018

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Transcription:

Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru Zhang

Tensor: Multi-dimensional Array

Tensor Data Example

Motivation: Compressed Image Transmission

Motivation: Interaction Effect Model source: Contraceptive Method Choice dataset from UCI

Motivation: Interaction Effect Model

Sparse and Low-Ran Tensor Recovery

Noisy Cubic Setching Model Observe {y i, X i } from noisy cubic setching model, y }{{} i = T, X i }{{} scalar tensor inner product + ɛ i }{{} noise, i = 1,..., n. For two tensors A R p 1 p 2 p 3 inner product is defined as and B R p 1 p 2 p 3, the tensor A, B = ij A ij B ij.

Noisy Cubic Setching Model General model including: tensor regression (Zhou, Li, Zhu (2013)), tensor completion (Yuan and Zhang (2014)). Goal: Recover unnown third-order tensor parameter T. High-dimensional problem: n dim(t ) p 3.

Key Assumptions on Tensor Parameter When T R p p p is a symmetric tensor... 1 CANDECOMP/PARAFAC(CP) low-ran: Represented as sum of ran-one tensor, where p. 2 Sparse components: β 0 s for [K]. The cubic setching tensor X i for symmetric case is X i = x i x i x i, where {x i } n i=1 are Gaussian random vectors. β and β are not orthogonal. Different from singular-value decomposition in matrix case.

Key Assumptions on Tensor Parameter When T R p 1 p 2 p 3 is a non-symmetric tensor... 1 CANDECOMP/PARAFAC(CP) low-ran: 2 Sparse components: β 1 0 s 1, β 2 0 s 2, β 3 0 s 3 for [K]. The cubic setching tensor X i for non-symmetric case is X i = u i v i w i, where {u i, v i, w i } n i=1 are Gaussian random vectors.

Reduced Symmetric Tensor Recovery Model For symmetric tensor recovery model K y i = ηβ β β, x i x i x i + ɛ i = =1 Connect with interaction effect model. K η (x i β) 3 +ɛ i }{{} =1 non-linear New Goal: Recover {η, β }K =1

Reduced Non-symmetric Tensor Recovery Model For non-symmetric tensor recovery model y i = K ηβ 1 β2 β3, u i v i w i + ɛ i =1 = K η (u i β1)(v i β2)(w i β3) +ɛ i }{{} =1 non-linear Connect with compressed image transmission model. New Goal: Recover {η, β 1, β 2, β 3 }K =1.

Empirical Ris Minimization Consider Empirical Ris Minimization T = argmin {η,β } T = argmin n K (y i η (x i β ) 3 ) 2 i=1 } =1 {{ } L 1 (η,β ) n K (y i η (u i β 1 )(vi β 2 )(wi β 3 )) {η,β i } i=1 } =1 {{ } L 2 (η,β i ) Difficulties: Non-convex optimization! Non-convexity from cube structure or tri-convexity.

Our Contributions 1 Efficient two-stage implementation to non-convex optimization problem. 2 Non-asymptotic analysis. Provide optimal estimation rate.

Two-stage Implementation

Main Algorithm

Initial Step: Non-symmetric unbiased estimator Construct an unbiased empirical moment based tensor T (y i, X i ) R p 1 p 2 p 3 as following T := 1 n y i u i v i w i n i=1 }{{} only depends on observations. This is used for non-symmetric case and u i, v i, w i are independent standard Gaussian random vectors.

Initial Step: Symmetric unbiased estimator Construct an unbiased empirical moment based tensor T s (y i, X i ) R p p p as following T s := 1 [ 1 6 n n ] y i x i x i x i U i=1 }{{} only depends on observations. where the bias term U = ) p j=1 (m 1 e j e j + e j m 1 e j + e j e j m 1, and m 1 = 1 ni=1 n y i x i. Here {e j } p j=1 are the canonical vectors in R p. Bias term U is due to the correlation among three identical Gaussian random vectors.

Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.

Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.

Initial Step: Decompose unbiased estimator Intuition: E[T s ] = T. Observation T s. Noise E = T s E(T s ): approximation error. Decompose T s to obtain {η (0), β(0) } or Decompose T to obtain {η (0), β(0) 1, β(0) 2, β(0) 3 } through sparse tensor decomposition. See next slide for details. Far from the optimal estimation, but good enough as a warm start.

Sparse Tensor Decomposition for Symmetric Tensor Generate L staring points {β start l } L l=1. For each starting point, compute a non-sparse factor of moment-based T s via symmetric tensor power update: β (t+1) l = T s 2 β (t) l T s 2 β (t) l 3 β (t) l 3 β (t) l 2, where for T s R p p p and x R p, define T s 2 x 3 x := j,l x jx l [T ] :,j,l. Get a sparse solution β (t+1) l via thresholding or truncation. Cluster L sets of single component {β (T ) l, β (T ) l, β (T ) l clusters to obtain a ran-k decomposition {β (0), β(0) Different from matrix SVD due to non-orthogonality. } L l=1 into K, β(0) }K =1.

Sparse Tensor Decomposition for Non-symmetric Tensor Generate L staring points {β start 1l, β start 2l, β start 3l } L l=1. For each starting point, compute a non-sparse factor of moment-based T via alternating tensor power update: β (t+1) 1l = T s 2 β (t) 2l T s 2 β (t) β (t+1) 2l β (t+1) 3l 2l 3 β (t) 3l 3 β (t) 3l 2 The updates for and Get a sparse solution {β (t+1) 1l, β (t+1) 2l truncation. Cluster L sets of single component {β (T ) 1l, β (T ), are similar., β (t+1) 3l } via thresholding or 2l, β (T ) 3l } L l=1 into K clusters to obtain a ran-k decomposition {β (0) 1, β(0) 2, β(0) 3 }K =1.

Gradient Update: Thresholded Gradient Decent Input initial estimator {η (0), β(0) }K =1. In each iteration step, update {β } K =1 as β (t+1) = β (t) µ t φ β L 1 (η (0), β(t) ) where φ = 1 n n i=1 y2 i, µ t is the step size. Sparsify current update by thresholding β (t+1) (t+1) = ϕ ρ ( β ). Normalize final update β (T ) = β(t ) and update the weight β (T ) 2 η = η (0) β (T ) 3 2. 1 Alternating update for non-symmetric tensor recovery.

Gradient Update: Thresholded Gradient Decent Input initial estimator {η (0), β(0) 1, β(0) 2, β(0) 3 }K =1. In each iteration step, alternatively update {β 1, β 2, β 3 } K =1 as β (t+1) 1 = β (t) 1 µ t φ β 1 L 2 (η (0), β(t) 1, β(t) 2, β(t) 3 ) where φ = 1 n n i=1 y2 i, µ t is the step size. The update for (t+1) and β 3 is similar. Sparsify current update by thresholding β (t+1) j = ϕ ρ ( j = 1, 2, 3. Normalize final update β (T ) j = β(t ) j β (T ) j 2 η = η (0) β (T ) 1 2 β (T ) 2 2 β (T ) 3 2. β (t+1) 2 (t+1) β j ) for and update the weight 1 Alternating update for non-symmetric tensor recovery.

Non-asymptotic Analysis

Non-asymptotic Upper Bound 定理 Suppose some regularity conditions for the true tensor parameter hold. Assume n C 0 s 2 log p for some large constant C 0. Denote Z (t) = K =1 3 η β (t) 3 η β 2 2 For any t = 0, 1, 2,..., the factor-wise estimator satisfies Z (t+1) min κ t Z (t) + C 1η }{{} 16 computational error 4 3 σ 2 s log p n } {{ } statistical error with high probability, where κ is the contraction parameter between 0 and 1, η min = min {η }, σ is the noise level and C 0, C 1 are some absolute constants.,

Remars Interesting characterization for computational error and statistical error; Geometric convergence rate to the truth in the noiseless case and minimax optimal statistical rate shown later; The error bound is dominated by computation error in the first several iterations and then is dominated by statistical error. Useful guideline for choosing stopping rule.

Remars When t T for some enough T, the final estimator is bounded by T (T ) T 2 Cσ2 Ks log p, F n with high probability. Minimax optimal rate!

Class of Sparse and Low-ran tensor Sparse CP decomposition K T = β β β, β 0 s for [K] =1 Incoherence condition(nearly orthogonal): components are incoherent such that The true tensor max i j [K] β i, β j C. s

Minimax Lower Bound 定理 Consider the class of tensor satisfy sparse CP-decomposition and incoherence condition. Suppose we sample via cubic measurements with i.i.d. standard normal setches with i.i.d. N(0, σ 2 ) noise, then we have the following lower bound result for recovery loss for this class of low-ran tensors, inf T sup T F E T T 2 F Ks log(ep/s) cσ2. n

Optimal Estimation Rate 定理 Consider the class of tensor F p,k,s satisfy sparse CP-decomposition and incoherence condition. Suppose we observe n samples {y i, X i } n i=1 from symmetric tensor cubic setching model, where n Cs 2 log p for some large constant C. Then the estimator T achieves inf T sup E T T T F p,k,s 2 F Ks log(p/s) σ2 }{{ n } when log p log p/s. Here σ is the noise level. R,

Remars Our analysis is non-asymptotic and our estimator is rate-optimal. In general, we have a trade-off R is the outcome of statistical error and optimization error trade-off. Similar argument holds for non-symmetric case. Different technical tools are used. To overcome the obstacle from high-order Gaussian random variable, we develop novel high-order concentration inequality by using truncation argument and ψ α -norm.

Application to Interaction Effect Model Given the response y R n and covariates X R n p, the regression model with three-way interactions can be formulated as p p p y l = β 0 + X li β i + γ ij X li X lj + η ij X li X lj X l + ɛ l i=1 i,j=1 i,j,=1 = B, X l X l X l + ɛ l where X l = (1, X l ) R p+1.

Application to Interaction Effect Model It is reasonable to assume that B possess low-ran and/or sparsity structures in some biometrics studies (Hung, et. 2016). K y l = η β β β, X l X l X l + ε l =1 The symmetric tensor recovery model can be treated as a high-order interaction effect model.

Some Changes for Algorithm The unbiased empirical moment based tensor A R p 1 p 2 p 3 for interaction effect model is constructed as following: Define three quantities a = 1 n n l=1 y lx l, Ã = 1 n n i=l y lx l X l X l, Ā = 1 6 (Ã p j=1 (a e j e j + e j a e j + e j e j a)). For i, j, 0, A ij = Āij. For i 0, A 0,0,i = 1 3Ã0,0,i 1 6 ( p =1 Ã,,i (p + 2)a i ). And A 0,0,0 = 1 2p 2 ( p =1 Ã0,, (p + 2)Ã0,0,0). The additional intercept term changes the model structure dramatically.

Numerical Study Recover low-ran and sparse symmetric tensor T R p p p from y i = T, X i + ɛ i, i = 1,..., n. Proportion of non-zero elements for each factor s = 0.3. Tensor CP-ran K = 3. Replication = 200. Stopping rule for initialization: β m (l+1) β m (l) 2 10 6. Stopping rule for gradient update: B (T +1) B (T ) F 10 6. The dimension, sample size and noise level vary in different scenarios.

Left panel: relative error for initialization. Right panel: percent of successful recovery with varying sample size. Both are noiseless case with p = 30.

Noisy case. Left panel: absolute error for recovering ran-three tensor with varying noise level and sample size. Right panel: absolute error for recovering ran-five tensor with varying noise level and sample size.

Thans! and Questions?