Sparse Learning and Distributed PCA. Jianqing Fan

Size: px

Start display at page:

Download "Sparse Learning and Distributed PCA. Jianqing Fan"

Justin Thornton
5 years ago
Views:

1 w/ control of statistical errors and computing resources Jianqing Fan Princeton University

2 Coauthors Han Liu Qiang Sun Tong Zhang Dong Wang Kaizheng Wang Ziwei Zhu

3 Outline Computational Resources and Statistical Errors Sparse Learning: Complexity and Stat Error Distributed PCA: Communication and Stat Error Conclusion

4 Introduction

5 Data Tsunami Large and complex data arise routinely Biological Sci.: Genomics, Genetics, Neuroscience, Medicine Engineering: Machine learning, surveil. videos, social media, networks Natural Sci.: Astronomy, earth sciences, meterology. Social Sci: Economics, finance, business, and digital humanities

6 Big Data are ubiquitous Medicine Internet Business Science Big Data Finance Biological Science Digital Humanities Government Engineering EB ZB ZB ZB ZB

7 Deep Impact System: storage, communication, computation architectures Analysis: statistics, computation, optimization, privacy

8 Data Science Data Science Acquisition & Storage Computing Analysis Applications Applications

9 Statisticians Dream Efficient Methods Computable in Polynomial Time and Communicable

10 About this talk Control: Statistical error + computing resources Sparse Learning Technique: Folded concave penalty (complexity) Distributed PCA Approach: divide-and-conquer (communication)

11 Sparse Learning with control of computation complexity and statistical error Han Liu Qiang Sun Tong Zhang

12 A General Framework for Sparsity Learning Penalized M-estimator Fan & Li, 01; Negahban et al 12: { } β = arg min L(β) + P λ ( β ). β R d L(β): Quadratic, GLIM, Gausian pseuo-likelihood, Huber-loss etc. P λ ( ): Lasso, folded concave: SCAD or MCP. R P Pλ(βj) ( j) Lasso SCAD MCP β j

13 Convex and Folded-Concave Regularization Lasso is fast to compute, but introduces estimation bias. Inferior estimation rate: β β 2 s(logd)/n. Bickel, et al, 09; Negahban et al. 12;... Folded-concave estimators have oracle properties. Oracle estimation rate: β β 2 s/n, with beta-min cond.. Obtain oracle estimator with high probability But require heavy computation

14 Convex and Folded-Concave Regularization Lasso is fast to compute, but introduces estimation bias. Inferior estimation rate: β β 2 s(logd)/n. Bickel, et al, 09; Negahban et al. 12;... Folded-concave estimators have oracle properties. Oracle estimation rate: β β 2 s/n, with beta-min cond.. Obtain oracle estimator with high probability But require heavy computation

15 Nonconvex Penalized M-estimator Theoretical properties for the global optimum (or a specific local optimum), not obtainable in practice. Local min: Fan & Li (2001), Kim et al (2008), Fan & Lv (2011), Wang, Liu, Zhang (2014), Loh & Wainwright (2014),... Global min: Zhang & Zhang (2012), Kim & Kwon (2012), Negahban et al. (2012), Agarwal et al (2012),... Practical algorithms w/o theoretical guarantee: coordinate descent algorithm Friedman, et al., 08; Breheny & Huang, 11. What computed is not what proved.

16 Nonconvex Penalized M-estimator Theoretical properties for the global optimum (or a specific local optimum), not obtainable in practice. Local min: Fan & Li (2001), Kim et al (2008), Fan & Lv (2011), Wang, Liu, Zhang (2014), Loh & Wainwright (2014),... Global min: Zhang & Zhang (2012), Kim & Kwon (2012), Negahban et al. (2012), Agarwal et al (2012),... Practical algorithms w/o theoretical guarantee: coordinate descent algorithm Friedman, et al., 08; Breheny & Huang, 11. What computed is not what proved.

17 Algorithmic Approach to Sparse Learning

18 Folded Concave Penalized Regression argminl(β) + d j=1 p λ( β j ) ILQA LLA (0) Given a good initial estimator β LQA and LLA LQA : p (a) λ ( β j ) = p λ ( β 0 j ) + 0.5p λ ( β0 j )( β j 2 β 0 j 2 ) Fan & Li, 01 LLA : p λ ( β j ) = p λ ( β 0 j ) + p λ ( β0 j )( β j β 0 j ) Zou & Li (2008) Iterative use is an MM algorithm. Hunter & Li, 05. LQA LLA L LQA LLA

19 Local Linear Approximation Given β 0 = 0, compute β(1) β(2) { = arg min { = arg min L(β) + L(β) + d j=1 d j= } λ (0) j β j, λ (0) j = p λ (0) ( β j ) } λ (1) j β j, λ (1) j = p λ (1) ( β j ) How to solve each step? When to stop?

20 A Scalable Solution Isotropic LQA: L( β (0) )+ L( β (0) ),β β (0) + φ 2 β β (0) 2. Avoid storage and computation of Hessian Analytic solution with p λ ( β ): LLA d j=1 w j β j. Related to (proximal) gradient methods Nesterov, 83, 13; Tseng 08; Boyd & Vandenberghe, 09, Agarwal et al, Used by Loh and Wainwright, 14; Wang, Liu, Zhang, 14 Majorization requres φ 2 L( β (0) ), but majorization requires only locally, solved by line search.

21 A Scalable Solution Isotropic LQA: L( β (0) )+ L( β (0) ),β β (0) + φ 2 β β (0) 2. Avoid storage and computation of Hessian Analytic solution with p λ ( β ): LLA d j=1 w j β j. Related to (proximal) gradient methods Nesterov, 83, 13; Tseng 08; Boyd & Vandenberghe, 09, Agarwal et al, Used by Loh and Wainwright, 14; Wang, Liu, Zhang, 14 Majorization requres φ 2 L( β (0) ), but majorization requires only locally, solved by line search.

22 (Local Adaptive) MM Algorithm MM Principle: to minimize a general function f(β) Majorize f(β) by g k (β) at point β (k) : f(β (k) ) = g k (β (k) ) and f(β) g k (β) Compute β (k+1) { = argmin β gk (β) } LQA and LLA LQA LLA Property: The target values decrease: (a) f(β (k+1) ) g k (β (k+1) ) g }{{} k (β (k) ) = f(β (k) ). local requirement

23 Solution to General Sparse Learning Idea: Iteratively refine the estimator via convex program: β(0) β(1) β(2) β(t ), { β(l) =arg min L(β)+ λ (l 1) β 1 },l=1,...,t. β R d LLA to penalty λ (l 1) = (p λ (l 1) ( β 1 ),...,p λ (l 1) ( β d )) T Each is solved approximately by iterative use of LAMM.

24 Algorithm with Controlled Complexity Contraction stage: l=1 with optimization error ε c. Tightening stage: 2 l T with optimization error ε t. ε c ε t : for Gaussian noise ε c n 1 logd, ε t n 1. (0) : (1) : (1,0) =0 (2,0) = (1)... (1,k 1) = (1), (2,1)... (2,k 2) = (2), LAMM LAMM LAMM (1,1) LAMM LAMM LAMM k 1 2 c ; k 2 log( 1 t );... (T 1) : LAMM LAMM... LAMM... (T,0) = (T 1) (T,1)... (T,k T ) = (T ), k T log( 1 t ). T log(λ n) log(logd).

25 Algorithmic Convergence: Phase Transition Contracting stage: sub-linear convergence 1/ k, not a strong convex problem, but gives a sparse solution Computational Rate for Constant Correlation Design (,k) ( ) log contraction stage : = 1 tightening stage : = 2 tightening stage : = 3 tightening stage : = iteration count Tightening stage: linear convergence ρ k, strong convex

26 Effects on Statistical Errors Contraction region: Sparse strong convexity and smoothness. s/n (T ) (T 1) (1) (2) (0) s β (l) β 2 C s/n +δ β (l 1) β }{{} 2, for a δ (0,1). stat error

27 Theoretical Properties

28 Localized Sparse Eigenvalue Definition (Localized Sparse Eigenvalue) ρ + (m,r)=sup{u T J 2 L(β)u J : u J 2 2 =1, J m, β β 2 r}; u,β ρ (m,r) is defined by taking inf. Assumeption 1 (LSE Condition) Let s = β 0. There exists r λ s such that 0 < ρ ρ (6s,r) < ρ + (6s,r) ρ < +. Other Assump: min j S β j λ. p λ (αλ) = 0 for large α.

29 Localized Sparse Eigenvalue Definition (Localized Sparse Eigenvalue) ρ + (m,r)=sup{u T J 2 L(β)u J : u J 2 2 =1, J m, β β 2 r}; u,β ρ (m,r) is defined by taking inf. Assumeption 1 (LSE Condition) Let s = β 0. There exists r λ s such that 0 < ρ ρ (6s,r) < ρ + (6s,r) ρ < +. Other Assump: min j S β j λ. p λ (αλ) = 0 for large α.

30 Statistical Theory Theorem (Optimal Statistical Rate) Contracting property: there exists a δ (0, 1) such that β (l) β 2 C s/n + δ β (l 1) β 2, for l 2, If we take T log(λ n), then β (T ) β 2 s/n. The result is deterministic: ( L(β ) + ε c ) λ r/ s. For l = 1, β (1) β 2 λ s s log(d)/n, not enough for post-lasso.

31 Statistical Theory Theorem (Optimal Statistical Rate) Contracting property: there exists a δ (0, 1) such that β (l) β 2 C s/n + δ β (l 1) β 2, for l 2, If we take T log(λ n), then β (T ) β 2 s/n. The result is deterministic: ( L(β ) + ε c ) λ r/ s. For l = 1, β (1) β 2 λ s s log(d)/n, not enough for post-lasso.

32 Computational Theory Theorem (Algorithmic Complexity) The total number of ilamm iterations needed is C 1 ( + ε 2 (T 1)C 1 ) log, c ε t The number of LAMM iterations is O(n 2 /logd + log(logn) logn) Initial estimator is cruder, but dominates computation time.

33 Computational Theory Theorem (Algorithmic Complexity) The total number of ilamm iterations needed is C 1 ( + ε 2 (T 1)C 1 ) log, c ε t The number of LAMM iterations is O(n 2 /logd + log(logn) logn) Initial estimator is cruder, but dominates computation time.

34 Conclusion ILAMM provides a statistical optimization algorithm to solve nonconvex optimization problems. Examine iteration and optimization error effects. Propose weaker conditions which require localized analysis. All approximate solutions of ilamm enjoy optimal statistical convergence rate with controlled algorithmic complexity.

35 Distributed PCA Dong Wang Kaizheng Wang Ziwei Zhu

36 Principal Component Analysis Fundamental tools for statistical machine learning Dim. reduction, summarize data, latent factors Rates of convergence: v 1 v 1 2 = O P ( d nλ ) d dimension, n sample size, λ eigengap. (Paul 07; Jung et al. 09; Johnstone and Lu 12, Shen et al. 16; Wang and Fan, 17). What for distributed PCA?

37 Principal Component Analysis Fundamental tools for statistical machine learning Dim. reduction, summarize data, latent factors Rates of convergence: v 1 v 1 2 = O P ( d nλ ) d dimension, n sample size, λ eigengap. (Paul 07; Jung et al. 09; Johnstone and Lu 12, Shen et al. 16; Wang and Fan, 17). What for distributed PCA?

38 Problem Setup N samples stored on m servers. i.i.d. sub-gaussian with cov. Σ. generalized to {Σ (l) } m l=1 Spectral-decomp: Σ = VΛV T. Interest: column space of V K Error: ρ( VK,V K ) = VK VT K V K V T K F.

39 Distributed PCA Communications: Kd vectors each machine Aggregation: Ṽ K = argmin UK :d K orthorn Spill-over: Send more than K eigenvectors 1 m m l=1 ρ2 (U K, V(l) K )

40 Statistical Error Analysis Bias and Sample Variance

41 Bias and Variance V K : top-k eigenvectors of Σ = E[ V(l) K ( V(l) K )T ]. ρ(ṽ K,V K ) ρ(ṽ K,V }{{ K) } +ρ(v K,V K ). }{{} sample variance term bias term Sample Variance Term: 1 m m i=1 V (l) K V (l)t K Σ = ρ(ṽ K,V K ) 0

42 Analysis of Sample Variance Theorem 1 (variance) Let r(σ) = (E X 2 ) 2 / Σ ( effective rank) ( V(l) V (l)t K K Σ Σ Kr(Σ) ) F ψ1 = O, λ K λ K +1 n ( ρ(ṽ K,V Σ Kr(Σ) ) K ) ψ1 = O. λ K λ K +1 mn X ψ1 a n implies X = O P (a n ) with exp. concentration.

43 Analysis of Bias ρ(v K,V K ) = V K (V K )T V K V T K F. Comparisons: 1 Σ(l) PCA V(l) K ( V(l) expectation K )T 2 Σ(l) expectation Σ PCA V K V T K Σ PCA V K (V K )T When are they the same?

44 Analysis of Bias ρ(v K,V K ) = V K (V K )T V K V T K F. Comparisons: 1 Σ(l) PCA V(l) K ( V(l) expectation K )T 2 Σ(l) expectation Σ PCA V K V T K Σ PCA V K (V K )T When are they the same?

45 Unbiasedness Definition: Z is symmetric if Z d = I j Z for all j. X with cov Σ = VΛV T has symmetric innovation if V T X is symmetric. Theorem 2 (unbiasedness) If X has sign symmetric innovation, then Σ and Σ share the same set of eigenvectors. If ρ( V(l) K,V K ) ψ1 1/4, then ρ(v K,V K ) = 0.

46 Analysis of Bias: General Case Use Taylor s expansion and perturbation: V(l) K and V K are top K eigenvectors of Σ (l) and Σ, then V (l) K ( V(l) K )T = V K V T K + linear in ( Σ (l) Σ) + O P ( Σ (l) Σ 2 ). Σ = V K V T K + O( Σ (l) Σ 2 ). Theorem 3 (general case bias) ( Σ ρ(v 2 K r(σ) ) K,V K ) = O. n(λ K λ K +1 ) 2

47 Statistical Error Analysis Theorem 4 (MSE) Assume that n K r(σ)( Σ /(λ K λ K +1 )) 2. ) 1 Symmetric dist: ρ(ṽ K,V K ) ψ1 = O( Σ Kr(Σ) λ K λ ; K +1 N 2 General dist: For universal constants C 1 and C 2, ρ(ṽ K,V K ) ψ1 C 1 Σ Kr(Σ) + C 2 Σ 2 Kr(Σ) λ K λ K +1 N n(λ }{{} K λ K +1 ) }{{ 2 } Sample Variance Bias If m C 3 /Bias, then Bias is negligible. Same rate as PCA on the whole data. Extend to heterogeneous data The required n is sharp

48 Simulation Study

49 Simulation Setting {x i R d } mn i.i.d. i=1 N(0,Σ). Σ = diag(λ, λ 2, λ,1,,1) 4 K = 3, V K = (e 1,e 2,e 3 ), eigengap δ = λ 4 1 When λ = O(d), we have ρ(ṽ K,V K ) ψ1 = ( d ) O mnδ Error computed based on 100 simulations

50 Statistical Error Rate: ρ(ṽ K,V K ) v.s. m n = 500 n = 1000 n = 2000 n = 4000 log(error) d=100 d=200 d=400 d=800 d= log(m) log(error) d=100 d=200 d=400 d=800 d= log(m) log(error) d=100 d=200 d=400 d=800 d= log(m) log(error) d=100 d=200 d=400 d=800 d= log(m) ρ(ṽ K,V K ) m 1/2 Also verify ρ(ṽ K,V K ) n 1/2,d 1/2,δ 1/2

51 Statistical Error Rate: ρ(ṽ K,V K ) v.s. {d, m, n and δ} Multiple regression: log(ρ(ṽ K,V K )) = β 0 +β 1 log(d)+β 2 log(m)+β 3 log(n)+β 4 log(δ)+ε Estimated coefficients: β1 β2 β3 β Multiple R 2 :

52 Statistical Error Rate: ρ(ṽ K,V K ) v.s. {d, m, n and δ} Fitted Observed Figure: Observed and fitted values of log(ρ(ṽ K,V K ))

53 Comparison of Three Approaches 1 DP: Distributed PCA 2 FP: Full sample PCA 3 DP5: Distributed PCA with 5 extra eigenvectors Purpose: Examine spill-over effect

54 Comparison of Three Approaches: λ = 30 m = 5 m = 10 m = 20 m = 50 log(error) DP FP DP5 log(error) DP FP DP5 log(error) DP FP DP5 log(error) DP FP DP log(n) log(n) log(n) log(n) Figure: Comparison between DP, FP and DP5 when λ = 30 and d = Little Spill Over!

55 Conclusion Distributed PCA is unbiased when symmetric. In general, the bias is controllable when m is not too large. Distributed PCA has aggregated variance effect. Enjoy the same performance as PCA on full data

56 The End g{tç~ léâ

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic