Variations on Nonparametric Additive Models: Computational and Statistical Aspects

Size: px

Start display at page:

Download "Variations on Nonparametric Additive Models: Computational and Statistical Aspects"

Rodney Fields
5 years ago
Views:

1 Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty Department of Statistics & Department of Computer Science University of Chicago

2 Collaborators Sivaraman Balakrishnan (CMU) Mathias Drton (Chicago) Rina Foygel (Chicago) Michael Horrell (Chicago) Han Liu (Princeton) Kriti Puniyani (CMU) Pradeep Ravikumar (Univ. Texas, Austin) Larry Wasserman (CMU) 2

3 Perspective Even the simplest models can be interesting, challenging, and useful for large, high-dimensional data. 3

4 Motivation Great progress has been made on understanding sparsity for high dimensional linear models Many problems have clear nonlinear structure We are interested in purely functional methods for high dimensional, nonparametric inference no basis expansions 4

5 Additive Models Figure Bone Mineral Density Data Age Bmi Map Tc Figure Diabetes Data 5

6 Additive Models Fully nonparametric methods appear hopeless Logarithmic scaling, p = log n (e.g., Rodeo Lafferty and Wasserman (2008)) Additive models are useful compromise Exponential scaling, p = exp(n c ) (e.g., SpAM Ravikumar et al. (2009)) 6

7 Themes of this talk Variations on additive models enjoy most of the good statistical and computational properties of sparse linear models Thresholded backfitting algorithms, via subdifferential calculus RKHS formulations are problematic A little nonparametricity goes a long way 7

8 Outline Sparse additive models Nonparametric reduced rank regression Functional sparse CCA The nonparanormal Conclusions 8

9 Sparse Additive Models Ravikumar, Lafferty, Liu and Wasserman, JRSS B (2009) Additive Model: Y i = p j=1 m j(x ij ) + ε i, i = 1,..., n High dimensional: n p, with most m j = 0. ( Optimization: minimize E Y ) 2 j m j(x j ) subject to p j=1 E(m 2 j ) L n E(m j ) = 0 Related work by Bühlmann and van de Geer (2009), Koltchinskii and Yuan (2010), Raskutti, Wainwright and Yu (2011) 9

10 Sparse Additive Models C = { m R 4 : p = 0.5 p = 1 m m221 + m m222 L } Geometry π 12 C = π 13 C = p = 1.5 p = 2 p = 0.5 p =

11 Stationary Conditions Lagrangian L(f, λ, µ) = 1 2 E ( Y p j=1 m j(x j ) ) 2 + λ p j=1 E(m 2 j (X j)) Let R j = Y k j m k(x k ) be jth residual. Stationary condition where v j E(mj 2) satisfies m j E(R j X j ) + λv j = 0 a.e. v j = m j E(m 2 j ) if E(m 2 j ) 0 Evj 2 1 otherwise 11

12 Stationary Conditions Rewriting, m j + λv j = E(R j X j ) P j λ 1 + m j = P j if E(P 2 E(mj 2) j ) > λ m j = 0 otherwise This implies m j = 1 λ E(Pj 2 ) + P j 12

13 SpAM Backfitting Algorithm Input: Data (X i, Y i ), regularization parameter λ. Iterate until convergence: For each j = 1,..., p: Compute residual: R j = Y k j m k (X k ) Estimate projection P j = E(R j X j ), smooth: P j = S j R j Estimate norm: s j = E[P j ] 2 ] Soft-threshold: m j [1 λŝj P j + Output: Estimator m(x i ) = j m j (X ij ). 13

14 Example: Boston Housing Data Predict house value Y from 10 covariates. We added 20 irrelevant (random) covariates to test the method. Y = house value; n = 506, p = 30. Y = β 0 + m 1 (crime) + m 2 (tax) + + m 30 (X 30 ) + ɛ. Note that m 11 = = m 30 = 0. We choose λ by minimizing the estimated risk. SpAM yields 6 nonzero functions. It correctly reports that m 11 = = m 30 = 0. 14

15 Example Fits

16 L 2 norms of fitted functions versus 1/λ Component Norms

17 RKHS Version Raskutti, Wainwright and Yu (2011) Sample optimization 1 min f n n ( y i i=1 p ) 2 m j (x ij ) + λ m j Hj + µ j j j=1 m j L2 (P n) where m j L2 (P n) = 1 n n i=1 m2 j (x ij). By Representer Theorem, with m j ( ) = K j α j, 1 min f n n ( y i i=1 p ) 2 K j α j + λ αj T K j α j + µ j j=1 j α T j K 2 j α j Finite dimensional SOCP, but no scalable algorithms known. 17

18 Open Problems Under what conditions do the backfitting algorithms converge? What guarantees can be given on the solution to the infinite dimensional optimization? Is it possible to simultaneously adapt to unknown smoothness and sparsity? 18

19 Multivariate Regression Y R q and X R p. Regression function M(X) = E(Y X). Linear model M(X) = BX where B R q p. Reduced rank regression: r = rank(b) C. Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm B := min(p,q) j=1 σ j (B) as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011) 19

20 Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (2012) Nonparametric multivariate regression M(X) = (m 1 (X),..., m q (X)) T Each component an additive model m k (X) = p mj k (X j ) j=1 What is the nonparametric analogue of B penalty? 20

21 Recall: Sparse Vectors and l 1 Relaxation sparse vectors X 0 t convex hull X 1 t 21

22 Low-Rank Matrices and Convex Relaxation low rank matrices rank(x) t convex hull X t 22

23 Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball X t: Compute the SVD: B = U diag(σ) V T Soft threshold the singular values: B U diag(soft λ (σ)) V T 23

24 Low Rank Functions What does it mean for a set of functions m 1 (x),..., m q (x) to be low rank? Let x 1,..., x n be a collection of points. We require the n q matrix M(x 1:n ) = [m k (x i )] is low rank. Stochastic setting: M = [m k (X i )]. Natural penalty is M = q σ s (M) = s=1 q s=1 λ s (M T M) Population version: M := Cov(M(X)) = Σ(M) 1/2 24

25 Constrained Rank Additive Models (CRAM) Let Σ j = Cov(M j ). Two natural penalties: Σ 1/2 + Σ 1/ (Σ 1/2 1 Σ 1/2 2 Σ 1/2 p ) Σ 1/2 p Population risk functional (first penalty) 1 Y 2 E M j (X j ) 2 + λ 2 j j Mj 25

26 Stationary Conditions { ( ) } 1F Subdifferential is F = E(FF ) + H where H sp 1, E(FH ) = 0, E(FF )H = 0 Let P(X) := E(Y X) and consider optimization 1 2 E Y M(X) λ M Let E(PP T ) = U diag(τ) U T be the SVD. Define M = U diag([1 λ/ τ] + ) U T P Then M is a stationary point of the optimization, satisfying E(Y X) = M(X) + λv (X) a.e., for some V M 26

27 CRAM Backfitting Algorithm (Penalty 1) Input: Data (X i, Y i ), regularization parameter λ. Iterate until convergence: For each j = 1,..., p: Compute residual: R j = Y k j f k (X k ) Estimate projection P j = E(R j X j ), smooth: P j = S j R j Compute SVD: 1 n P j PT j = U diag(τ) U T Soft-threshold: M j = U diag([1 λ/ τ] + )U T P j Output: Estimator M(X i ) = j M j (X ij ). 27

28 Example Data of Smith et al. (1962), chemical measurements for 33 individual urine specimens. q = 5 response variables: pigment creatinine, and the concentrations (in mg/ml) of phosphate, phosphorus, creatinine and choline. p = 3 covariates: weight of subject, volume and specific gravity of specimen. We use Penalty 2 with local linear smoothing. We take λ = 1 and bandwidth h =.3. 28

29 X j \ Y k pigment creatinine phosphate phosphorus choline weight volume spec. gravity

30 Statistical Scaling for Prediction Let F be class of matrices of functions that have a functional SVD M(X) = UDV (X) where E(V V ) = I, and V (X) = [v sj (X j )] with each v sj in a second-order Sobolev space. Define { ( ) } n 1/4 M n = M : M F, D = o. q + log(pq) Let M minimize the empirical risk 1 n class M n. Then R( M) inf M M n R(M) i Y i j M j(x ij ) 2 2 over the P 0. 30

31 Nonparametric CCA Canonical correlation analysis (CCA, Hotelling, 1936) is classical method for finding correlations between components of two random vectors X R p and Y R q. Sparse versions have been proposed for high dimensional data (Witten & Tibshirani, 2009) Sparse additive models can be extended to this setting. 31

32 Sparse Additive Functional CCA Balasubramanian, Puniyani and Lafferty (2012) Population version of optimization: max E (f (X)g(Y )) f F, g G p max E(fj 2 ) 1, j max E(gk 2 ) 1, k j=1 q k=1 subject to E(f 2 j ) C f E(g 2 k ) C g Estimated with analogues of SpAM backfitting, together with screening procedures. See ICML paper. 32

33 Regression vs. Graphical Models assumptions regression graphical models parametric lasso graphical lasso nonparametric sparse additive model nonparanormal 33

34 The Nonparanormal Liu, Lafferty and Wasserman, JMLR 2009 A random vector X = (X 1,..., X p ) T has a nonparanormal distribution X NPN(µ, Σ, f ) in case Z f (X) N(µ, Σ) where f (X) = (f 1 (X 1 ),..., f p (X p )). Joint density p X (x) = { 1 (2π) p/2 exp 1 } Σ 1/2 2 (f (x) p µ)t Σ 1 (f (x) µ) f j (x j) j=1 Semiparametric Gaussian copula 34

35 Examples 35

36 The Nonparanormal Define h j (x) = Φ 1 (F j (x)) where F j (x) = P(X j x). Let Λ be the covariance matrix of Z = h(x). Then X j X k X rest if and only if Λ 1 jk = 0. Hence we need to: 1 Estimate ĥj(x) = Φ 1 ( F j (x)). 2 Estimate covariance matrix of Z = ĥ(x) using the glasso. 36

37 Winsorizing the CDF Truncation to estimate F j for n > p: δ n 1 4n 1/4 π log n δ n 1 4n 1/4 π log n

39 Gene-Gene Interactions for Arabidopsis thaliana Dataset from Affymetrix microarrays, sample size n = 118, p = 40 genes (isoprenoid pathway). source: wikipedia.org 39

40 Example Results NPN glasso difference

41 Transformations for 3 Genes x x x18 These genes have highly non-normal marginal distributions. The graphs are different at these genes. 41

42 Graphs on the S&P 500 Data from Yahoo! Finance (finance.yahoo.com). Daily closing prices for 452 stocks in the S&P 500 between 2003 and 2008 (before onset of the financial crisis ). Log returns X tj = log ( S t,j /S t 1,j ). Winsorized to trim outliers. In following graphs, each node is a stock, and color indicates GICS industry. Consumer Discretionary Energy Health Care Information Technology Telecommunications Services Consumer Staples Financials Industrials Materials Utilities 42

43 S&P Data: Glasso vs. Nonparanormal difference common 43

44 The Nonparanormal SKEPTIC Liu, Han, Yuan, Lafferty & Wasserman, 2012 Assuming X NPN(f, Σ 0 ), we have where ρ is Spearman s rho: Empirical estimate: Σ 0 jk = 2 sin ( π 6 ρ jk ρ jk := Corr ( F j (X j ), F k (X k ) ). ) ρ jk = n i=1 (r j i r j )(rk i r k ) n i=1 (r j i r j ) 2 n. i=1 (r k i r k ) 2 Similar relation holds for Kendall s tau. 44

45 The Nonparanormal SKEPTIC Using a Hoeffding inequality for U-statistics, we get max Σ ρ jk jk Σ0 jk 3 2π log d + log n, 2 n with probability at least 1 1/n 2. Can thus estimate the covariance at the parametric rate Punch line: For graph and covariance estimation, no loss in statistical or computational efficiency comes from using Nonparanormal rather than Normal graphical model. 45

46 Conclusions Thresholded backfitting algorithms derived from subdifferential calculus RKHS formulations are problematic Theory for infinite dimensional optimizations still incomplete Many extensions possible: Nonparanormal component analysis, etc. Variations on additive models enjoy most of the good statistical and computational properties of sparse linear models, with relaxed assumptions We re building a toolbox for large scale, high dimensional nonparametric inference. 46

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric