Statistical Machine Learning for Structured and High Dimensional Data

Size: px

Start display at page:

Download "Statistical Machine Learning for Structured and High Dimensional Data"

Laurence Claude Powell
5 years ago
Views:

1 Statistical Machine Learning for Structured and High Dimensional Data (FA ) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington, DC) Cognition, Decision, and Computational Intelligence

2 Statistical Machine Learning (Lafferty) Objective: (1) Rigorously explore conditions under which it is possible to overcome the curse of dimensionality by exploiting low-dimensional structure. (2) Investigate inherent tension between predictive accuracy and computational resources. Technical Approach: Nonparametric graphical models Theory that incorporates computational costs into statistical risk Links with channel coding and learning Methods for manifold-structured data Theoretically sound greedy methods DoD Benefit: Statistical learning offers significant potential to form a principled, analytic framework for automatic target detection, recognition and tracking, complementing physics-based or knowledge-based approaches. Budget: Actual/ Planned $K Annual Progress Report Submitted? 1 Dec Nov 2012 $180/$180 Project End Date: February 28, 2014 Y

3 List of Project Goals 1. Develop methods for nonparametric graphical models and non-iid data. 2. Develop theory that incorporates computational costs into statistical risk 3. Investigate links between learning and channel coding 4. Develop methods for data with manifold structure and low dimensionality 5. Develop theoretically sound greedy methods for nonparametric models

4 Main Theme Exploit structure in high dimensional data using nonparametric methods that make weak assumptions.

5 Progress Towards Goals Nonparametric graphical models Online density estimation and kernel regression Optimal mutual information estimation Low-rank nonparametric regression Conditional sparse coding Computation and risk tradeoffs

6 Multivariate Regression Y R q and X R p. Regression function m(x) = E(Y X). Linear model Y = BX + ɛ where B R q p. Reduced rank regression: r = rank(b) C. Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm B is used as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011). E.g., B n B F = O P ( Var(ɛ)r(p + q) n ) 6

7 Low-Rank Matrices and Convex Relaxation low rank matrices rank(x) t convex hull X t 7

8 Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball X t: Compute the SVD: B = U diag(σ) V T Soft threshold the singular values: B U diag(soft λ (σ)) V T 8

Nonparametric Reduced Rank Regression Foygel,

.., m q (X)) T Each component an additive

9 Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (NIPS 2012, arxiv 2013) Nonparametric multivariate regression m(x) = (m 1 (X),..., m q (X)) T Each component an additive model p m k (X) = mj k (X j ) j=1 What is the nonparametric analogue of B penalty? 9

10 Low Rank Functions What does it mean for a set of functions m 1 (x),..., m q (x) to be low rank? Let x 1,..., x n be a collection of points. We require the n q matrix M(x 1:n ) = [m k (x i )] is low rank. Stochastic setting: M = [m k (X i )]. Natural penalty is 1 n M = 1 n q σ s (M) = s=1 q s=1 λ s ( 1 n MT M) Population version: M := Cov(M(X)) = Σ(M) 1/2 10

11 Constrained Rank Additive Models (CRAM) Let Σ j = Cov(M j ). Two natural penalties: Σ 1/2 + Σ 1/ (Σ 1/2 1 Σ 1/2 2 Σ 1/2 p ) Σ 1/2 p 1 Y Population risk (first penalty) 2 E j M j(x j ) 2 + λ j M j 2 Linear case: p Σ 1/2 p j=1 = (Σ 1/2 1 Σ 1/2 2 Σ 1/2 p B j 2 j=1 p ) = B 11

12 CRAM Backfitting Algorithm (Penalty 1) Input: Data (X i, Y i ), regularization parameter λ. Iterate until convergence: For each j = 1,..., p: Compute residual: R j = Y k j M k (X k ) Estimate projection P j = E(R j X j ), smooth: P j = S j R j Compute SVD: 1 n P j PT j = U diag(τ) U T Soft-threshold: M j = U diag([1 λ/ τ] + )U T P j Output: Estimator M(X i ) = j M j (X ij ). 12

13 Scaling of Estimation Error The population risk of a q p regression matrix M(X) is R(M) = E Y M(X)1 p 2 2, Consider all models with functional SVD M(X) = U D V (X) where U is an orthogonal q r matrix, D is a positive diagonal matrix, and V (X) = [v js (X j )] satisfies E(V V ) = I r, with each v sj in a second-order Sobolev space. The population risk can be reexpressed as { ( ) ( ) ( ) } Iq ΣYY Σ R(M) = tr YV Iq DU DU Σ YV Σ VV and similarly for R(M). 13

14 Scaling of Estimation Error The controllable risk satisfies, using von Neumann s inequality, R c (M) R c (M) C D 2 Σ(V ) Σ n (V ) For the last factor, sup Σ(V ) Σ n (V ) C sup sup V sp V w N sp ( w Σ(V ) Σ ) n (V ) w where N is a 1/2-covering of the unit (q + r)-sphere, which has size N 6 q+r 36 q (Vershynin, 2010). 14

15 Scaling of Estimation Error Let M minimize the empirical risk 1 n class M n { ( M n = M : M F, D = o i Y i j M j(x ij ) 2 2 over the n q + log(pq) ) 1/4 } Then the empirical estimator is persistent over this class: R( M) P inf R(M) 0. M M n. 15

16 Example E. coli data from DREAM 5 Network Inference Challenge X = (X 1,..., X 6 ) transcription factors (TFs) Y = (Y 1,..., Y 27 ) target genes (TGs) In gold standard, two intermediate genes d-separate X and Y. Regression function m(x) = h(g 1 (X), g 2 (X)) If h is linear, then m has rank at most 2. 16

17 Penalty 1, λ = Penalty 1: L = 20 17

18 Penalty 2, λ = Penalty 2: L = 5 18

19 Summary Variations on additive models enjoy most of the good statistical and computational properties of sparse or low-rank linear models. We re building a toolbox for large scale, high dimensional nonparametric inference. 19

20 Computation-Risk Tradeoffs In traditional computational learning theory, dividing line between learnable and non-learnable is polynomial vs. exponential time Valiant s PAC model Mostly negative results: It is not possible to efficiently learn in natural settings Claim: Distinctions in polynomial time matter most 20

21 Analogy: Numerical Optimization In numerical optimization, it is understood how to tradeoff computation for speed of convergence First order methods: linear cost, linear convergence Quasi-Newton methods: quadratic cost, superlinear convergence Newton s method: cubic cost, quadratic convergence Are similar tradeoffs possible in statistical learning? 21

22 Hints of a Computation-Risk Tradeoff Graph estimation: Our method for estimating graph for Ising models: n = Ω(d 3 log p), T = O(p 4 ) for graphs with p nodes and maximum degree d Information-theoretic lower bound: n = Ω(d log p) 22

23 Statistical vs. Computational Efficiency Challenge: Understand how families of estimators with different computational efficiencies can yield different statistical efficiencies Rate H,F (n) = inf bm n H sup m F Risk( m n, m) H: computationally constrained hypothesis class F: smoothness constraints on true model 23

24 Computation-Risk Tradeoffs for Linear Regression Dinah Shender has been studying such a tradeoff in the setting of high dimensional linear regression 24

25 Computation-Risk Tradeoffs for Linear Regression Standard ridge estimator solves ( 1 n X T X + λ n I ) βλ = 1 n X T Y Sparsify sample covariance to get estimator ( Tt [ Σ] + λ n I ) βt,λ = 1 n X T Y where T t [ Σ] is hard-thresholded sample covariance: T t ([m ij ]) = [ m ij 1( m ij > t) ] Recent advance in theoretical CS (Spielman et al.): Solving a symmetric diagonally-dominant linear system with m nonzero matrix entries can be done in time Õ(m log 2 p) 25

26 Computation-Risk Tradeoffs for Linear Regression We have recently proved the statistical error scales as β t,λ β β = O P ( T t (Σ) Σ 2 ) = O(t 1 q ) for class of covariance matrices with rows in sparse l q balls (as studied by Bickel and Levina). Combined with the computational advance, this gives us an explicit, fine-grained risk/computation tradeoff 26

27 Simulation risk lambda 27

28 Progress on Related Projects Minhua Chen: Nonparametric log-concave graph estimation Eric Janofsky: Nonparanormal component analysis Min Xu: High dimensional conditional density and graph estimation 28

29 Recent Publications Sparse nonparametric graphical models, John Lafferty, Han Liu, Larry Wasserman, Stat. Science, 2013 Sequential nonparametric regression, Haijie Gu and John Lafferty, ICML 2012 Matrix sparse coding, Min Xu and John Lafferty, ICML 2012 High dimensional semiparametric Gaussian copula graphical models, Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman, The Annals of Statistics (to appear), HUGE: High dimensional undirected graph estimation, Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman, Journal of Machine Learning Research (JMLR), Vol 3, pp , Exponential concentration for mutual information estimation, Han Liu, John Lafferty, and Larry Wasserman, Neural Information Processing Systems (NIPS), Nonparametric reduced rank regression, Rina Foygel, Michael Horrell, Mathias Drton, and John Lafferty, Neural Information Processing Systems (NIPS),

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric