Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Size: px

Start display at page:

Download "Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center"

Melina Marylou Lynch
5 years ago
Views:

1 Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center

2 Outline Modern nonparametric inference for high dimensional data Nonparametric reduced rank regression Risk-computation tradeoffs Covariance-constrained linear regression Other research and teaching activities 2

3 Context for High Dimensional Nonparametrics Great progress in recent years on high dimensional linear models Many problems have important nonlinear structure. We ve been studying purely functional methods for high dimensional, nonparametric inference no basis expansions no Mercer kernels 3

4 Additive Models Fully nonparametric models appear hopeless Logarithmic scaling, p = log n (e.g., Rodeo Lafferty and Wasserman (2008)) Additive models are useful compromise Exponential scaling, p = exp(n c ) (e.g., SpAM Ravikumar, Lafferty, Liu and Wasserman (2009)) 4

5 Additive Models Figure Bone Mineral Density Data Age Bmi Map Tc Figure Diabetes Data 5

6 Multivariate Regression Y R q and X R p. Regression function m(x) = E(Y X). Linear model Y = BX + ɛ where B R q p. Reduced rank regression: r = rank(b) C. Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm B is used as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011). E.g., B n B F = O P ( Var(ɛ)r(p + q) n ) 6

7 Low-Rank Matrices and Convex Relaxation low rank matrices rank(x) t convex hull X t 7

8 Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball X t: Compute the SVD: B = U diag(σ) V T Soft threshold the singular values: B U diag(soft λ (σ)) V T 8

.., m q (X)) T Each component an additive

9 Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (NIPS 2012) Nonparametric multivariate regression m(x) = (m 1 (X),..., m q (X)) T Each component an additive model p m k (X) = mj k (X j ) j=1 What is the nonparametric analogue of B penalty? 9

10 Low Rank Functions What does it mean for a set of functions m 1 (x),..., m q (x) to be low rank? Let x 1,..., x n be a collection of points. We require the n q matrix M(x 1:n ) = [m k (x i )] is low rank. Stochastic setting: M = [m k (X i )]. Natural penalty is 1 n M = 1 n q σ s (M) = s=1 q s=1 λ s ( 1 n MT M) Population version: M := Cov(M(X)) = Σ(M) 1/2 10

11 Constrained Rank Additive Models (CRAM) Let Σ j = Cov(M j ). Two natural penalties: Σ 1/2 + Σ 1/ (Σ 1/2 1 Σ 1/2 2 Σ 1/2 p ) Σ 1/2 p 1 Y Population risk (first penalty) 2 E j M j(x j ) 2 + λ j M j 2 Linear case: p Σ 1/2 p j=1 = (Σ 1/2 1 Σ 1/2 2 Σ 1/2 p B j 2 j=1 p ) = B 11

12 CRAM Backfitting Algorithm (Penalty 1) Input: Data (X i, Y i ), regularization parameter λ. Iterate until convergence: For each j = 1,..., p: Compute residual: R j = Y k j M k (X k ) Estimate projection P j = E(R j X j ), smooth: P j = S j R j Compute SVD: 1 n P j PT j = U diag(τ) U T Soft-threshold: M j = U diag([1 λ/ τ] + )U T P j Output: Estimator M(X i ) = j M j (X ij ). 12

13 Scaling of Estimation Error Using a double covering technique, ( 1 2 -parametric, 1 2-nonparametric), we bound the deviation between empirical and population functional covariance matrices in spectral norm: sup Σ(V ) Σ ( ) q + log(pq) n (V ) = O P. V sp n This allows us to bound the excess risk of the empirical estimator relative to an oracle. 13

14 Summary Variations on additive models enjoy most of the good statistical and computational properties of sparse or low-rank linear models. We re building a toolbox for large scale, high dimensional nonparametric inference. 14

15 Computation-Risk Tradeoffs In traditional computational learning theory, dividing line between learnable and non-learnable is polynomial vs. exponential time Valiant s PAC model Mostly negative results: It is not possible to efficiently learn in natural settings Claim: Distinctions in polynomial time matter most 15

16 Analogy: Numerical Optimization In numerical optimization, it is understood how to tradeoff computation for speed of convergence First order methods: linear cost, linear convergence Quasi-Newton methods: quadratic cost, superlinear convergence Newton s method: cubic cost, quadratic convergence Are similar tradeoffs possible in statistical learning? 16

17 Hints of a Computation-Risk Tradeoff Graph estimation: Our method for estimating graph for Ising models: n = Ω(d 3 log p), T = O(p 4 ) for graphs with p nodes and maximum degree d Information-theoretic lower bound: n = Ω(d log p) 17

18 Statistical vs. Computational Efficiency Challenge: Understand how families of estimators with different computational efficiencies can yield different statistical efficiencies Rate H,F (n) = inf m n H sup m F Risk( m n, m) H: computationally constrained hypothesis class F: smoothness constraints on true model 18

19 Computation-Risk Tradeoffs for Linear Regression Dinah Shender has been studying such a tradeoff in the setting of high dimensional linear regression 19

20 Computation-Risk Tradeoffs for Linear Regression Standard ridge estimator solves ( 1 n X T X + λ n I ) βλ = 1 n X T Y Sparsify sample covariance to get estimator ( Tt [ Σ] + λ n I ) βt,λ = 1 n X T Y where T t [ Σ] is hard-thresholded sample covariance: T t ([m ij ]) = [ m ij 1( m ij > t) ] Recent advance in theoretical CS (Spielman et al.): Solving a symmetric diagonally-dominant linear system with m nonzero matrix entries can be done in time Õ(m log 2 p) 20

21 Computation-Risk Tradeoffs for Linear Regression Dinah has recently proved the statistical error scales as β t,λ β β = O P ( T t (Σ) Σ 2 ) = O(t 1 q ) for class of covariance matrices with rows in sparse l q balls (as studied by Bickel and Levina). Combined with the computational advance, this gives us an explicit, fine-grained risk/computation tradeoff 21

22 Simulation risk lambda 22

23 Some Other Projects Minhua Chen: Convex optimization for dictionary learning Eric Janofsky: Nonparanormal component analysis Min Xu: High dimensional conditional density and graph estimation 23

Courses in the Works Winter 2013: Nonparametric Inference (Undergraduate and Masters) Spring 2013: Machine Learning for Big Data (Undergraduate Statistics and Computer Science)

24 Courses in the Works Winter 2013: Nonparametric Inference (Undergraduate and Masters) Spring 2013: Machine Learning for Big Data (Undergraduate Statistics and Computer Science) Charles Cary: Developing Cloud-based infrastructure for the course. Candidate data: 80 million images, Yahoo! clickthrough data, Science journal articles, City of Chicago datasets. 24

Statistical Machine Learning for Structured and High Dimensional Data

Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,