Nearest Neighbor based Coordinate Descent

Size: px

Start display at page:

Download "Nearest Neighbor based Coordinate Descent"

Lora Ward
5 years ago
Views:

1 Nearest Neighbor based Coordinate Descent Pradeep Ravikumar, UT Austin Joint with Inderjit Dhillon, Ambuj Tewari Simons Workshop on Information Theory, Learning and Big Data, 2015

2 Modern Big Data Across modern applications {fmri images, gene expression profiles, social networks} many^many variables in system, not enough observations Curse of dimensionality To train the system or model, number of observations have to be much larger than variables in system (scaling exponentially in non-parametric models, polynomially in parametric models)

3 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc.

4 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc. See Negahban, Ravikumar, Wainwright, Yu, 2012; Chandrasekharan, Recht, Willsky, 2012 for a general linear-algebraic notion of structure

5 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc. See Negahban, Ravikumar, Wainwright, Yu, 2012; Chandrasekharan, Recht, Willsky, 2012 for a general linear-algebraic notion of structure Under such structure, we know how to obtain estimators whose statistical or sample complexity depends weakly on problem dimension p typically scaling as log(p)

6 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc. See Negahban, Ravikumar, Wainwright, Yu, 2012; Chandrasekharan, Recht, Willsky, 2012 for a general linear-algebraic notion of structure Under such structure, we know how to obtain estimators whose statistical or sample complexity depends weakly on problem dimension p typically scaling as log(p) Can we achieve similar weak dependence on p in computational complexity?

7 Convex Optimization Optimization Problem: min L(w). w2r p Loss L is convex and smooth: L(w) L(v) κ 1 w v 1 Sparse minimizer w : w 0 = s, w B

8 Coordinate Descent Optimization Problem: min L(w). w2r p Algorithm 1 Cyclic coordinate Descent Initialize: Set the initial value of w 0. for n =1,... do j = t mod p. w j t = arg min L(w t L(w t 1 t + 1 αe + j te ) α j ). wl t = wt 1 l, for l 6= j. end for

9 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step

10 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems

11 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance

12 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance But at least linear (or worse) dependence of comp. complexity on p!

13 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance But at least linear (or worse) dependence of comp. complexity on p! Suppose the optimal solution is sparse (very few coordinates are non-zero)

14 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance But at least linear (or worse) dependence of comp. complexity on p! Suppose the optimal solution is sparse (very few coordinates are non-zero) If CD judiciously chooses coordinate to optimize at each step, can it be expected to leverage potential sparsity of optimum?

15 Greedy Coordinate Descent (GCD) Optimization Problem: min L(w). w2r p Algorithm 3 Greedy Coordinate Gradient Descent Initialize: Set the initial value of w 0. for t =1,... do j = arg max l r l L(w t ). w t = w t 1 1 apple 1 r j L(w t )e j. end for

16 Greedy Coordinate Descent: Analysis Loss L is convex and smooth: L(w) L(v) κ 1 w v 1 Sparse minimizer w : w 0 = s, w B Initialize: w 0 0. for t =1,... do j =argmax l l L(w t ). w t = w t 1 1 κ 1 j L(w t )e j. end for Greedy Coordinate Descent Guarantee: L(w t ) L(w ) κ 1 2 w 0 w 2 1 t = κ 1 2 s 2 B 2 t

17 Greedy CD PRO: No. of iterations avoids costly dependence on dimension p CON: Each GCD iteration (naively implemented) takes Ω(p) time

18 Greedy CD PRO: No. of iterations avoids costly dependence on dimension p CON: Each GCD iteration (naively implemented) takes Ω(p) time Solution: Perform approximate greedy steps via reduction to Approximate Nearest Neighbor (ANN)

19 Greedy CD PRO: No. of iterations avoids costly dependence on dimension p CON: Each GCD iteration (naively implemented) takes Ω(p) time Solution: Perform approximate greedy steps via reduction to Approximate Nearest Neighbor (ANN) allows us to use recent advances in sublinear time ANN search: e.g. locality sensitive hashing (LSH)

20 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n l(w T x i,y i ) i=1

21 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n i=1 l(w T x i,y i ) j L(w) = x j,r(w) is an inner product between feature j and residual r(w) =(l (w T x i,y i )) n i=1

22 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n i=1 l(w T x i,y i ) j L(w) = x j,r(w) is an inner product between feature j and residual r(w) =(l (w T x i,y i )) n i=1 Greedy step needs to compute (assuming x j 2 =1) arg max j [p] x j,r(w t ) arg min j [2p] x j r(w t ) 2 2

23 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n i=1 l(w T x i,y i ) j L(w) = x j,r(w) is an inner product between feature j and residual r(w) =(l (w T x i,y i )) n i=1 Greedy step needs to compute (assuming x j 2 =1) arg max j [p] x j,r(w t ) arg min j [2p] x j r(w t ) 2 2 Leverage state-of-the-art in NN search to do this in o(p) time

24 Approximate Greedy CD: Analysis If greedy step has multiplicative approximation factor (1 + ϵ nn )then:either L(w t ) L(w t ) L(w ) ϵ + ϵ nn (1 + ϵ nn ) r(wt ) 2, or 1+ϵ nn ϵ nn (1/ϵ)+1 κ1 w 0 w 2 1 t In summary, convergence rate is K κ1s 2 t

25 Fast Greedy: Computational Complexity If each greedy step costs C t (n, p, ϵ nn ), overall cost C G to accuracy ϵ is: C G = C t (n, p, ϵ nn ) Kκ 1s 2 Preprocessing time C (n, p, ϵ nn )canbe amortized. ϵ

26 Fast Greedy: Computational Complexity Locality Sensitive Hashing: Uses random projections to hash data points such that distant points are unlikely to collide. It gives: (ρ =1/(1 + ϵ nn ) < 1) C t = O (np ρ ) C = O ( ) np 1+ρ ϵ 2 nn

27 Fast Greedy: Computational Complexity Locality Sensitive Hashing: Uses random projections to hash data points such that distant points are unlikely to collide. It gives: (ρ =1/(1 + ϵ nn ) < 1) C t = O (np ρ ) C = O ( ) np 1+ρ ϵ 2 nn Ailon & Chazelle (2006) s method: Uses multiple lookup tables after random projections. It gives: C t = O ( n log n + ϵ 3 nn log 2 p ) ( ) C = O p ϵ 2 nn

28 Fast Greedy: Computational Complexity Locality Sensitive Hashing: Uses random projections to hash data points such that distant points are unlikely to collide. It gives: (ρ =1/(1 + ϵ nn ) < 1) C t = O (np ρ ) C = O ( ) np 1+ρ ϵ 2 nn Ailon & Chazelle (2006) s method: Uses multiple lookup tables after random projections. It gives: C t = O ( n log n + ϵ 3 nn log 2 p ) ( ) C = O p ϵ 2 nn Quad Trees+Random Projections: Under mutual incoherence, using simple quad tree with random Gaussian projections, ( ) we obtain: C t = O p ϵ 2 nn C = O ( ) np log p ϵ 2 nn Mutual incoherence (µ =max i j x i,x j < 1) plays important role in statistical complexity for sparse parameter recovery. Here it is also related to computational complexity.

29 Non-smooth Objectives Smooth plus separable composite objective: (think of R = λ 1 ) min w R p L(w)+R(w) Separable regularizer: R(w) = j R j(w j )

30 Non-smooth Objectives Smooth plus separable composite objective: (think of R = λ 1 ) min w R p L(w)+R(w) Separable regularizer: R(w) = j R j(w j ) If we updated coordinate j: w t+1 j =argmin w gt j(w w t j)+ κ 1 2 (w wt j) 2 +R j (w) Guaranteed descent in objective is κ 1 2 ηj t 2 where ηj t = w t+1 j wj t

31 Modified Greedy for non-smooth objectives Modified greedy algorithm (chooses j with maximum guaranteed descent) Initialize: w 0 0. for t =1,... do j t arg max j [p] ηj t w t+1 w t + η t j t e jt end for Guarantee: L(w t )+R(w t ) L(w ) R(w ) κ 1 2 w 0 w 2 1 t

32 nces w w + η w t+1 j =argmin w gt j(w wj)+ t κ jt e jt end for 1 2 (w wt j) 2 +R j (w) κ 1 Gu ality 2 (w Experiments wt sen- j) 2 +R j (w) Guarantee: Guaranteed descent in objective is κ 1 2 ηj t 2 where L(w tive is κ 1 ηj t = w t+1 j wj t 2 ηj t 2 where L(w t )+R(w t ) L(w ) R(w ) κ 1 w 0 w 2 t ally Three algos. (cyclic CD, greedy CD, greedy CD+LSH), Two loss functions (logistic, squared), l 1 regularization (up to putational LSH 3000 n putation 2000 ry simple 500 orithm for Experiments Three algos. (cyclic CD, greedy CD, greedy CD+LSH), Two los D, greedy CD+LSH), X: standard Two loss Gaussian functions with(logistic, normalized squared), columns, l 1 regularization Y = Xw tr with n=3684, p=10000, k=100 (logistic loss) n=4605, p=100000, k=100 (lo alized columns, Y = 3000 Xw tr with w tr 100-sparse, n = 400 log(p) 3500 Cyclic n=4605, p=100000, k=100 (logistic loss) Greedy.LSH n=5526, p= , k=100 (logistic loss) 2500 Greedy Cyclic Cyclic Greedy.LSH Greedy.LSH 2000 Greedy Greedy Objective Objective Objective n=5526, 10 p= , 15 k= (squared 25500loss) 30 CPU Time (in seconds) Cyclic Objective CPU Time (in second

33 Experiments: Logistic Loss Objective n=3684, p=10000, k=100 (logistic loss) Cyclic Greedy.LSH Greedy Objective n=4605, p=100000, k=100 (logistic loss) Cyclic Greedy.LSH Greedy Objective n=5526, p= , k=100 (logistic loss) Cyclic Greedy.LSH Greedy CPU Time (in seconds) CPU Time (in seconds) CPU Time (in seconds)

34 Experiments: Squared Loss n=3684, p=10000, k=100 (squared loss) Cyclic Greedy.LSH Greedy n=4605, p=100000, k=100 (squared loss) Cyclic Greedy.LSH Greedy n=5526, p= , k=100 (squared loss) Cyclic Greedy.LSH Greedy Objective Objective Objective CPU Time (in seconds) CPU Time (in seconds) CPU Time (in seconds)

35 Summary Optimization Method with sub-linear dependence on p! New connections between computational geometry and first order optimization Interplay between statistical and computational efficiency: mutual incoherence very simple data structure works for ANN New greedy algorithm for composite objectives

High dimensional Ising model selection

High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse