A Two-Stage Approach for Learning a Sparse Model with Sharp

Size: px

Start display at page:

Download "A Two-Stage Approach for Learning a Sparse Model with Sharp"

Arline Hubbard
6 years ago
Views:

1 A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017

2 1 Problem and Chanllenges 2 The Two-stage Approach 3 Experimental Results 4 Conclusion

3 Outline 1 Problem and Chanllenges 2 The Two-stage Approach 3 Experimental Results 4 Conclusion

4 Problem Let x R d and y R denote an input and output pair Let w be an optimal model that minimizes the expected error w = arg min w 1 B 1 2 E P[(w T x y) 2 ] Key Problem: w is not necessarily sparse The goal: to learn a sparse model w to achieve small excess risk ER(w, w ) = E P [(w T x y) 2 ] E P [(w T x y) 2 ] ɛ

5 The challenges L = E P [(w T x y) 2 ] is not necessarily strongly convex Stochastic optimization: O(1/ɛ 2 ) sample complexity and no sparsity guarantee Empirical risk minimization + l 1 penalty: O(1/ɛ 2 ) sample complexity and no sparsity guarantee Challenges: Can we reduce sample complexity (e.g. O(1/ɛ))? Can we also have a guarantee on sparsity of model? Our solution:

6 Outline 1 Problem and Chanllenges 2 The Two-stage Approach 3 Experimental Results 4 Conclusion

7 The first stage Our first stage algorithm is motivated by EPOCH-GD algorithm [Hazan, Kale 2011], which is on strongly convex setting. How to avoid strongly convex assumption? L(w) = E P [(w T x y) 2 ] = h(aw) + b T w + c h( ): a strongly convex function The optimal solution set is a polyhedron By Hoffmans bound we have 2(L(w) L ) 1 κ w w where w + is the closest solution to w in the optimal solution set. [1] Elad Hazan, Satyen Kale, Beyond the regret minimization barrier: optimal algorithm for stochastic strongly-convex optimization

8 The first stage (algorithm) Stochastic Optimization for Sparse Learning Input: the total number of iterations T and η 1, ρ 1, T 1. Initialization: w1 1 = 0 and k = 1. While m i=1 T i T For t = 1,..., T k Obtain a sample denoted by (x k t, yt k ) Compute wt+1 k = Π w 1 B, w w1 k 2 ρ [wk k t η k l(wt k x k t, yt k )] Update T k+1 = 2T k, η k+1 = η k /2, ρ k+1 = ρ k / 2 and w1 k+1 = T k t=1 wk t /T k Set k = k + 1 Output: ŵ = w1 m+1

9 The first stage (theoretical guarantee) Theorem Assume x 2 2 R2. By running the previous algorithm with ρ 1 = B, η 1 = 1/(2R T 1 ), T 1 (8cR + 64R 2 log(1/ δ)) 2. In order to have ER(ŵ, w ) ɛ with a high probability 1 δ over {(x k t, yt k )}, it suffice to have T = cb2 T 1 ɛ where δ = δ m, m = log 2(cB 2 /(2ɛ) + 1) and c = max(κ, 1). No strong convexity assumption No sparsity assumption

10 The second stage (algorithm) Our second stage algorithm: Randomized Sparsification For k = 1,..., K End For Sample i k [d] according to Pr(i k = j) = p j Compute [ w k ] ik p j = al., 2010] ŵj 2E[x2 j ] = [ w k 1 ] ik + ŵi k p ik instead of p d j=1 ŵj 2E[x2 j ] j = ŵ j ŵ 1 Reduced constant in O(1/ɛ) for sparsity [Shalve-Shwartz et [2] shalve-shwartz, Srebro, Zhang, Trading accuracy for sparsity in optimization problems with sparsity constraints

11 The second stage (theoretical guarantee) Theorem Given the samples in the first stage algorithm, let ŵj 2E[x2 j ] d j=1 ŵj 2E[x2 j p j = j [d] in the second stage algorithm. In order ], to have ER( w, w ) ER(ŵ, w ) + ɛ with a probability 1 δ over i 1,..., i K, it suffice to have ( d ) 2 i=1 ŵj 2E[x j 2] K = ɛδ

12 Outline 1 Problem and Chanllenges 2 The Two-stage Approach 3 Experimental Results 4 Conclusion

13 Experimental Results The first stage RMSE E2006-tfidf k Epoch-SGD SGD RMSE E2006-log1p k Epoch-SGD SGD Comparison of RMSE between SGD and EPOCH-SGD

14 Experimental Results The second stage RMSE E2006-tfidf MG-Sparsification DD-Sparsification full model RMSE E2006-log1p MG-Sparsification DD-Sparsification full model K K Comparison of RMSE between MG-Sparsification and DD-Sparsification

15 Experimental Results Overall RMSE E2006-tfidf SpT: K = 500 SpS: B = 1 SpS: B = 2 SpS: B = 3 SpS: B = 4 SpS: B = 5 RMSE E2006-log1p SpT: K = 5000 SpS: B = 1 SpS: B = 2 SpS: B = 3 SpS: B = 4 SpS: B = Sparsity(%) RMSE vs Sparsity Sparsity(%)

16 Outline 1 Problem and Chanllenges 2 The Two-stage Approach 3 Experimental Results 4 Conclusion

17 Conclusion We proposed a two-stage approach for learning a sparse model. We reduced the sample complexity from O(1/ɛ 2 ) to O(1/ɛ) without strongly convexity assumption. We reduced the constant in O(1/ɛ) for sparsity by exploring the distribution dependence sampling. We emprically justified the proposed approach could achieve better performance.

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong