Exploiting Primal and Dual Sparsity for Extreme Classification 1

Size: px

Start display at page:

Download "Exploiting Primal and Dual Sparsity for Extreme Classification 1"

Christina Bryant
5 years ago
Views:

1 Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon University Department of Computer Science University of Texas at Austin 1 PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. ICML / 19

2 Outline 1 Extreme Multiclass & Multilabel Classification 2 Joint Primal and Dual Sparsity 3 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) 4 Experimental Results 2 / 19

3 Extreme Multiclass & Multilabel Classification Goal Learn a function h(x) : R D R K from D input features to K output scores consistent with the ground-truth labels y {0, 1} K. We consider problems with large K (e.g ). Let P(y) = {k y k = 1} and N (y) = {k y k = 0}. Multiclass: P(y) = 1. Multilabel: 0 < P(y) K typically. Linear Classification: h(x) := W T x where W : D K. Easily extended to nonlinear setting via Random Features φ(x). 3 / 19

4 Extreme Multiclass & Multilabel Classification Challenge Standard approaches (1-vs-All, 1-vs-1, Multiclass SVM) require prohibitive Ω(NDK) cost for training/prediction. Existing Approach 1: Low-Rank Embedding h(x) = W T x = (UV T )x. Existing Approach 2: Tree group {w 1, w 2,..., w K } via hierarchy. - Searching for the best tree could be difficult. When structural assumption does not hold lower accuracy than one-vs-all approach. Question If nnz(x) D, low-rank approach could have nnz(v T x) > nnz(x). Can we make model h(x) compact without sacrificing accuracy? 4 / 19

5 Outline 1 Extreme Multiclass & Multilabel Classification 2 Joint Primal and Dual Sparsity 3 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) 4 Experimental Results 5 / 19

6 Dual Sparsity Binary SVM Prediction score z R and label y { 1, 1}. L(z, y) = max(1 yz, 0) Dual solution α 0 1 yz attains the maximum. In practice, L(z, y) > 0 for a significant fraction of samples dual solution α R N has nnz(α) N. Multiclass SVM (Crammer & Singer, 2001) Prediction score z R K and label y [K]. L(z, y) = max (1 + z k z y ) + k [K]\y Dual solution α k z k z y > 0 attains the maximum (k is a Support Label). In practice, dual solution α R N K has nnz(α) NK for large K. 6 / 19

7 Joint Primal & Dual Sparsity Max-Margin Loss for Multilabel (Crammer & Singer, 2003) ( ) L(z, y) = 1 + zkn z kp max k n N (y), k p P(y) + The W R D K obtained from minimization of Max-Margin Loss min W N L(W T x i, y i ) i=1 is determined by the scores of Support Labels: that attain the maximum. (k n, k p ) argmax (1 + z kn z kp ) + k n N (y), k p P(y) 7 / 19

8 Joint Primal & Dual Sparsity Max-Margin Loss (Crammer & Singer, 2003) ( ) L(z, y) = 1 + zkn z kp max k n N (y), k p P(y) + Optimal W should satisfy Nk A constraints where k A is the average #Support Labels per sample (k A K). DK Nk A W is under-determined Can we find a sparse W with minimum loss? 8 / 19

9 Joint Primal & Dual Sparsity Theorem (Joint Primal & Dual Sparsity) For any λ > 0 and {x i } N i=1 drawn from continuous probability distribution, W argmin W λ K w k 1 + k=1 N L(W T x i, y i ) satisfies Dk W = nnz(w ) nnz(a ) = Nk A, where A : N K is the optimal solution of the dual problem. i=1 9 / 19

10 Joint Primal & Dual Sparsity l 1 -l 2 Regularization For ease of optimization, we solve the l 1 -l 2 -regularized objective min W K k=1 1 2 w k 2 + λ w k 1 + C N L(W T x i, y i ), i=1 which gives sparsity pattern similar to l 1 -regularized objective empirically. Sparsity Level when λ is tuned for best accuracy: Data sets k A : #support labels/sample k W : #nonzero/feature EUR-Lex (K=3,956) LSHTC-wiki (K=320,338) LSHTC (K=12,294) aloi.bin (K=1,000) bibtex (K=159) Dmoz (K=11,947) / 19

11 Outline 1 Extreme Multiclass & Multilabel Classification 2 Joint Primal and Dual Sparsity 3 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) 4 Experimental Results 11 / 19

12 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) Dual Form of l 1 -l 2 -regularized problem min α i C i, i [N] G(α) := 1 2 K w k (α k ) 2 + k=1 N e T i α i where w k (α k ) := prox λ. 1 (X T α k ) and C i is a (P i, N i )-bi-simplex. i=1 Smooth objective G(α) + Block-separable constraints α i C i. Minimize w.r.t. one block α i at a time. Challenge: How to identify Support labels and Active Features? Solution: Leverage primal sparsity to search active dual variables. Leverage dual sparsity to update active primal variables. 12 / 19

13 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) Search active set A i via sparse W ; maintain W (α) via sparse α i. O(nnz(x i )nnz(w j ) + nnz(x }{{} i )nnz(α i )) cost per iteration. }{{} search maintain 13 / 19

14 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) Costs O(nnz(X)k W + nnz(x)k A ) per pass of data; O(1/ɛ) passes needed to have 1 N (G(α) G ) ɛ. We know Dk W Nk A so the search time could dominate if D N. In such case, we use sampling techniques to speed up the search step. 14 / 19

15 Dual-BCFW: Efficient Implementation Fast prediction by w k, x i = j nz(x i ) x ijw j to exploit sparsity of both W and x i O(nnz(x i )k W ) (compared to O(nnz(x i )r + rk) in low-rank approach of rank r). Space-efficiency: DK space is not acceptable while maintaining W = prox(x T A) requires random access. We use D hash tables sharing a hash function Evaluate once for nnz(x i ) updates. 15 / 19

16 Outline 1 Extreme Multiclass & Multilabel Classification 2 Joint Primal and Dual Sparsity 3 Dual Block-Coordinate Frank-Wolfe (Dual-BCFW) 4 Experimental Results 16 / 19

17 Experimental Result: Multiclass 1-vs-All takes long time to train; Multi-SVM could be out of memory (> 300G). Low-rank and Tree assumption could hurt accuracy (FastXML ensembles 50 trees to re-gain accuracy). Low-rank could even lead to slower prediction for sparse feature vectors. PD-Sparse reduces training, prediction time by orders of magnitude without sacrificing accuracy. 17 / 19

PD-Sparse takes 1 day to train, and has 10% higher accuracy than tree-based

18 Experimental Result: Multilabel 1-vs-All takes > 3 months to train. Even storing models is a problem ( 870G). PD-Sparse takes 1 day to train, and has 10% higher accuracy than tree-based and low-rank approaches, with orders of magnitude faster prediction. (Please see more results in the paper) 18 / 19

19 Conclusion Extreme Classification is inherently Primal and Dual sparse when N, D, K are large. A Dual BCFW algorithm can identify active dual variables by leveraging sparsity in the primal, and vice versa, thus resulting in complexity sublinear to K. Experiment shows the PD-Sparse approach reduces training and prediction time by orders of magnitude without sacrificing accuracy. 19 / 19

PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification

PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification Ian E. H. Yen 1 * Xiangru Huang 1 * Kai Zhong Pradeep Ravikumar 1, Inderjit S. Dhillon 1, * Both authors