ECS289: Scalable Machine Learning

Size: px

Start display at page:

Download "ECS289: Scalable Machine Learning"

Elaine Morton
6 years ago
Views:

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016

2 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels

3 Multiclass Learning n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {1,..., L} is the corresponding label Each training data belongs to one category Goal: find a function to predict the correct label f (x) y

Multi-label Problems n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {0, 1} L is a label vector (or Y i

4 Multi-label Problems n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {0, 1} L is a label vector (or Y i {1, 2,..., L}) Example: y i = [0, 0, 1, 0, 0, 1, 1] (or Y i = {3, 6, 7}) Each training data can belong to multiple categories Goal: Given a testing sample x, predict the correct labels

5 Illustration Multiclass: each row of L has exact one 1 Multilabel: each row of L can have multiple ones

6 Measure the accuracy (multi-class) Let {x i, y i } m i=1 be a set of testing data for multi-class problems Let z 1,..., z m be the prediction for each testing data Accuracy: (I ( ) is the indicator function) 1 m m I (y i = z i ) i=1

7 Measure the accuracy (multi-class) Let {x i, y i } m i=1 be a set of testing data for multi-class problems Let z 1,..., z m be the prediction for each testing data Accuracy: (I ( ) is the indicator function) 1 m m I (y i = z i ) i=1 If the algorithm outputs a set of k potential labels for each sample: Each Z i is a set of k labels Precision@k: Z 1, Z 2,..., Z m 1 m m I (y i Z i ) i=1

8 Example (multiclass)

9 Example (multiclass)

10 Example (multiclass)

11 Example (multiclass)

12 Measure the accuracy (multi-label) Let {x i, y i } m i=1 be a set of testing data for multi-label problems For each testing data, the classifier outputs z i {0, 1} L We use Y i = {t : (y i ) t 0} and Z i = {t : (z i ) t 0} to denote the subset of real and predictive labels for the i-th testing data. If each Z i has k elements, then Precision@k is defined by: 1 m m i=1 Z i Y i k Hamming loss: overall classification performance 1 m m d(y i, z i ), i=1 where d(y i, z i ) measures the number of places where y i and z i differ. ROC curve (assume the classifier predicts a ranking of labels for each data point)

13 Example (multilabel)

14 Example (multilabel)

15 Example (multilabel)

16 Traditional Approach Many algorithms for binary classification Idea: transform multi-class or multi-label problems to multiple binary classification problems Two approaches: One versus All (OVA) One versus One (OVO)

17 One Versus All (OVA) Multi-class/multi-label problems with L categories Build L different binary classifiers For the t-th classifier: Positive samples: all the points in class t ({x i : t y i }) Negative samples: all the points not in class t ({x i : t / y i }) f t (x): the decision value for the t-th classifier (larger f t higher probability that x in class t) Prediction: f (x) = arg max f t (x) t Example: using SVM to train each binary classifier.

18 One Versus One (OVO) Multi-class/multi-label problems with L categories Build L(L 1) different binary classifiers For the (s, t)-th classifier: Positive samples: all the points in class s ({x i : s y i }) Negative samples: all the points in class t ({x i : t y i }) f s,t (x): the decision value for this classifier (larger f s,t (x) label s has higher probability than label t) f t,s (x) = f s,t (x) Prediction: ( ) f (x) = arg max s t f s,t (x) Example: using SVM to train each binary classifier. Mainly for multi-class problems.

19 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO?

20 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO? NO, depends on the time complexity of the binary classifier If the binary classifier requires O(n) time for n samples: OVA and OVO have similar time complexity If the binary classifier requires O(n 1.xx ) time: OVO is faster than OVA

21 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO? NO, depends on the time complexity of the binary classifier If the binary classifier requires O(n) time for n samples: OVA and OVO have similar time complexity If the binary classifier requires O(n 1.xx ) time: OVO is faster than OVA LIBSVM (kernel SVM solver): OVO LIBLINEAR (linear SVM solver): OVA

22 Comparisons (accuracy) For kernel SVM with RBF kernel (See A comparison of methods for multiclass support vector machines, 2002)

23 Comparisons (training time) For kernel SVM (with RBF kernel) (See A comparison of methods for multiclass support vector machines, 2002)

24 Methods Using Instance-wise Ranking Loss

25 Main idea OVA and OVO: decompose the problem by labels But good binary classifiers may not imply good multi-class prediction. Both OVA and OVO decompose the problem into individual labels they cannot capture the ranking information Ranking based approaches directly minimizes the ranking loss: For multiclass classification, the score of y i should be larger than other labels For multilabel classification, the score of Y i should be larger than other labels Solve one combined optimization problem by minimizing the ranking loss

26 Main idea For simplicity, we assume a linear model Model parameters: w 1,..., w L For each data point x, compute the decision value for each label: Prediction: w T 1 x, w T 2 x,..., w T L x y = arg max w T t t x For training data x i, y i is the true label, so we want y i arg max w T t t x i i

27 Main idea Question: how to define the distance between XW and Y

28 Weston-Watkins Formulation Proposed in Weston and Watkins, Multi-class support vector machines. In ESANN, min {w t},{ξi t} 2 L w t 2 + C t=1 n L i=1 t=1 s.t. w T y i x i w T t x i 1 ξ t i, ξ t i ξ t i 0 t y i, i = 1,..., n If point i is in class y i, for any other labels (t y i ), we want w T y i x i w T t x i 1 or we pay a penalty ξ t i Prediction: f (x) = arg max w T t t x i

29 Weston-Watkins Formulation (dual) The dual problem of Weston-Watkins formulation: 1 min {α t} 2 L w t (α) + t=1 n αi t i=1 t y i s.t.0 α t i C, t y i, i = 1,..., n α 1,..., α L R n : dual variables for each label w t (α) = i αt i x i, and α y i i = t y i αi t Can be solved by dual (block) coordinate descent

30 Crammer-Singer Formulation Proposed in Carmmer and Singer, On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, min {w t},{ξi t} 2 L w t 2 + C t=1 n i=1 ξ i s.t. w T y i x i w T t x i 1 ξ i, ξ i 0 i = 1,..., n t y i, i = 1,..., n If point i is in class y i, for any other labels (t y i ), we want w T y i x i w T t x i 1 For each point i, we only pay the largest penalty Prediction: f (x) = arg max w T t t x i

31 Crammer-Singer Formulation (dual) The dual problem of Crammer-Singer formulation: 1 L min w t (α) + {α t} 2 t=1 ( s.t. αi t C, t, t n αi t i=1 t y i ) αi t = 0, i = 1, 2,..., n α 1,..., α L R n : dual variables for each label w t (α) = i αt i x i, and α y i i = t y i αi t Can be solved by dual (block) coordinate descent

32 Comparisons (See A Sequential Dual Method for Large Scale Multi-Class Linear SVMs, KDD 2008)

33 Scaling to huge number of labels Extreme Classification

of labels Image classification > 10000 labels

34 Challenges: large number of categories Multi-label (or multi-class) classification with large number of labels Image classification > labels Recommending tags for articles: millions of labels (tags)

35 Challenges: large number of categories Consider a problem with 1 million labels (L = 1, 000, 000) One versus all: reduce to 1 million binary problems Training: 1 million binary classification problems. Need 694 days if each binary problem can be solved in 1 minute Model size: 1 million models. Need 1 TB if each model requires 1MB. Prediction one testing data: 1 million binary prediction Need 1000 secs if each binary prediction needs 10 3 secs.

36 Extreme Classification Datasets See downloads/xc/xmlrepository.html

37 Several Approaches Today we will discuss: Label space reduction by Compressed Sensing (CS) Feature space reduction (PCA) Supervised latent feature model by matrix factorization Other approaches: Tree-based algorithm (random forest) Neural network?

38 Compressed Sensing If A R n d, w R d, y R n, and Given A and y, can we recover w? Aw = y

39 Compressed Sensing If A R n d, w R d, y R n, and Aw = y Given A and y, can we recover w? When n d: Usually yes, because number of constraints number of variable (over-determined linear system)

40 Compressed Sensing If A R n d, w R d, y R n, and Aw = y Given A and y, can we recover w? When n d: No, because number of constraints number of variable (high dimensional regression, under-determined linear system,... )

41 Compressed Sensing However, if we know w is a sparse vector (with s nonozero elements), and A satisfies certain condition, then we can recover w even in the high dimensional setting.

42 Compressed Sensing Conditions: w is s-sparse and A satisfies Restricted Isometry Property (RIP) Example: all entries of A are generated i.i.d. from Gaussian N(0, σ)... w can be recoverd by O(s log d) samples Lasso (l 1 -regularized linear regression) Orthogonal Mathcing Pursuit (OMP)... How is this related to multilabel classification?

43 Label Space Reduction by Compressed Sensing Proposed in Hsu et al., Multi-Label Prediction via Compressed Sensing. In NIPS Main idea: reduce the label space from R L to R k, where k L z i = My i where M R k L If we can recover y i by knowing z i and M, then: we only need to learn a function to map x i to z i : f (x i ) z i

44 Label Space Reduction by Compressed Sensing Proposed in Hsu et al., Multi-Label Prediction via Compressed Sensing. In NIPS Main idea: reduce the label space from R L to R k, where k L z i = My i where M R k L If we can recover y i by knowing z i and M, then: we only need to learn a function to map x i to z i : f (x i ) z i By compressed sensing: y i can be recovered given x i and M even when k = O(s log L) s is the number of nonzeroes in y i : usually very small in practice

45 Label Space Reduction by Compressed Sensing Training: Step 1: Construct M by i.i.d. Gaussian Step 2: Compute z i = My i for all i = 1, 2,..., n Step 3: Learn a function f such that f (x i ) z i Prediction for a test sample x Step 1: compute z = f (x) Step 2: compute y R L by solving a Lasso problem Step 3: Threshold y to give the prediction

46 Label Space Reduction by Compressed Sensing Reduce the label size from L to O(s log L) Drawbacks: 1 Slow prediction time (need to solve a Lasso problem for every prediction) 2 Large error in practice

47 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i M R L k : Gaussian random matrix Goal: find a U matrix such that XU YM

48 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i M R L k : Gaussian random matrix M : psuedo inverse of M Goal: find a U matrix such that XUM YMM

49 Feature space reduction Another way to improve speed: reduce the feature space Dimension reduction from d dimensional space to k dimensional space x i Nx i = x i, where N R k d and k d Matrix form: X = XN T Multilabel learning on the reduced feature space: Learn V such that X V Y Time complexity: reduced by a factor of n/k Several ways to choose N: PCA, random projection,...

50 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i N R k d : matrix for dimensional reduction Goal: find a V matrix such that XN T V Y

51 Supervised latent space model (by matrix factorization) Label space reduction: find U such that X UV T Y Feature space reduction: find V such that How to improve? XUV T Y Find best U and V simultaneously Proposed in Chen and Lin, Feature-aware label space dimension reduction for multi-label classification. In NIPS Xu et al., Speedup Matrix Completion with Side Information: Application to Multi-Label Learning. In NIPS Yu et al., Large-scale Multi-label Learning with Missing Labels. In ICML 2014.

52 Low Rank Modeling for Multilabel Classification Let X R n d be the data matrix, Y R n L be the 0-1 label matrix Find U R d k and V R L k such that Y XUV T Obtain U, V by solving the following optimization problem: min U R d k,v R L k Y XUV T 2 F + λ 1 U 2 F + λ 2 V 2 F where λ 1, λ 2 are the regularization parameters Solve by alternating minimization.

53 Low Rank Modeling for Multilabel Classification Time complexity: Overall: O(nkL + ndk) per iteration Original time complexity: O(ndL) per iteration

54 Extensions Extend to general loss function: min dis(y ij, (XUV T ) ij ) + λ 1 U 2 U R d k,v R L k F + λ 2 V 2 F i,j Can handle problems with missing data: min dis(y ij, (XUV T ) ij ) + λ 1 U 2 U R d k,v R L k F + λ 2 V 2 F i,j Ω

55 Model size and Prediction time Model size: Low-rank model: O(dk + kl) Classical multi-label (OVA): O(dL) Prediction time: Low-rank model: O(dk + kl) per sample Classical multi-label (OVA): O(dL) per sample

56 More on prediction time MIPS Both OVA and Low-rank model can be reduced to MIPS (Maximum Inner Product Search) problem. Given a collection of n candidate vectors {h j R k : 1,, n} and a query point h, find argmin hj T h j OVA: L candidate vectors, each one is a d-dimensional vector Low-rank extreme classification: L candidate vectors, each one is a k-dimensional vector Matrix Completion: n (number of items) vectors, each one is a k-dimensional vector (Will discuss in detail later)

57 Other extreme classification solvers Random forest: Prabhu and Varma, FastXML: A Fast, Accurate and Stable Tree-classifier for extreme Multi-label Learning. In KDD Jain et al., Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. In KDD Local Embedding Learning: Bhatia et al., Sparse Local Embeddings for Extreme Multi-label Classification. NIPS, Low-rank modeling: Weston et al., Label Partitioning for Sublinear Ranking. ICML Yu et al., Large-scale Multi-label Learning with Missing Labels. ICML Crammer-Singer form: Yen et al., PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. ICML 2016.

58 Coming up Questions?

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass