ECS289: Scalable Machine Learning

Size: px

Start display at page:

Download "ECS289: Scalable Machine Learning"

Joleen Curtis
5 years ago
Views:

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

2 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels

3 Multiclass Learning n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {1,..., L} is the corresponding label Each training data belongs to one category Goal: find a function to predict the correct label f (x) y

Multi-label Problems n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {0, 1} L is a label vector (or Y i

4 Multi-label Problems n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {0, 1} L is a label vector (or Y i {1, 2,..., L}) Example: y i = [0, 0, 1, 0, 0, 1, 1] (or Y i = {3, 6, 7}) Each training data can belong to multiple categories Goal: Given a testing sample x, predict the correct labels

5 Illustration Multiclass: each row of L has exact one 1 Multilabel: each row of L can have multiple ones

6 Measure the accuracy (multi-class) Let {x i, y i } m i=1 be a set of testing data for multi-class problems Let z 1,..., z m be the prediction for each testing data Accuracy: 1 m m I (y i = z i ) i=1

7 Measure the accuracy (multi-class) Let {x i, y i } m i=1 be a set of testing data for multi-class problems Let z 1,..., z m be the prediction for each testing data Accuracy: 1 m m I (y i = z i ) i=1 If the algorithm outputs a set of k potential labels for each sample: Each Z i is a set of k labels Precision@k: Z 1, Z 2,..., Z m 1 m m I (y i Z i ) i=1

8 Example (multiclass)

9 Example (multiclass)

10 Example (multiclass)

11 Example (multiclass)

12 Measure the accuracy (multi-label) Let {x i, y i } m i=1 be a set of testing data for multi-label problems For each testing data, the classifier outputs z i {0, 1} L We use Y i = {t : (y i ) t 0} and Z i = {t : (z i ) t 0} to denote the subset of real and predictive labels for the i-th testing data. If each Z i has k elements, then Precision@k is defined by: 1 m m i=1 Z i Y i k Hamming loss: overall classification performance 1 m m d(y i, z i ), i=1 where d(y i, z i ) measures the number of places where y i and z i differ. ROC curve (assume the classifier predicts a ranking of labels for each data point)

13 Example (multilabel)

14 Example (multilabel)

15 Example (multilabel)

16 Traditional Approach Many algorithms for binary classification Idea: transform multi-class or multi-label problems to multiple binary classification problems Two approaches: One versus All (OVA) One versus One (OVO)

17 One Versus All (OVA) Multi-class/multi-label problems with L categories Build L different binary classifiers For the t-th classifier: Positive samples: all the points in class t ({x i : t y i }) Negative samples: all the points not in class t ({x i : t / y i }) f t (x): the decision value for the t-th classifier (larger f t higher probability that x in class t) Prediction: f (x) = arg max f t (x) t Example: using SVM to train each binary classifier.

18 One Versus One (OVO) Multi-class/multi-label problems with L categories Build L(L 1) different binary classifiers For the (s, t)-th classifier: Positive samples: all the points in class s ({x i : s y i }) Negative samples: all the points in class t ({x i : t y i }) f s,t (x): the decision value for this classifier (larger f s,t (x) label s has higher probability than label t) f t,s (x) = f s,t (x) Prediction: ( ) f (x) = arg max s t f s,t (x) Example: using SVM to train each binary classifier. Not for multilabel problems.

19 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO?

20 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO? NO, depends on the time complexity of the binary classifier If the binary classifier requires O(n) time for n samples: OVA and OVO have similar time complexity If the binary classifier requires O(n 1.xx ) time: OVO is faster than OVA

21 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO? NO, depends on the time complexity of the binary classifier If the binary classifier requires O(n) time for n samples: OVA and OVO have similar time complexity If the binary classifier requires O(n 1.xx ) time: OVO is faster than OVA LIBSVM (kernel SVM solver): OVO LIBLINEAR (linear SVM solver): OVA

22 Comparisons (accuracy) For kernel SVM with RBF kernel (See A comparison of methods for multiclass support vector machines, 2002)

23 Comparisons (training time) For kernel SVM (with RBF kernel) (See A comparison of methods for multiclass support vector machines, 2002)

24 Methods Using Instance-wise Ranking Loss

25 Main idea OVA and OVO: decompose the problem by labels However, the ranking of the labels for a testing sample is important For multiclass classification, the score of y i should be larger than other labels For multilabel classification, the score of Y i should be larger than other labels Both OVA and OVO decompose the problem into individual labels they cannot capture the ranking information Solve one combined optimization problem by minimizing the ranking loss

26 Main idea For simplicity, we assume a linear model Model parameters: w 1,..., w L For each data point x, compute the decision value for each label: Prediction: w T 1 x, w T 2 x,..., w T L x y = arg max w T t t x For training data x i, y i is the true label, so we want y i arg max w T t t x i i

27 Main idea Question: how to define the distance between XW and Y

28 Weston-Watkins Formulation Proposed in Weston and Watkins, Multi-class support vector machines. In ESANN, min {w t},{ξi t} 2 L w t 2 + C t=1 n L i=1 t=1 s.t. w T y i x i w T t x i 1 ξ t i, ξ t i ξ t i 0 t y i, i = 1,..., n If point i is in class y i, for any other labels (t y i ), we want w T y i x i w T t x i 1 or we pay a penalty ξ t i Prediction: f (x) = arg max w T t t x i

29 Weston-Watkins Formulation (dual) The dual problem of Weston-Watkins formulation: 1 min {α t} 2 L w t (α) + t=1 n αi t i=1 t y i s.t.0 α t i C, t y i, i = 1,..., n α 1,..., α L R n : dual variables for each label w t (α) = i αt i x i, and α y i i = t y i αi t Can be solved by dual (block) coordinate descent

30 Crammer-Singer Formulation Proposed in Carmmer and Singer, On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, min {w t},{ξi t} 2 L w t 2 + C t=1 n i=1 ξ i s.t. w T y i x i w T t x i 1 ξ i, ξ i 0 i = 1,..., n t y i, i = 1,..., n If point i is in class y i, for any other labels (t y i ), we want w T y i x i w T t x i 1 For each point i, we only pay the largest penalty Prediction: f (x) = arg max w T t t x i

31 Crammer-Singer Formulation (dual) The dual problem of Crammer-Singer formulation: 1 L min w t (α) + {α t} 2 t=1 ( s.t. αi t C, t, t n αi t i=1 t y i ) αi t = 0, i = 1, 2,..., n α 1,..., α L R n : dual variables for each label w t (α) = i αt i x i, and α y i i = t y i αi t Can be solved by dual (block) coordinate descent

32 Comparisons (See A Sequential Dual Method for Large Scale Multi-Class Linear SVMs, KDD 2008)

33 Scaling to huge number of labels

of labels Image classification > 10000 labels

34 Challenges: large number of categories Multi-label (or multi-class) classification with large number of labels Image classification > labels Recommending tags for articles: millions of labels (tags)

35 Challenges: large number of categories Consider a problem with 1 million labels (L = 1, 000, 000) One versus all: reduce to 1 million binary problems Training: 1 million binary classification problems. Need 694 days if each binary problem can be solved in 1 minute Model size: 1 million models. Need 1 TB if each model requires 1MB. Prediction one testing data: 1 million binary prediction Need 1000 secs if each binary prediction needs 10 3 secs.

36 Several Approaches Label space reduction by Compressed Sensing (CS) Feature space reduction (PCA) Supervised latent feature model by matrix factorization

37 Compressed Sensing If A R n d, w R d, y R n, and Given A and y, can we recover w? Aw = y

38 Compressed Sensing If A R n d, w R d, y R n, and Aw = y Given A and y, can we recover w? When n d: Usually yes, because number of constraints number of variable (over-determined linear system)

39 Compressed Sensing If A R n d, w R d, y R n, and Aw = y Given A and y, can we recover w? When n d: No, because number of constraints number of variable (high dimensional regression, under-determined linear system,... )

40 Compressed Sensing However, if we know w is a sparse vector (with s nonozero elements), and A satisfies certain condition, then we can recover w even in the high dimensional setting.

41 Compressed Sensing Conditions: w is s-sparse and A satisfies Restricted Isometry Property (RIP) Example: all entries of A are generated i.i.d. from Gaussian N(0, σ)... w can be recoverd by O(s log d) samples Lasso (l 1 -regularized linear regression) Orthogonal Mathcing Pursuit (OMP)... How is this related to multilabel classification?

42 Label Space Reduction by Compressed Sensing Proposed in Hsu et al., Multi-Label Prediction via Compressed Sensing. In NIPS Main idea: reduce the label space from R L to R k, where k L z i = My i where M R k L If we can recover y i by knowing z i and M, then: we only need to learn a function to map x i to z i : f (x i ) z i

43 Label Space Reduction by Compressed Sensing Proposed in Hsu et al., Multi-Label Prediction via Compressed Sensing. In NIPS Main idea: reduce the label space from R L to R k, where k L z i = My i where M R k L If we can recover y i by knowing z i and M, then: we only need to learn a function to map x i to z i : f (x i ) z i By compressed sensing: y i can be recovered given x i and M even when k = O(s log L) s is the number of nonzeroes in y i : usually very small in practice

44 Label Space Reduction by Compressed Sensing Training: Step 1: Construct M by i.i.d. Gaussian Step 2: Compute z i = My i for all i = 1, 2,..., n Step 3: Learn a function f such that f (x i ) z i Prediction for a test sample x Step 1: compute z = f (x) Step 2: compute y R L by solving a Lasso problem Step 3: Threshold y to give the prediction

45 Label Space Reduction by Compressed Sensing Reduce the label size from L to O(s log L) Drawbacks: 1 Slow prediction time (need to solve a Lasso problem for every prediction) 2 Large error in practice

46 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i M R L k : Gaussian random matrix Goal: find a U matrix such that XU YM

47 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i M R L k : Gaussian random matrix M : psuedo inverse of M Goal: find a U matrix such that XUM YMM

48 Feature space reduction Another way to improve speed: reduce the feature space Dimension reduction from d dimensional space to k dimensional space x i Nx i = x i, where N R k d and k d Matrix form: X = XN T Multilabel learning on the reduced feature space: Learn V such that X V Y Time complexity: reduced by a factor of n/k Several ways to choose N: PCA, random projection,...

49 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i N R k d : matrix for dimensional reduction Goal: find a V matrix such that XN T V Y

50 Supervised latent space model (by matrix factorization) Label space reduction: find U such that X UV T Y Feature space reduction: find V such that How to improve? XUV T Y Find best U and V simultaneously Proposed in Chen and Lin, Feature-aware label space dimension reduction for multi-label classification. In NIPS Xu et al., Speedup Matrix Completion with Side Information: Application to Multi-Label Learning. In NIPS Yu et al., Large-scale Multi-label Learning with Missing Labels. In ICML 2014.

51 Low Rank Modeling for Multilabel Classification Let X R n d be the data matrix, Y R n L be the 0-1 label matrix Find U R d k and V R L k such that Y XUV T Obtain U, V by solving the following optimization problem: min U R d k,v R L k Y XUV T 2 F + λ 1 U 2 F + λ 2 V 2 F where λ 1, λ 2 are the regularization parameters Solve by alternating minimization.

52 Low Rank Modeling for Multilabel Classification Time complexity: O(nkL) for updating V O(ndk) for updating U Overall: O(nkL + ndk) per iteration Original time complexity: O(ndL) per iteration

53 Extensions Extend to general loss function: min dis(y ij, (XUV T ) ij ) + λ 1 U 2 U R d k,v R L k F + λ 2 V 2 F i,j Can handle problems with missing data: min dis(y ij, (XUV T ) ij ) + λ 1 U 2 U R d k,v R L k F + λ 2 V 2 F i,j Ω Is it similar to inductive matrix completion? What s the time complexity for prediction?

54 Coming up Next class: paper presentations on mutlilabel classification and matrix completion Questions?

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass