Extreme Multi Class Classification

Size: px

Start display at page:

Download "Extreme Multi Class Classification"

Gilbert Garrett
6 years ago
Views:

1 Extreme Multi Class Classification Anna Choromanska Department of Electrical Engineering Columbia University Alekh Agarwal Microsoft Research New York, NY, USA John Langford Microsoft Research New York, NY, USA Abstract We consider the multi class classification problem under the setting where the number of labels is very large and hence it is very desirable to efficiently achieve train and test running times which are logarithmic ihe label complexity. Additionally the labels are feature dependent in our setting. We propose a reduction of this problem to a set of binary regression problems organized in a tree structure and we introduce a simple top-down criterion for purification of labels that allows for gradient descent style optimization. Furthermore we prove that maximizing the proposed objective function (splitting criterion) leads simultaneously to pure and balanced splits. We use the entropy of the tree leafs, a standard measure used in decisiorees, to measure the quality of obtained tree and we show an upperbound ohe number of splits required to reduce this measure below threshold ɛ. Finally we empirically show that the splits recovered by our algorithm leads to significantly smaller error thaandom splits. 1 Introduction The central problem of this paper is computational complexity in a setting where the number of classes k for multiclass prediction is very large. Almost all machine learning algorithms (with the notable exception of decisiorees) have running times for multiclass classification which are O(k) with a canonical example being one-against-all classifiers Rifkin and Klautau (004). Ihis setting, the most efficient approach that could be imagined has a running time of O(log k) for both training and testing, while effectively using online learning algorithms to minimizes passes over the data. A number of prior works have addressed aspects of this problem, which we review next. The Filter Tree Beygelzimer et al. (009b) addresses consistent (and robust) multiclass classification, showing that it is possible ihe statistical limit. A critical problem not addressed there is the choice of partition. In effect, it can only succeed whehe chosen partition happens to be easy. The partition finding problem is addressed ihe conditional probability tree Beygelzimer et al. (009a), but that paper addresses conditional probability estimation. Conditional probability estimation can be converted into multiclass prediction, but doing so is not a logarithmic time operation. Futher work Bengio et al. (010) addresses the partitioning problem by recursively applying spectral clustering on a confusion graph. This approach is O(k) or worse at training time, making it intractable by our standards. Empirically, this approach has been found to sometimes lead to badly imbalanced splits Deng et al. (011). Also another work by Weston et al. (013) make use of the k-means hierarchical clustering (other distortion-minimizing method could also be used) to recover the label sets 1

2 for a given partitiohough this work primarily adresses the problem of ranking a very large set of labels rather thahe multiclass classification problem. Decisiorees are naturally structured to allow logarithmic time prediction. Traditional decision trees often have difficulties with a large number of classes because their splitting criteria is not wellsuited to the large class setting. However, newer approaches Agarwal et al. (013) have addressed this effectively. The approach we discuss here can be thought of as a decisioree algorithm which differs ihe objective optimized, and in how that optimization is done. The different objective allows us to prove certain useful properties while the optimization uses an online learning algorithm which is queried and trained simultaneously as we pass over the data. This offers a tradeoff: our nodes are more powerful, potentially exponentially reducing the state of the model, while being somewhat more expensive to compute since they depend on more than one feature. Splitting criterion.1 Formulation of the objective function Ihis section we will introduce the splitting criteriohat we use in every node of the tree to decide whether an example should be sent to the left or right child node. For notation simplicity consider the split done ihe root. Let X denotes the input dataset. The objective functiohat we aim to maximize is given as follows Ω(, k r ) = x X [ nl k l(x) n l + where is the total number of examples, (resp. n l ) is the number of examples going to the right (resp. left) child. Notice that = + n l. is the total number of examples labeled ihe same way as x, k r (x) (resp. k l (x)) is the number of examples labeled ihe same way as x that are going to the right (resp. left) child. Notice that x X = k r (x) + k l (x). Thus the objective function can be rewritten as Ω(, k r ) = x X By the maximally balanced split we will understand the one for which = and thus the same number of examples were directed to the left and right child nodes. By the maximally pure split we will understand the one for which k r (x) = 0 or k r (x) = and thus there exists no two distinct data points with the same label that are in different child nodes. The proposed objective function has certain desirable properties which are captured in Lemma 1 and Lemma and ihe entire Section 3. Lemma 1. If there exists a maximally pure and balanced split, this split maximizes the objective. Proof. Let X R be the set of data points ihe right child node such that no data point ihe left child node has the same label as any data point in X R (X L is defined in analogy). Let X B be the set of data points such that their labels appear in both child nodes. = x X R Ω(, k r ) = x X + x X L + x X B ] = + x X R x X L The last step comes from the fact that we consider maximally pure split thus x XR k r (x) =, x XL k r (x) = 0 and thus furthermore X B must be an empty set (that eliminates the third term). We can further simplify as follows ( ) + ( ) = 4 ( )

3 Since we are maximizing the objective, we set to = n l = 1. That shows the split is also balanced. Lemma. In isolation, i.e. when is fixed and under the conditiohat a maximally pure and balanced split exists, maximizing the objective recovers this split. Proof. Objective function is Ω(, k r ) = x X = x X 1 = 1 x X k r (x) k r (x) If is fixed and one optimizes for k r (notice x X k r (x) < 0, >) then either one has to set x X k r (x) = 0 or k r (x) =. 3 Purity and balancing factors In order to show some more interesting properties of the objective function we need to introduce more formal notation. Let k be the number of labels. Let H be the hypothesis class. Let π i be the probability that randomly chosen data point from the dataset has label i. Consider hypothesis h H and denote P r(h(x) > 0 i) t be the probability that h(x) > 0 givehat x has label i. We cahen define pure and balanced splits as follows Definition 1. The hypothesis h H induces a pure split if π i min(p r(h(x) > 0 i), P r(h(x) < 0 i)) δ. Definition. The hypothesis h H induces a balanced split if c1 (0,1)c P r(h(x) > 0) 1 c. We will refer to k π i min(p r(h(x) > 0 i), P r(h(x) < 0 i)) as the purity factor as it determines how pure the split is and we will refer to P r(h(x) > 0 as the balancing factor as it determines how balanced the split is. One cahen express the objective function ihe equivalent form as Ω(h) = π i [ P (h(x) > 0) P (h(x) > 0 x X i )) ] We will now show that increasing the value of objective function leads to recovering the hypothesis that induces more balanced splits and also more pure splits. 3.1 Balancing factor We want to show the relation betweehe balancing factor and the value of the objective function. In order to do that we will start from deriving an upper-bound on Ω(h), where h H. For the ease of notation let P i = P r(h(x) > 0 x X i ). Thus Ω(h) = π i P (h(x) > 0 x X i ) P (h(x) > 0) = π i P i π j P j, where i={1,,...,k} 0 P i 1. The objective Ω(h) is definitely maximized ohe extremes of the [0, 1] interval. The upper-bound on Ω(h) can be thus obtained by setting some of the P i s to 1 s and remaining ones to 0 s. To be more precise, let L 1 = {i : i {1,,..., k}, P i = 1} j=1 3

4 and L = {i : i {1,,..., k}, P i = 0} We cahen write that Ω(h) [ π i (1 π j ) + π i π j = π i ( π i ) + (1 π i ) i L 1 j L 1 i L j L 1 i L 1 i L 1 i L 1 [ ] = 4 π i ( π i ) i L 1 i L 1 For the ease of notation let p = P r(h(x) > 0). Notice that p = i I 1 π i thus i L 1 π i ] Thus Ω(h) 4p(1 p) 4p 4p + Ω(h) 0 p [ 1 1 Ω(h), 1 + ] 1 Ω(h). That yields the balancing factor c to be c = 1 1 Ω(h). Maximizing Ω(h) leads to narrowing the [c, 1 c] interval around value 1 which corresponds to the maximally balanced split. In particular for the objective-maximizing hypothesis h (then Ω(h ) = 1) we obtaihat c = 1 and the split is maximally balanced then. 3. Purity factor In analogy to what was shown before now we want to show the relation betweehe purity factor and the value of the objective function. As before in order to do that we will start from deriving an upper-bound on Ω(h), where h H. For the ease of notation let P i = P r(h(x) > 0 x X i ). Thus Ω(h) = π i P (h(x) > 0 x X i ) P (h(x) > 0) = π i P i π j P j, where i={1,,...,k} 0 P i 1. Let ɛ i = min(p i, 1 P i ) and ɛ = k π iɛ i. Let p = P (h(x) > 0). Without loss of generality let p 1. Let L 1 = {i : i {1,,..., k}, P i 1 }, j=1 and L = {i : i {1,,..., k}, P i [p, 1 )} L 3 = {i : i {1,,..., k}, P i < p}. First notice that: p = π i P i = π i (1 ɛ i ) + π i ɛ i = π i π i ɛ i + ɛ i L 1 i L L 3 i L 1 i L 1 We cahen write that Ω(h) = π i P i p = π i (1 ɛ i p) + π i (ɛ i p) + π i (p ɛ i ) i L 1 i L i L 3 = π i (1 p) π i ɛ i + π i ɛ i π i p + π i p π i ɛ i i L 1 i L 1 i L i L i L 3 i L 3 = π i (1 p) π i ɛ i + π i ɛ i π i p + p(1 π i π i ) π i ɛ i i L 1 i L 1 i L i L i L 1 i L i L 3 4

5 = π i (1 p) π i ɛ i + π i ɛ i + p(1 π i ) π i ɛ i i L 1 i L 1 i L i L i L 3 = π i (1 p) + p(1 π i ) ɛ + π i ɛ i i L 1 i L i L = (1 p)(p + π i ɛ i ɛ) + p(1 π i ) ɛ + π i ɛ i i L 1 i L i L = (1 p)(p ɛ) + (1 p) π i ɛ i p π i + π i ɛ i i L 1 i L i L = (1 p)(p ɛ) + (1 p) π i ɛ i + π i (ɛ i p) i L 1 i L (1 p)(p ɛ) + (1 p) π i ɛ i + π i ( 1 p) i L 1 i L (1 p)(p ɛ)+(1 p) π i ɛ i + π i ( 1 p) (1 p)(p ɛ)+(1 p) π i ɛ i +1 p i L 1 i L i L 1 Thus: (1 p)(p ɛ) + (1 p)ɛ + 1 p = p(1 p) ɛ(1 p) + ɛ(1 p) + 1 p = p(1 p) ɛ(1 p 1 + p) + 1 p = p(1 p) pɛ + 1 p = 1 p pɛ ɛ Ω(h) 4p That yields the balancing factor δ to be δ = Ω(h) 4p p. We already know that maximizing Ω(h) leads to narrowing the [c, 1 c] interval around value 1 which corresponds to the maximally balanced split. Thue p will be then pushed closer to value 1 and that will result ihe decrease of δ. In particular for the objective-maximizing hypothesis h (then Ω(h ) = 1) we obtaihe maximally balanced split and then p = 1 and simultaneously we obtaihat δ = 0 and thus this split is also maximally pure then. 4 Boosting statement We will now use the entropy of the tree leafs, a standard measure used in decisiorees, to measure the quality of obtained tree and show the upper-bound ohe number of splits required to reduce this measure below threshold ɛ. We borrow from the theoretical analysis of decisioree algorithms in Kearns and Mansour (1995) originally developed to show the boosting properties of the decision trees for binary classification problems. Our analysis generalizes the analysis there to the multi class classification setting. Consider the tree T, where every node except for leafs (we will refer to the set of the tree leafs as L) is characterized by the splitting hypothesis h H recovered by maximizing the objective function introduced before. We will consider the entropy function G as the measure of the quality of tree T : G(T ) = w(n) π ni ln(π ni ) n L where π ni s are the probabilities that randomly chosen x drawn from the underlying target distribution P has label i givehat x reaches node n and w(n) is the weight of leaf n defined as the probability of randomly chosen x drawn from P to reach leaf n (note that n L w(n) = 1). Lets fix a leaf node n. For the ease of notation let w = w n. We will consider splitting the leaf to two children n 0 and n 1. For the ease of notation let w 0 = w n0 and w 1 = w n1. Also for the ease of notation let p = P (h n (x) > 0) and P i = P (h n (x) > 0 i). Let π i be the probability that randomly chosen x drawn from P has label i givehat x reaches node n. Recall that p = k π ip i and k π i = 1. Also notice that w 0 = w(1 p) and w 1 = wp. Let π be the k-element vector with i th entrance equal to π i. Furthermore let G(π) = k π i ln(π i ). Before the split the p 5

6 contribution of node o the total loss of the tree T, that we will refer to as G t (t is the index of the current iteration), was wg(π 1, π,..., π k ). Let π i (n 0 ) = πi(1 Pi) 1 p and π i (n 1 ) = πipi p be the probabilities that randomly chosen x drawn from P has label i givehat x reaches node respectively n 0 or n 1. Furthermore let π(n 0 ) be the k-element vector with i th entrance equal to π i (n 0 ) and let π(n 1 ) be the k-element vector with i th entrance equal to π i (n 1 ). Notice that π = (1 p)π(n 0 ) + pπ(n 1 ). After the split the contribution of the same, now internal, node n changes to w((1 p)g(π(n 0 )) + pg(π(n 1 )). We will denote the difference betweehem as t and thus t = w [G(π 1 ) (1 p)g(π(n 0 )) pg(π(n 1 )] We will aim to lower-bound t, but before that we still need to present two more useful results. Without loss of generality assume that P 1 P P k. For the ease of notation let Ω = Ω(h n ). Recall that Ω = π i P i p From strong concativity we know that ( t wp(1 p) π(n 0 ) π(n 1 ) 1 = wp(1 p) π i (n 0 ) π i (n 1 ) ( P i = wp(1 p) π i p 1 P ( i P i p 1 p ) = wp(1 p) π i p(1 p) ( ) w = π i (P i p) = wω p(1 p) 4p(1 p) Furthermore notice that at round t there must be a leaf n such that w(n) Gt t ln k (we assume we selected this leaf to the currently considered split), where G t = n L w(n) k π ni ln(π ni ). That is because G t = n L w(n) k π ni ln(π ni ) n L w(n) ln k tw max ln k where w max = max n w(n). Thus w max Gt t ln k. Thus t Ω G t 8p(1 p)t ln k Definition 3. (Weak Hypothesis Assumption) Let γ (0, min(p n, 1 p n )]. Let for any distribution P over X at each non-leaf node n of the tree T there exists a hypothesis h H such that i {1,,...,k} P ni p n γ. Note that the Weak Hypothesis Assumption in fact requires that each non-leaf node of the tree T have a hypothesis h in its hypothesis class H which guarantees certain weak purity of the split on any distribution P over X. Also note that the condition γ (0, min(p n, 1 p n )] implies that γ 1. From the Weak Hypothesis Assumption it follows that for any n, p n cannot be too near 0 or 1 since 1 γ p n γ. We will now proceed to further lower-bounding t. Note that Thus and finally Let η = 8 (1 γ) ln k γ. Then t Ω = π i ( p P i ) t Ω γ π 1 γ = γ γ G t (1 γ) t ln k. > η G t 16t 6 ) )

7 Thus we obtaihe recurrence inequality [ ] G t+1 G t t < G t η G t = G t 1 η 16t 16t We can now compute the minimum number of splits required to reduce G t below ɛ, where ɛ [0, 1]. We use the result from Kearns and Mansour (1995) (see the proof of Theorem 10) and obtaihe following theorem. Theorem 1. Under the Weak Hypothesis Assumption, for any ɛ [0, 1], to obtain G t ɛ it suffices to make ( ) 1 4(1 γ) ln k γ t ɛ splits. We will now show one more interesting result for the special case when having k = classes. We will show that the worst case value of p w.r.t. the change ihe potential is a balanced p, in particular we will prove the Lemma which states that ihe worst case setting when t is minimized, the value of p has to lie ihe interval [0.4, 0.6]. We will start from showing a useful result. Recall that π(n 0 )(1 p) + π(n 1 )p = π 1 and thus π(n 0 ) = π1 π(n1)p. Therefore we can write that: [ ] π1 π(n 1 )p π(n 0 ) π(n 1 ) = π(n 1 ) 1 p 1 p = π 1 1 p π(n 1) 1 p = π 1 1 p π 1P 1 p(1 p), where the last equality comes from the fact that π(n 1 ) = π1p1 p. Let δ = π(n 0) π(n 1 ) and thus p(1 p)δ = pπ 1 P 1 π 1 = π 1 (p P 1 ) γπ 1 γπ 1 (1 π 1 ), where the first inequality comes from the Weak Learning Assumption. Thus we obtained that p(1 p)δ γπ 1 (1 π 1 ). One can compare this result with Lemma from Kearns and Mansour (1995) and observe that as expected we obtained similar condition. Now recall that t = G(π 1 ) (1 p)g(π(n 0 )) pg(π(n 1 )), where without loss of generality weight w was assumed to be w = 1. Notice that π(n 0 ) = π 1 pδ and π(n 1 ) = π 1 + (1 p)δ (it can be easily verified by substituting π 1 = π(n 0 )(1 p) + π(n 1 )p). We can further express t as a function of π 1, p, δ as follows t (π 1, p, δ) = G(π 1 ) (1 p)g(π 1 pδ) pg(π 1 + (1 p)δ). For any fixed values π 1, p [0, 1], t (π 1, p, δ) is minimized by choosing δ as small as possible. Thus let us set δ = γπ1(1 π1) p(1 p) and define t (π 1, p) = t (π 1, p, γπ 1(1 π 1 ) ) ( = G(π 1 ) (1 p)g π 1 γπ 1(1 π 1 ) (1 p) p(1 p) ) pg ( π 1 + γπ 1(1 π 1 ) p ). (1) The next Lemma is a direct application of Lemma 4 from Kearns and Mansour (1995). Lemma 3. Let γ [0, 1] be any fixed value, and let t (π 1, p) be as defined in Equation 1. Then for any fixed π 1 [0, 1], t (π 1, p) is minimized by a value of p falling ihe interval [0.4, 0.6]. That implies that ihe worst case setting when t is minimized the split has to be close to balanced. 5 Algorithm Based ohe previous formulation of the objective function one can show one more equivalent form of this objective function which will yield a simple online algorithm for tree construction and training. Notice that Ω(h) = E[1(h(x) > 0)] E[1(h(x) > 0 i)] This is a discrete optimization problem that we relax and instead consider the following objective functioo maximize Ω(h) = E[h(x) > 0] E[h(x) > 0 i] We can store the empirical estimate of the expectations and very easily update them online. The sign of the difference decides whether the currently seen example should be send to the left or right child node. 7

8 Algorithm 1 Online tree training (regression algorithm R, threshold p thr ) create the root node r initialize q1 r = zeros(k, 1), q r = 0, t = 0, i {1,,...,k} i = 0 foreach example s(x, y) do Set j = r while j is not a leaf do update q j 1 (y) = nj y/(n j y + 1); q j 1 (y)+= f j (s(x, y)); q j 1 (y)/= nj y + 1 q j = nj t/(n j t + 1); q j += f j (s(x, y)); q j /= nj t + 1 if (q j 1 (y) > qj ) Y j R = Y j R y(s(x, y)), c = 1 else Y j L = Y j L y(s), c = 0 Train f j with example (s(x, y), c) (use absolute value loss and take step ihe subgradient direction) Set j to the child of j corresponding to c n j t+= 1; n j y+= 1 create children of leaf j left with a copy of j (including f j ) right with label y trained on (s(x, y), 0) Train f j with (s(x, y), 1) 6 Experiments We empirically compare the quality of the split at the root obtained by our algorithm with twenty random partitions. In order to do so we build only a single layer of the tree (root and two children) and traihe root by sending the examples to the children nodes following the sign of the difference of the expectations. The label is then assigned to the child node that has seen more data points with this particular label. We report the error rate of this approach and compare against the error rate of random partitions. We performed experiments owo publicly available datasets, MNIST (Lecun and Cortes (009), 10 labels) and Reuters RCV1 (Lewis et al. (004), 99 labels). Additionally we also performed experiments ohe artificially generated dataset (we call it data100), mixture of 100 well-separated spherical Gaussians with means placed ohe grid. Each dataset was split into a training and test set. All methods were implemented ihe open source system Vowpal Wabbit (Langford et al. (007)). For the MNIST dataset, our algorithm recovered a balanced partition, i.e. half of the labels were assigned to each child node. For the data100 dataset the recovered partition was almost balanced, i.e. 51 labels were assigned to one child and 49 to the other. For the RCV1 dataset, our algorithm recovered a partition where 33 labels were assigned to one child node and the remaining ones were assigned to the other. Thus ihis case we compare our algorithm with the random partitions that also divide the labels in proportion 33/66 and also with random partitions that are balanced (proportion 49/50). The results are captured in Table 1 where clearly our algorithm is able to recover significantly better partitions. Dataset Algorithm Training error Testing error data100 Our algorithm Random partitions ± ± MNIST Our algorithm Random partitions ± ± RCV1 Our algorithm Random partitions (33/66) ± ± Random partitions (49/50) ± ± Table 1: Error rates at the root for data100 dataset for 49/51 partitions, for MNIST dataset for 50/50 partitions and for RCV1 dataset for 33/66 and 49/50 partitions. 8

9 Acknowledgments We would like to thank Dean Foster, Robert Schapire and Matus Telgarsky for valuable discussions. References Agarwal, R., Gupta, A., Prabhu, Y., and Varma, M. (013). Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW. Bengio, S., Weston, J., and Grangier, D. (010). Label embedding trees for large multi-class tasks. In NIPS. Beygelzimer, A., Langford, J., Lifshits, Y., Sorkin, G. B., and Strehl, A. L. (009a). Conditional probability tree estimation analysis and algorithms. In UAI. Beygelzimer, A., Langford, J., and Ravikumar, P. D. (009b). Error-correcting tournaments. In ALT. Deng, J., Satheesh, S., Berg, A. C., and Li, F.-F. (011). Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS. Kearns, M. and Mansour, Y. (1995). Ohe boosting ability of top-down decisioree learning algorithms. In In Proceedings of the Twenty-Eighth Annual ACM Symposium ohe Theory of Computing, pages ACM Press. Langford, J., Li, L., and Srehl, A. (007). Vowpal wabbit program, vw. Lecun, Y. and Cortes, C. (009). The mnist database of handwritten digits. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (004). Rcv1: A new benchmark collection for text categorizatioesearch. J. Mach. Learn. Res., 5: Rifkin, R. and Klautau, A. (004). In defense of one-vs-all classification. J. Mach. Learn. Res., 5: Weston, J., Makadia, A., and Yee, H. (013). Label partitioning for sublinear ranking. In ICML. 9

arxiv: v13 [cs.lg] 14 Nov 2015

arxiv: v13 [cs.lg] 14 Nov 2015 Logarithmic Time Online Multiclass prediction arxiv:1406.1822v13 [cs.lg] 14 Nov 2015 Anna Choromanska Courant Institute of Mathematical Sciences New York, NY, USA achoroma@cims.nyu.edu Abstract John Langford