Extreme Multi Class Classification

Size: px
Start display at page:

Download "Extreme Multi Class Classification"

Transcription

1 Extreme Multi Class Classification Anna Choromanska Department of Electrical Engineering Columbia University Alekh Agarwal Microsoft Research New York, NY, USA John Langford Microsoft Research New York, NY, USA Abstract We consider the multi class classification problem under the setting where the number of labels is very large and hence it is very desirable to efficiently achieve train and test running times which are logarithmic ihe label complexity. Additionally the labels are feature dependent in our setting. We propose a reduction of this problem to a set of binary regression problems organized in a tree structure and we introduce a simple top-down criterion for purification of labels that allows for gradient descent style optimization. Furthermore we prove that maximizing the proposed objective function (splitting criterion) leads simultaneously to pure and balanced splits. We use the entropy of the tree leafs, a standard measure used in decisiorees, to measure the quality of obtained tree and we show an upperbound ohe number of splits required to reduce this measure below threshold ɛ. Finally we empirically show that the splits recovered by our algorithm leads to significantly smaller error thaandom splits. 1 Introduction The central problem of this paper is computational complexity in a setting where the number of classes k for multiclass prediction is very large. Almost all machine learning algorithms (with the notable exception of decisiorees) have running times for multiclass classification which are O(k) with a canonical example being one-against-all classifiers Rifkin and Klautau (004). Ihis setting, the most efficient approach that could be imagined has a running time of O(log k) for both training and testing, while effectively using online learning algorithms to minimizes passes over the data. A number of prior works have addressed aspects of this problem, which we review next. The Filter Tree Beygelzimer et al. (009b) addresses consistent (and robust) multiclass classification, showing that it is possible ihe statistical limit. A critical problem not addressed there is the choice of partition. In effect, it can only succeed whehe chosen partition happens to be easy. The partition finding problem is addressed ihe conditional probability tree Beygelzimer et al. (009a), but that paper addresses conditional probability estimation. Conditional probability estimation can be converted into multiclass prediction, but doing so is not a logarithmic time operation. Futher work Bengio et al. (010) addresses the partitioning problem by recursively applying spectral clustering on a confusion graph. This approach is O(k) or worse at training time, making it intractable by our standards. Empirically, this approach has been found to sometimes lead to badly imbalanced splits Deng et al. (011). Also another work by Weston et al. (013) make use of the k-means hierarchical clustering (other distortion-minimizing method could also be used) to recover the label sets 1

2 for a given partitiohough this work primarily adresses the problem of ranking a very large set of labels rather thahe multiclass classification problem. Decisiorees are naturally structured to allow logarithmic time prediction. Traditional decision trees often have difficulties with a large number of classes because their splitting criteria is not wellsuited to the large class setting. However, newer approaches Agarwal et al. (013) have addressed this effectively. The approach we discuss here can be thought of as a decisioree algorithm which differs ihe objective optimized, and in how that optimization is done. The different objective allows us to prove certain useful properties while the optimization uses an online learning algorithm which is queried and trained simultaneously as we pass over the data. This offers a tradeoff: our nodes are more powerful, potentially exponentially reducing the state of the model, while being somewhat more expensive to compute since they depend on more than one feature. Splitting criterion.1 Formulation of the objective function Ihis section we will introduce the splitting criteriohat we use in every node of the tree to decide whether an example should be sent to the left or right child node. For notation simplicity consider the split done ihe root. Let X denotes the input dataset. The objective functiohat we aim to maximize is given as follows Ω(, k r ) = x X [ nl k l(x) n l + where is the total number of examples, (resp. n l ) is the number of examples going to the right (resp. left) child. Notice that = + n l. is the total number of examples labeled ihe same way as x, k r (x) (resp. k l (x)) is the number of examples labeled ihe same way as x that are going to the right (resp. left) child. Notice that x X = k r (x) + k l (x). Thus the objective function can be rewritten as Ω(, k r ) = x X By the maximally balanced split we will understand the one for which = and thus the same number of examples were directed to the left and right child nodes. By the maximally pure split we will understand the one for which k r (x) = 0 or k r (x) = and thus there exists no two distinct data points with the same label that are in different child nodes. The proposed objective function has certain desirable properties which are captured in Lemma 1 and Lemma and ihe entire Section 3. Lemma 1. If there exists a maximally pure and balanced split, this split maximizes the objective. Proof. Let X R be the set of data points ihe right child node such that no data point ihe left child node has the same label as any data point in X R (X L is defined in analogy). Let X B be the set of data points such that their labels appear in both child nodes. = x X R Ω(, k r ) = x X + x X L + x X B ] = + x X R x X L The last step comes from the fact that we consider maximally pure split thus x XR k r (x) =, x XL k r (x) = 0 and thus furthermore X B must be an empty set (that eliminates the third term). We can further simplify as follows ( ) + ( ) = 4 ( )

3 Since we are maximizing the objective, we set to = n l = 1. That shows the split is also balanced. Lemma. In isolation, i.e. when is fixed and under the conditiohat a maximally pure and balanced split exists, maximizing the objective recovers this split. Proof. Objective function is Ω(, k r ) = x X = x X 1 = 1 x X k r (x) k r (x) If is fixed and one optimizes for k r (notice x X k r (x) < 0, >) then either one has to set x X k r (x) = 0 or k r (x) =. 3 Purity and balancing factors In order to show some more interesting properties of the objective function we need to introduce more formal notation. Let k be the number of labels. Let H be the hypothesis class. Let π i be the probability that randomly chosen data point from the dataset has label i. Consider hypothesis h H and denote P r(h(x) > 0 i) t be the probability that h(x) > 0 givehat x has label i. We cahen define pure and balanced splits as follows Definition 1. The hypothesis h H induces a pure split if π i min(p r(h(x) > 0 i), P r(h(x) < 0 i)) δ. Definition. The hypothesis h H induces a balanced split if c1 (0,1)c P r(h(x) > 0) 1 c. We will refer to k π i min(p r(h(x) > 0 i), P r(h(x) < 0 i)) as the purity factor as it determines how pure the split is and we will refer to P r(h(x) > 0 as the balancing factor as it determines how balanced the split is. One cahen express the objective function ihe equivalent form as Ω(h) = π i [ P (h(x) > 0) P (h(x) > 0 x X i )) ] We will now show that increasing the value of objective function leads to recovering the hypothesis that induces more balanced splits and also more pure splits. 3.1 Balancing factor We want to show the relation betweehe balancing factor and the value of the objective function. In order to do that we will start from deriving an upper-bound on Ω(h), where h H. For the ease of notation let P i = P r(h(x) > 0 x X i ). Thus Ω(h) = π i P (h(x) > 0 x X i ) P (h(x) > 0) = π i P i π j P j, where i={1,,...,k} 0 P i 1. The objective Ω(h) is definitely maximized ohe extremes of the [0, 1] interval. The upper-bound on Ω(h) can be thus obtained by setting some of the P i s to 1 s and remaining ones to 0 s. To be more precise, let L 1 = {i : i {1,,..., k}, P i = 1} j=1 3

4 and L = {i : i {1,,..., k}, P i = 0} We cahen write that Ω(h) [ π i (1 π j ) + π i π j = π i ( π i ) + (1 π i ) i L 1 j L 1 i L j L 1 i L 1 i L 1 i L 1 [ ] = 4 π i ( π i ) i L 1 i L 1 For the ease of notation let p = P r(h(x) > 0). Notice that p = i I 1 π i thus i L 1 π i ] Thus Ω(h) 4p(1 p) 4p 4p + Ω(h) 0 p [ 1 1 Ω(h), 1 + ] 1 Ω(h). That yields the balancing factor c to be c = 1 1 Ω(h). Maximizing Ω(h) leads to narrowing the [c, 1 c] interval around value 1 which corresponds to the maximally balanced split. In particular for the objective-maximizing hypothesis h (then Ω(h ) = 1) we obtaihat c = 1 and the split is maximally balanced then. 3. Purity factor In analogy to what was shown before now we want to show the relation betweehe purity factor and the value of the objective function. As before in order to do that we will start from deriving an upper-bound on Ω(h), where h H. For the ease of notation let P i = P r(h(x) > 0 x X i ). Thus Ω(h) = π i P (h(x) > 0 x X i ) P (h(x) > 0) = π i P i π j P j, where i={1,,...,k} 0 P i 1. Let ɛ i = min(p i, 1 P i ) and ɛ = k π iɛ i. Let p = P (h(x) > 0). Without loss of generality let p 1. Let L 1 = {i : i {1,,..., k}, P i 1 }, j=1 and L = {i : i {1,,..., k}, P i [p, 1 )} L 3 = {i : i {1,,..., k}, P i < p}. First notice that: p = π i P i = π i (1 ɛ i ) + π i ɛ i = π i π i ɛ i + ɛ i L 1 i L L 3 i L 1 i L 1 We cahen write that Ω(h) = π i P i p = π i (1 ɛ i p) + π i (ɛ i p) + π i (p ɛ i ) i L 1 i L i L 3 = π i (1 p) π i ɛ i + π i ɛ i π i p + π i p π i ɛ i i L 1 i L 1 i L i L i L 3 i L 3 = π i (1 p) π i ɛ i + π i ɛ i π i p + p(1 π i π i ) π i ɛ i i L 1 i L 1 i L i L i L 1 i L i L 3 4

5 = π i (1 p) π i ɛ i + π i ɛ i + p(1 π i ) π i ɛ i i L 1 i L 1 i L i L i L 3 = π i (1 p) + p(1 π i ) ɛ + π i ɛ i i L 1 i L i L = (1 p)(p + π i ɛ i ɛ) + p(1 π i ) ɛ + π i ɛ i i L 1 i L i L = (1 p)(p ɛ) + (1 p) π i ɛ i p π i + π i ɛ i i L 1 i L i L = (1 p)(p ɛ) + (1 p) π i ɛ i + π i (ɛ i p) i L 1 i L (1 p)(p ɛ) + (1 p) π i ɛ i + π i ( 1 p) i L 1 i L (1 p)(p ɛ)+(1 p) π i ɛ i + π i ( 1 p) (1 p)(p ɛ)+(1 p) π i ɛ i +1 p i L 1 i L i L 1 Thus: (1 p)(p ɛ) + (1 p)ɛ + 1 p = p(1 p) ɛ(1 p) + ɛ(1 p) + 1 p = p(1 p) ɛ(1 p 1 + p) + 1 p = p(1 p) pɛ + 1 p = 1 p pɛ ɛ Ω(h) 4p That yields the balancing factor δ to be δ = Ω(h) 4p p. We already know that maximizing Ω(h) leads to narrowing the [c, 1 c] interval around value 1 which corresponds to the maximally balanced split. Thue p will be then pushed closer to value 1 and that will result ihe decrease of δ. In particular for the objective-maximizing hypothesis h (then Ω(h ) = 1) we obtaihe maximally balanced split and then p = 1 and simultaneously we obtaihat δ = 0 and thus this split is also maximally pure then. 4 Boosting statement We will now use the entropy of the tree leafs, a standard measure used in decisiorees, to measure the quality of obtained tree and show the upper-bound ohe number of splits required to reduce this measure below threshold ɛ. We borrow from the theoretical analysis of decisioree algorithms in Kearns and Mansour (1995) originally developed to show the boosting properties of the decision trees for binary classification problems. Our analysis generalizes the analysis there to the multi class classification setting. Consider the tree T, where every node except for leafs (we will refer to the set of the tree leafs as L) is characterized by the splitting hypothesis h H recovered by maximizing the objective function introduced before. We will consider the entropy function G as the measure of the quality of tree T : G(T ) = w(n) π ni ln(π ni ) n L where π ni s are the probabilities that randomly chosen x drawn from the underlying target distribution P has label i givehat x reaches node n and w(n) is the weight of leaf n defined as the probability of randomly chosen x drawn from P to reach leaf n (note that n L w(n) = 1). Lets fix a leaf node n. For the ease of notation let w = w n. We will consider splitting the leaf to two children n 0 and n 1. For the ease of notation let w 0 = w n0 and w 1 = w n1. Also for the ease of notation let p = P (h n (x) > 0) and P i = P (h n (x) > 0 i). Let π i be the probability that randomly chosen x drawn from P has label i givehat x reaches node n. Recall that p = k π ip i and k π i = 1. Also notice that w 0 = w(1 p) and w 1 = wp. Let π be the k-element vector with i th entrance equal to π i. Furthermore let G(π) = k π i ln(π i ). Before the split the p 5

6 contribution of node o the total loss of the tree T, that we will refer to as G t (t is the index of the current iteration), was wg(π 1, π,..., π k ). Let π i (n 0 ) = πi(1 Pi) 1 p and π i (n 1 ) = πipi p be the probabilities that randomly chosen x drawn from P has label i givehat x reaches node respectively n 0 or n 1. Furthermore let π(n 0 ) be the k-element vector with i th entrance equal to π i (n 0 ) and let π(n 1 ) be the k-element vector with i th entrance equal to π i (n 1 ). Notice that π = (1 p)π(n 0 ) + pπ(n 1 ). After the split the contribution of the same, now internal, node n changes to w((1 p)g(π(n 0 )) + pg(π(n 1 )). We will denote the difference betweehem as t and thus t = w [G(π 1 ) (1 p)g(π(n 0 )) pg(π(n 1 )] We will aim to lower-bound t, but before that we still need to present two more useful results. Without loss of generality assume that P 1 P P k. For the ease of notation let Ω = Ω(h n ). Recall that Ω = π i P i p From strong concativity we know that ( t wp(1 p) π(n 0 ) π(n 1 ) 1 = wp(1 p) π i (n 0 ) π i (n 1 ) ( P i = wp(1 p) π i p 1 P ( i P i p 1 p ) = wp(1 p) π i p(1 p) ( ) w = π i (P i p) = wω p(1 p) 4p(1 p) Furthermore notice that at round t there must be a leaf n such that w(n) Gt t ln k (we assume we selected this leaf to the currently considered split), where G t = n L w(n) k π ni ln(π ni ). That is because G t = n L w(n) k π ni ln(π ni ) n L w(n) ln k tw max ln k where w max = max n w(n). Thus w max Gt t ln k. Thus t Ω G t 8p(1 p)t ln k Definition 3. (Weak Hypothesis Assumption) Let γ (0, min(p n, 1 p n )]. Let for any distribution P over X at each non-leaf node n of the tree T there exists a hypothesis h H such that i {1,,...,k} P ni p n γ. Note that the Weak Hypothesis Assumption in fact requires that each non-leaf node of the tree T have a hypothesis h in its hypothesis class H which guarantees certain weak purity of the split on any distribution P over X. Also note that the condition γ (0, min(p n, 1 p n )] implies that γ 1. From the Weak Hypothesis Assumption it follows that for any n, p n cannot be too near 0 or 1 since 1 γ p n γ. We will now proceed to further lower-bounding t. Note that Thus and finally Let η = 8 (1 γ) ln k γ. Then t Ω = π i ( p P i ) t Ω γ π 1 γ = γ γ G t (1 γ) t ln k. > η G t 16t 6 ) )

7 Thus we obtaihe recurrence inequality [ ] G t+1 G t t < G t η G t = G t 1 η 16t 16t We can now compute the minimum number of splits required to reduce G t below ɛ, where ɛ [0, 1]. We use the result from Kearns and Mansour (1995) (see the proof of Theorem 10) and obtaihe following theorem. Theorem 1. Under the Weak Hypothesis Assumption, for any ɛ [0, 1], to obtain G t ɛ it suffices to make ( ) 1 4(1 γ) ln k γ t ɛ splits. We will now show one more interesting result for the special case when having k = classes. We will show that the worst case value of p w.r.t. the change ihe potential is a balanced p, in particular we will prove the Lemma which states that ihe worst case setting when t is minimized, the value of p has to lie ihe interval [0.4, 0.6]. We will start from showing a useful result. Recall that π(n 0 )(1 p) + π(n 1 )p = π 1 and thus π(n 0 ) = π1 π(n1)p. Therefore we can write that: [ ] π1 π(n 1 )p π(n 0 ) π(n 1 ) = π(n 1 ) 1 p 1 p = π 1 1 p π(n 1) 1 p = π 1 1 p π 1P 1 p(1 p), where the last equality comes from the fact that π(n 1 ) = π1p1 p. Let δ = π(n 0) π(n 1 ) and thus p(1 p)δ = pπ 1 P 1 π 1 = π 1 (p P 1 ) γπ 1 γπ 1 (1 π 1 ), where the first inequality comes from the Weak Learning Assumption. Thus we obtained that p(1 p)δ γπ 1 (1 π 1 ). One can compare this result with Lemma from Kearns and Mansour (1995) and observe that as expected we obtained similar condition. Now recall that t = G(π 1 ) (1 p)g(π(n 0 )) pg(π(n 1 )), where without loss of generality weight w was assumed to be w = 1. Notice that π(n 0 ) = π 1 pδ and π(n 1 ) = π 1 + (1 p)δ (it can be easily verified by substituting π 1 = π(n 0 )(1 p) + π(n 1 )p). We can further express t as a function of π 1, p, δ as follows t (π 1, p, δ) = G(π 1 ) (1 p)g(π 1 pδ) pg(π 1 + (1 p)δ). For any fixed values π 1, p [0, 1], t (π 1, p, δ) is minimized by choosing δ as small as possible. Thus let us set δ = γπ1(1 π1) p(1 p) and define t (π 1, p) = t (π 1, p, γπ 1(1 π 1 ) ) ( = G(π 1 ) (1 p)g π 1 γπ 1(1 π 1 ) (1 p) p(1 p) ) pg ( π 1 + γπ 1(1 π 1 ) p ). (1) The next Lemma is a direct application of Lemma 4 from Kearns and Mansour (1995). Lemma 3. Let γ [0, 1] be any fixed value, and let t (π 1, p) be as defined in Equation 1. Then for any fixed π 1 [0, 1], t (π 1, p) is minimized by a value of p falling ihe interval [0.4, 0.6]. That implies that ihe worst case setting when t is minimized the split has to be close to balanced. 5 Algorithm Based ohe previous formulation of the objective function one can show one more equivalent form of this objective function which will yield a simple online algorithm for tree construction and training. Notice that Ω(h) = E[1(h(x) > 0)] E[1(h(x) > 0 i)] This is a discrete optimization problem that we relax and instead consider the following objective functioo maximize Ω(h) = E[h(x) > 0] E[h(x) > 0 i] We can store the empirical estimate of the expectations and very easily update them online. The sign of the difference decides whether the currently seen example should be send to the left or right child node. 7

8 Algorithm 1 Online tree training (regression algorithm R, threshold p thr ) create the root node r initialize q1 r = zeros(k, 1), q r = 0, t = 0, i {1,,...,k} i = 0 foreach example s(x, y) do Set j = r while j is not a leaf do update q j 1 (y) = nj y/(n j y + 1); q j 1 (y)+= f j (s(x, y)); q j 1 (y)/= nj y + 1 q j = nj t/(n j t + 1); q j += f j (s(x, y)); q j /= nj t + 1 if (q j 1 (y) > qj ) Y j R = Y j R y(s(x, y)), c = 1 else Y j L = Y j L y(s), c = 0 Train f j with example (s(x, y), c) (use absolute value loss and take step ihe subgradient direction) Set j to the child of j corresponding to c n j t+= 1; n j y+= 1 create children of leaf j left with a copy of j (including f j ) right with label y trained on (s(x, y), 0) Train f j with (s(x, y), 1) 6 Experiments We empirically compare the quality of the split at the root obtained by our algorithm with twenty random partitions. In order to do so we build only a single layer of the tree (root and two children) and traihe root by sending the examples to the children nodes following the sign of the difference of the expectations. The label is then assigned to the child node that has seen more data points with this particular label. We report the error rate of this approach and compare against the error rate of random partitions. We performed experiments owo publicly available datasets, MNIST (Lecun and Cortes (009), 10 labels) and Reuters RCV1 (Lewis et al. (004), 99 labels). Additionally we also performed experiments ohe artificially generated dataset (we call it data100), mixture of 100 well-separated spherical Gaussians with means placed ohe grid. Each dataset was split into a training and test set. All methods were implemented ihe open source system Vowpal Wabbit (Langford et al. (007)). For the MNIST dataset, our algorithm recovered a balanced partition, i.e. half of the labels were assigned to each child node. For the data100 dataset the recovered partition was almost balanced, i.e. 51 labels were assigned to one child and 49 to the other. For the RCV1 dataset, our algorithm recovered a partition where 33 labels were assigned to one child node and the remaining ones were assigned to the other. Thus ihis case we compare our algorithm with the random partitions that also divide the labels in proportion 33/66 and also with random partitions that are balanced (proportion 49/50). The results are captured in Table 1 where clearly our algorithm is able to recover significantly better partitions. Dataset Algorithm Training error Testing error data100 Our algorithm Random partitions ± ± MNIST Our algorithm Random partitions ± ± RCV1 Our algorithm Random partitions (33/66) ± ± Random partitions (49/50) ± ± Table 1: Error rates at the root for data100 dataset for 49/51 partitions, for MNIST dataset for 50/50 partitions and for RCV1 dataset for 33/66 and 49/50 partitions. 8

9 Acknowledgments We would like to thank Dean Foster, Robert Schapire and Matus Telgarsky for valuable discussions. References Agarwal, R., Gupta, A., Prabhu, Y., and Varma, M. (013). Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW. Bengio, S., Weston, J., and Grangier, D. (010). Label embedding trees for large multi-class tasks. In NIPS. Beygelzimer, A., Langford, J., Lifshits, Y., Sorkin, G. B., and Strehl, A. L. (009a). Conditional probability tree estimation analysis and algorithms. In UAI. Beygelzimer, A., Langford, J., and Ravikumar, P. D. (009b). Error-correcting tournaments. In ALT. Deng, J., Satheesh, S., Berg, A. C., and Li, F.-F. (011). Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS. Kearns, M. and Mansour, Y. (1995). Ohe boosting ability of top-down decisioree learning algorithms. In In Proceedings of the Twenty-Eighth Annual ACM Symposium ohe Theory of Computing, pages ACM Press. Langford, J., Li, L., and Srehl, A. (007). Vowpal wabbit program, vw. Lecun, Y. and Cortes, C. (009). The mnist database of handwritten digits. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (004). Rcv1: A new benchmark collection for text categorizatioesearch. J. Mach. Learn. Res., 5: Rifkin, R. and Klautau, A. (004). In defense of one-vs-all classification. J. Mach. Learn. Res., 5: Weston, J., Makadia, A., and Yee, H. (013). Label partitioning for sublinear ranking. In ICML. 9

arxiv: v13 [cs.lg] 14 Nov 2015

arxiv: v13 [cs.lg] 14 Nov 2015 Logarithmic Time Online Multiclass prediction arxiv:1406.1822v13 [cs.lg] 14 Nov 2015 Anna Choromanska Courant Institute of Mathematical Sciences New York, NY, USA achoroma@cims.nyu.edu Abstract John Langford

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

ABC-LogitBoost for Multi-Class Classification

ABC-LogitBoost for Multi-Class Classification Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

Adaptive Crowdsourcing via EM with Prior

Adaptive Crowdsourcing via EM with Prior Adaptive Crowdsourcing via EM with Prior Peter Maginnis and Tanmay Gupta May, 205 In this work, we make two primary contributions: derivation of the EM update for the shifted and rescaled beta prior and

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Decision Trees (Cont.)

Decision Trees (Cont.) Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

More information

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth N-fold cross validation Instead of a single test-training split: train test Split data into N equal-sized parts Train and test

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Optimal and Adaptive Algorithms for Online Boosting

Optimal and Adaptive Algorithms for Online Boosting Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015 Boosting: An Example

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Randomized Decision Trees

Randomized Decision Trees Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016 Lecture 1: Introduction and Review We begin with a short introduction to the course, and logistics. We then survey some basics about approximation algorithms and probability. We also introduce some of

More information

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Polynomial time Prediction Strategy with almost Optimal Mistake Probability Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

25 Minimum bandwidth: Approximation via volume respecting embeddings

25 Minimum bandwidth: Approximation via volume respecting embeddings 25 Minimum bandwidth: Approximation via volume respecting embeddings We continue the study of Volume respecting embeddings. In the last lecture, we motivated the use of volume respecting embeddings by

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

1 Review of Winnow Algorithm

1 Review of Winnow Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm

More information

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL Active Learning 9.520 Class 22, 03 May 2006 Claire Monteleoni MIT CSAIL Outline Motivation Historical framework: query learning Current framework: selective sampling Some recent results Open problems Active

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information