arxiv: v2 [stat.ml] 16 Oct 2017

Size: px
Start display at page:

Download "arxiv: v2 [stat.ml] 16 Oct 2017"

Transcription

1 arxiv: v2 [stat.ml] 6 Oct 207 Semi-Supervised AUC Optimization based on Positive-Unlabeled Learning Tomoya Sakai,2, Gang Niu,2, and Masashi Sugiyama 2, Graduate School of Frontier Sciences, The University of Tokyo, Japan 2 Center for Advanced Intelligence Project, RIKEN, Japan Abstract Maximizing the area under the receiver operating characteristic curve AUC is a standard approach to imbalanced classification. So far, various supervised AUC optimization methods have been developed and they are also extended to semi-supervised scenarios to cope with small sample problems. However, existing semi-supervised AUC optimization methods rely on strong distributional assumptions, which are rarely satisfied in real-world problems. In this paper, we propose a novel semi-supervised AUC optimization method that does not require such restrictive assumptions. We first develop an AUC optimization method based only on positive and unlabeled data PU-AUC and then extend it to semi-supervised learning by combining it with a supervised AUC optimization method. We theoretically prove that, without the restrictive distributional assumptions, unlabeled data contribute to improving the generalization performance in PU and semi-supervised AUC optimization methods. Finally, we demonstrate the practical usefulness of the proposed methods through experiments. Introduction Maximizing the area under the receiver operating characteristic curve AUC Hanley and McNeil, 982 is a standard approach to imbalanced classification Cortes and Mohri, While the misclassification rate relies on the sign of the score of a single sample, AUC is governed by the ranking of the scores of two samples. Based on this principle, various supervised methods for directly optimizing AUC have been developed so far and demonstrated to be useful Herschtal and Raskutti, 2004; Zhao et al., 20; Rakhlin et al., 202; Kotlowski et al., 20; Ying et al., 206. However, collecting labeled samples is often expensive and laborious in practice. To mitigate this problem, semi-supervised AUC optimization methods have been developed that can utilize unlabeled samples Amini et al., 2008; Fujino and Ueda, 206. These semi-supervised methods solely rely on the assumption that an unlabeled sample that is similar to a labeled sample shares the same label. However, such a restrictive distributional assumption which is often referred to as the cluster or the entropy minimization principle is rarely satisfied in practice and thus the practical usefulness of these semi-supervised methods is limited Cozman et al., 2003; Sokolovska et al., 2008; Li and Zhou, 205; Krijthe and Loog, 207.

2 On the other hand, it has been recently shown that unlabeled data can be effectively utilized without such restrictive distributional assumptions in the context of classification from positive and unlabeled data PU classification du Plessis et al., 204. Furthermore, based on recent advances in PU classification du Plessis et al., 204, 205; Niu et al., 206, a novel semisupervised classification approach has been developed that combines supervised classification with PU classification Sakai et al., 207. This approach inherits the advances of PU classification that the restrictive distributional assumptions are not necessary and is demonstrated to perform excellently in experiments. Following this line of research, we first develop an AUC optimization method from positive and unlabeled data PU-AUC in this paper. Previously, a pairwise ranking method for PU data has been developed Sundararajan et al., 20, which can be regarded as an AUC optimization method for PU data. However, it merely regards unlabeled data as negative data and thus the obtained classifier is biased. On the other hand, our PU-AUC method is unbiased and we theoretically prove that unlabeled data contribute to reducing an upper bound on the generalization error with the optimal parametric convergence rate without the restrictive distributional assumptions. Then we extend our PU-AUC method to the semi-supervised setup by combining it with a supervised AUC optimization method. Theoretically, we again prove that unlabeled data contribute to reducing an upper bound on the generalization error with the optimal parametric convergence rate without the restrictive distributional assumptions, and further we prove that the variance of the empirical risk of our semi-supervised ACU optimization method can be smaller than that of the plain supervised counterpart. The latter claim suggests that the proposed semi-supervised empirical risk is also useful in the cross-validation phase. Finally, we experimentally demonstrate the usefulness of the proposed PU and semi-supervised AUC optimization methods. 2 Preliminary We first describe our problem setting and review an existing supervised AUC optimization method. Let covariate x R d and its corresponding label y {±} be equipped with probability density px, y, where d is a positive integer. Suppose we have sets of positive and negative samples: X P := {x P i } np i.i.d. i= p P x := px y = +, and X N := {x N j } nn i.i.d. j= p N x := px y =. Furthermore, let g : R d R be a decision function and classification is carried out based on its sign: ŷ = signgx. The goal is to train a classifier g by maximizing the AUC Hanley and McNeil, 982; Cortes and Mohri, 2004 defined and expressed as AUCg := E P [E N [Igx P gx N ]] = E P [E N [Igx P < gx N ]] = E P [E N [l 0- gx P gx N ]], where E P and E N be the expectations over p P x and p N x, respectively. I is the indicator function, which is replaced with the zero-one loss, l 0- m = signm/2, to obtain the 2

3 last equation. Let fx, x := gx gx be a composite classifier. Maximizing the AUC corresponds to minimizing the second term in Eq.. Practically, to avoid the discrete nature of the zero-one loss, we replace the zero-one loss with a surrogate loss lm and consider the following PN-AUC risk Herschtal and Raskutti, 2004; Kotlowski et al., 20; Rakhlin et al., 202: R PN f := E P [E N [lfx P, x N ]]. 2 In practice, we train a classifier by minimizing the empirical PN-AUC risk defined as R PN f := n P n N n N i= j= lfx P i, x N j. Similarly to the classification-calibrated loss Bartlett et al., 2006 in misclassification rate minimization, the consistency of AUC optimization in terms of loss functions has been studied recently Gao and Zhou, 205; Gao et al., 206. They showed that minimization of the AUC risk with a consistent loss function is asymptotically equivalent to that with the zero-one loss function. The squared loss l S m := m 2, the exponential loss l E m := exp m, and the logistic loss l L m := log + exp m are shown to be consistent, while the hinge loss l H m := max0, m and the absolute loss l A m := m are not consistent. 3 Proposed Method In this section, we first propose an AUC optimization method from positive and unlabeled data and then extend it to a semi-supervised AUC optimization method. 3. PU-AUC Optimization In PU learning, we do not have negative data while we can use unlabeled data drawn from marginal density px in addition to positive data: where X U := {x U k } nu i.i.d. k= px = θ P p P x + p N x, 3 θ P := py = + and := py =. We derive an equivalent expression to the PN-AUC risk that depends only on positive and unlabeled data distributions without the negative data distribution. In our derivation and theoretical analysis, we assume that θ P and are known. In practice, they are replaced by their estimate obtained, e.g., by du Plessis et al. 207, Kawakubo et al. 206, and references therein. From the definition of the marginal density in Eq. 3, we have E P [E U [lfx P, x U ]] = θ P E P [E P [lfx P, x P ]] + E P [E N [lfx P, x N ]] = θ P E P [E P [lfx P, x P ]] + R PN f, 3

4 where E P denotes the expectation over p P x P. Dividing the above equation by and rearranging it, we can express the PN-AUC risk in Eq. 2 based on PU data the PU-AUC risk as R PN f = E P [E U [lfx P, x U ]] θ P E P [E P [lfx P, x P ]] := R PU f. 4 We refer to the method minimizing the PU-AUC risk as PU-AUC optimization. We will theoretically investigate the superiority of R PU in Section 4.. To develop a semi-supervised AUC optimization method later, we also consider AUC optimization from negative and unlabeled data, which can be regarded as a mirror of PU-AUC optimization. From the definition of the marginal density in Eq. 3, we have E U [E N [lfx U, x N ]] = θ P E P [E N [lfx P, x N ]] + E N [E N [lfx N, x N ]] = θ P R PN f + E N [E N [lfx N, x N ]], where E N denotes the expectation over p N x N. Rearranging the above equation, we can obtain the PN-AUC risk in Eq. 2 based on negative and unlabeled data the NU-AUC risk: R PN f = θ P E U [E N [lfx U, x N ]] θ P E N [E N [lfx N, x N ]] := R NU f. 5 We refer to the method minimizing the NU-AUC risk as NU-AUC optimization. 3.2 Semi-Supervised AUC Optimization Next, we propose a novel semi-supervised AUC optimization method based on positiveunlabeled learning. The idea is to combine the PN-AUC risk with the PU-AUC/NU-AUC risks, similarly to Sakai et al First of all, let us define the PNPU-AUC and PNNU-AUC risks as R γ PNPU f := γr PNf + γr PU f, R γ PNNU f := γr PNf + γr NU f, where γ [0, ] is the combination parameter. We then define the PNU-AUC risk as { R η PNU f := R η PNPU f η 0, R η PNNU f η < 0, 6 where η [, ] is the combination parameter. We refer to the method minimizing the PNU- AUC risk as PNU-AUC optimization. We will theoretically discuss the superiority of R γ PNPU and R γ PNNU in Section Discussion about Related Work Sundararajan et al. 20 proposed a pairwise ranking method for PU data, which can be regarded as an AUC optimization method for PU data. Their approach simply regards unlabeled data as negative data and the ranking SVM Joachims, 2002 is applied to PU data so that the score of positive data tends to be higher than that of unlabeled data. Although this approach In Sakai et al. 207, the combination of the PU and NU risks has also considered and found to be less favorable than the combination of the PN and PU/NU risks. For this reason, we focus on the latter in this paper. 4

5 is simple and shown computationally efficient in experiments, the obtained classifier is biased. From the mathematical viewpoint, the existing method ignores the second term in Eq. 4 and maximizes only the first term with the hinge loss function. However, the effect of ignoring the second term is not negligible when the class prior, θ P, is not sufficiently small. In contrast, our proposed PU-AUC risk includes the second term so that the PU-AUC risk is equivalent to the PN-AUC risk. Our semi-supervised AUC optimization method can be regarded as an extension of the work by Sakai et al They considered the misclassification rate as a measure to train a classifier and proposed a semi-supervised classification method based on the recently proposed PU classification method du Plessis et al., 204, 205. On the other hand, we train a classifier by maximizing the AUC, which is a standard approach for imbalanced classification. To this end, we first developed an AUC optimization method for PU data, and then extended it to a semi-supervised AUC optimization method. Thanks to the AUC maximization formulation, our proposed method is expected to perform better than the method proposed by Sakai et al. 207 for imbalanced data sets. 4 Theoretical Analyses In this section, we theoretically analyze the proposed risk functions. We first derive generalization error bounds of our methods and then discuss variance reduction. 4. Generalization Error Bounds Recall the composite classifier fx, x = gx gx. As the classifier g, we assume the linear-in-parameter model given by gx = b w l φx = w φx, l= where denotes the transpose of vectors and matrices, b is the number of basis functions, w = w,..., w b is a parameter vector, and φx = φ x,..., φ b x is a basis function vector. Let F be a function class of bounded hyperplanes: F := {fx, x = w φx φx w C w ; x: φx C φ }, where C w > 0 and C φ > 0 are certain positive constants. This assumption is reasonable because the l 2 -regularizer included in training and the use of bounded basis functions, e.g., the Gaussian kernel basis, ensure that the minimizer of the empirical AUC risk belongs to such the function class F. We assume that a surrogate loss is bounded from above by C l and denote the Lipschitz constant by L. For simplicity, 2 we focus on a surrogate loss satisfying l 0- m lm. For example, the squared loss and the exponential loss satisfy the condition. 3 Let If = E P [E N [l 0- fx P, x N ]] be the generalization error of f in AUC optimization. For convenience, we define hδ := 2 2LC l C w C φ log2/δ. 2 Our theoretical analysis can be easily extended to the loss satisfying l 0- m Mlm with a certain M > 0. 3 These losses are bounded in our setting, since the input to lm, i.e., f is bounded. 5

6 In the following, we prove the generalization error bounds of both PU and semi-supervised AUC optimization methods. For the PU-AUC/NU-AUC risks, we prove the following generalization error bounds its proof is available in Appendix B: Theorem. For any δ > 0, the following inequalities hold separately with probability at least δ for all f F: If R PU f + hδ/2 minnp, n U + θ P, np If R NU f + hδ/2 θ P minnn, n U + θ N, θ P nn where R PU and R NU are unbiased empirical risk estimators corresponding to R PU and R NU, respectively. Theorem guarantees that If can be bounded from above by the empirical risk, Rf, plus the confidence terms of order O p np + nu and O p nn + nu. Since n P n N and n U can increase independently in our setting, this is the optimal convergence rate without any additional assumptions Vapnik, 998; Mendelson, For the PNPU-AUC and PNNU-AUC risks, we prove the following generalization error bounds its proof is also available in Appendix B: Theorem 2. For any δ > 0, the following inequalities hold separately with probability at least δ for all f F: If R γ PNPU f + hδ/3 γ minnp, n N + γ minnp, n U + γθ P, np If R γ PNNU f + hδ/3 γ minnp, n N + γ θ P minnn, n U + γθ N. θ P nn where R γ PNPU and R γ PNNU are unbiased empirical risk estimators corresponding to Rγ PNPU and R γ PNNU, respectively. Theorem 2 guarantees that If can be bounded from above by the empirical risk, Rf, plus the confidence terms of order O p np + nn + nu. Again, since n P, n N, and n U can increase independently in our setting, this is the optimal convergence rate without any additional assumptions. 4.2 Variance Reduction In the existing semi-supervised classification method based on PU learning, the variance of the empirical risk was proved to be smaller than the supervised counterpart under certain conditions Sakai et al., 207. Similarly, we here investigate if the proposed semi-supervised risk estimators have smaller variance than its supervised counterpart. 6

7 Let us introduce the following variances and covariances: 4 σ 2 PNf = Var PN [lfx P, x N ], σ 2 PPf = Var PP [lfx P, x P ], σ 2 NNf = Var NN [lfx N, x N ], τ PN,PP f = Cov PN,PP [lfx P, x N, lfx P, x P ], τ PN,NN f = Cov PN,NN [lfx P, x N, lfx N, x N ], τ PU,PP f = Cov PU,PP [lfx P, x U, lfx P, x P ], τ NU,NN f = Cov NU,NN [lfx P, x U, lfx N, x N ]. Then, we have the following theorem its proof is available in Appendix C: Theorem 3. Assume n U. For any fixed f, the minimizers of the variance of the empirical PNPU-AUC and PNNU-AUC risks are respectively obtained by γ PNPU = argmin Var[ R γ PNPU f] = ψ PN ψ PP /2, γ ψ PN + ψ PU ψ PP 7 γ PNNU = argmin Var[ R γ PNNU f] = ψ PN ψ NN /2, γ ψ PN + ψ NU ψ NN 8 where ψ PN = n P n N σ 2 PNf, ψ PU = θ2 P θ 2 N n P 2 σ2 PPf ψ PP = n P τ PN,PU f ψ NU = θ2 N θ 2 P n N 2 σ2 NNf θ P θn 2 n τ PU,PP f, P θ P n P τ PN,PP f, θp 2 n τ NU,NN f, N ψ NN = τ PN,NU f τ PN,NN f. θ P n N θ P n N Additionally, we have Var[ R γ PNPU f] < Var[ R PN f] for any γ 0, 2γ PNPU if ψ PN + ψ PU > ψ PP and 2ψ PN > ψ PP. Similarly, we have Var[ R γ PNNU f] < Var[ R PN f] for any γ 0, 2γ PNNU if ψ PN + ψ NU > ψ NN and 2ψ PN > ψ NN. This theorem means that, if γ is chosen appropriately, our proposed risk estimators, R γ PNPU and R γ PNNU, have smaller variance than the standard supervised risk estimator R PN. A practical consequence of Theorem 3 is that when we conduct cross-validation for hyperparameter selection, we may use our proposed risk estimators R γ PNPU and R γ PNNU instead of the standard supervised risk estimator R PN since they are more stable see Section 5.3 for details. 4 Var PN, Var PP, and Var NN are the variances over p P x P p N x N, p P x P p P x P, and p N x N p N x N, respectively. Cov PN,PP, Cov PN,NN, Cov PU,PP, and Cov NU,NN are the covariances over p P x P p N x N p P x P, p P x P p N x N p N x N, p P x P px U p P x P, and p N x N px U p N x N, respectively. 7

8 5 Practical Implementation In this section, we explain the implementation details of our proposed methods. 5. General Case In practice, the AUC risks R introduced above are replaced with their empirical version R, where the expectations in R are replaced with the corresponding sample averages. Here, we focus on the linear-in-parameter model given by gx = b w l φx = w φx, l= where denotes the transpose of vectors and matrices, b is the number of basis functions, w = w,..., w b is a parameter vector, and φx = φ x,..., φ b x is a basis function vector. The linear-in-parameter model allows us to express the composite classifier as where fx, x = w φx, x, φx, x := φx φx is a composite basis function vector. We train the classifier by minimizing the l 2 -regularized empirical AUC risk: min w where λ 0 is the regularization parameter. Rf + λ w 2, 5.2 Analytical Solution for Squared Loss For the squared loss l S m := m 2, the empirical PU-AUC risk 5 can be expressed as R PU f = n P n U n U θ P n P n P l S fx P i, x U k i= k= i= i = l S fx P i, x P i + = 2w ĥ PU + w Ĥ PU w w Ĥ PP w, 5 We discuss the way of estimating the PU-AUC risk in Appendix A. θ P n P 8

9 where ĥ PU := Φ P np Φ n P n U nu, U Ĥ PU := Φ P Φ P Φ n P n P n U nu n P Φ P U Φ P np n n P n U Φ U + Φ U n UΦ U, U 2θ P 2θ P Ĥ PP := n P Φ P Φ P n P n P Φ P np n P Φ P, Φ P := φx P,..., φx P n P, Φ U := φx U,..., φx U n U, and b is the b-dimensional vector whose elements are all one. With the l 2 -regularizer, we can analytically obtain the solution by ŵ PU := ĤPU ĤPP + λi b ĥ PU, where I b is the b-dimensional identity matrix. The computational complexity of computing ĥpu, ĤPU, and ĤPP are On P + n U b, On P + n U b 2, and On P b 2, respectively. Then, solving a system of linear equations to obtain the solution ŵ PU requires the computational complexity of Ob 3. In total, the computational complexity of this PU-AUC optimization method is On P + n U b 2 + b 3. As given by Eq. 6, our PNU-AUC optimization method consists of the PNPU-AUC risk and the PNNU-AUC risk. For the squared loss l S m := m 2, the empirical PNPU-AUC risk can be expressed as R γ PNPU f = γ n P n N n N i= j= γθ P n P n P l S fx P i, x N γ j + n P n U n P n P i= i = l S fx P i, x P i + n U i= k= = γ 2 γw ĥ PN + γw Ĥ PN w + γ 2γw ĥ PU + γw Ĥ PU w γw Ĥ PP w, γθ P n P l S fx P i, x U k where ĥ PN := n P Φ P np n N Φ N nn, Ĥ PN := n P Φ P Φ P n P n N Φ P np n N Φ N n P n N Φ N nn n P Φ P + n N Φ NΦ N, Φ N := φx N,..., φx N n N. The solution for the l 2 -regularized PNPU-AUC optimization can be analytically obtained by ŵ γ PNPU := γĥpn + γĥpu γĥpp + λi b γĥpn + γĥpu. 9

10 Similarly, the solution for the l 2 -regularized PNNU-AUC optimization can be obtained by ŵ γ PNNU := γĥpn + γĥnu γĥnn + λi b γĥpn + γĥnu. where ĥ NU := Φ θ P n U nu Φ U θ P n N nn, N Ĥ NU := Φ θ P n NΦ N Φ N θ P n N n U nu n N Φ N U Φ θ P n N n N nn n U Φ U + Φ U θ P n UΦ U, U 2 2 Ĥ NN := θ P n N Φ NΦ N θ P n N n N Φ N nn n N Φ N. The computational complexity of computing ĥpn and ĤPN are On P + n N b and On P + n N b 2, respectively. Then, obtaining the solution ŵ PNPU ŵ PNNU requires the computational complexity of Ob 3. Including the computational complexity of computing ĥ PU, Ĥ PU, and ĤPP, the total computational complexity of the PNPU-AUC optimization method is On P + n N + n U b 2 + b 3. Similarly, the total computational complexity of the PNNU-AUC optimization method is On P + n N + n U b 2 + b 3. Thus, the computational complexity of the PNU-AUC optimization method is On P + n N + n U b 2 + b 3. From the viewpoint of computational complexity, the squared loss and the exponential loss are more efficient than the logistic loss because these loss functions reduce the nested summations to individual ones. More specifically, for example, in the PU-AUC optimization method, the logistic loss requires On P n U operations for evaluating the first term in the PU-AUC risk, i.e., the loss over positive and unlabeled samples. In contrast, the squared loss and exponential loss reduce the number of operations for loss evaluation to On P + n U. 6 This property is beneficial especially when we handle large scale data sets. 5.3 Cross-Validation To tune the hyperparameters such as the regularization parameter λ, we use the cross-validation. For the PU-AUC optimization method, we use the PU-AUC risk in Eq. 4 with the zero-one loss as the cross-validation score. For the PNU-AUC optimization method, we use the PNU-AUC risk in Eq. 6 with the zero-one loss as the score. To this end, however, we need to fix the combination parameter η in the cross-validation score in advance and then, we tune the hyperparameters including the combination parameter. More specifically, let η [, ] be the predefined combination parameter. We conduct cross-validation with respect to R η PNU f for tuning the hyperparameters. 6 For example, the exponential loss over positive and unlabeled data can be computed as follows: n P n U i= k= l E fx P i, xu k = n P n U exp gx P i + gxu k i= k= n P n U = exp gx P i expgx U k. Thus, the number of operations for loss evaluation is reduced to n P + n U + rather than n P n U. i= k= 0

11 Since the PNU-AUC risk is equivalent to the PN-AUC risk for any η, we can choose any η in principle. However, when the empirical PNU-AUC risk is used in practice, choice of η may affect the performance of cross-validation. Here, based on the theoretical result of variance reduction given in Section 4.2, we give a practical method to determine η. Assuming the covariances, e.g., τ PN,PP f, are small enough to be neglected and σ PN f = σ PP f = σ NN f, we can obtain a simpler form of Eqs. 7 and 8 as γ PNPU = + θp 2 n N/θN 2 n P, γ PNNU = + θn 2 n P/θP 2 n N. They can be computed simply from the number of samples and the estimated class-prior. γpnpu γpnnu Finally, to select the combination parameter η, we use R PNPU for η 0, and R PNNU for η < 0. 6 Experiments In this section, we numerically investigate the behavior of the proposed methods and evaluate their performance on various data sets. All experiments were carried out using a PC equipped with two 2.60GHz Intel Xeon E v3 CPUs. As the classifier, we used the linear-in-parameter model. In all experiments except text classification tasks, we used the Gaussian kernel basis function expressed as φ l x = exp x x l 2 2σ 2, where σ > 0 is the Gaussian bandwidth, {x l } b l= are the samples randomly selected from training samples {x i } n i= and n is the number of training samples. In text classification tasks, we used the linear kernel basis function: φ l x = x x l. The number of basis functions was set at b = minn, 200. The candidates of the Gaussian bandwidth were median{ x i x j } n i,j= {/8, /4, /2,, 2} and that of the regularization parameter were {0 3, 0 2, 0, 0 0, 0 }. All hyper-parameters were determined by fivefold cross-validation. As the loss function, we used the squared loss function l S m = m Effect of Variance Reduction First, we numerically confirm the effect of variance reduction. We compare the variance of the empirical PNU-AUC risk against the variance of the empirical PN-AUC risk, Var[ R η PNU f] vs. Var[ R PN f], under a fixed classifier f. As the fixed classifier, we used the minimizer of the empirical PN-AUC risk, denoted by f PN. The number of positive and negative samples for training varied as n P, n N = 2, 8, 0, 0, and 8, 2. We then computed the variance of the empirical PN-AUC and PNU-AUC risks with additional 0 positive, 0 negative, and 300 unlabeled samples. As the data set, we used the Banana data set Rätsch et al., 200. In this experiment, the class-prior was set at θ P = 0. and assumed to be known.

12 r r 0 2 n P = 2; n N = 8 n P = 0; n N = 0 n P = 8; n N = a η b η > 0 Figure : Average with standard error of the ratio between the variance of the empirical PNU risk and that of the PN risk, r = Var[ R η PNU f PN ]/ Var[ R PN f PN ], as a function of the combination parameter η over 00 trials on the Banana data set. The class-prior is θ P = 0. and the number of positive and negative samples varies as n P, n N = 2, 8, 0, 0, and 8, 2. Left: values of r as a function of η. Right: values for η > 0 are magnified. Figure plots the value of the variance of the empirical PNU-AUC risk divided by that of the PN-AUC risk, r := Var[ R η PNU f PN ] Var[ R PN f PN ], as a function of the combination parameter η under different numbers of positive and negative samples. The results show that r < can be achieved by an appropriate choice of η, meaning that the variance of the empirical PNU-AUC risk can be smaller than that of the PN-AUC risk. We then investigate how the class-prior affects the variance reduction. In this experiment, the number of positive and negative samples for f PN are n P = 0 and n N = 0, respectively. Figure 2 showed the values of r as a function of the combination parameter η under different class-priors. When the class-prior, θ P, is 0. and 0.2, the variance can be reduced for η > 0. When the class-prior is 0.3, the range of the value of η that yields variance reduction becomes smaller. However, this may not be that problematic in practice, because AUC optimization is effective when two classes are highly imbalanced, i.e., the class-prior is far from 0.5; when the class-prior is close to 0.5, we may simply use the standard misclassification rate minimization approach. 6.2 Benchmark Data Sets Next, we report the classification performance of the proposed PU-AUC and PNU-AUC optimization methods, respectively. We used 5 benchmark data sets from the IDA Benchmark Repository Rätsch et al., 200, the Semi-Supervised Learning Book Chapelle et al., 2006, the LIBSVM Chang and Lin, 20, and the UCI Machine Learning Repository Lichman, 203. The detailed statistics of the data sets are summarized in Appendix D. 2

13 r r P = 0: 3 P = 0:2 3 P = 0: a η b η > 0 Figure 2: Average with standard error of the ratio between the variance of the PNU-AUC risk and that of the PN-AUC risk, r = Var[ R η PNU f PN ]/ Var[ R PN f PN ], as a function of the combination parameter η over 00 trials on the Banana data set. A class-prior varies as θ P = 0., 0.2, and 0.3. Left: values of r as a function of η. Right: values for η > 0 are magnified. When θ P = 0., 0.2, the variance of the empirical PNU-AUC risk is smaller than that of the PN risk for η > 0. Table : Average and standard error of the estimated class-prior over 50 trials on benchmark data sets in PU learning setting. Data set d θ P = 0. θ P = 0.2 Banana skin_nonskin cod-rna Magic Image Twonorm Waveform mushrooms AUC Optimization from Positive and Unlabeled Data We compared the proposed PU-AUC optimization method against the existing AUC optimization method based on the ranking SVM PU-RSVM Sundararajan et al., 20. We trained a classifier with samples of size n P = 00 and n U = 000 under the different class-priors θ P = 0. and 0.2. For the PU-AUC optimization method, the squared loss function was used and the class-prior was estimated by the distribution matching method du Plessis et al., 207. The results of the estimated class-prior are summarized in Table. Table 2 lists the average with standard error of the AUC over 50 trials, showing that the proposed PU-AUC optimization method achieves better performance than the existing method. 3

14 Computation Time [sec.] Table 2: Average and standard error of the AUC over 50 trials on benchmark data sets. The boldface denotes the best and comparable methods in terms of the average AUC according to the t-test at the significance level 5%. The last row shows the number of best/comparable cases of each method. θ Data set d P = 0. θ P = 0.2 PU-AUC PU-RSVM PU-AUC PU-RSVM Banana skin_nonskin cod-rna Magic Image Twonorm Waveform mushrooms #Best/Comp PU-AUC PU-RSVM Banana skin nonskin cod-rna Magic Image Twonorm Waveform mushrooms Figure 3: Average computation time on benchmark data sets when θ P = 0. over 50 trials. The computation time of the PU-AUC optimization method includes the class-prior estimation and the empirical risk minimization. In particular, when θ P = 0.2, the difference between PU-RSVM and our method becomes larger compared with the difference when θ P = 0.. Since PU-RSVM can be interpreted as regarding unlabeled data as negative, the bias caused by this becomes larger when θ P = 0.2. Figure 3 summarizes the average computation time over 50 trials. The computation time of the PU-AUC optimization method includes both the class-prior estimation and the empirical risk minimization. The results show that the PU-AUC optimization method requires almost twice computation time as that of PU-RSVM, but it would be acceptable in practice to obtain better performance. 4

15 6.2.2 Semi-Supervised AUC Optimization Here, we compare the proposed PNU-AUC optimization method against existing AUC optimization approaches: the semi-supervised rankboost SSRankboost Amini et al., 2008, 7 the semi-supervised AUC-optimized logistic sigmoid sauc-ls Fujino and Ueda, 206, 8 and the optimum AUC with a generative model OptAG Fujino and Ueda, 206. We trained the classifier with samples of size n P = θ P n L, n N = n L n P, and n U = 000, where n L is the number of labeled samples. For the PNU-AUC optimization method, the squared loss function was used and the candidates of the combination parameter η were { 0.9, 0.8,..., 0.9}. For the class-prior estimation, we used the energy distance minimization method Kawakubo et al., 206. The results of the estimated class-prior are summarized in Table 3. For SSRankboost, the discount factor and the number of neighbors were chosen from {0 3, 0 2, 0 } and {2, 3,..., 7}, respectively. For sauc-ls and OptAG, the regularization parameter for the entropy regularizer was chosen from {, 0}. Furthermore, as the generative model of OptAG, we adapted the Gaussian distribution for the data distribution and the Gaussian and Gamma distributions for the prior of the data distribution. 9 Table 4 lists the average with standard error of the AUC over 50 trials, showing that the proposed PNU-AUC optimization method achieves better performance than or comparable performance to the existing methods on many data sets. Figure 4 summarizes the average computation time over 50 trials. The computation time of the PNU-AUC optimization method includes both the class-prior estimation and the empirical risk minimization. The results show that even though our proposed method involves the class-prior estimation, the computation time is relatively faster than SSRankboost and much faster than sauc-ls and OptAG. The reason for longer computation time of sauc-ls and OptAG is that their implementation is based on the logistic loss in which the number of operations for loss evaluation is On P n N +n P n U +n N n U, unlike the PNU-AUC optimization method with the squared loss in which the number of operations for loss evaluation is On P + n N + n U cf. the discussion about the computational complexity in Section 5. 7 We used the code available at 8 This method is equivalent to OptAG without a generative model, which only employs a discriminative model with the entropy minimization principle. To eliminate the adverse effect of the wrongly chosen generative model, we added this method for comparison. 9 As the generative model, we used the Gaussian distributions for positive and negative classes: p gx P ; µ P τ d 2 P exp τ P 2 xp µ P 2, p gx P ; µ N τ d 2 N exp τ N 2 xn µ N 2, where τ P and τ N denote the precisions and µ P and µ N are the means. As the prior of µ P, µ N, τ P, and τ N, we used the Gaussian and Gamma distributions: pµ P ; µ 0 P τ d 2 P exp ρ0 P τ P µ 2 P µ 0 P 2, pµ N ; µ 0 N τ d 2 N exp pτ P ; a 0 P, b0 P τ a0 P P exp b 0 P τ P, pτ N ; a 0 N, b0 N τ a0 N N exp b 0 N τ N, ρ0 N τ N µ 2 N µ 0 N 2, where µ 0 P, µ0, a 0 P, b0 P, a0 N, b0 N, ρ0 P, and ρ0 N are the hyperparameters. 5

16 Table 3: Average and standard error of the estimated class-prior over 50 trials on benchmark data sets in semi-supervised learning setting. Data set n L θ P = 0. θ P = 0.2 Banana d = skin_nonskin d = cod-rna d = Magic d = Image d = SUSY d = Ringnorm d = Twonorm d = Waveform d = covtype d = phishing d = a9a d = mushrooms d = USPS d = w8a d = Text Classification Next, we apply our proposed PNU-AUC optimization method to a text classification task. We used the Reuters Corpus Volume I data set Lewis et al., 2004, the Amazon Review data set Dredze et al., 2008, and the 20 Newsgroups data set Lang, 995. More specifically, we used the data set processed for a binary classification task: the rcv, amazon2, and news20 data sets. The rcv and news20 data sets are available at the website of LIBSVM Chang and Lin, 20, and the amazon2 is designed by ourselves, which consists of the product reviews of books and 6

17 Table 4: Average and standard error of the AUC over 50 trials on benchmark data sets. The boldface denotes the best and comparable methods in terms of the average AUC according to the t-test at the significance level 5%. The last row shows the number of best/comparable cases of each method. SSRboost is an abbreviation for SSRankboost. Data set n L θ P = 0. θ P = 0.2 PNU-AUC SSRboost sauc-ls OptAG PNU-AUC SSRboost sauc-ls OptAG Banana d = skin_nonskin d = cod-rna d = Magic d = Image d = SUSY d = Ringnorm d = Twonorm d = Waveform d = covtype d = phishing d = a9a d = mushrooms d = USPS d = w8a d = #Best/Comp music from the Amazon7 data set Blondel et al., 203. The dimension of a feature vector of the rcv data set is 47, 236, that of the amazon2 data set is 262, 44, and that of the news20 data set is, 355, 9. We trained a classifier with samples of size n P = 20, n N = 80, and n U = 0, 000. The true class-prior was set at θ P = 0.2 and estimated by the method based on energy distance minimization Kawakubo et al., 206. For the generative model of OptAG, we employed naive Bayes NB multinomial models and a Dirichlet prior for the prior distribution of the NB model as described in Fujino and Ueda 206. Table 5 lists the average with standard error of the AUC over 20 trials, showing that the proposed method outperforms the existing methods. Figure 5 summarizes the average computation time of each method. These results show that the proposed method achieves better performance 7

18 Computation Time [sec.] 0 4 PNU-AUC SSRankboost sauc-ls OptAG Banana skin nonskin cod-rna Magic Image SUSY Ringnorm Twonorm Waveform covtype phishing a9a mushrooms USPS w8a Figure 4: Average computation time of each method on benchmark data sets when n L = 00 and θ P = 0. over 50 trials. Table 5: Average with standard error of the AUC over 20 trials on the text classification data sets. The boldface denotes the best and comparable methods in terms of the average AUC according to the t-test at the significance level 5%. Data set d θp PNU-AUC SSRankboost sauc-ls OptAG rcv amazon news PNU-AUC SSRankboost sauc-ls OptAG Computation Time [sec.]04 rcv amazon2 news20 Figure 5: Average computation time of each method on the text classification data sets. with short computation time. 6.4 Sensitivity Analysis Here, we investigate the effect of the estimation accuracy of the class-prior for the PNU-AUC optimization method. Specifically, we added noise ρ { 0.09, 0.08,..., 0.09} to the true class-prior θ P and used θ P = θ P +ρ as the estimated class-prior for the PNU-AUC optimization method. Under the different values of the class-prior θ P = 0., 0.2, and 0.3, we trained a classifier with samples of size n P = θ P 50, n N = 50, and n U = 000. Figure 6 summarizes the average with standard error of the AUC as a function of the noise. The plots show that when θ P = 0.2 and 0.3, the performance of the PNU-AUC optimization 8

19 AUC AUC AUC P = 0: 3P = 0:2 3P = 0: ; a Banana d = ; b cod-rna d = ; c w8a d = 300 Figure 6: Average with standard error of the AUC as a function of the noise ρ over 00 trials. The PNU-AUC optimization method used the noisy class-prior θ P = θ P + ρ in training. The plots show that when θ P = 0.2 and 0.3, the performance of the PNU-AUC optimization method is stable even when the estimated class-prior has some noise. However, when θ P = 0., as the noise is close to ρ = 0.09, the performance largely decreases. Since the true class-prior is small, it is sensitive to the negative bias. method is stable even when the estimated class-prior has some noise. On the other hand, when θ P = 0., as the noise is close to ρ = 0.09, the performance largely decreases. Since the true class-prior is small, it is sensitive to the negative bias. In particular, when ρ = 0.09, the gap between the estimated and true class-priors is larger than other values. For instance, when ρ = 0.09 and θ P = 0.2, θ P / θ P.8, but when ρ = 0.09 and θ P = 0., θ P / θ P 0. In contrast, the positive bias does not heavily affect the performance even when θ P = Scalability Finally, we report the scalability of our proposed PNU-AUC optimization method. Specifically, we evaluated the AUC and computation time while increasing the number of unlabeled samples. We picked two large data sets: the SUSY and amazon2 data sets. The number of positive and negative samples were n P = 40 and n N = 60, respectively. Figure 7 summarizes the average with standard error of the AUC and computation time as a function of the number of unlabeled samples. The AUC on the SUSY data set slightly increased at n U =, 000, 000, but the improvement on the amazon2 data set was not noticeable or the performance decreased slightly. In this experiment, the increase of the size of unlabeled data did not significantly improve the performance of the classifier, but it did not affect adversely, i.e., it did not cause significant performance degeneration. The result of computation time shows that the proposed PNU-AUC optimization method can handle approximately, 000, 000 samples within reasonable computation time in this experiment. The longer computation time on the SUSY data set before n U = 0, 000 is because we need to choose one additional hyperparameter, i.e., the bandwidth of the Gaussian kernel basis function, compared with the linear kernel basis function. However, the effect gradually decreases; after n U = 0, 000, the matrix multiplication of the high dimensional matrix on the amazon2 data set d = 262, 44 requires more computation time than the SUSY data set d = 8. 9

20 AUC Computation Time [sec.] amazon2 SUSY n U a AUC n U b Computation time Figure 7: Average with standard error of the AUC as a function of the number of unlabeled data n U over 20 trials. 7 Conclusions In this paper, we proposed a novel AUC optimization method from positive and unlabeled data and extend it to a novel semi-supervised AUC optimization method. Unlike the existing approach, our approach does not rely on strong distributional assumptions on the data distributions such as the cluster and the entropy minimization principle. Without the distributional assumptions, we theoretically derived the generalization error bounds of our PU and semi-supervised AUC optimization methods. Moreover, for our semi-supervised AUC optimization method, we showed that the variance of the empirical risk can be smaller than that of the supervised counterpart. Through numerical experiments, we demonstrated the practical usefulness of the proposed PU and semi-supervised AUC optimization methods. Acknowledgements TS was supported by KAKNEHI 5J09. GN was supported by the JST CREST program and Microsoft Research Asia. MS was supported by JST CREST JPMJCR403. We thank Han Bao for his comments. A PU-AUC Risk Estimator In this section, we discuss the way of estimating the proposed PU-AUC risk. Recall that the PU-AUC risk in Eq. 4 is defined as R PU f = E P [E U [lfx P, x U ]] θ P E P [E P [lfx P, x P ]]. If one additional set of positive samples {x P i } np i= is available, we obtain the unbiased PU-AUC risk estimator by R PU f = n U i= k= lfx P i, x U k θ P n P 2 n P n P i= i = lfx P i, x P i. 20

21 We used this estimator in our theoretical analyses because learning is not involved. However, obtaining one additional set of samples is not always possible in practice. Thus, instead of the above risk estimator, we use the following risk estimator in our implementation: R PU f = n U i= k= lfx P i, x U k θ P n P n P n P n P i= i = lfx P i, x P i l0. n P This estimator is also unbiased. To show unbiasedness of this estimator, let us rewrite the second term of the PU-AUC risk without coefficient in Eq. 4 as The unbiased estimator can be expressed as E P [E P [lfx P, x P ]] = E x P,x P[lfxP, x P ]. n P n P n P n P i= i = lfx P i, x P i l0 n P, because the expectation of the above estimator can be computed as follows: [ E x P,...,x P n P n P n P n P n P i= i = lfx P i, x P i [ = E x P,...,x lfx P P n i, x P i + P n P n P i= [ = E x P,...,x l0 + P n P n P n P = = n P n P n P n P i= i i i= i i ] = E x P,x [lfx P P, x P, i= i= i i E x P i,x P i [ lfx P i, x P i ] ] E xp,x [lfx P P, x P l0 ] n P i= i i lfx P i, x P i lfx P i, x P i l0 ] n P l0 ] n P where we used fx, x = w φx φx = 0 from the second to third lines. If the squared loss function lm = m 2 is used, l0 = cf. the implementation with the squared loss in Section 5. Therefore, the proposed PU-AUC risk estimator is unbiased. B Proof of Generalization Error Bounds Here, we give the proofs of generalization error bounds in Section 4.. The proofs are based on Usunier et al Let {x i } m i= and {x j }n j= be two sets of samples drawn from the distribution equipped with densities qx and q x, respectively. Recall F be a function class of bounded hyperplanes: F := {fx = w, φx φx w C w ; x: φx C φ }, 2

22 where C w > 0 and C φ > 0 are certain positive constants. Then, the AUC risk over distributions q and q and its empirical version can be expressed as For convenience, we define Rf := E x q [E x q [lfx, x ]], Rf := m n lfx i, x mn j. i= j= We first have the following theorem: hδ := 2 2LC l C w C φ log2/δ. Theorem 4. For any δ > 0, the following inequality holds with probability at least δ for any f F: Rf Rf hδ minn, n. Proof. By slightly modifying Theorem 7 in Usunier et al to fit our setting, for any δ > 0, with probability at least δ for any f F, we have Rf Rf 2LC lc w maxn, n n n nn φx i φx j 2 i= j= log2/δ minn, n. 9 Applying the inequality n n φx i φx j 2 n i= j= n i= 2nn C 2 φ, to the first term in Eq. 9, we obtain the theorem. φx i 2 + n φx j 2 By using Theorem 4, we prove the risk bounds of the PU-AUC and NU-AUC risks: Lemma 5. For any δ > 0, the following inequalities hold separately with probability at least δ for any f F: R PU f R PU f hδ/2 minnp, n U + θ P, np R NU f R NU f hδ/2 θ P minnn, n U + θ N. θ P nn Proof. Recall that the PU-AUC and NU-AUC risks are expressed as R PU f = E P [E U [lfx P, x U ]] θ P E P [E P [lfx P, x P ]], R NU f = θ P E U [E N [lfx U, x N ]] θ P E N [E N [lfx N, x N ]]. n j= 22

23 Based on Theorem 4, for any δ > 0, we have these uniform deviation bounds with probability at least δ/2: sup E P [E U [lfx P, x U ]] f F n P n U sup E U [E N [lfx U, x N ]] f F n N n U sup E P [E P [lfx P, x P ]] f F n 2 P sup E N [E N [lfx N, x N ]] f F n 2 N n U i= k= n U n N k= j= i= i = n N n N j= j = lfx P i, x U k lfx U k, x N j hδ/2 minnp, n U, hδ/2 minnn, n U, lfx P i, x P i hδ/2, np lfx N j, x N j hδ/2, nn Simple calculation showed that for any δ > 0, with probability δ, we have sup R PU f R PU f sup E P [E U [lfx P, x U ]] f F f F n P n U where we used + θ P sup E P [E θ P [lfx P, x P ]] N f F hδ/2 supx + y supx + supy, minnp, n U + n U i= k= n 2 P i= i = lfx P i, x U k lfx P i, x P i θ P, 0 np R PU f E P [E U [lfx P, x U ]] + θ P E P [E P [lfx P, x P ]]. Similarly, for the NU-AUC risk, we have R NU f R NU f hδ/2 sup f F Eqs. 0 and conclude the lemma. Finally, we give the proof of Theorem. θ P minnn, n U + θ N. θ P nn Proof. Assume the loss satisfying l 0- m Mlm. We have If MRf. When M = such as l S m and l E m, If Rf holds. This observation yields Theorem. Next, we prove the generalization error bounds of the PNPU-AUC and PNNU-AUC risks in Theorem 2. We first prove the following risk bounds: Lemma 6. For any δ > 0, the following inequalities hold separately with probability at least δ for all f F: R γ PNPU f R γ PNPU f hδ/3 γ minnp, n N + γ minnp, n U + θ Pγ, np R γ PNNU f R γ PNNU f hδ/3 γ minnp, n N + 23 γ θ P minnn, n U + θ Nγ. θ P nn

24 Proof. Recall the PNPU-AUC and PNNU-AUC risks: R γ PNPU f := γr PNf + γr PU f, R γ PNNU f := γr PNf + γr NU f. Based on Theorem 4, for any δ > 0, we have these uniform deviation bounds with probability at least δ/3: sup E P [E N [lfx P, x N ]] f F n P n N sup E P [E U [lfx P, x U ]] f F n P n U sup E U [E N [lfx U, x N ]] f F n N n U sup E P [E P [lfx P, x P ]] f F n 2 P sup E N [E N [lfx N, x N ]] f F n 2 N n N i= j= n U i= k= n U n N k= j= i= i = n N n N j= j = lfx P i, x N j lfx P i, x U k lfx U k, x N j hδ/3 minnp, n N, hδ/3 minnp, n U, hδ/3 minnn, n U, lfx P i, x P i hδ/3, np lfx N j, x N j hδ/3. nn Combining three bounds from the above, for any δ > 0, with probability δ, we have R γpnpu f R γpnpu f sup f F γ sup f F E P [E N [lfx P, x N ]] n P n N + γ sup E P [E U [lfx P, x U ]] f F n P n U + γθ P hδ/3 sup E P [E P [lfx P, x P ]] f F γ minnp, n N + n N i= j= n U i= k= n 2 P i= i = γ minnp, n U + This concludes the risk bounds of the PNPU-AUC risk. Similarly, we prove the risk bounds of the PNNU-AUC risk. lfx P i, x N j lfx P i, x U k lfx P i, x P i γθ P. np Again, If Rf holds in our setting. This leads to Theorem 2. C Proof of Variance Reduction Here, we give the proof of Theorem 3. 24

arxiv: v2 [cs.lg] 14 Oct 2016

arxiv: v2 [cs.lg] 14 Oct 2016 Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Tomoya Sakai Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University of Tokyo The University of

More information

Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data

Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Tomoya Sakai 1 2 Marthinus Christoffel du Plessis Gang Niu 1 Masashi Sugiyama 2 1 Abstract Most of the semi-supervised

More information

Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data

Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Tomoya Sakai 2 Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama 2 Abstract Most of the semi-supervised

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Input-Dependent Estimation of Generalization Error under Covariate Shift

Input-Dependent Estimation of Generalization Error under Covariate Shift Statistics & Decisions, vol.23, no.4, pp.249 279, 25. 1 Input-Dependent Estimation of Generalization Error under Covariate Shift Masashi Sugiyama (sugi@cs.titech.ac.jp) Department of Computer Science,

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning Gang Niu 1 Marthis C. du Plessis 1 Tomoya Sakai 1 Yao Ma 3 Masashi Sugiyama 2,1 1 The University of Tokyo, Japan

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Support vector comparison machines

Support vector comparison machines THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE Abstract Support vector comparison machines Toby HOCKING, Supaporn SPANURATTANA, and Masashi SUGIYAMA Department

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Classification from Pairwise Similarity and Unlabeled Data

Classification from Pairwise Similarity and Unlabeled Data Han Bao Gang Niu Masashi Sugiyama Abstract Supervised learning needs a huge amount of labeled data, which can be a big bottleneck under the situation where there is a privacy concern or labeling cost is

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Instance-Dependent PU Learning by Bayesian Optimal Relabeling

Instance-Dependent PU Learning by Bayesian Optimal Relabeling Instance-Dependent PU Learning by Bayesian Optimal Relabeling arxiv:1808.02180v1 [cs.lg] 7 Aug 2018 Fengxiang He Tongliang Liu Geoffrey I Webb Dacheng Tao Abstract When learning from positive and unlabelled

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data

Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Jonathan Ortigosa-Hernández, Iñaki Inza, and Jose A. Lozano Contents 1 Appendix A.

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop

More information

Training algorithms for fuzzy support vector machines with nois

Training algorithms for fuzzy support vector machines with nois Training algorithms for fuzzy support vector machines with noisy data Presented by Josh Hoak Chun-fu Lin 1 Sheng-de Wang 1 1 National Taiwan University 13 April 2010 Prelude Problem: SVMs are particularly

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Tackling the Poor Assumptions of Naive Bayes Text Classifiers Tackling the Poor Assumptions of Naive Bayes Text Classifiers Jason Rennie MIT Computer Science and Artificial Intelligence Laboratory jrennie@ai.mit.edu Joint work with Lawrence Shih, Jaime Teevan and

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Generalization error bounds for classifiers trained with interdependent data

Generalization error bounds for classifiers trained with interdependent data Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

On the Interpretability of Conditional Probability Estimates in the Agnostic Setting On the Interpretability of Conditional Probability Estimates in the Agnostic Setting Yihan Gao Aditya Parameswaran Jian Peng University of Illinois at Urbana-Champaign Abstract We study the interpretability

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

This is an author-deposited version published in : Eprints ID : 17710

This is an author-deposited version published in :   Eprints ID : 17710 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Trading Convexity for Scalability

Trading Convexity for Scalability Trading Convexity for Scalability Léon Bottou leon@bottou.org Ronan Collobert, Fabian Sinz, Jason Weston ronan@collobert.com, fabee@tuebingen.mpg.de, jasonw@nec-labs.com NEC Laboratories of America A Word

More information