Class-prior Estimation for Learning from Positive and Unlabeled Data

Size: px

Start display at page:

Download "Class-prior Estimation for Learning from Positive and Unlabeled Data"

Ann Lynch
6 years ago
Views:

1 JMLR: Workshop and Conference Proceedings 45:22 236, 25 ACML 25 Class-prior Estimation for Learning from Positive and Unlabeled Data Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University of Tokyo, Tokyo, 3-33, Japan. Editor: Geoffrey Holmes and Tie-Yan Liu Abstract We consider the problem of estimating the class prior in an unlabeled dataset. Under the assumption that an additional labeled dataset is available, the class prior can be estimated by fitting a mixture of class-wise data distributions to the unlabeled data distribution. However, in practice, such an additional labeled dataset is often not available. In this paper, we show that, with additional samples coming only from the positive class, the class prior of the unlabeled dataset can be estimated correctly. Our key idea is to use properly penalized divergences for model fitting to cancel the error caused by the absence of negative samples. We further show that the use of the penalized L -distance gives a computationally efficient algorithm with an analytic solution, and establish its uniform deviation bound and estimation error bound. Finally, we experimentally demonstrate the usefulness of the proposed method. Keywords: Class-prior estimation, positive and unlabeled data. Introduction Suppose that we have two datasets X and X, which are i.i.d. samples from probability distributions with density p(x y = ) and, respectively: X = {x i } n i.i.d. p(x y = ), X = {x j} n i.i.d.. That is, X is a set of samples from the positive class and X is a set of unlabeled samples (consisting of both the positive and negative samples). Our goal is to estimate the class prior π = p(y = ), in the unlabeled dataset X. Estimation of the class prior from positive and unlabeled data is of great practical importance, since it allows a classifier to be trained only from these datasets (Scott and Blanchard, 29; du Plessis et al., 24), in the absence of negative data. If a mixture of class-wise input data densities, q (x; ) = p(x y = ) + ( )p(x y = ), c 25 M.C. du Plessis, G. Niu & M. Sugiyama.

2 du Plessis Niu Sugiyama ( )p(x y= ) p(x y=) p(x y=) x (a) Full matching with q(x; ) = p(x y = ) + ( )p(x y = ) x (b) Partial matching with q(x; ) = p(x y = ) Figure : Class-prior estimation by matching model q(x; ) to unlabeled input data density. is fitted to the unlabeled input data density, the true class prior π can be obtained (Saerens et al., 22; du Plessis and Sugiyama, 22), as illustrated in Figure (a). In practice, fitting may be performed under the f-divergence (Ali and Silvey, 966; Csiszár, 967): ( q ) (x; ) := arg min f dx, () where f(t) is a convex function with f() =. So far, class-prior estimation methods based on the Kullback-Leibler divergence (Saerens et al., 22), and the Pearson divergence (du Plessis and Sugiyama, 22) have been developed (Table ). Additionally, class-prior estimation has been performed by L 2 -distance minimization (Sugiyama et al., 22). However, since these methods require labeled samples from both positive and negative classes, they cannot be directly employed in the current setup. To cope with problem, a partial model, q(x; ) = p(x y = ), was used in Elkan and Noto (28) and du Plessis and Sugiyama (24) to estimate the class prior in the absence of negative samples (Figure (b)): where Div f () := := arg min Div f (), (2) f ( ) q(x; ) dx. In this paper, we first show that the above partial matching approach consistently overestimates the true class prior. We then show that, by appropriately penalizing f- divergences, the class prior can be correctly obtained. We further show that the use of the 222

3 Class-prior Estimation for Learning from Positive and Unlabeled Data Table : Common f-divergences. f (z) is the conjugate of f(t) and f (z) is the conjugate of the penalized function f(t) = f(t) for t and otherwise. Divergence Kullback-Leibler divergence (Kullback and Leibler, 95) Pearson divergence (Pearson, 9) Function f(t) L -distance t Conjugate f (z) log(t) log( z) 2 (t )2 2 z2 + z { z z otherwise Penalized Conjugate f (z) { log( z) z z z > 2 z < 2 z2 + z z z z > max (z, ) penalized L -distance drastically simplifies the estimation procedure, resulting in an analytic estimator that can be computed efficiently. We also establish a uniform deviation bound and an estimation error bound for the penalized L -distance estimator. Finally, through experiments, we demonstrate the usefulness of the proposed method in classification from positive and unlabeled data. 2. Class-prior estimation via penalized f-divergences First, we investigate the behavior of the partial matching method (2), which can be regarded as an extension of the existing analysis for the Pearson divergence (du Plessis and Sugiyama, 24) to more general divergences. We show that naively using general divergences may result in an overestimate of the class prior, and show how this situation can be avoided by penalization. 2.. Over-estimation of the class prior For f-divergences, we focus on f(t) such that its minimum is attained at t. We also assume that it is differentiable and the derivative of f(t) is f(t) <, when t <, and f(t) when t =. This condition is satisfied for divergences such as the Kullback-Leibler divergence and the Pearson divergence. Because of the divergence matching formulation, we expect that the objective function (2) is minimized at = π. That is, based on the first-order optimality condition, we expect that the derivative of Div f () w.r.t., given by ( ) p(x y = ) Div f () = f p(x y = )dx, satisfies Div f (π) =. Since ( ) πp(x y = ) πp(x y = ) = p(y = x) = f, 223

4 du Plessis Niu Sugiyama we have Div f (π)= ( ) f p(y = x) p(x y = )dx. } {{ } The domain of the above integral, where p(x y = ) >, can be expressed as: D = {x : p(y = x) = p(x y = ) > }, D 2 = {x : p(y = x) < p(x y = ) > }. The derivative is then expressed as Div f (π) = f(p(y = x)) p(x y = )dx + D }{{} f(p(y = x)) p(x y = )dx. D 2 }{{} (3) The posterior is p(y = x) = πp(x y = )/, where = πp(x y = ) + ( π)p(x y = ). D is the part of the domain where the two classes do not overlap, because p(y = x) = implies that πp(x y = ) = and ( π)p(x y = ) =. Conversely, D 2 is the part of the domain where the classes overlap because p(y = x) < implies that πp(x y = ) <, and ( π)p(x y = ) >. Since the first term in (3) is non-positive, the derivative can be zero only if D 2 is empty (i.e., there is no class overlap). If D 2 is not empty (i.e., there is class overlap) the derivative will be negative. Since the objective function Div f () is convex, the derivative Div f () is a monotone non-decreasing function. Therefore, if the function Div f () has a minimizer, it will be larger than the true class prior π Partial distribution matching via penalized f-divergences In this section, we consider a function f(t) = { (t ) t <, c(t ) t. This function coincides with the L distance when c =. The analysis here is slightly more involved, since the subderivative should be taken at t =. This gives the following: Div f (π) = f()p(x y = )dx + D f(p(y = x)) p(x y = )dx, D 2 }{{} (4) where the subderivative at t = is f() = [, c and the derivative for t < is f(t) =. We can therefore write the subderivative of the first term in (4) as [ f()p(x y = )dx = p(x y = )dx, c p(x y = )dx. D D D The derivative for the second term in (4) is f(p(y = x)p(x y = )dx = p(x y = )dx. D 2 D 2 < < 224

5 Class-prior Estimation for Learning from Positive and Unlabeled Data To achieve a minimum at π, we should have [ p(x y = )dx p(x y = )dx, c p(x y = )dx p(x y = )dx. D D 2 D D 2 However, depending on D and D 2, we may have c p(x y = )dx p(x y = )dx <, D D 2 which means that Div f (π) The solution is to take c = to ensure that Div f (π) is always satisfied. Using the same reasoning as above, we can also rectify the overestimation problem for other divergences specified by f(t), by replacing f(t) with a penalized function f(t): { f(t) t, f(t) = otherwise. In the next section, estimation of penalized f-divergences is discussed Direct evaluation of penalized f-divergences Here, we show how distribution matching can be performed without density estimation under penalized f-divergences. We use the Fenchel duality bounding technique for f- divergences (Keziou, 23), which is based on Fenchel s inequality: where f (z) is the Fenchel dual or convex conjugate defined as f(t) tz f (z), (5) f (z) = sup t t z f(t ). Applying the bound (5) in a pointwise manner, we obtain ( ) ( ) p(x y = ) p(x y = ) f r(x) f (r(x)), where r(x) fulfills the role of z in (5). Multiplying both sides with gives ( ) p(x y = ) f r(x)p(x y = ) f (r(x)). (6) Integrating and then selecting the tightest bound gives Div f (p q) sup r(x)p(x y = )dx r Replacing expectations with sample averages gives Div f (p q) sup r n f (r(x))dx. (7) n r(x i ) n n f (r(x j)), (8) where Div f (p q) denotes Div f (p q) estimated from sample averages. Note that the conjugate f (z) of any function f(t) is convex. Therefore, if r(x) is linear in parameters, the above maximization problem is convex and thus can be easily solved. The conjugates for selected penalized f-divergences are given in Table. 225

6 du Plessis Niu Sugiyama 2.4. Penalized L -distance estimation Here we focus on penalized L -distance f(t) = t as a specific example of penalized f-divergences, and show that an analytic and computationally efficient solution can be obtained. The conjugate for the penalized L -distance is f (z) = max (z, ). Then we can see that the lower-bound in (7) will be met with equality when { p(x y = ), r(x) = (9) otherwise. For this reason, we use the following linear model as r(x): r(x) = α l ϕ l (x), () where {ϕ l (x)} b is the set of non-negative basis functions. Then the empirical estimate in the right-hand side of (8) can be expressed as ( α,..., α b ) = arg min (α,...,α b ) n ( n ) max α l ϕ l (x j), n n α l ϕ l (x i ) + + λ 2 αl 2, where the regularization term λ b 2 α2 l for λ > is included to avoid overfitting. The optimal value for the lower-bound (7) occurs when r(x) (see (9)). This fact can be incorporated by constraining the parameters of the model (), so that α l, l =,..., b. If the basis functions, ϕ l (x), l =,..., b, are non-negative, the term inside the max in () is always positive and the max operation becomes superfluous. This allows us to obtain the parameter vector (α,..., α b ) as the solution to the following constrained optimization problem: where ( α,..., α b ) = arg min (α,...,α b ) λ 2 α2 l α l β l, s.t. α l, l =,..., b, () β l = n n ϕ l (x i ) n n ϕ l (x j). The above optimization problem decouples for all α l values and can be solved separately as α l = λ max (, β l).. In practice, we use Gaussian kernels centered at all sample points as the basis functions: ϕ l (x) = exp ( x c l 2 /(2σ 2 ) ), where σ >, and (c,..., c n, c n+,..., c n+n ) = (x,..., x n, x,... x n ) 226

7 Class-prior Estimation for Learning from Positive and Unlabeled Data Since α can be just calculated with a max operation, the above solution is extremely fast to calculate. All hyper-parameters including the Gaussian width σ and the regularization parameter λ are selected for each via straightforward cross-validation. Finally, our estimate of the penalized L -distance (i.e., the maximizer of the empirical estimate in the right-hand side of (8)) is obtained as penl () = λ max (, β l ) β l +. The class prior is then selected so as to minimize the above estimator. 3. Stability analysis Regarding the estimation stability of penl () for fixed, we have the following deviation bound. Without loss of generality, we assume that the basis functions are upper bounded by one, i.e., x, ϕ l (x) for l =,..., b. Theorem (Deviation bound) Fix, then, for any < δ <, with probability at least δ over the repeated sampling of D = X X for estimating penl (; D), we have penl ( (; D) E D [ penl (; D) ln(2/δ) ( 2λ 2 /b ) 2 + (2 n n n + n ) ) 2. Proof We prove the theorem based on a technique known as the method of bounded difference. Let f l (x,..., x n, x,..., x n ) = max (, β l ) β l, so that penl (; D) = λ f l (x,..., x n, x,..., x n ) +. Next, we replace x i with x i and bound the difference between f l (x,..., x i,..., x n, x,..., x n ) and f l (x,..., x i,..., x n, x,..., x n ). Let t = ϕ l (x i ), t = ϕ l ( x i ), and ξ l = β l t. Since the basis functions are bounded, we know that t, t, as well as ξ l. Then the maximum difference is, c l = sup max t, t (, n ) ( ) ( n t ξ l t ξ l max, ) ( ) n t ξ l n t ξ l. By analyzing the above for different cases where the constraints are active, the the maximum difference for replacing x i with is x i is (/n)(2 + /n). We can use the same argument for replacing x j with x j in f l(x,..., x n, x,..., x n ), resulting in a maximum difference of (/n)(2 + /n). Note that this holds for all f l ( ) simultaneously, and thus the change of penl (; D) is no more than (b/λ)(/n)(2 + /n) if x i is replaced with x i or (b/λ)(/n)(2 + /n) if x j is replaced with x j. We can therefore apply McDiarmid s inequality to obtain, with probability at least δ/2, [ penl (; D) E D penl (; D) ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) )

8 du Plessis Niu Sugiyama [ Applying McDiarmid s inequality again for E D penl (; D) penl (; D) proves the result. Theorem shows that the deviation from our estimate to its expectation is small with high probability. Nevertheless, must be fixed before seeing the data, so we cannot use the estimate to choose. This motivates us to derive a uniform deviation bound. Let us define the constants C x = sup x ( b ϕ2 l (x) ) /2 b, C α = sup sup X p n (x y=),x p n (x)( b α2 l (, D) ) /2, where we write α l (, D) to emphasize that α l depends on and D. Theorem 2 (Uniform deviation bound) For any < δ <, with probability at least δ over the repeated sampling of D, the following holds for all, penl ( 2 (; D) E D [ penl (; D) n + 2 ) C α C x n + ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2. Proof 2 We denote g(; D) = b α lβ l, and g() = E D [g(; D) so that penl (; D) = g(; D) +, E D [ penl (; D) = g() +, and Step : penl (; D) E D [ penl (; D) = g(; D) g(). We first consider one direction. By definition,, g(; D) g() sup {g(; D) g()}. We cannot simply apply Theorem to bound the right-hand side since in the right-hand side is not fixed. Nevertheless, according to the proof of Theorem, if we replace a single point x i or x j in D, the change of sup {g(; D) g()} is also bounded by b λn (2 + n ) or b λn (2 + n ). We then have, with probability at least δ/2, ( sup{g(; D) g()} E D [sup {g(; D) g()} + ln(2/δ) ( 2λ 2 /b 2 2+ ) 2 + (2 n n n + n ) )

9 Class-prior Estimation for Learning from Positive and Unlabeled Data Step 2: Next we bound E D [sup {g(; D) g()} based on a technique known as symmetrization. Note that the function g(; D) = b α lβ l can be rewritten in a point-wise manner other than a base-wise manner, g(; D) = = n ( ) αl n ϕ l (x i ) n n ω(x i ) ω (x j), where for simplicity we define ω(x) = n ( ) αl ϕ l (x), and, ω (x) = n ( ) αl n ϕ l (x j) ( ) αl n ϕ l (x). Let D = { x,..., x n, x,..., x n } be a ghost sample, [ E D [sup {g(; D) g()} = E D sup{g(; D) E D [g(; D )}, = E D [sup {E D [g(; D) g(; D )}, [ E D,D sup{g(; D) g(; D )}, where we apply Jensen s inequality with the fact that the supremum is a convex function. Moreover, let σ = {σ,..., σ n, σ,..., σ n } be a set of Rademacher variables of size n + n, [ E D,D sup{g(; D) g(; D )} = E D,D = E D,D = E σ,d,d sup sup sup n n n n ω(x i ) ω (x j) ω( x i ) ω ( x j) n n (ω(x i ) ω( x i )) (ω (x j) ω ( x j)) n n σ i (ω(x i ) ω( x i )) σ j(ω (x j) ω ( x j)), since the original and ghost samples are symmetric and each (ω(x i ) ω( x i )) shares the same distribution with σ i (ω(x i ) ω( x i )) and each (ω (x j ) ω ( x j )) shares the same distribution 229

10 du Plessis Niu Sugiyama with σ j (ω (x j ) ω ( x j )). Subsequently, n n n n E σ,d,d sup σ i ω(x i ) σ jω (x j) + ( σ i )ω( x i ) ( σ j)ω ( x j) n n n n E σ,d sup σ i ω(x i ) σ jω (x j) +E σ,d sup ( σ i )ω( x i ) ( σ j)ω ( x j) n n = 2E σ,d sup σ i ω(x i ) σ jω (x j), where we first apply the triangle inequality, and then make use of that the original and ghost samples have the same distribution and all Rademacher variables have the same distribution. Step 3: The Rademacher complexity still remains to be bound. To this end, we decompose the Rademacher complexity into two, n n E D,σ sup σ i ω(x i ) σ jω (x j) n ( ) αl n ( ) = E D,σ sup σ i ϕ l (x i ) σ j αl n n ϕ l (x j) [ n n E D,σ sup ( α l ) σ i ϕ l (x i ) + n E n D,σ sup α l σ jϕ l (x j). Applying the Cauchy-Schwarz inequality followed by Jensen s inequality to the first Rademacher average gives [ ( n n E D,σ sup ( α l ) σ i ϕ l (x i ) C α n E n ) /2 2 D,σ σ i ϕ l (x i ) C α n = C α n E D,σ E D,σ ( n ) 2 σ i ϕ l (x i ) n i,i = Since σ,..., σ n are Rademacher variables, [ n n E D,σ σ i σ i ϕ l (x i )ϕ l (x i ) = E D i,i = /2 σ i σ i ϕ l (x i )ϕ l (x i ) ϕ 2 l (x i) ncx. 2 /2. 23

11 Class-prior Estimation for Learning from Positive and Unlabeled Data Consequently, we have and E D,σ 3... Step 4 [ n E D,σ sup n E D,σ sup ( α l ) n σ i ϕ l (x i ) C αc x n, n α l σ jϕ l (x j) C αc x, n n n sup σ i ω(x i ) σ jω (x j) ( n + ) C α C x. n Combining the three steps together, we obtain that with probability at least δ/2,, ( 2 g(; D) g() n + 2 ) ( C α C x + ln(2/δ) ( n 2λ 2 /b ) 2 + (2 n n n + n ) ) 2. The same argument can be used to bound g() g(; D). Combining these two tail inequalities proves the theorem. Theorem 2 shows that the uniform deviation bound is of order O(/ n+/ n ), whereas the deviation bound for fixed is of order O( /n + /n ) as shown in Theorem. For this special estimation problem, the convergence rate of the uniform deviation bound is clearly worse than the convergence rate of the deviation bound for fixed. However, after obtaining the uniform deviation bound, we are able to bound the estimation error, that is, the gap between the expectation of our estimate and the best possible estimate within the model. To do so, we need to constrain the parameters of the best estimate via C α since () takes (9) as the target, while (9) is an unbounded function. Furthermore, we assume that the regularization is weak enough so that it would not affect the solution α too much. Specifically, given D, let α be the minimizer of the objective function in () but without the regularization term λ b 2 α2 l, subjecting to α 2 C α. Let penl (; D) be an estimator of the penalized L -distance corresponding to α, and we assume that there exists α > such that, D, penl (; D) penl (; D) α. Then we have the following theorem. Theorem 3 (Estimation error bound) Let penl () be the maximizer of the estimate in the right-hand side of (7) based on () with the best possible α where α 2 C α. For any < δ <, with probability at least δ over the repeated sampling of D, the following holds for all, ( 4 penl () E D [ penl (; D) α + n + 4 ) C α C x n + ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2. 23

12 du Plessis Niu Sugiyama Proof 3 Since α is fixed, we have E D [penl (; D) = penl (). Then, penl () E D [ penl (; D) = E D [penl (; D) E D [ penl (; D) ( ) = (E D [penl (; D) penl (; D)) + penl (; D) E D [ penl (; D) ( ) ( ) + penl (; D) penl (; D) + penl (; D) penl (; D). We bound each of the four terms separately. According to the proof of Theorem 2, with probability at least δ/2,, E D [penl (; D) penl (; D) ( 2 n + 2 ) C α C x + ln(2/δ) n 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2, The same can be proven for penl (; D) E D [ penl (; D). The third term must be nonpositive since penl (; D) is the maximizer of the empirical estimate. Finally, the four term is upper bounded by α, which completes the proof. Theorem 3 shows that the deviation from the expectation of our estimate from the optimal value in the model is small with high probability. 4. Related work In Scott and Blanchard (29) and Blanchard et al. (2), it was proposed to reduce the problem of estimating the class prior to Neyman-Pearson classification 2. A Neyman- Pearson classifier f minimizes the false-negative rate R (f), while keeping the false-positive rate R (f) constrained under a user-specified threshold (Scott and Nowak, 25): R (f) = P (f(x) ), R (f) = P (f(x) ), where P and P denote the probabilities for the positive-class and negative-class conditional densities, respectively. The false-negative rate on the unlabeled dataset is defined and expressed as R X (f) = P X (f(x) = ) = π( R (f)) + ( π)r (f), where P X denotes the probability for unlabeled input data density. The Neyman-Pearson classifier between P and P X is defined as R X,α = inf f R X (f) s.t. R (f) α. 2. The papers (Scott and Blanchard, 29; Blanchard et al., 2) considered the nominal class as y =, and the novel class as y =. The aim was to estimate p(y = ). We use a different notation with the nominal class as y = and the novel class as y = and estimate π = p(y = ). To simplify the exposition, we use the same notation here as in the rest of the paper. 232

13 Class-prior Estimation for Learning from Positive and Unlabeled Data Then the minimum false-negative rate for the unlabeled dataset given false positive rate α is expressed as R X,α = ( α) + ( )R,α. (2) Theorem in Scott and Blanchard (29) says that if the supports for P and P are different, there exists α such that R,α =. Therefore, the class prior can be determined as = dr X,α, (3) dα α= where α is the limit from the left-hand side. Note that this limit is necessary since the first term in (2) will be zero when α =. However, estimating the derivative when α is not straightforward in practice. The curve of RX vs. R can be interpreted as an ROC curve (with a suitable change in class notation), but the empirical ROC curve is often unstable at the right endpoint when the input dimensionality is high (Sanderson and Scott, 24). One approach to overcome this problem is to fit a curve to the right endpoint of the ROC curve in order to enable the estimation (as in Sanderson and Scott (24)). However, it is not clear how the estimated class-prior is affected by this curve-fitting. 5. Experiments In this section, we experimentally compare the performance of the proposed method and alternative methods for estimating the class prior. We compared the following methods: EN: The method of Elkan and Noto (28) with the classifier as a squared-loss variant of logistic regression classifier (Sugiyama, 2). PE: The direct Pearson-divergence matching method proposed in du Plessis and Sugiyama (24). SB The method of Blanchard et al. (2). The Neyman-Pearson classifier was implemented as a the thresholded ratio of two kernel density estimates, each with a bandwidth parameter. As in Blanchard et al. (2), the bandwidth parameters were jointly optimized by maximizing the cross-validated estimate of the AUC. The prior was obtained by estimating (3) from the empirical ROC curve. pen-l (proposed): The penalized L -distance method with an analytic solution. The basis functions were selected as Gaussians centered at all training samples. All hyper-parameters were determined by cross-validation. First, we illustrate the systematic overestimation of the class prior by two previously proposed methods, EN (Elkan and Noto, 28) and PE (du Plessis and Sugiyama, 24) when the classes significantly overlap. The class-conditional densities are p(x y = ) = N x (, 2 ) and p(x y = ) = N x ( 2, 2 ), 233

14 du Plessis Niu Sugiyama (a) EN (b) PE (c) SB (d) pen-l Figure 2: Histograms of class-prior estimates when the true class prior is p(y = ) =.2. The EN method and the PE method have an intrinsic bias that does not decrease even when the number of samples is increased, while the SB method and the proposed pen-l method work reasonably well. ( where N x µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 with respect to x. The true class prior is set at π = p(y = ) =.2. The sizes of the unlabeled dataset and the labeled dataset were both set at n = n = 3. The histograms of class prior estimates are plotted in Figure 2, showing that the EN and PE methods clearly overestimate the true class prior. We also confirmed that this overestimation does not decrease even when the number of samples is increased. On the other hand, the SB and pen-l methods work reasonably well. Finally, we use the MNIST hand-written digit dataset. For each digit, all the other digits were assumed to be in the opposite class (i.e., one-versus-rest). The dataset was reduced to 4-dimensions using principal component analysis. The squared error of classprior estimation is given in Figure 3, showing that the proposed pen-l method overall gives accurate estimates of the class prior, while the EN and PE methods tend to give less accurate estimates for low class priors and more accurate estimates for higher class priors, which agrees with the observation in du Plessis and Sugiyama (24). On the other hand, the SB method tends to perform poorly, which is caused by the instability of the empirical ROC curve at the right endpoint when the input dimensionality is larger, as pointed out in Section Conclusion In this paper, we discussed the problem of class-prior estimation from positive and unlabeled data. We first showed that class-prior estimation from positive and unlabeled data by partial distribution matching under f-divergences yields systematic overestimation of the class prior. We then proposed to use penalized f-divergences to rectify this problem. We further showed that the use of L -distance as an example of f-divergences yields a computationally efficient algorithm with an analytic solution. We provided its uniform deviation bound and estimation error bound, which theoretically supports the usefulness of the proposed method. Finally, through experiments, we demonstrated that the proposed method compares favorably with existing approaches. 234

15 Class-prior Estimation for Learning from Positive and Unlabeled Data EN PE SB pen L (a) Digit vs. others (b) Digit 2 vs. others (c) Digit 3 vs. others (d) Digit 4 vs. others (e) Digit 5 vs. others (f ) Digit 6 vs. others (g) Digit 7 vs. others (h) Digit 8 vs. others (i) Digit 9 vs. others Figure 3: Class-prior estimation accuracy for the MNIST dataset. The EN method and the PE method behave similarly, and the proposed pen-l method consistently outperforms them. The SB method performed poorly due to the instability of the empirical ROC curve at the right endpoint. Acknowledgments MCdP and GN were supported by the JST CREST program and MS was supported by KAKENHI References S. M. Ali and S. D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28:3 42, 966. G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. The Journal of Machine Learning Research, 9999: ,

16 du Plessis Niu Sugiyama I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229 38, 967. M. C. du Plessis and M. Sugiyama. Semi-supervised learning of class balance under classprior change by distribution matching. In ICML 22, pages , Jun. 26 Jul. 22. M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems, E97-D, 24. M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 73 7, 24. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In ACM SIGKDD 4, pages 23 22, 28. A. Keziou. Dual representation of φ-divergences and applications. Comptes Rendus Mathématique, 336(): , 23. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79 86, 95. K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 5:57 75, 9. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 4():2 4, 22. T. Sanderson and C. Scott. Class proportion estimation with application to multiclass anomaly rejection. In Proceedings of the 7th International Conference on Artificial Intelligence and Statistics (AISTATS), 24. C. Scott and G. Blanchard. Novelty detection: Unlabeled data definitely help. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS29), Clearwater Beach, Florida USA, Apr C. Scott and R. Nowak. A Neyman-Pearson approach to statistical learning. Information Theory, IEEE Transactions on, 5(): , 25. M. Sugiyama. Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D():269 27, 2. M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi. Densitydifference estimation. In Advances in Neural Information Processing Systems 25, pages 692 7,

Class Prior Estimation from Positive and Unlabeled Data

IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,