Class-prior Estimation for Learning from Positive and Unlabeled Data

Size: px
Start display at page:

Download "Class-prior Estimation for Learning from Positive and Unlabeled Data"

Transcription

1 JMLR: Workshop and Conference Proceedings 45:22 236, 25 ACML 25 Class-prior Estimation for Learning from Positive and Unlabeled Data Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University of Tokyo, Tokyo, 3-33, Japan. Editor: Geoffrey Holmes and Tie-Yan Liu Abstract We consider the problem of estimating the class prior in an unlabeled dataset. Under the assumption that an additional labeled dataset is available, the class prior can be estimated by fitting a mixture of class-wise data distributions to the unlabeled data distribution. However, in practice, such an additional labeled dataset is often not available. In this paper, we show that, with additional samples coming only from the positive class, the class prior of the unlabeled dataset can be estimated correctly. Our key idea is to use properly penalized divergences for model fitting to cancel the error caused by the absence of negative samples. We further show that the use of the penalized L -distance gives a computationally efficient algorithm with an analytic solution, and establish its uniform deviation bound and estimation error bound. Finally, we experimentally demonstrate the usefulness of the proposed method. Keywords: Class-prior estimation, positive and unlabeled data. Introduction Suppose that we have two datasets X and X, which are i.i.d. samples from probability distributions with density p(x y = ) and, respectively: X = {x i } n i.i.d. p(x y = ), X = {x j} n i.i.d.. That is, X is a set of samples from the positive class and X is a set of unlabeled samples (consisting of both the positive and negative samples). Our goal is to estimate the class prior π = p(y = ), in the unlabeled dataset X. Estimation of the class prior from positive and unlabeled data is of great practical importance, since it allows a classifier to be trained only from these datasets (Scott and Blanchard, 29; du Plessis et al., 24), in the absence of negative data. If a mixture of class-wise input data densities, q (x; ) = p(x y = ) + ( )p(x y = ), c 25 M.C. du Plessis, G. Niu & M. Sugiyama.

2 du Plessis Niu Sugiyama ( )p(x y= ) p(x y=) p(x y=) x (a) Full matching with q(x; ) = p(x y = ) + ( )p(x y = ) x (b) Partial matching with q(x; ) = p(x y = ) Figure : Class-prior estimation by matching model q(x; ) to unlabeled input data density. is fitted to the unlabeled input data density, the true class prior π can be obtained (Saerens et al., 22; du Plessis and Sugiyama, 22), as illustrated in Figure (a). In practice, fitting may be performed under the f-divergence (Ali and Silvey, 966; Csiszár, 967): ( q ) (x; ) := arg min f dx, () where f(t) is a convex function with f() =. So far, class-prior estimation methods based on the Kullback-Leibler divergence (Saerens et al., 22), and the Pearson divergence (du Plessis and Sugiyama, 22) have been developed (Table ). Additionally, class-prior estimation has been performed by L 2 -distance minimization (Sugiyama et al., 22). However, since these methods require labeled samples from both positive and negative classes, they cannot be directly employed in the current setup. To cope with problem, a partial model, q(x; ) = p(x y = ), was used in Elkan and Noto (28) and du Plessis and Sugiyama (24) to estimate the class prior in the absence of negative samples (Figure (b)): where Div f () := := arg min Div f (), (2) f ( ) q(x; ) dx. In this paper, we first show that the above partial matching approach consistently overestimates the true class prior. We then show that, by appropriately penalizing f- divergences, the class prior can be correctly obtained. We further show that the use of the 222

3 Class-prior Estimation for Learning from Positive and Unlabeled Data Table : Common f-divergences. f (z) is the conjugate of f(t) and f (z) is the conjugate of the penalized function f(t) = f(t) for t and otherwise. Divergence Kullback-Leibler divergence (Kullback and Leibler, 95) Pearson divergence (Pearson, 9) Function f(t) L -distance t Conjugate f (z) log(t) log( z) 2 (t )2 2 z2 + z { z z otherwise Penalized Conjugate f (z) { log( z) z z z > 2 z < 2 z2 + z z z z > max (z, ) penalized L -distance drastically simplifies the estimation procedure, resulting in an analytic estimator that can be computed efficiently. We also establish a uniform deviation bound and an estimation error bound for the penalized L -distance estimator. Finally, through experiments, we demonstrate the usefulness of the proposed method in classification from positive and unlabeled data. 2. Class-prior estimation via penalized f-divergences First, we investigate the behavior of the partial matching method (2), which can be regarded as an extension of the existing analysis for the Pearson divergence (du Plessis and Sugiyama, 24) to more general divergences. We show that naively using general divergences may result in an overestimate of the class prior, and show how this situation can be avoided by penalization. 2.. Over-estimation of the class prior For f-divergences, we focus on f(t) such that its minimum is attained at t. We also assume that it is differentiable and the derivative of f(t) is f(t) <, when t <, and f(t) when t =. This condition is satisfied for divergences such as the Kullback-Leibler divergence and the Pearson divergence. Because of the divergence matching formulation, we expect that the objective function (2) is minimized at = π. That is, based on the first-order optimality condition, we expect that the derivative of Div f () w.r.t., given by ( ) p(x y = ) Div f () = f p(x y = )dx, satisfies Div f (π) =. Since ( ) πp(x y = ) πp(x y = ) = p(y = x) = f, 223

4 du Plessis Niu Sugiyama we have Div f (π)= ( ) f p(y = x) p(x y = )dx. } {{ } The domain of the above integral, where p(x y = ) >, can be expressed as: D = {x : p(y = x) = p(x y = ) > }, D 2 = {x : p(y = x) < p(x y = ) > }. The derivative is then expressed as Div f (π) = f(p(y = x)) p(x y = )dx + D }{{} f(p(y = x)) p(x y = )dx. D 2 }{{} (3) The posterior is p(y = x) = πp(x y = )/, where = πp(x y = ) + ( π)p(x y = ). D is the part of the domain where the two classes do not overlap, because p(y = x) = implies that πp(x y = ) = and ( π)p(x y = ) =. Conversely, D 2 is the part of the domain where the classes overlap because p(y = x) < implies that πp(x y = ) <, and ( π)p(x y = ) >. Since the first term in (3) is non-positive, the derivative can be zero only if D 2 is empty (i.e., there is no class overlap). If D 2 is not empty (i.e., there is class overlap) the derivative will be negative. Since the objective function Div f () is convex, the derivative Div f () is a monotone non-decreasing function. Therefore, if the function Div f () has a minimizer, it will be larger than the true class prior π Partial distribution matching via penalized f-divergences In this section, we consider a function f(t) = { (t ) t <, c(t ) t. This function coincides with the L distance when c =. The analysis here is slightly more involved, since the subderivative should be taken at t =. This gives the following: Div f (π) = f()p(x y = )dx + D f(p(y = x)) p(x y = )dx, D 2 }{{} (4) where the subderivative at t = is f() = [, c and the derivative for t < is f(t) =. We can therefore write the subderivative of the first term in (4) as [ f()p(x y = )dx = p(x y = )dx, c p(x y = )dx. D D D The derivative for the second term in (4) is f(p(y = x)p(x y = )dx = p(x y = )dx. D 2 D 2 < < 224

5 Class-prior Estimation for Learning from Positive and Unlabeled Data To achieve a minimum at π, we should have [ p(x y = )dx p(x y = )dx, c p(x y = )dx p(x y = )dx. D D 2 D D 2 However, depending on D and D 2, we may have c p(x y = )dx p(x y = )dx <, D D 2 which means that Div f (π) The solution is to take c = to ensure that Div f (π) is always satisfied. Using the same reasoning as above, we can also rectify the overestimation problem for other divergences specified by f(t), by replacing f(t) with a penalized function f(t): { f(t) t, f(t) = otherwise. In the next section, estimation of penalized f-divergences is discussed Direct evaluation of penalized f-divergences Here, we show how distribution matching can be performed without density estimation under penalized f-divergences. We use the Fenchel duality bounding technique for f- divergences (Keziou, 23), which is based on Fenchel s inequality: where f (z) is the Fenchel dual or convex conjugate defined as f(t) tz f (z), (5) f (z) = sup t t z f(t ). Applying the bound (5) in a pointwise manner, we obtain ( ) ( ) p(x y = ) p(x y = ) f r(x) f (r(x)), where r(x) fulfills the role of z in (5). Multiplying both sides with gives ( ) p(x y = ) f r(x)p(x y = ) f (r(x)). (6) Integrating and then selecting the tightest bound gives Div f (p q) sup r(x)p(x y = )dx r Replacing expectations with sample averages gives Div f (p q) sup r n f (r(x))dx. (7) n r(x i ) n n f (r(x j)), (8) where Div f (p q) denotes Div f (p q) estimated from sample averages. Note that the conjugate f (z) of any function f(t) is convex. Therefore, if r(x) is linear in parameters, the above maximization problem is convex and thus can be easily solved. The conjugates for selected penalized f-divergences are given in Table. 225

6 du Plessis Niu Sugiyama 2.4. Penalized L -distance estimation Here we focus on penalized L -distance f(t) = t as a specific example of penalized f-divergences, and show that an analytic and computationally efficient solution can be obtained. The conjugate for the penalized L -distance is f (z) = max (z, ). Then we can see that the lower-bound in (7) will be met with equality when { p(x y = ), r(x) = (9) otherwise. For this reason, we use the following linear model as r(x): r(x) = α l ϕ l (x), () where {ϕ l (x)} b is the set of non-negative basis functions. Then the empirical estimate in the right-hand side of (8) can be expressed as ( α,..., α b ) = arg min (α,...,α b ) n ( n ) max α l ϕ l (x j), n n α l ϕ l (x i ) + + λ 2 αl 2, where the regularization term λ b 2 α2 l for λ > is included to avoid overfitting. The optimal value for the lower-bound (7) occurs when r(x) (see (9)). This fact can be incorporated by constraining the parameters of the model (), so that α l, l =,..., b. If the basis functions, ϕ l (x), l =,..., b, are non-negative, the term inside the max in () is always positive and the max operation becomes superfluous. This allows us to obtain the parameter vector (α,..., α b ) as the solution to the following constrained optimization problem: where ( α,..., α b ) = arg min (α,...,α b ) λ 2 α2 l α l β l, s.t. α l, l =,..., b, () β l = n n ϕ l (x i ) n n ϕ l (x j). The above optimization problem decouples for all α l values and can be solved separately as α l = λ max (, β l).. In practice, we use Gaussian kernels centered at all sample points as the basis functions: ϕ l (x) = exp ( x c l 2 /(2σ 2 ) ), where σ >, and (c,..., c n, c n+,..., c n+n ) = (x,..., x n, x,... x n ) 226

7 Class-prior Estimation for Learning from Positive and Unlabeled Data Since α can be just calculated with a max operation, the above solution is extremely fast to calculate. All hyper-parameters including the Gaussian width σ and the regularization parameter λ are selected for each via straightforward cross-validation. Finally, our estimate of the penalized L -distance (i.e., the maximizer of the empirical estimate in the right-hand side of (8)) is obtained as penl () = λ max (, β l ) β l +. The class prior is then selected so as to minimize the above estimator. 3. Stability analysis Regarding the estimation stability of penl () for fixed, we have the following deviation bound. Without loss of generality, we assume that the basis functions are upper bounded by one, i.e., x, ϕ l (x) for l =,..., b. Theorem (Deviation bound) Fix, then, for any < δ <, with probability at least δ over the repeated sampling of D = X X for estimating penl (; D), we have penl ( (; D) E D [ penl (; D) ln(2/δ) ( 2λ 2 /b ) 2 + (2 n n n + n ) ) 2. Proof We prove the theorem based on a technique known as the method of bounded difference. Let f l (x,..., x n, x,..., x n ) = max (, β l ) β l, so that penl (; D) = λ f l (x,..., x n, x,..., x n ) +. Next, we replace x i with x i and bound the difference between f l (x,..., x i,..., x n, x,..., x n ) and f l (x,..., x i,..., x n, x,..., x n ). Let t = ϕ l (x i ), t = ϕ l ( x i ), and ξ l = β l t. Since the basis functions are bounded, we know that t, t, as well as ξ l. Then the maximum difference is, c l = sup max t, t (, n ) ( ) ( n t ξ l t ξ l max, ) ( ) n t ξ l n t ξ l. By analyzing the above for different cases where the constraints are active, the the maximum difference for replacing x i with is x i is (/n)(2 + /n). We can use the same argument for replacing x j with x j in f l(x,..., x n, x,..., x n ), resulting in a maximum difference of (/n)(2 + /n). Note that this holds for all f l ( ) simultaneously, and thus the change of penl (; D) is no more than (b/λ)(/n)(2 + /n) if x i is replaced with x i or (b/λ)(/n)(2 + /n) if x j is replaced with x j. We can therefore apply McDiarmid s inequality to obtain, with probability at least δ/2, [ penl (; D) E D penl (; D) ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) )

8 du Plessis Niu Sugiyama [ Applying McDiarmid s inequality again for E D penl (; D) penl (; D) proves the result. Theorem shows that the deviation from our estimate to its expectation is small with high probability. Nevertheless, must be fixed before seeing the data, so we cannot use the estimate to choose. This motivates us to derive a uniform deviation bound. Let us define the constants C x = sup x ( b ϕ2 l (x) ) /2 b, C α = sup sup X p n (x y=),x p n (x)( b α2 l (, D) ) /2, where we write α l (, D) to emphasize that α l depends on and D. Theorem 2 (Uniform deviation bound) For any < δ <, with probability at least δ over the repeated sampling of D, the following holds for all, penl ( 2 (; D) E D [ penl (; D) n + 2 ) C α C x n + ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2. Proof 2 We denote g(; D) = b α lβ l, and g() = E D [g(; D) so that penl (; D) = g(; D) +, E D [ penl (; D) = g() +, and Step : penl (; D) E D [ penl (; D) = g(; D) g(). We first consider one direction. By definition,, g(; D) g() sup {g(; D) g()}. We cannot simply apply Theorem to bound the right-hand side since in the right-hand side is not fixed. Nevertheless, according to the proof of Theorem, if we replace a single point x i or x j in D, the change of sup {g(; D) g()} is also bounded by b λn (2 + n ) or b λn (2 + n ). We then have, with probability at least δ/2, ( sup{g(; D) g()} E D [sup {g(; D) g()} + ln(2/δ) ( 2λ 2 /b 2 2+ ) 2 + (2 n n n + n ) )

9 Class-prior Estimation for Learning from Positive and Unlabeled Data Step 2: Next we bound E D [sup {g(; D) g()} based on a technique known as symmetrization. Note that the function g(; D) = b α lβ l can be rewritten in a point-wise manner other than a base-wise manner, g(; D) = = n ( ) αl n ϕ l (x i ) n n ω(x i ) ω (x j), where for simplicity we define ω(x) = n ( ) αl ϕ l (x), and, ω (x) = n ( ) αl n ϕ l (x j) ( ) αl n ϕ l (x). Let D = { x,..., x n, x,..., x n } be a ghost sample, [ E D [sup {g(; D) g()} = E D sup{g(; D) E D [g(; D )}, = E D [sup {E D [g(; D) g(; D )}, [ E D,D sup{g(; D) g(; D )}, where we apply Jensen s inequality with the fact that the supremum is a convex function. Moreover, let σ = {σ,..., σ n, σ,..., σ n } be a set of Rademacher variables of size n + n, [ E D,D sup{g(; D) g(; D )} = E D,D = E D,D = E σ,d,d sup sup sup n n n n ω(x i ) ω (x j) ω( x i ) ω ( x j) n n (ω(x i ) ω( x i )) (ω (x j) ω ( x j)) n n σ i (ω(x i ) ω( x i )) σ j(ω (x j) ω ( x j)), since the original and ghost samples are symmetric and each (ω(x i ) ω( x i )) shares the same distribution with σ i (ω(x i ) ω( x i )) and each (ω (x j ) ω ( x j )) shares the same distribution 229

10 du Plessis Niu Sugiyama with σ j (ω (x j ) ω ( x j )). Subsequently, n n n n E σ,d,d sup σ i ω(x i ) σ jω (x j) + ( σ i )ω( x i ) ( σ j)ω ( x j) n n n n E σ,d sup σ i ω(x i ) σ jω (x j) +E σ,d sup ( σ i )ω( x i ) ( σ j)ω ( x j) n n = 2E σ,d sup σ i ω(x i ) σ jω (x j), where we first apply the triangle inequality, and then make use of that the original and ghost samples have the same distribution and all Rademacher variables have the same distribution. Step 3: The Rademacher complexity still remains to be bound. To this end, we decompose the Rademacher complexity into two, n n E D,σ sup σ i ω(x i ) σ jω (x j) n ( ) αl n ( ) = E D,σ sup σ i ϕ l (x i ) σ j αl n n ϕ l (x j) [ n n E D,σ sup ( α l ) σ i ϕ l (x i ) + n E n D,σ sup α l σ jϕ l (x j). Applying the Cauchy-Schwarz inequality followed by Jensen s inequality to the first Rademacher average gives [ ( n n E D,σ sup ( α l ) σ i ϕ l (x i ) C α n E n ) /2 2 D,σ σ i ϕ l (x i ) C α n = C α n E D,σ E D,σ ( n ) 2 σ i ϕ l (x i ) n i,i = Since σ,..., σ n are Rademacher variables, [ n n E D,σ σ i σ i ϕ l (x i )ϕ l (x i ) = E D i,i = /2 σ i σ i ϕ l (x i )ϕ l (x i ) ϕ 2 l (x i) ncx. 2 /2. 23

11 Class-prior Estimation for Learning from Positive and Unlabeled Data Consequently, we have and E D,σ 3... Step 4 [ n E D,σ sup n E D,σ sup ( α l ) n σ i ϕ l (x i ) C αc x n, n α l σ jϕ l (x j) C αc x, n n n sup σ i ω(x i ) σ jω (x j) ( n + ) C α C x. n Combining the three steps together, we obtain that with probability at least δ/2,, ( 2 g(; D) g() n + 2 ) ( C α C x + ln(2/δ) ( n 2λ 2 /b ) 2 + (2 n n n + n ) ) 2. The same argument can be used to bound g() g(; D). Combining these two tail inequalities proves the theorem. Theorem 2 shows that the uniform deviation bound is of order O(/ n+/ n ), whereas the deviation bound for fixed is of order O( /n + /n ) as shown in Theorem. For this special estimation problem, the convergence rate of the uniform deviation bound is clearly worse than the convergence rate of the deviation bound for fixed. However, after obtaining the uniform deviation bound, we are able to bound the estimation error, that is, the gap between the expectation of our estimate and the best possible estimate within the model. To do so, we need to constrain the parameters of the best estimate via C α since () takes (9) as the target, while (9) is an unbounded function. Furthermore, we assume that the regularization is weak enough so that it would not affect the solution α too much. Specifically, given D, let α be the minimizer of the objective function in () but without the regularization term λ b 2 α2 l, subjecting to α 2 C α. Let penl (; D) be an estimator of the penalized L -distance corresponding to α, and we assume that there exists α > such that, D, penl (; D) penl (; D) α. Then we have the following theorem. Theorem 3 (Estimation error bound) Let penl () be the maximizer of the estimate in the right-hand side of (7) based on () with the best possible α where α 2 C α. For any < δ <, with probability at least δ over the repeated sampling of D, the following holds for all, ( 4 penl () E D [ penl (; D) α + n + 4 ) C α C x n + ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2. 23

12 du Plessis Niu Sugiyama Proof 3 Since α is fixed, we have E D [penl (; D) = penl (). Then, penl () E D [ penl (; D) = E D [penl (; D) E D [ penl (; D) ( ) = (E D [penl (; D) penl (; D)) + penl (; D) E D [ penl (; D) ( ) ( ) + penl (; D) penl (; D) + penl (; D) penl (; D). We bound each of the four terms separately. According to the proof of Theorem 2, with probability at least δ/2,, E D [penl (; D) penl (; D) ( 2 n + 2 ) C α C x + ln(2/δ) n 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2, The same can be proven for penl (; D) E D [ penl (; D). The third term must be nonpositive since penl (; D) is the maximizer of the empirical estimate. Finally, the four term is upper bounded by α, which completes the proof. Theorem 3 shows that the deviation from the expectation of our estimate from the optimal value in the model is small with high probability. 4. Related work In Scott and Blanchard (29) and Blanchard et al. (2), it was proposed to reduce the problem of estimating the class prior to Neyman-Pearson classification 2. A Neyman- Pearson classifier f minimizes the false-negative rate R (f), while keeping the false-positive rate R (f) constrained under a user-specified threshold (Scott and Nowak, 25): R (f) = P (f(x) ), R (f) = P (f(x) ), where P and P denote the probabilities for the positive-class and negative-class conditional densities, respectively. The false-negative rate on the unlabeled dataset is defined and expressed as R X (f) = P X (f(x) = ) = π( R (f)) + ( π)r (f), where P X denotes the probability for unlabeled input data density. The Neyman-Pearson classifier between P and P X is defined as R X,α = inf f R X (f) s.t. R (f) α. 2. The papers (Scott and Blanchard, 29; Blanchard et al., 2) considered the nominal class as y =, and the novel class as y =. The aim was to estimate p(y = ). We use a different notation with the nominal class as y = and the novel class as y = and estimate π = p(y = ). To simplify the exposition, we use the same notation here as in the rest of the paper. 232

13 Class-prior Estimation for Learning from Positive and Unlabeled Data Then the minimum false-negative rate for the unlabeled dataset given false positive rate α is expressed as R X,α = ( α) + ( )R,α. (2) Theorem in Scott and Blanchard (29) says that if the supports for P and P are different, there exists α such that R,α =. Therefore, the class prior can be determined as = dr X,α, (3) dα α= where α is the limit from the left-hand side. Note that this limit is necessary since the first term in (2) will be zero when α =. However, estimating the derivative when α is not straightforward in practice. The curve of RX vs. R can be interpreted as an ROC curve (with a suitable change in class notation), but the empirical ROC curve is often unstable at the right endpoint when the input dimensionality is high (Sanderson and Scott, 24). One approach to overcome this problem is to fit a curve to the right endpoint of the ROC curve in order to enable the estimation (as in Sanderson and Scott (24)). However, it is not clear how the estimated class-prior is affected by this curve-fitting. 5. Experiments In this section, we experimentally compare the performance of the proposed method and alternative methods for estimating the class prior. We compared the following methods: EN: The method of Elkan and Noto (28) with the classifier as a squared-loss variant of logistic regression classifier (Sugiyama, 2). PE: The direct Pearson-divergence matching method proposed in du Plessis and Sugiyama (24). SB The method of Blanchard et al. (2). The Neyman-Pearson classifier was implemented as a the thresholded ratio of two kernel density estimates, each with a bandwidth parameter. As in Blanchard et al. (2), the bandwidth parameters were jointly optimized by maximizing the cross-validated estimate of the AUC. The prior was obtained by estimating (3) from the empirical ROC curve. pen-l (proposed): The penalized L -distance method with an analytic solution. The basis functions were selected as Gaussians centered at all training samples. All hyper-parameters were determined by cross-validation. First, we illustrate the systematic overestimation of the class prior by two previously proposed methods, EN (Elkan and Noto, 28) and PE (du Plessis and Sugiyama, 24) when the classes significantly overlap. The class-conditional densities are p(x y = ) = N x (, 2 ) and p(x y = ) = N x ( 2, 2 ), 233

14 du Plessis Niu Sugiyama (a) EN (b) PE (c) SB (d) pen-l Figure 2: Histograms of class-prior estimates when the true class prior is p(y = ) =.2. The EN method and the PE method have an intrinsic bias that does not decrease even when the number of samples is increased, while the SB method and the proposed pen-l method work reasonably well. ( where N x µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 with respect to x. The true class prior is set at π = p(y = ) =.2. The sizes of the unlabeled dataset and the labeled dataset were both set at n = n = 3. The histograms of class prior estimates are plotted in Figure 2, showing that the EN and PE methods clearly overestimate the true class prior. We also confirmed that this overestimation does not decrease even when the number of samples is increased. On the other hand, the SB and pen-l methods work reasonably well. Finally, we use the MNIST hand-written digit dataset. For each digit, all the other digits were assumed to be in the opposite class (i.e., one-versus-rest). The dataset was reduced to 4-dimensions using principal component analysis. The squared error of classprior estimation is given in Figure 3, showing that the proposed pen-l method overall gives accurate estimates of the class prior, while the EN and PE methods tend to give less accurate estimates for low class priors and more accurate estimates for higher class priors, which agrees with the observation in du Plessis and Sugiyama (24). On the other hand, the SB method tends to perform poorly, which is caused by the instability of the empirical ROC curve at the right endpoint when the input dimensionality is larger, as pointed out in Section Conclusion In this paper, we discussed the problem of class-prior estimation from positive and unlabeled data. We first showed that class-prior estimation from positive and unlabeled data by partial distribution matching under f-divergences yields systematic overestimation of the class prior. We then proposed to use penalized f-divergences to rectify this problem. We further showed that the use of L -distance as an example of f-divergences yields a computationally efficient algorithm with an analytic solution. We provided its uniform deviation bound and estimation error bound, which theoretically supports the usefulness of the proposed method. Finally, through experiments, we demonstrated that the proposed method compares favorably with existing approaches. 234

15 Class-prior Estimation for Learning from Positive and Unlabeled Data EN PE SB pen L (a) Digit vs. others (b) Digit 2 vs. others (c) Digit 3 vs. others (d) Digit 4 vs. others (e) Digit 5 vs. others (f ) Digit 6 vs. others (g) Digit 7 vs. others (h) Digit 8 vs. others (i) Digit 9 vs. others Figure 3: Class-prior estimation accuracy for the MNIST dataset. The EN method and the PE method behave similarly, and the proposed pen-l method consistently outperforms them. The SB method performed poorly due to the instability of the empirical ROC curve at the right endpoint. Acknowledgments MCdP and GN were supported by the JST CREST program and MS was supported by KAKENHI References S. M. Ali and S. D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28:3 42, 966. G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. The Journal of Machine Learning Research, 9999: ,

16 du Plessis Niu Sugiyama I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229 38, 967. M. C. du Plessis and M. Sugiyama. Semi-supervised learning of class balance under classprior change by distribution matching. In ICML 22, pages , Jun. 26 Jul. 22. M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems, E97-D, 24. M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 73 7, 24. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In ACM SIGKDD 4, pages 23 22, 28. A. Keziou. Dual representation of φ-divergences and applications. Comptes Rendus Mathématique, 336(): , 23. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79 86, 95. K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 5:57 75, 9. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 4():2 4, 22. T. Sanderson and C. Scott. Class proportion estimation with application to multiclass anomaly rejection. In Proceedings of the 7th International Conference on Artificial Intelligence and Statistics (AISTATS), 24. C. Scott and G. Blanchard. Novelty detection: Unlabeled data definitely help. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS29), Clearwater Beach, Florida USA, Apr C. Scott and R. Nowak. A Neyman-Pearson approach to statistical learning. Information Theory, IEEE Transactions on, 5(): , 25. M. Sugiyama. Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D():269 27, 2. M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi. Densitydifference estimation. In Advances in Neural Information Processing Systems 25, pages 692 7,

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Convex Formulation for Learning from Positive and Unlabeled Data

Convex Formulation for Learning from Positive and Unlabeled Data Marthinus Christoffel du Plessis The University of Tokyo, 7-3- Hongo, Bunkyo-ku, Tokyo, Japan Gang Niu Baidu Inc., Beijing, 00085, China Masashi Sugiyama The University of Tokyo, 7-3- Hongo, Bunkyo-ku,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Learning from Corrupted Binary Labels via Class-Probability Estimation

Learning from Corrupted Binary Labels via Class-Probability Estimation Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National

More information

arxiv: v2 [cs.lg] 1 May 2013

arxiv: v2 [cs.lg] 1 May 2013 Semi-Supervised Information-Maximization Clustering Daniele Calandriello 1, Gang Niu 2, and Masashi Sugiyama 2 1 Politecnico di Milano, Milano, Italy daniele.calandriello@mail.polimi.it 2 Tokyo Institute

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Class Proportion Estimation with Application to Multiclass Anomaly Rejection

Class Proportion Estimation with Application to Multiclass Anomaly Rejection Class Proportion Estimation with Application to Multiclass Anomaly Rejection Tyler Sanderson University of Michigan in Ann Arbor Clayton Scott University of Michigan in Ann Arbor Abstract This work addresses

More information

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

Novelty detection: Unlabeled data definitely help

Novelty detection: Unlabeled data definitely help Clayton Scott University of Michigan Ann Arbor, MI, USA Gilles Blanchard Fraunhofer FIRST.IDA Berlin, Germany Abstract In machine learning, one formulation of the novelty detection problem is to build

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,

More information

Density-Difference Estimation

Density-Difference Estimation Neural Computation. vol.25, no., pp.2734 2775, 23. Density-Difference Estimation Masashi Sugiyama Tokyo Institute of Technology, Japan. sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/ sugi Takafumi

More information

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning Gang Niu 1 Marthis C. du Plessis 1 Tomoya Sakai 1 Yao Ma 3 Masashi Sugiyama 2,1 1 The University of Tokyo, Japan

More information

When is undersampling effective in unbalanced classification tasks?

When is undersampling effective in unbalanced classification tasks? When is undersampling effective in unbalanced classification tasks? Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi 09/09/2015 ECML-PKDD 2015 Porto, Portugal 1/ 23 INTRODUCTION In several binary

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

Bregman divergence and density integration Noboru Murata and Yu Fujimoto Journal of Math-for-industry, Vol.1(2009B-3), pp.97 104 Bregman divergence and density integration Noboru Murata and Yu Fujimoto Received on August 29, 2009 / Revised on October 4, 2009 Abstract. In this

More information

Any Reasonable Cost Function Can Be Used for a Posteriori Probability Approximation

Any Reasonable Cost Function Can Be Used for a Posteriori Probability Approximation Any Reasonable Cost Function Can Be Used for a Posteriori Probability Approximation Marco Saerens, Patrice Latinne & Christine Decaestecker Université Catholique de Louvain and Université Libre de Bruxelles

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Generalized Kernel Classification and Regression

Generalized Kernel Classification and Regression Generalized Kernel Classification and Regression Jason Palmer and Kenneth Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA japalmer@ucsd.edu,kreutz@ece.ucsd.edu

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities

Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities JMLR: Workshop and Conference Proceedings 30:1 16, 2015 ACML 2015 Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities Hiroaki Sasaki Graduate School

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Relative Density-Ratio Estimation for Robust Distribution Comparison

Relative Density-Ratio Estimation for Robust Distribution Comparison Neural Computation, vol.5, no.5, pp.34 37, 3. Relative Density-Ratio Estimation for Robust Distribution Comparison Makoto Yamada NTT Communication Science Laboratories, NTT Corporation, Japan. yamada.makoto@lab.ntt.co.jp

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation

Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation Ann Inst Stat Math (202) 64:009 044 DOI 0.007/s0463-0-0343-8 Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation Masashi Sugiyama Taiji Suzuki Takafumi

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Least-Squares Independence Test

Least-Squares Independence Test IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,

More information

Does quantification without adjustments work?

Does quantification without adjustments work? Does quantification without adjustments work? Dirk Tasche February 28, 2016 arxiv:1602.08780v1 [stat.ml] 28 Feb 2016 Classification is the task of predicting the class labels of objects based on the observation

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health

More information

Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis

Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis Maoto Yamada and Masashi

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Margin Based PU Learning

Margin Based PU Learning Margin Based PU Learning Tieliang Gong, Guangtao Wang 2, Jieping Ye 2, Zongben Xu, Ming Lin 2 School of Mathematics and Statistics, Xi an Jiaotong University, Xi an 749, Shaanxi, P. R. China 2 Department

More information

arxiv: v1 [stat.ml] 23 Jun 2011

arxiv: v1 [stat.ml] 23 Jun 2011 Relative Density-Ratio Estimation for Robust Distribution Comparison arxiv:06.4729v [stat.ml] 23 Jun 20 Makoto Yamada Tokyo Institute of Technology 2-2- O-okayama, Meguro-ku, Tokyo 52-8552, Japan. Taiji

More information

Advanced statistical methods for data analysis Lecture 1

Advanced statistical methods for data analysis Lecture 1 Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Statistical Analysis of Distance Estimators with Density Differences and Density Ratios

Statistical Analysis of Distance Estimators with Density Differences and Density Ratios Entropy 2014, 16, 921-942; doi:10.3390/e16020921 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Statistical Analysis of Distance Estimators with Density Differences and Density

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Non-parametric Methods

Non-parametric Methods Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Classification and Support Vector Machine

Classification and Support Vector Machine Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline

More information

Dual Augmented Lagrangian, Proximal Minimization, and MKL

Dual Augmented Lagrangian, Proximal Minimization, and MKL Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

arxiv: v2 [cs.lg] 14 Oct 2016

arxiv: v2 [cs.lg] 14 Oct 2016 Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Tomoya Sakai Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University of Tokyo The University of

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for

More information

Adaptive Multi-Modal Sensing of General Concealed Targets

Adaptive Multi-Modal Sensing of General Concealed Targets Adaptive Multi-Modal Sensing of General Concealed argets Lawrence Carin Balaji Krishnapuram, David Williams, Xuejun Liao and Ya Xue Department of Electrical & Computer Engineering Duke University Durham,

More information

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali

More information

Decentralized decision making with spatially distributed data

Decentralized decision making with spatially distributed data Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Multi-Positive and Unlabeled Learning

Multi-Positive and Unlabeled Learning Multi-Positive and Unlabeled Learning Yixing Xu, Chang Xu, Chao Xu, Dacheng Tao Key Laboratory of Machine Perception (MOE), Cooperative Medianet Innovation Center, School of Electronics Engineering and

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information