Class-prior Estimation for Learning from Positive and Unlabeled Data
|
|
- Ann Lynch
- 6 years ago
- Views:
Transcription
1 JMLR: Workshop and Conference Proceedings 45:22 236, 25 ACML 25 Class-prior Estimation for Learning from Positive and Unlabeled Data Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University of Tokyo, Tokyo, 3-33, Japan. Editor: Geoffrey Holmes and Tie-Yan Liu Abstract We consider the problem of estimating the class prior in an unlabeled dataset. Under the assumption that an additional labeled dataset is available, the class prior can be estimated by fitting a mixture of class-wise data distributions to the unlabeled data distribution. However, in practice, such an additional labeled dataset is often not available. In this paper, we show that, with additional samples coming only from the positive class, the class prior of the unlabeled dataset can be estimated correctly. Our key idea is to use properly penalized divergences for model fitting to cancel the error caused by the absence of negative samples. We further show that the use of the penalized L -distance gives a computationally efficient algorithm with an analytic solution, and establish its uniform deviation bound and estimation error bound. Finally, we experimentally demonstrate the usefulness of the proposed method. Keywords: Class-prior estimation, positive and unlabeled data. Introduction Suppose that we have two datasets X and X, which are i.i.d. samples from probability distributions with density p(x y = ) and, respectively: X = {x i } n i.i.d. p(x y = ), X = {x j} n i.i.d.. That is, X is a set of samples from the positive class and X is a set of unlabeled samples (consisting of both the positive and negative samples). Our goal is to estimate the class prior π = p(y = ), in the unlabeled dataset X. Estimation of the class prior from positive and unlabeled data is of great practical importance, since it allows a classifier to be trained only from these datasets (Scott and Blanchard, 29; du Plessis et al., 24), in the absence of negative data. If a mixture of class-wise input data densities, q (x; ) = p(x y = ) + ( )p(x y = ), c 25 M.C. du Plessis, G. Niu & M. Sugiyama.
2 du Plessis Niu Sugiyama ( )p(x y= ) p(x y=) p(x y=) x (a) Full matching with q(x; ) = p(x y = ) + ( )p(x y = ) x (b) Partial matching with q(x; ) = p(x y = ) Figure : Class-prior estimation by matching model q(x; ) to unlabeled input data density. is fitted to the unlabeled input data density, the true class prior π can be obtained (Saerens et al., 22; du Plessis and Sugiyama, 22), as illustrated in Figure (a). In practice, fitting may be performed under the f-divergence (Ali and Silvey, 966; Csiszár, 967): ( q ) (x; ) := arg min f dx, () where f(t) is a convex function with f() =. So far, class-prior estimation methods based on the Kullback-Leibler divergence (Saerens et al., 22), and the Pearson divergence (du Plessis and Sugiyama, 22) have been developed (Table ). Additionally, class-prior estimation has been performed by L 2 -distance minimization (Sugiyama et al., 22). However, since these methods require labeled samples from both positive and negative classes, they cannot be directly employed in the current setup. To cope with problem, a partial model, q(x; ) = p(x y = ), was used in Elkan and Noto (28) and du Plessis and Sugiyama (24) to estimate the class prior in the absence of negative samples (Figure (b)): where Div f () := := arg min Div f (), (2) f ( ) q(x; ) dx. In this paper, we first show that the above partial matching approach consistently overestimates the true class prior. We then show that, by appropriately penalizing f- divergences, the class prior can be correctly obtained. We further show that the use of the 222
3 Class-prior Estimation for Learning from Positive and Unlabeled Data Table : Common f-divergences. f (z) is the conjugate of f(t) and f (z) is the conjugate of the penalized function f(t) = f(t) for t and otherwise. Divergence Kullback-Leibler divergence (Kullback and Leibler, 95) Pearson divergence (Pearson, 9) Function f(t) L -distance t Conjugate f (z) log(t) log( z) 2 (t )2 2 z2 + z { z z otherwise Penalized Conjugate f (z) { log( z) z z z > 2 z < 2 z2 + z z z z > max (z, ) penalized L -distance drastically simplifies the estimation procedure, resulting in an analytic estimator that can be computed efficiently. We also establish a uniform deviation bound and an estimation error bound for the penalized L -distance estimator. Finally, through experiments, we demonstrate the usefulness of the proposed method in classification from positive and unlabeled data. 2. Class-prior estimation via penalized f-divergences First, we investigate the behavior of the partial matching method (2), which can be regarded as an extension of the existing analysis for the Pearson divergence (du Plessis and Sugiyama, 24) to more general divergences. We show that naively using general divergences may result in an overestimate of the class prior, and show how this situation can be avoided by penalization. 2.. Over-estimation of the class prior For f-divergences, we focus on f(t) such that its minimum is attained at t. We also assume that it is differentiable and the derivative of f(t) is f(t) <, when t <, and f(t) when t =. This condition is satisfied for divergences such as the Kullback-Leibler divergence and the Pearson divergence. Because of the divergence matching formulation, we expect that the objective function (2) is minimized at = π. That is, based on the first-order optimality condition, we expect that the derivative of Div f () w.r.t., given by ( ) p(x y = ) Div f () = f p(x y = )dx, satisfies Div f (π) =. Since ( ) πp(x y = ) πp(x y = ) = p(y = x) = f, 223
4 du Plessis Niu Sugiyama we have Div f (π)= ( ) f p(y = x) p(x y = )dx. } {{ } The domain of the above integral, where p(x y = ) >, can be expressed as: D = {x : p(y = x) = p(x y = ) > }, D 2 = {x : p(y = x) < p(x y = ) > }. The derivative is then expressed as Div f (π) = f(p(y = x)) p(x y = )dx + D }{{} f(p(y = x)) p(x y = )dx. D 2 }{{} (3) The posterior is p(y = x) = πp(x y = )/, where = πp(x y = ) + ( π)p(x y = ). D is the part of the domain where the two classes do not overlap, because p(y = x) = implies that πp(x y = ) = and ( π)p(x y = ) =. Conversely, D 2 is the part of the domain where the classes overlap because p(y = x) < implies that πp(x y = ) <, and ( π)p(x y = ) >. Since the first term in (3) is non-positive, the derivative can be zero only if D 2 is empty (i.e., there is no class overlap). If D 2 is not empty (i.e., there is class overlap) the derivative will be negative. Since the objective function Div f () is convex, the derivative Div f () is a monotone non-decreasing function. Therefore, if the function Div f () has a minimizer, it will be larger than the true class prior π Partial distribution matching via penalized f-divergences In this section, we consider a function f(t) = { (t ) t <, c(t ) t. This function coincides with the L distance when c =. The analysis here is slightly more involved, since the subderivative should be taken at t =. This gives the following: Div f (π) = f()p(x y = )dx + D f(p(y = x)) p(x y = )dx, D 2 }{{} (4) where the subderivative at t = is f() = [, c and the derivative for t < is f(t) =. We can therefore write the subderivative of the first term in (4) as [ f()p(x y = )dx = p(x y = )dx, c p(x y = )dx. D D D The derivative for the second term in (4) is f(p(y = x)p(x y = )dx = p(x y = )dx. D 2 D 2 < < 224
5 Class-prior Estimation for Learning from Positive and Unlabeled Data To achieve a minimum at π, we should have [ p(x y = )dx p(x y = )dx, c p(x y = )dx p(x y = )dx. D D 2 D D 2 However, depending on D and D 2, we may have c p(x y = )dx p(x y = )dx <, D D 2 which means that Div f (π) The solution is to take c = to ensure that Div f (π) is always satisfied. Using the same reasoning as above, we can also rectify the overestimation problem for other divergences specified by f(t), by replacing f(t) with a penalized function f(t): { f(t) t, f(t) = otherwise. In the next section, estimation of penalized f-divergences is discussed Direct evaluation of penalized f-divergences Here, we show how distribution matching can be performed without density estimation under penalized f-divergences. We use the Fenchel duality bounding technique for f- divergences (Keziou, 23), which is based on Fenchel s inequality: where f (z) is the Fenchel dual or convex conjugate defined as f(t) tz f (z), (5) f (z) = sup t t z f(t ). Applying the bound (5) in a pointwise manner, we obtain ( ) ( ) p(x y = ) p(x y = ) f r(x) f (r(x)), where r(x) fulfills the role of z in (5). Multiplying both sides with gives ( ) p(x y = ) f r(x)p(x y = ) f (r(x)). (6) Integrating and then selecting the tightest bound gives Div f (p q) sup r(x)p(x y = )dx r Replacing expectations with sample averages gives Div f (p q) sup r n f (r(x))dx. (7) n r(x i ) n n f (r(x j)), (8) where Div f (p q) denotes Div f (p q) estimated from sample averages. Note that the conjugate f (z) of any function f(t) is convex. Therefore, if r(x) is linear in parameters, the above maximization problem is convex and thus can be easily solved. The conjugates for selected penalized f-divergences are given in Table. 225
6 du Plessis Niu Sugiyama 2.4. Penalized L -distance estimation Here we focus on penalized L -distance f(t) = t as a specific example of penalized f-divergences, and show that an analytic and computationally efficient solution can be obtained. The conjugate for the penalized L -distance is f (z) = max (z, ). Then we can see that the lower-bound in (7) will be met with equality when { p(x y = ), r(x) = (9) otherwise. For this reason, we use the following linear model as r(x): r(x) = α l ϕ l (x), () where {ϕ l (x)} b is the set of non-negative basis functions. Then the empirical estimate in the right-hand side of (8) can be expressed as ( α,..., α b ) = arg min (α,...,α b ) n ( n ) max α l ϕ l (x j), n n α l ϕ l (x i ) + + λ 2 αl 2, where the regularization term λ b 2 α2 l for λ > is included to avoid overfitting. The optimal value for the lower-bound (7) occurs when r(x) (see (9)). This fact can be incorporated by constraining the parameters of the model (), so that α l, l =,..., b. If the basis functions, ϕ l (x), l =,..., b, are non-negative, the term inside the max in () is always positive and the max operation becomes superfluous. This allows us to obtain the parameter vector (α,..., α b ) as the solution to the following constrained optimization problem: where ( α,..., α b ) = arg min (α,...,α b ) λ 2 α2 l α l β l, s.t. α l, l =,..., b, () β l = n n ϕ l (x i ) n n ϕ l (x j). The above optimization problem decouples for all α l values and can be solved separately as α l = λ max (, β l).. In practice, we use Gaussian kernels centered at all sample points as the basis functions: ϕ l (x) = exp ( x c l 2 /(2σ 2 ) ), where σ >, and (c,..., c n, c n+,..., c n+n ) = (x,..., x n, x,... x n ) 226
7 Class-prior Estimation for Learning from Positive and Unlabeled Data Since α can be just calculated with a max operation, the above solution is extremely fast to calculate. All hyper-parameters including the Gaussian width σ and the regularization parameter λ are selected for each via straightforward cross-validation. Finally, our estimate of the penalized L -distance (i.e., the maximizer of the empirical estimate in the right-hand side of (8)) is obtained as penl () = λ max (, β l ) β l +. The class prior is then selected so as to minimize the above estimator. 3. Stability analysis Regarding the estimation stability of penl () for fixed, we have the following deviation bound. Without loss of generality, we assume that the basis functions are upper bounded by one, i.e., x, ϕ l (x) for l =,..., b. Theorem (Deviation bound) Fix, then, for any < δ <, with probability at least δ over the repeated sampling of D = X X for estimating penl (; D), we have penl ( (; D) E D [ penl (; D) ln(2/δ) ( 2λ 2 /b ) 2 + (2 n n n + n ) ) 2. Proof We prove the theorem based on a technique known as the method of bounded difference. Let f l (x,..., x n, x,..., x n ) = max (, β l ) β l, so that penl (; D) = λ f l (x,..., x n, x,..., x n ) +. Next, we replace x i with x i and bound the difference between f l (x,..., x i,..., x n, x,..., x n ) and f l (x,..., x i,..., x n, x,..., x n ). Let t = ϕ l (x i ), t = ϕ l ( x i ), and ξ l = β l t. Since the basis functions are bounded, we know that t, t, as well as ξ l. Then the maximum difference is, c l = sup max t, t (, n ) ( ) ( n t ξ l t ξ l max, ) ( ) n t ξ l n t ξ l. By analyzing the above for different cases where the constraints are active, the the maximum difference for replacing x i with is x i is (/n)(2 + /n). We can use the same argument for replacing x j with x j in f l(x,..., x n, x,..., x n ), resulting in a maximum difference of (/n)(2 + /n). Note that this holds for all f l ( ) simultaneously, and thus the change of penl (; D) is no more than (b/λ)(/n)(2 + /n) if x i is replaced with x i or (b/λ)(/n)(2 + /n) if x j is replaced with x j. We can therefore apply McDiarmid s inequality to obtain, with probability at least δ/2, [ penl (; D) E D penl (; D) ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) )
8 du Plessis Niu Sugiyama [ Applying McDiarmid s inequality again for E D penl (; D) penl (; D) proves the result. Theorem shows that the deviation from our estimate to its expectation is small with high probability. Nevertheless, must be fixed before seeing the data, so we cannot use the estimate to choose. This motivates us to derive a uniform deviation bound. Let us define the constants C x = sup x ( b ϕ2 l (x) ) /2 b, C α = sup sup X p n (x y=),x p n (x)( b α2 l (, D) ) /2, where we write α l (, D) to emphasize that α l depends on and D. Theorem 2 (Uniform deviation bound) For any < δ <, with probability at least δ over the repeated sampling of D, the following holds for all, penl ( 2 (; D) E D [ penl (; D) n + 2 ) C α C x n + ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2. Proof 2 We denote g(; D) = b α lβ l, and g() = E D [g(; D) so that penl (; D) = g(; D) +, E D [ penl (; D) = g() +, and Step : penl (; D) E D [ penl (; D) = g(; D) g(). We first consider one direction. By definition,, g(; D) g() sup {g(; D) g()}. We cannot simply apply Theorem to bound the right-hand side since in the right-hand side is not fixed. Nevertheless, according to the proof of Theorem, if we replace a single point x i or x j in D, the change of sup {g(; D) g()} is also bounded by b λn (2 + n ) or b λn (2 + n ). We then have, with probability at least δ/2, ( sup{g(; D) g()} E D [sup {g(; D) g()} + ln(2/δ) ( 2λ 2 /b 2 2+ ) 2 + (2 n n n + n ) )
9 Class-prior Estimation for Learning from Positive and Unlabeled Data Step 2: Next we bound E D [sup {g(; D) g()} based on a technique known as symmetrization. Note that the function g(; D) = b α lβ l can be rewritten in a point-wise manner other than a base-wise manner, g(; D) = = n ( ) αl n ϕ l (x i ) n n ω(x i ) ω (x j), where for simplicity we define ω(x) = n ( ) αl ϕ l (x), and, ω (x) = n ( ) αl n ϕ l (x j) ( ) αl n ϕ l (x). Let D = { x,..., x n, x,..., x n } be a ghost sample, [ E D [sup {g(; D) g()} = E D sup{g(; D) E D [g(; D )}, = E D [sup {E D [g(; D) g(; D )}, [ E D,D sup{g(; D) g(; D )}, where we apply Jensen s inequality with the fact that the supremum is a convex function. Moreover, let σ = {σ,..., σ n, σ,..., σ n } be a set of Rademacher variables of size n + n, [ E D,D sup{g(; D) g(; D )} = E D,D = E D,D = E σ,d,d sup sup sup n n n n ω(x i ) ω (x j) ω( x i ) ω ( x j) n n (ω(x i ) ω( x i )) (ω (x j) ω ( x j)) n n σ i (ω(x i ) ω( x i )) σ j(ω (x j) ω ( x j)), since the original and ghost samples are symmetric and each (ω(x i ) ω( x i )) shares the same distribution with σ i (ω(x i ) ω( x i )) and each (ω (x j ) ω ( x j )) shares the same distribution 229
10 du Plessis Niu Sugiyama with σ j (ω (x j ) ω ( x j )). Subsequently, n n n n E σ,d,d sup σ i ω(x i ) σ jω (x j) + ( σ i )ω( x i ) ( σ j)ω ( x j) n n n n E σ,d sup σ i ω(x i ) σ jω (x j) +E σ,d sup ( σ i )ω( x i ) ( σ j)ω ( x j) n n = 2E σ,d sup σ i ω(x i ) σ jω (x j), where we first apply the triangle inequality, and then make use of that the original and ghost samples have the same distribution and all Rademacher variables have the same distribution. Step 3: The Rademacher complexity still remains to be bound. To this end, we decompose the Rademacher complexity into two, n n E D,σ sup σ i ω(x i ) σ jω (x j) n ( ) αl n ( ) = E D,σ sup σ i ϕ l (x i ) σ j αl n n ϕ l (x j) [ n n E D,σ sup ( α l ) σ i ϕ l (x i ) + n E n D,σ sup α l σ jϕ l (x j). Applying the Cauchy-Schwarz inequality followed by Jensen s inequality to the first Rademacher average gives [ ( n n E D,σ sup ( α l ) σ i ϕ l (x i ) C α n E n ) /2 2 D,σ σ i ϕ l (x i ) C α n = C α n E D,σ E D,σ ( n ) 2 σ i ϕ l (x i ) n i,i = Since σ,..., σ n are Rademacher variables, [ n n E D,σ σ i σ i ϕ l (x i )ϕ l (x i ) = E D i,i = /2 σ i σ i ϕ l (x i )ϕ l (x i ) ϕ 2 l (x i) ncx. 2 /2. 23
11 Class-prior Estimation for Learning from Positive and Unlabeled Data Consequently, we have and E D,σ 3... Step 4 [ n E D,σ sup n E D,σ sup ( α l ) n σ i ϕ l (x i ) C αc x n, n α l σ jϕ l (x j) C αc x, n n n sup σ i ω(x i ) σ jω (x j) ( n + ) C α C x. n Combining the three steps together, we obtain that with probability at least δ/2,, ( 2 g(; D) g() n + 2 ) ( C α C x + ln(2/δ) ( n 2λ 2 /b ) 2 + (2 n n n + n ) ) 2. The same argument can be used to bound g() g(; D). Combining these two tail inequalities proves the theorem. Theorem 2 shows that the uniform deviation bound is of order O(/ n+/ n ), whereas the deviation bound for fixed is of order O( /n + /n ) as shown in Theorem. For this special estimation problem, the convergence rate of the uniform deviation bound is clearly worse than the convergence rate of the deviation bound for fixed. However, after obtaining the uniform deviation bound, we are able to bound the estimation error, that is, the gap between the expectation of our estimate and the best possible estimate within the model. To do so, we need to constrain the parameters of the best estimate via C α since () takes (9) as the target, while (9) is an unbounded function. Furthermore, we assume that the regularization is weak enough so that it would not affect the solution α too much. Specifically, given D, let α be the minimizer of the objective function in () but without the regularization term λ b 2 α2 l, subjecting to α 2 C α. Let penl (; D) be an estimator of the penalized L -distance corresponding to α, and we assume that there exists α > such that, D, penl (; D) penl (; D) α. Then we have the following theorem. Theorem 3 (Estimation error bound) Let penl () be the maximizer of the estimate in the right-hand side of (7) based on () with the best possible α where α 2 C α. For any < δ <, with probability at least δ over the repeated sampling of D, the following holds for all, ( 4 penl () E D [ penl (; D) α + n + 4 ) C α C x n + ln(2/δ) 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2. 23
12 du Plessis Niu Sugiyama Proof 3 Since α is fixed, we have E D [penl (; D) = penl (). Then, penl () E D [ penl (; D) = E D [penl (; D) E D [ penl (; D) ( ) = (E D [penl (; D) penl (; D)) + penl (; D) E D [ penl (; D) ( ) ( ) + penl (; D) penl (; D) + penl (; D) penl (; D). We bound each of the four terms separately. According to the proof of Theorem 2, with probability at least δ/2,, E D [penl (; D) penl (; D) ( 2 n + 2 ) C α C x + ln(2/δ) n 2λ 2 /b 2 ( ( 2 + ) 2 + (2 n n n + n ) ) 2, The same can be proven for penl (; D) E D [ penl (; D). The third term must be nonpositive since penl (; D) is the maximizer of the empirical estimate. Finally, the four term is upper bounded by α, which completes the proof. Theorem 3 shows that the deviation from the expectation of our estimate from the optimal value in the model is small with high probability. 4. Related work In Scott and Blanchard (29) and Blanchard et al. (2), it was proposed to reduce the problem of estimating the class prior to Neyman-Pearson classification 2. A Neyman- Pearson classifier f minimizes the false-negative rate R (f), while keeping the false-positive rate R (f) constrained under a user-specified threshold (Scott and Nowak, 25): R (f) = P (f(x) ), R (f) = P (f(x) ), where P and P denote the probabilities for the positive-class and negative-class conditional densities, respectively. The false-negative rate on the unlabeled dataset is defined and expressed as R X (f) = P X (f(x) = ) = π( R (f)) + ( π)r (f), where P X denotes the probability for unlabeled input data density. The Neyman-Pearson classifier between P and P X is defined as R X,α = inf f R X (f) s.t. R (f) α. 2. The papers (Scott and Blanchard, 29; Blanchard et al., 2) considered the nominal class as y =, and the novel class as y =. The aim was to estimate p(y = ). We use a different notation with the nominal class as y = and the novel class as y = and estimate π = p(y = ). To simplify the exposition, we use the same notation here as in the rest of the paper. 232
13 Class-prior Estimation for Learning from Positive and Unlabeled Data Then the minimum false-negative rate for the unlabeled dataset given false positive rate α is expressed as R X,α = ( α) + ( )R,α. (2) Theorem in Scott and Blanchard (29) says that if the supports for P and P are different, there exists α such that R,α =. Therefore, the class prior can be determined as = dr X,α, (3) dα α= where α is the limit from the left-hand side. Note that this limit is necessary since the first term in (2) will be zero when α =. However, estimating the derivative when α is not straightforward in practice. The curve of RX vs. R can be interpreted as an ROC curve (with a suitable change in class notation), but the empirical ROC curve is often unstable at the right endpoint when the input dimensionality is high (Sanderson and Scott, 24). One approach to overcome this problem is to fit a curve to the right endpoint of the ROC curve in order to enable the estimation (as in Sanderson and Scott (24)). However, it is not clear how the estimated class-prior is affected by this curve-fitting. 5. Experiments In this section, we experimentally compare the performance of the proposed method and alternative methods for estimating the class prior. We compared the following methods: EN: The method of Elkan and Noto (28) with the classifier as a squared-loss variant of logistic regression classifier (Sugiyama, 2). PE: The direct Pearson-divergence matching method proposed in du Plessis and Sugiyama (24). SB The method of Blanchard et al. (2). The Neyman-Pearson classifier was implemented as a the thresholded ratio of two kernel density estimates, each with a bandwidth parameter. As in Blanchard et al. (2), the bandwidth parameters were jointly optimized by maximizing the cross-validated estimate of the AUC. The prior was obtained by estimating (3) from the empirical ROC curve. pen-l (proposed): The penalized L -distance method with an analytic solution. The basis functions were selected as Gaussians centered at all training samples. All hyper-parameters were determined by cross-validation. First, we illustrate the systematic overestimation of the class prior by two previously proposed methods, EN (Elkan and Noto, 28) and PE (du Plessis and Sugiyama, 24) when the classes significantly overlap. The class-conditional densities are p(x y = ) = N x (, 2 ) and p(x y = ) = N x ( 2, 2 ), 233
14 du Plessis Niu Sugiyama (a) EN (b) PE (c) SB (d) pen-l Figure 2: Histograms of class-prior estimates when the true class prior is p(y = ) =.2. The EN method and the PE method have an intrinsic bias that does not decrease even when the number of samples is increased, while the SB method and the proposed pen-l method work reasonably well. ( where N x µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 with respect to x. The true class prior is set at π = p(y = ) =.2. The sizes of the unlabeled dataset and the labeled dataset were both set at n = n = 3. The histograms of class prior estimates are plotted in Figure 2, showing that the EN and PE methods clearly overestimate the true class prior. We also confirmed that this overestimation does not decrease even when the number of samples is increased. On the other hand, the SB and pen-l methods work reasonably well. Finally, we use the MNIST hand-written digit dataset. For each digit, all the other digits were assumed to be in the opposite class (i.e., one-versus-rest). The dataset was reduced to 4-dimensions using principal component analysis. The squared error of classprior estimation is given in Figure 3, showing that the proposed pen-l method overall gives accurate estimates of the class prior, while the EN and PE methods tend to give less accurate estimates for low class priors and more accurate estimates for higher class priors, which agrees with the observation in du Plessis and Sugiyama (24). On the other hand, the SB method tends to perform poorly, which is caused by the instability of the empirical ROC curve at the right endpoint when the input dimensionality is larger, as pointed out in Section Conclusion In this paper, we discussed the problem of class-prior estimation from positive and unlabeled data. We first showed that class-prior estimation from positive and unlabeled data by partial distribution matching under f-divergences yields systematic overestimation of the class prior. We then proposed to use penalized f-divergences to rectify this problem. We further showed that the use of L -distance as an example of f-divergences yields a computationally efficient algorithm with an analytic solution. We provided its uniform deviation bound and estimation error bound, which theoretically supports the usefulness of the proposed method. Finally, through experiments, we demonstrated that the proposed method compares favorably with existing approaches. 234
15 Class-prior Estimation for Learning from Positive and Unlabeled Data EN PE SB pen L (a) Digit vs. others (b) Digit 2 vs. others (c) Digit 3 vs. others (d) Digit 4 vs. others (e) Digit 5 vs. others (f ) Digit 6 vs. others (g) Digit 7 vs. others (h) Digit 8 vs. others (i) Digit 9 vs. others Figure 3: Class-prior estimation accuracy for the MNIST dataset. The EN method and the PE method behave similarly, and the proposed pen-l method consistently outperforms them. The SB method performed poorly due to the instability of the empirical ROC curve at the right endpoint. Acknowledgments MCdP and GN were supported by the JST CREST program and MS was supported by KAKENHI References S. M. Ali and S. D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28:3 42, 966. G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. The Journal of Machine Learning Research, 9999: ,
16 du Plessis Niu Sugiyama I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229 38, 967. M. C. du Plessis and M. Sugiyama. Semi-supervised learning of class balance under classprior change by distribution matching. In ICML 22, pages , Jun. 26 Jul. 22. M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems, E97-D, 24. M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 73 7, 24. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In ACM SIGKDD 4, pages 23 22, 28. A. Keziou. Dual representation of φ-divergences and applications. Comptes Rendus Mathématique, 336(): , 23. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79 86, 95. K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 5:57 75, 9. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 4():2 4, 22. T. Sanderson and C. Scott. Class proportion estimation with application to multiclass anomaly rejection. In Proceedings of the 7th International Conference on Artificial Intelligence and Statistics (AISTATS), 24. C. Scott and G. Blanchard. Novelty detection: Unlabeled data definitely help. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS29), Clearwater Beach, Florida USA, Apr C. Scott and R. Nowak. A Neyman-Pearson approach to statistical learning. Information Theory, IEEE Transactions on, 5(): , 25. M. Sugiyama. Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D():269 27, 2. M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi. Densitydifference estimation. In Advances in Neural Information Processing Systems 25, pages 692 7,
Class Prior Estimation from Positive and Unlabeled Data
IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,
More informationConvex Formulation for Learning from Positive and Unlabeled Data
Marthinus Christoffel du Plessis The University of Tokyo, 7-3- Hongo, Bunkyo-ku, Tokyo, Japan Gang Niu Baidu Inc., Beijing, 00085, China Masashi Sugiyama The University of Tokyo, 7-3- Hongo, Bunkyo-ku,
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationLearning from Corrupted Binary Labels via Class-Probability Estimation
Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National
More informationarxiv: v2 [cs.lg] 1 May 2013
Semi-Supervised Information-Maximization Clustering Daniele Calandriello 1, Gang Niu 2, and Masashi Sugiyama 2 1 Politecnico di Milano, Milano, Italy daniele.calandriello@mail.polimi.it 2 Tokyo Institute
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationClass Proportion Estimation with Application to Multiclass Anomaly Rejection
Class Proportion Estimation with Application to Multiclass Anomaly Rejection Tyler Sanderson University of Michigan in Ann Arbor Clayton Scott University of Michigan in Ann Arbor Abstract This work addresses
More informationAlgebraic Information Geometry for Learning Machines with Singularities
Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationNovelty detection: Unlabeled data definitely help
Clayton Scott University of Michigan Ann Arbor, MI, USA Gilles Blanchard Fraunhofer FIRST.IDA Berlin, Germany Abstract In machine learning, one formulation of the novelty detection problem is to build
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationDependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationf-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models
IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,
More informationDensity-Difference Estimation
Neural Computation. vol.25, no., pp.2734 2775, 23. Density-Difference Estimation Masashi Sugiyama Tokyo Institute of Technology, Japan. sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/ sugi Takafumi
More informationTheoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning
Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning Gang Niu 1 Marthis C. du Plessis 1 Tomoya Sakai 1 Yao Ma 3 Masashi Sugiyama 2,1 1 The University of Tokyo, Japan
More informationWhen is undersampling effective in unbalanced classification tasks?
When is undersampling effective in unbalanced classification tasks? Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi 09/09/2015 ECML-PKDD 2015 Porto, Portugal 1/ 23 INTRODUCTION In several binary
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationGenerative MaxEnt Learning for Multiclass Classification
Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationBregman divergence and density integration Noboru Murata and Yu Fujimoto
Journal of Math-for-industry, Vol.1(2009B-3), pp.97 104 Bregman divergence and density integration Noboru Murata and Yu Fujimoto Received on August 29, 2009 / Revised on October 4, 2009 Abstract. In this
More informationAny Reasonable Cost Function Can Be Used for a Posteriori Probability Approximation
Any Reasonable Cost Function Can Be Used for a Posteriori Probability Approximation Marco Saerens, Patrice Latinne & Christine Decaestecker Université Catholique de Louvain and Université Libre de Bruxelles
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationGeneralized Kernel Classification and Regression
Generalized Kernel Classification and Regression Jason Palmer and Kenneth Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA japalmer@ucsd.edu,kreutz@ece.ucsd.edu
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationSufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities
JMLR: Workshop and Conference Proceedings 30:1 16, 2015 ACML 2015 Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities Hiroaki Sasaki Graduate School
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More information44 CHAPTER 2. BAYESIAN DECISION THEORY
44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationRelative Density-Ratio Estimation for Robust Distribution Comparison
Neural Computation, vol.5, no.5, pp.34 37, 3. Relative Density-Ratio Estimation for Robust Distribution Comparison Makoto Yamada NTT Communication Science Laboratories, NTT Corporation, Japan. yamada.makoto@lab.ntt.co.jp
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationDensity-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation
Ann Inst Stat Math (202) 64:009 044 DOI 0.007/s0463-0-0343-8 Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation Masashi Sugiyama Taiji Suzuki Takafumi
More informationA Large Deviation Bound for the Area Under the ROC Curve
A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu
More informationLeast-Squares Independence Test
IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,
More informationDoes quantification without adjustments work?
Does quantification without adjustments work? Dirk Tasche February 28, 2016 arxiv:1602.08780v1 [stat.ml] 28 Feb 2016 Classification is the task of predicting the class labels of objects based on the observation
More information10-701/ Machine Learning, Fall
0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationThe Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models
The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health
More informationDirect Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis
Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis Maoto Yamada and Masashi
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationMargin Based PU Learning
Margin Based PU Learning Tieliang Gong, Guangtao Wang 2, Jieping Ye 2, Zongben Xu, Ming Lin 2 School of Mathematics and Statistics, Xi an Jiaotong University, Xi an 749, Shaanxi, P. R. China 2 Department
More informationarxiv: v1 [stat.ml] 23 Jun 2011
Relative Density-Ratio Estimation for Robust Distribution Comparison arxiv:06.4729v [stat.ml] 23 Jun 20 Makoto Yamada Tokyo Institute of Technology 2-2- O-okayama, Meguro-ku, Tokyo 52-8552, Japan. Taiji
More informationAdvanced statistical methods for data analysis Lecture 1
Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationStatistical Analysis of Distance Estimators with Density Differences and Density Ratios
Entropy 2014, 16, 921-942; doi:10.3390/e16020921 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Statistical Analysis of Distance Estimators with Density Differences and Density
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationNon-parametric Methods
Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationFeature selection and extraction Spectral domain quality estimation Alternatives
Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationClassification and Support Vector Machine
Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline
More informationDual Augmented Lagrangian, Proximal Minimization, and MKL
Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo
More informationAre You a Bayesian or a Frequentist?
Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationarxiv: v2 [cs.lg] 14 Oct 2016
Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data Tomoya Sakai Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University of Tokyo The University of
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationTight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding
APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for
More informationAdaptive Multi-Modal Sensing of General Concealed Targets
Adaptive Multi-Modal Sensing of General Concealed argets Lawrence Carin Balaji Krishnapuram, David Williams, Xuejun Liao and Ya Xue Department of Electrical & Computer Engineering Duke University Durham,
More informationThe Information Bottleneck Revisited or How to Choose a Good Distortion Measure
The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali
More informationDecentralized decision making with spatially distributed data
Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationMulti-Positive and Unlabeled Learning
Multi-Positive and Unlabeled Learning Yixing Xu, Chang Xu, Chao Xu, Dacheng Tao Key Laboratory of Machine Perception (MOE), Cooperative Medianet Innovation Center, School of Electronics Engineering and
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More information