Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Size: px

Start display at page:

Download "Convex Multiple-Instance Learning by Estimating Likelihood Ratio"

Eric Long
5 years ago
Views:

1 Convex Multiple-Instance Learning by Estimating Likelihood Ratio Fuxin Li and Cristian Sminchisescu Institute for Numerical Simulation, University of Bonn Abstract We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is cast as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the - classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the - loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach. Introduction Multiple Instance Learning (MIL) has been proposed over years ago as a methodology to learn models under weak labeling constraints []. Unlike traditional binary classification problems, the positive items are represented as bags, which are sets of instances. A feature vector is used to represent each instance in the bag. There is an OR relationship in a bag: if one of the feature vectors is classified as positive, the entire bag is considered positive. A simple intuition is: one has a number of keys and faces a locked door. To enter the door, we only need one matching keys. MIL is a natural weak labeling formulation for text categorization [2] and computer vision problems [3]. In document classification, one is given files made of many sentences, and often only a few are useful. In computer vision, an image can be decomposed into different regions, and only some delineate objects. Therefore, MIL can be used in sophisticated tasks, such as identifying the location of object parts from bounding box information in images [4]. Although efforts have been made to provide datasets with increasingly more detailed supervisory information [5], without automation such a minutiae level of detail becomes prohibitive for large datasets, or more complicated data like video [6, 7]. In this case, one necessarily needs to resort to multiple-instance learning. MIL is interesting mainly because of its potential to provide instance-level labels from weak supervisory information. However the state-of-the-art in MIL is often obtained by simply using a weighted sum of kernel values between all instance pairs within the bags, while ignoring the prediction of instance labels [8, 9, ]. It is intriguing why MIL algorithms that exploit instance level information cannot achieve better performance, as constraints at instance level seems abundant none of the negative instances is positive. This should provide additional constraints in defining the region of positive instances and should help classification in input space. A major challenge is the non-convexity of many instance-level MIL algorithms [2,, 2, 3, 4]. Most of these algorithms perform alternating minimization on the classifier and the instance weights.

2 This procedure usually gives only a local optimum since the objective is non-convex. The benchmark performance of MIL methods is overall quite similar, although techniques differ significantly: some assign binary weights to instances [2], some assign real weights [2, 3], yet others use probabilistic formulations [4]; some optimize using conventional alternating minimization, others use convexconcave procedures []. Gehler and Chapelle [5] have recently performed an interesting analysis of the MIL costs, where deterministic annealing (DA) was used to compute better local optima for several formulations. In the case of a previous mi-svm formulation [2], annealing methods did not improve the performance significantly. A newly proposed algorithm, ALP-SVM, was also introduced, which used a preset parameter defining the fixed ratio of witnesses the true positive instances in a positive bag. Excellent results were obtained with this witness rate parameter set to the correct value. However, in practice it is unclear whether this can be known beforehand and whether it is stationary across different bags. In principle, the witness rate should also be estimated, and this learning stage partially accounts for the non-convexity of the MIL problem. It remains however unclear whether the observed performance variations are caused by non-convexity or by other modeling aspects. Although performance considerations have hindered the application of MIL to practical problems, the methodology has started to gain momentum recently [4, 6, 7, 6]. The success of the Latent SVM for person detection [4] shows that a standard MIL procedure (the reformulation of the alternating minimization MI-SVM algorithm in [2]) can achieve good results if properly initialized. However, proper initialization of MIL remains elusive in general, as it often requires engineering experience with the individual problem structure. Therefore, it is still of broad interest to develop an initialization-independent formulation for MIL. Recently Li et al. [7] proposed a convex instancelevel MIL algorithm based on multiple kernel learning, where one kernel was used for each possible combination of instances. This creates an exponential number of constraints and requires a cuttingplane solver. Although the formulation is convex, its scalability drops significantly for bags with many instances. In this paper we make an alternative attempt towards a convex formulation: we establish that nonconvex MIL constraints can be recast reliably into convex constraints on the likelihood ratio between the positive and negative classes for each instance. We transform the multiple-instance learning problem into a convex joint estimation of the likelihood ratio function and the likelihood ratio values on training instances. The choice of the jointly convex loss function is rich, remarkably at least from a family of f-divergences. Theoretically, we prove consistency results for likelihood ratio estimation, thus showing that f-divergence loss functions upper bound the classification - loss tightly, unless the likelihood is very large. A support vector regression scheme is implemented to estimate the likelihood ratio, and it is shown to separate positive and negative instances well. However, determining the correct threshold for instance classification from the training set remains non-trivial. To address this problem, we propose a post-processing step based on a bag classifier computed as a linear combination of likelihood ratios. While this is shown to be suboptimal in synthetic experiments, it still achieves state-of-theart results in practical datasets, demonstrating the vast potential of the proposed approach. 2 Convex Reformulation of the Multiple Instance Constraint Let us consider a learning problem with n training instances in total, n + positive and n negative. In negative bags, every instance is negative, hence we do not separately define such bags instead we directly work with the instances. Let B = {B,B 2,...,B k } be positive bags and X + = {x +,x+ 2,...,x+ n + }, X = {x,x 2,...,x n } be the training input, where each x i belongs to a positive bagb j and eachx i is a negative instance. The goal of multiple instance learning is, given {X +,X,B}, to learn a decision rule,sign(f(x)), to predict the label{+, } for the test instance x. The MIL problem can be characterized by two properties. ) negative-exclusion: if none of the instances in a bag is positive, the bag is not positive. 2) positive-identifiability: if one of the instances in the bag is positive, the bag is positive. These properties are equivalent to a constraint max xi B j f(x i ) on positive bags. This constraint is not convex since the negativemax function is concave. Reformulation into a sum constraint such as f(x) would be convex, when 2

3 f(x) = w T x is linear [6]. However, this hardly retains positive-identifiability, since if there is only onex i with f(x i ) >, this can be superseded by other instances with f(x i ) <. Apparently, the distinction between the sum and the max operations is significant and difficult to ignore in this context. However, in this paper we show that if MIL conditions are formulated as constraints on the likelihood ratio, convexity can be achieved. For example, the constraint: Pr(y = x i ) Pr(y = x i ) > B j () x i B j can ensure both of the MIL properties. Positive-identifiability is satisfied when Pr(y = x i ) or equivalently, when the positive examples all have very large margin. B i B i + When the size of the bag is large, the assumption Pr(y = x i ) > Bj B j + can be too strong. Therefore, we exploit large deviation bounds to reduce the quantity B j, such that Pr(y = x i ) does not have to be very large to satisfy the constraint. Intuitively, if the examples are not very ambiguous, i.e. Pr(y = x i ) is not close to /2, then likelihood ratio sums on negative examples can become much smaller, hence we can adopt a significantly lower threshold at some degree of violation of the negative-exclusion property. To this end, a common assumption is the low label noise [8, 9]: M β : c >, ǫ,pr( < Pr(y = x i ) 2 ǫ) cǫβ. This assumes that the posterior Pr(y = x i ) is usually not very close to /2, meaning that most examples are not very ambiguous, which is reasonable in many practical problems. In [8, 9, 2], a number of results have been obtained implying that classifiers learned under this assumption converge to the Bayes error much faster than the conventional empirical process rate O(n /2 ) of most standard classifiers, and can be as fast aso(n ). These theoretical results show that low label noise assumptions indeed supports learning with fewer observations. AssumingM β holds, we prove the following result which allows us to relax the hard constraint (): Theorem δ >, for each x i in a bag B j, assume y i is drawn i.i.d. from the distribution Pr Bj (y i x i ) that satisfiesm β. If all instancesx i B j are negative, then the probability that Pr(y = x i ) Pr(y = x i ) β +4 2(β +)(β +2) B 4β + j + 2(β +) 2 (2β +3) B j log/δ+ log/δ (2) 3 x i B j is at most δ. The proof is given in an accompanying technical report [2]. From Theorem, we could weaken the constraint () to obtain constraint (2) and still ensure negative-exclusion with probability δ. When β is large, the reduction is significant. For example, for β = 2 and δ =.5, the right- hand side of (2) is approximately 4 B i B i +, which is an important decrease over B i, whenever B i 3. Note that the i.i.d. assumption in Theorem applies to each bag. Different bags can have different label distributions. This is often a significantly weaker assumption than the ones based on global i.i.d. of labels [8]. 3 Likelihood Ratio Estimation To estimate the likelihood ratio, one possibility would be to use kernel methods as nonparametric estimators over a RKHS. This approach was taken in [22], where predictions of the ratio provided a variational estimate of an f-divergence (or Ali-Silvey divergence) between two distributions. The formulation is powerful, yet not immediately applicable here. In our case, because of the uncertainty in the positive examples, Pr(y = x) is not observed but has to be estimated. Therefore we need to optimize jointly asmin f,pr(y= x) D(f,Pr(y = x))+λ f 2 with loss functiond(f,g). This optimization would not be convex if a framework in [22] were taken. 3

4 The requirement to estimate two sets of variables simultaneously (e.g. f and Pr(y = x) here), is one of the major difficulties in turning multiple-instance learning into a convex problem. Approaches based on classification-style loss functions lead to non-convex optimization [2, 3]. However, since we are outside a classification setting, we can optimize over divergence measuresd φ (f,g) that are convex w.r.t. both f and g. These measures are common. For example, the f-divergence family that includes many statistical distances, satisfies the following properties [23]: (x i y i) 2 x i ; L : D(x,y) = i x i y i ;χ 2 : D(x,y) = i Kullback-Leibler: D(x,y) = i x ilogx i x i logy i x i +y i ; (3) Symmetric Kullback-Leibler: D(x,y) = i (y i x i )logy i +(x i y i )logx i x i +y i In principle, any of the measures given above can be used to estimate the likelihood ratio. An important issue is the relationship between the likelihood ratio estimation and our final goal: binary classification. In [2], the authors give necessary and sufficient conditions for Bayes consistent learners by minimizing the mean of a surrogate loss function of the data. In this paper we extend these results to loss functions for likelihood ratio estimation. Let R(f) = P(sign(y) sign(f(x) )) be the - risk of a likelihood estimator f, with classification rule given by sign(f(x) ). The Bayes risk is thenr = inf f R(f). For a generic loss function C(α,η), let η = Pr(y = x), we can define the C-risk as R C (f) = E xy (C(f,η)) and RC = inf f R C (f). Our goal is to bound the excess - risk R(f) R by the excess-c riskr C (f) RC, so that minimizing the classification loss can be converted into minimizing the excess-c risk. Let us further define the optimal conditional risk ash(η) = inf α R C(α,η), and H (η) = inf α,(α )(2η ) C(α,η). We say C(α,η) is classification-calibrated if for any η /2, H (η) > H(η). Then, we define the ψ-transform of C(α,η) as ψ(θ) = ψ (θ), where ψ(θ) = H ( +θ 2 ) H(+θ 2 ),θ [,], andg is the Fenchel-Legendre biconjugate ofg, which is essentially the largest convex lower bound ofg [2]. The difference between likelihood ratio estimation and the classification setting is in the asymmetric scaling of the loss function for positive and negative examples. Let ψ = ψ( x), R (f) = Pr(y =,f(x) > ),R = inf f R (f), R + (f) = Pr(y =,f(x) < ) andr + = inf f R + (f) be the risk and Bayes risks on negative and positive examples, respectively. It is easy to prove that R(f) R = R (f) R +R + (f) R +. We derived the following theorem: Theorem 2 a) For any nonnegative loss function C(α,η), any measurable f : X R, and any probability distribution onx {±},ψ (R (f) R )+ψ(r +(f) R+ ) R C(f) RC. b) The following conditions are equivalent: ()C is classification-calibrated; (2) For any sequence(θ i ) in [,],ψ(θ i ) if and only ifθ i ; (3) For every sequence of measurable functionsf i : X R and every probability distribution onx {±},R C (f i ) RC impliesr(f i) R. The proof is given in an accompanying technical report [2]. This suggests that if ψ is well-behaved, minimizing R C (f) still gives a reasonable surrogate for the classification risk. Compared against Theorem 3 in [2] which has the form ψ(r(f) R ) R C (f) RC, the difference here stems from the different loss transforms used for the positive and the negative examples. We consider an f-divergence of the likelihood as the loss function, i.e., C(α, η) = D(α, η ), where η η is the likelihood ratio when the Pr(y = x) = η. From convexity arguments, it can be easily seen that H(η) = C( η η,η) = and H η (η) = D(, η ), therefore ψ(θ) = D(, +θ θ ). The ψ for all the loss functions listed in (4) can be computed accordingly. In fig. 3 (a) we show the ψ(θ) ofl and the KL-divergence from (4) and compare it against the hinge loss (where ψ(θ) = θ [2]) used for SVM classification. It could be seen that our approximation of the classification loss is accurate when Pr(y i = x i ) is small. However, likelihood estimation would severely penalize the misclassified positive examples with largepr(y i = x i ). This suggests that for the joint estimation of f and Pr(y i = x i ), the optimizer would tend to make Pr(y i = x i ) smaller, in order to avoid high penalties, as shown in fig. (b). In fig. (a) we plot ψ functions for different losses. We prefer an L measure as it is closer to the classification hinge loss, at least for the negative examples. It is easier to solve the nonparametric function estimation in RKHS using an epsilon-insensitive L loss, since it can be reformulated as η 4

5 .2 Positive Negative.2 Positive Negative Undecided ψ(θ) 5 L divergence KL divergence Hinge Loss Estimated Likelihood θ Example (a) (b) (c) Figure : Loss functions and their influence on the estimation bias. (a) The function ψ appearing in the losses used for likelihood estimation (L, KL-divergence) is similar to the hinge loss when θ > ; however it goes to infinity as θ approaches. This deviation essentially means the surrogate loss is going to be extremely large if an example with very largepr(y i = x i ) is misclassified. (b) Example estimated likelihood for a synthetic example. The estimated likelihood is biased towards smaller values. However, with a fully labeled training set, the threshold can still be obtained. (c) If we only know the label of the negative examples (blue) and the maximal positive example (red), determining the optimal threshold becomes non-trivial. support vector regression on the conditional likelihood, with the additional MIL constraints in (2): min f,η + x + j s.t. max( f(x + j ) η+ j ǫ,)+ x max( f(x j ) ǫ,)+λ f 2 j x + η+ j Bi j D i,η + j (4) where f 2 is the RKHS norm;d i is a constant for each bag and can be determined from Theorem Pr(y= x + i ), with appropriately chosen values for constantsβ andδ; η i + is an estimate of for the Pr(y= x + i ) training set. In this paper we use β = 2 and δ =.5, which gives the estimate of the bound for each bag asd i = 4 B 3 i + 4 B i +, whenb i 3 andd i = B i when B i < 3. It can be proved that optimization problem (4) is jointly convex in both arguments. A standard representer theorem [24] would convert it to an optimization on vectors, which we omit here. The problem can be solved by different methods. The one easiest to implement is the alternating minimization between solving for the SVM and projecting on the constraint sets given by x + j Bi y+ j D i and y + j. As this can turn out to be slow for large datasets, approaches such as the dual SMO or primal subgradient projection algorithms (in the case of linear SVM) can be used. In this paper we implement the alternating minimization approach, which is provably convergent since the optimization problem (4) is convex. In the accompanying technical report [2] we derive an SMO algorithm based on the dual of (4) and characterize the basic properties of the optimization problem. 4 Bag and Instance Classification If the likelihood ratio is obtained using an unbiased estimator, a decision rule based on sign(f(x) ) should give the optimal classifier. However as previously argued, the joint estimation on f and η + introduces a bias which is not always easy to identify. In positive bags, it is unclear whether an instance should be labeled positive or negative, as long as it does not contribute significantly to the classification error of its bag (fig. 3(b),(c)). In the synthetic experiments, we noticed that knowledge of the correct threshold would make the algorithm outperform competitors by a large margin (fig. 2). This means that based on the learned likelihood ratio, the positive examples are usually well separated from the negative ones. Developing a theory that would advance these aspects remains a promising avenue for future work. The main difficulty stems from the compound source of bias which arises from both the estimation ofη + and the loss minimization overη + andf. Here we propose a partial solution. Instead of directly estimating the threshold, we learn a linear combination of instance likelihood ratios to classify the bag. First, we sort the instance likelihood ratios for each bag into a vector of lengthmax i B i. We appendto bags that do not have enough 5

Error averaged over 5 runs.45.4.35.3.25.2.5..5 (a) (b) (c) Bag Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM.2.4.6.

6 Error averaged over 5 runs (a) (b) (c) Bag Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM percent of positive labeled points in bags Error averaged over 5 runs Pattern Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM percent of positives in bags (d) (e) (f) Estimated witness rate Witness Rate Estimate by SVR SVM Percent of positives in bags Figure 2: Synthetic dataset (best viewed in color). (a) The true decision boundary. (b) Training points at 4% witness rate. (c) The learned regression function. (d) Bag misclassification rate of different algorithms. (e) Instance misclassification rate of different algorithms. (f) Estimated witness rate and true witness rate. instances. Under this representation, bag classification turns into a standard binary decision problem where a vector and a binary label is given for each bag, and a linear SVM is learned to solve the problem. If we were to classify only the likelihood ratio on the first instance, this procedure would reduce to simple thresholding. We instead leverage information in the entire bag, aiming to constrain the classifier to learn the correct threshold. In this linear SVM setting, regularization never helps in practice and we always fix C to very large values. Effectively no parameter tuning is needed. To classify instances, a threshold is still necessary. In the current system, we follow a simple approach and take the mean between two instances: the one with the highest likelihood among training bags that are predicted negative by the bag classifier, and the lowest scored one among instances in positive bags with a score higher than the previous one. This approach is derived from the basic MIL assumption that all instances in a negative bag are negative. Based on instance classification we could also estimate the witness rate of the dataset. This is computed as the ratio of positively classified instances and the total number of instances in the positive bags of the training set. Since our algorithm automatically adjusts to different witness rates, this estimate offers quantitative insight as to whether MIL should be used. For instance, if the witness rate is%, it may be more effective to use a conventional learning approach. 5 Experiments 5. Synthetic Data We start with an experiment on the synthetic dataset of [5], where the controlled setting helps understanding the behavior of the proposed algorithm. This is a 2-D dataset with the actual decision boundary shown in fig. 2 (a). The positive bags have a fraction of points sampled uniformly from the white region and the rest sampled uniformly from the black region. An example of the sample at 4% witness rate is shown in fig. 2 (b). In this figure, the plotted instance labels are the ones of their bags indeed, one could notice many positive (blue) instances in the negative (red) region. We have also experimented with a uniform threshold based on probabilistic estimates, as well as with predicting an instance-level threshold. While the former tends to underfit, the latter overfits. Our bag-level classifier targets an intermediate level of granularity and turns out to be the most robust in our experiments. 6

7 Table : Performance of various MIL algorithms on weak labeling benchmarks. The best result on each dataset is shown in bold. The second group of algorithms either not provide instance labels (MI-Kernel and migraph) or require a parameter that can be difficult to tune (ALP-SVM). SVR- SVM appears to give consistent results among algorithms that provide instance labels. The row denoted Est. WR gives the estimated witness rates of our method. Algorithm Musk- Musk-2 Elephant Tiger Fox CH-FD EMDD mi-svm MI-SVM MICA AW-SVM Ins-KI-SVM MI-Kernel 88.± ± ± ±.9 6.3±. migraph 88.9± ± ±.7 86.± ±.6 ALP-SVM SVR-SVM 87.9± ± ± ± ±3.5 Est. WR % 89.5 % 37.8 % 42.7 % % In order to test the effect of witness rates, different types of datasets are created by varying the rates over the range.,.2,...,. In this experiment we fix the hyperparameters C = 5 and use a Gaussian kernel with σ =.We show a trained likelihood ratio function in fig. 2 (c), estimated on the dataset shown in fig. 2 (b). Under the likelihood ratio, the positive examples are well separated from negatives. This illustrates how our proposed approach converts multiple-instance learning into the problem of deciding a one-dimensional threshold. Complete results on datasets with different witness rates are shown in fig. 2 (d) and (e). We give both bag classification and instance classification results. Our approach is referred to as SVR- SVM. BEST THRESHOLD refers to a method where the best threshold was chosen based on the full knowledge of training/test instance labels, i.e., the optimal performance our likelihood ratio estimator can achieve. Comparison is done with two other approaches, the mi-svm in [2] and the AW-SVM from [5]. SVR-SVM generally works well when the witness rate is not very low. From instance classification, one can see that the original mi-svm is only competitive when the witness rate is near this situation is close to a supervised SVM. With a deterministic annealing approach in [5], AW-SVM and mi-svm perform quite the opposite competitive when the witness rate is small but degrade when this is large. Presumably this is because deterministic annealing is initialized with the apriori assumption that datasets are multiple-instance i.e. has a small witness rate [5]. When the witness rate is large, annealing does not improve performance. On the contrary, the proposed SVR-SVM does not appear to be affected by the witness rate. With the same parameters used across all the experiments, the method self-adjusts to different witness rates. One could see the effect especially in fig. 2 (e): regardless of the witness rate, the instance error rate remains roughly the same. However, this is still inferior to our model based on the best threshold, which indicates that important room for improvement exists. 5.2 MIL Datasets The algorithm is evaluated on a number of popular MIL benchmarks. We use the common experimental setting, based on -fold cross-validation for parameter selection and we report the test results averaged over trials. The results are shown in Table, together with other competitive methods in from the literature [2, 5, ] (for some of these methods standard deviation estimates are not available). In our tests, the proposed SVR-SVM gives consistently good results among algorithms that provide instance-level labels. The only atypical case is Tiger, where the algorithm underperforms other methods. Overall, the performance of SVR-SVM is slightly worse than migraph and ALP-SVM. But we note that results in ALP-SVM are obtained by tuning the witness rate to the optimal value, which may be difficult in practical settings. The slightly lower performance compared to migraph suggests that we may be inferior in the bag classification step, which we already know is suboptimal. 7

8 Table 2: Results from 2 Newsgroups. The best result on each dataset is shown in bold, pairwise t-tests are performed to determine if the differences are statistically significantly. migraph is dominating in datasets, whereas SVR-SVM is dominating in 4. Dataset MI-Kernel migraph [] migraph (web) SVR-SVM Est. WR alt.atheism 6.2 ± ± ± ±.7.83 % comp.graphics 47. ± ± ± ± % comp.windows.misc 5. ± ±.5 7. ± ± % comp.ibm.pc.hardware 46.9 ± ± ± ± % comp.sys.mac.hardware 44.5 ± ± ± 78. ± % comp.window.x 5.8 ± ± ± ± % misc.forsale 5.8 ± ± ± 72.3 ± % rec.autos 52.9 ± ± ± ± % rec.motorcycles 5.6 ± ± ± ± % rec.sport.baseball 5.7 ± ± ± ± % rec.sport.hockey 5.3 ± ± ± 89.3 ± % sci.crypt 56.3 ± ± ± ± % sci.electronics 5.6 ± ± ± 9.5 ± % sci.med 5.6±.9 62.± ± ± % sci.space 54.7 ± ± ± ± % soc.religion.christian 49.2 ± ± ± ± % talk.politics.guns 47.7 ± ± ± ± % talk.politics.mideast 55.9 ± ± ±. 8.5 ± % talk.politics.misc 5.5 ± ± ± ± % talk.religion.misc 55.4 ± ± ±. 7.9 ± % 5.3 Text Categorization The text datasets are taken from []. These data have the benefit of being designated to have a small witness rate. Thus they serve as a better MIL benchmark compared to the previous ones. These are derived from the 2 Newsgroups corpus, with 5 positive and 5 negative bags for each of the 2 news categories. Each positive bag has around3% witness rate. We run -fold cross validation times on each dataset and compute the average accuracy and standard deviations,c is fixed to, ǫ to.2. Authors of [] reported recent results for this dataset on their website, which are vastly superior than the ones reported in the paper. Therefore, in Table 2 we included both results in the comparison, identified as migraph (paper) and migraph (website), respectively. Our SVR-SVM performs significantly better than MI-Kernel and migraph (paper). It is comparable with migraph (web), and offers a marginal improvement. It is interesting that even though we use a suboptimal second step, SVR-SVM fares well with the state-of-the-art. This shows the potential of methods based on likelihood ratio estimators for multiple instance learning. 6 Conclusion We have proposed an approach to multiple-instance learning based on estimating the likelihood ratio between the positive and the negative classes on instances. The MIL constraint is reformulated into a convex constraint on the likelihood ratio where a joint estimation of both the function and the target ratios on the training set is performed. Theoretically we justify that learning the likelihood ratio is Bayes-consistent and has desirable excess loss transform properties. Although we are not able to find the optimal classification threshold on the estimated ratio function, our proposed bag classifier based on such ratios obtains state-of-the-art results in a number of difficult datasets. In future work, we plan to explore transductive learning techniques in order to leverage the information in the learned ratio function and identify better threshold estimation procedures. Acknowledgements This work is supported, in part, by the European Commission, under a Marie Curie Excellence Grant MCEXT

9 References [] Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89 (997) 3 7 [2] Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS. (23) [3] Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: NIPS. (998) [4] Felzenszwalb, P.F., McAllester, D.A., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR. (28) [5] Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and webbased tool for image annotation. IJCV 77(-3) (28) [6] Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: Temporal grouping and dialogsupervised person recognition. In: CVPR. (2) [7] Zeisl, B., Leistner, C., Saffari, A., Bischof, H.: On-line semi-supervised multiple-instance boosting. In: CVPR. (2) [8] Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: ICML. (22) [9] Tao, Q., Scott, S., Vinodchandran, N.V., Osugi, T.T.: Svm-based generalized multiple-instance learning via approximate box counting. In: ICML. (24) [] Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-i.i.d. samples. In: ICML. (29) [] Cheung, P.M., Kwok, J.T.: A regularization framework for multiple-instance learning. In: ICML. (26) 93 2 [2] Fung, G., Dandar, M., Krishnapuram, B., Rao, R.B.: Multiple instance learning for computer aided diagnosis. In: NIPS. (27) [3] Mangasarian, O., Wild, E.: Multiple instance classification via successive linear programming. Journal of Optimization Theory and Applications 37 (28) [4] Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multipleinstance learning. In: ICML. (22) [5] Gehler, P., Chapelle, O.: Deterministic annealing for multiple-instance learning. In: AISTATS. (27) [6] Dollár, P., Babenko, B., Belongie, S., Perona, P., Tu, Z.: Multiple component learning for object detection. In: ECCV. (28) [7] Li, Y.F., Kwok, J.T., Tsang, I.W., Zhou, Z.H.: A convex method for locating regions of interest with multi-instance learning. In: ECML. (29) [8] Mammen, E., Tsybakov, A.B.: Smooth discrimination analysis. Annals of Statistics 27 (999) [9] Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32 (24) [2] Bartlett, P., Jordan, M.I., McAulliffe, J.: Convexity, classification and risk bounds. Journal of American Statistical Association (26) [2] Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood ratio. Technical report, Institute for Numerical Simulation, University of Bonn (November 2) [22] Nguyen, X., Wainwright, M., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: NIPS. (27) [23] Liese, F., Vajda, I.: Convex Statistical Distances. Teubner VG (987) [24] Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. The Annals of Statistics 36 (28)

10 Supplementary material to the paper Convex Multiple-Instance Learning by Estimating Likelihood Ratio October 5, 2 Proof of Theorem Lemma If 2x assumes a beta distribution with parameter (, β), the expection of x, x 2, x 3, x 4 is E(x) = 2 β + E(x 2 ) = 2 (β + 2)(β + ) E(x 3 ) = 3 4 (β + )(β + 2)(β + 3) E(x 4 ) = 3 2 (β + )(β + 2)(β + 3)(β + 4) Proof: It is known that the expectation of the beta distribution B(, β) is E(x) = β+, therefore the first equation is immediate. For k >, the expectations E(x k ) of a beta distribution can be computed as: = = = = B(α, β) xk x α ( x) β dx Γ(α + β) Γ(α)Γ(β) xk x α ( x) β dx Γ(α + β)γ(α + k ) Γ(α + k + β)γ(α) Γ(α + β)γ(α + k ) Γ(α + k + β)γ(α) k 2 j= k 2 Γ(α + k + β) Γ(α + k )Γ(β) xk x α ( x) β dx x B(α + k, β) xα+k 2 ( x) β dx (α + j) α + k j= (α + β + j) (k 2) () α + k + β

11 The last step use the formulae B(x+y) = Γ(x)Γ(y) Γ(x+y) and Γ(x+) = xγ(x), as well as the expectation of the beta distribution E(x) = α α+β. If 2x instead of x assumes the beta distribution, then E(x k ) = 2 k E((2x)k ) = k 2 j= (α + j) α + k 2 k k 2 j= (α + β + j) α + k + β. (2) Substite α =, k = 2, 3, 4 we obtain the Lemma. Proof of Theorem : Proof: Define η(x i ) = Pr(y = x i, Pr(y = x i ) > Pr(y = x i )). The most adversarial scenario is Pr( 2η(x i ) 2ɛ) = cɛ β, because this maximizes uniformly the chance that the class-conditional probability is close to /2. This means 2η(x i ) assumes a beta distribution B(, β). It could be proved that the likelihood ratio satisfies 9 4 η η η η 2η2 + η From Lemma, we know that E(η 2 ) = 2 (β+2)(β+), E(η) = 2 have 3 β (β + )(β + 2) E( η η ) β + 4 2(β + )(β + 2) Furthermore, a bound on the variance can be computed as η V ( η ) = E[( η η η )2 ] E[( η )]2 β+, therefore we 4E(η 4 ) + 4E(η 3 ) + E(η 2 ) 9 (β + 5) 2 64 (β + ) 2 (β + 2) 2 ( (β + ) + 5β 83 8(β + )(β + 3)(β + 4) ) (β + )(β + 2) (β+)(β+2)(β+3) in which E(η 4 ) = 3 2 (β+)(β+2)(β+3)(β+4), E(η3 ) = 3 4 involves some quite complicated arithmetics and simplification. A simpler bound can be expressed as: η V ( η ) (4β + ) 2(β + ) 2 (2β + 3), which is an upper bound of the above bound. By Bennett s inequality [], n 2V (Z) log /δ Z i E(Z) + n n i= log /δ 3n. The last step 2

12 therefore with probability δ, η n Z i ne(z) + 2nV (Z) log /δ + i= log /δ. 3 Take Z = η and plug in the mean and variance bounds computed previously, we obtain the theorem. 2 Proof of Theorem 2 (a) It is straightforward to show that R(f) R = R(f) R(η /2) = E([f(X) >, η(x) < /2]( 2η(X)) +E([f(X) <, η(x) > /2](2η(X) )) = (R (f) R ) + R + (f) R + We apply Jensen s inequality to both summands and obtain ψ( [η(x) < /2](R(f) R ))) + ψ([η(x) /2](R(f) R )) E(ψ( [η(x) < /2, f(x) > ]( 2η(X))) +ψ([η(x) /2, f(x) < ](2η(X) ))) = E([sign(f(X) ) sign(η(x) /2)]ψ((2η(X) ))) E([sign(f(X) ) sign(η(x) /2)] ψ((2η(x) ))) = E([sign(f(X) ) sign(η(x) /2)](H (η(x)) H(η(X)))) ( ( )) = E [sign(f(x) ) sign(η(x) /2)] inf C(α, η(x)) H(η(X)) α,(α )(2η(X) ) E(C(α, η(x)) H(η(X))) = R C (f) R C (b) First note that since ψ() = and ψ is continuous, θ i implies ψ(θ i ). Thus we can replace condition (2) by (2 ) For any sequence (θ i ) in [, ], ψ(θ i ) implies θ i To see that () implies (2 ), let C be classification-calibrated, and let (θ i ) be a sequence that does not converge to. Define c = lim sup θ i >, and pass to a subsequence with lim θ i = c. Then by continuity lim ψ(θ i ) = ψ(c), and ψ(c) > by classification-calibration. Thus for the original seequence (θ i ), we see lim sup ψ(θ i ) >, so we cannot have ψ(θ i ). To see that (2 ) implies (3), suppose that R C (f i ) RC. By part (a) of the theorem ψ (R (f i ) R ) + ψ(r + (f i ) R+), since both ψ and ψ are convex and positive, (2 ) implies R(f i ) R. 3

13 Finally, to see (3) implies (), suppose C is not classification-calibrated. By definition, we can choose η /2 and a sequence α, α 2,... such that sign(α i (η /2)) = but c η (α i ) H(η). Fix x and choose the probability distribution P so that P X {x} = and P (Y = X = x) = η. Define a sequence of functions f i for which f i (x) = α i. Then lim R(f i ) > R, and this is true for any infinite subsequence. But C η (α i ) H(η) implies R C (f i ) RC. The contradiction proves the final part of the theorem. 3 Equivalent Constant Transforms Proposition For loss functions that satisfy a scaling equality L(k ŷ, k y) = k 2 L(ŷ, y), solving (4) with C = C and D i for each bag is equivalent with solving (3) with C = k2 k 2 C and k D i. Proof: A variable substitution of z = k y and v = k w in (3) would obtain the conclusion. The support vector regression (SVR) we use satisfies L(k ŷ, k y; ɛ) = k L(ŷ, y; ɛ k ). In this case, the uniform scaling on all the D i s is equivalent to subsequent changes in both C and ɛ. 4 Projection to the Bag Constraint We use the following procedure to project y + : Denote the sum of scores in a bag as s i = x + j B i y+ j. For each bag B i, we check if s i < D i, if not, we simply set all the negative y + j to. If s i < D i, we add D i s i B i to each y + j. This makes s i = D i. Then we set all the negative y + j to. If this makes s i > D i, we subtract equal amount on all the positive y + j, to make s i = D i. If this created additional negative y + j, the alternating projection process goes on until both conditions: s i = D i and y + are fulfilled. 5 Derivation of the SVM dual problem First rewrite the optimization in the canonical form: min w,b,y + 2 w 2 + C( n + +n i= (ξ i + ˆξ i )) s.t. ( w, φ(x i ) + b) y i ɛ + ξ i, for each x + i, x i y i ( w, φ(x i ) + b) ɛ + ˆξ i, for each x + i, x i x + j Bi y+ j D i B i y + j where y i = for x i and K(x, y) = φ(x), φ(y). 4

14 Setting α i, α+ i as the Lagrange multipliers of the first and second set of constraints, and η j the Lagrange multipliers of the bag constraints, we obtain the following KKT condition: w = i (α+ i α i )φ(x i) α + i, α i C η j α + i α i i (α+ i α i ) = α + i, α i, η j From the KKT conditions one could see that to make the equality constraint x + j Bi y+ j = D i B i hold, we need η j >, essentially, each α + i in the bag must be positive. This means that y i ( w, φ(x i ) + b) ɛ for all the items in the bag. Thus, to make the equality hold, all items in the positive bag need to be predicted smaller than their real value y i. Another issue is, since η must hold, α + i α i, since only one of α+ i and α i is nonzero, this means α i = for instances in positive bags. Therefore, the situation that α + i =, α i > could never exist. Back to the original optimization problem, this means ( w, φ(x i ) + b) y i ɛ + ξ i never holds in equality. Therefore, ( w, φ(x i ) + b) y i + ɛ. The dual problem is very similar with the original SVM, with only minor differences introduced from η: min α +,α,η 2 (α+ α ) T K(α + α ) +ɛ n i= (α+ i + α i ) k i= D i B i η i s.t. α + i, α i C n i= (α+ i α i ) = η j α + i α i, x i B j α + i, α i, η j From the dual problem, we could see that η j needs to be made larger to improve the result of the dual. Therefore, the algorithm would always prefer to choose η j >. From our previous analysis this essentially means that the equality constraint x + j B y+ i j = D i B i is desirable. Thus the algorithm would tend to make more vectors from positive bags as support vectors. And tend to make predictions smaller than the estimated value y i. It could be seen that the best solution of η j is η j = max(min xi B j (α + i α i ), ), with this in mind, the problem can still be solved using an SMO-type active set approach of selecting two α i for one iteration. 5

15 The SMO subproblem is thus: min α i,α 2 j [ ] [ ] [ ] Q si α i s j α ii Q ij si α i j + ɛ(α Q ij Q jj s j α i + α j ) j s.t. D k B k η k D k2 B k2 η k2 α i, α j C s i α i + s j α j doesn t change where s i and s j are the signs of α i and α j, respectively. The SMO working set selection would be the same as in the original SVM, i.e., find the maximal violating pair, except that the gradient now needs to take η into consideration. References [] Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (963) 3 3 6

Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Convex Multiple-Instance Learning by Estimating Likelihood Ratio Fuxin Li and Cristian Sminchisescu Institute for Numerical Simulation, University of Bonn {fuxin.li,cristian.sminchisescu}@ins.uni-bonn.de