Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Size: px
Start display at page:

Download "Convex Multiple-Instance Learning by Estimating Likelihood Ratio"

Transcription

1 Convex Multiple-Instance Learning by Estimating Likelihood Ratio Fuxin Li and Cristian Sminchisescu Institute for Numerical Simulation, University of Bonn Abstract We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is cast as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the - classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the - loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach. Introduction Multiple Instance Learning (MIL) has been proposed over years ago as a methodology to learn models under weak labeling constraints []. Unlike traditional binary classification problems, the positive items are represented as bags, which are sets of instances. A feature vector is used to represent each instance in the bag. There is an OR relationship in a bag: if one of the feature vectors is classified as positive, the entire bag is considered positive. A simple intuition is: one has a number of keys and faces a locked door. To enter the door, we only need one matching keys. MIL is a natural weak labeling formulation for text categorization [2] and computer vision problems [3]. In document classification, one is given files made of many sentences, and often only a few are useful. In computer vision, an image can be decomposed into different regions, and only some delineate objects. Therefore, MIL can be used in sophisticated tasks, such as identifying the location of object parts from bounding box information in images [4]. Although efforts have been made to provide datasets with increasingly more detailed supervisory information [5], without automation such a minutiae level of detail becomes prohibitive for large datasets, or more complicated data like video [6, 7]. In this case, one necessarily needs to resort to multiple-instance learning. MIL is interesting mainly because of its potential to provide instance-level labels from weak supervisory information. However the state-of-the-art in MIL is often obtained by simply using a weighted sum of kernel values between all instance pairs within the bags, while ignoring the prediction of instance labels [8, 9, ]. It is intriguing why MIL algorithms that exploit instance level information cannot achieve better performance, as constraints at instance level seems abundant none of the negative instances is positive. This should provide additional constraints in defining the region of positive instances and should help classification in input space. A major challenge is the non-convexity of many instance-level MIL algorithms [2,, 2, 3, 4]. Most of these algorithms perform alternating minimization on the classifier and the instance weights.

2 This procedure usually gives only a local optimum since the objective is non-convex. The benchmark performance of MIL methods is overall quite similar, although techniques differ significantly: some assign binary weights to instances [2], some assign real weights [2, 3], yet others use probabilistic formulations [4]; some optimize using conventional alternating minimization, others use convexconcave procedures []. Gehler and Chapelle [5] have recently performed an interesting analysis of the MIL costs, where deterministic annealing (DA) was used to compute better local optima for several formulations. In the case of a previous mi-svm formulation [2], annealing methods did not improve the performance significantly. A newly proposed algorithm, ALP-SVM, was also introduced, which used a preset parameter defining the fixed ratio of witnesses the true positive instances in a positive bag. Excellent results were obtained with this witness rate parameter set to the correct value. However, in practice it is unclear whether this can be known beforehand and whether it is stationary across different bags. In principle, the witness rate should also be estimated, and this learning stage partially accounts for the non-convexity of the MIL problem. It remains however unclear whether the observed performance variations are caused by non-convexity or by other modeling aspects. Although performance considerations have hindered the application of MIL to practical problems, the methodology has started to gain momentum recently [4, 6, 7, 6]. The success of the Latent SVM for person detection [4] shows that a standard MIL procedure (the reformulation of the alternating minimization MI-SVM algorithm in [2]) can achieve good results if properly initialized. However, proper initialization of MIL remains elusive in general, as it often requires engineering experience with the individual problem structure. Therefore, it is still of broad interest to develop an initialization-independent formulation for MIL. Recently Li et al. [7] proposed a convex instancelevel MIL algorithm based on multiple kernel learning, where one kernel was used for each possible combination of instances. This creates an exponential number of constraints and requires a cuttingplane solver. Although the formulation is convex, its scalability drops significantly for bags with many instances. In this paper we make an alternative attempt towards a convex formulation: we establish that nonconvex MIL constraints can be recast reliably into convex constraints on the likelihood ratio between the positive and negative classes for each instance. We transform the multiple-instance learning problem into a convex joint estimation of the likelihood ratio function and the likelihood ratio values on training instances. The choice of the jointly convex loss function is rich, remarkably at least from a family of f-divergences. Theoretically, we prove consistency results for likelihood ratio estimation, thus showing that f-divergence loss functions upper bound the classification - loss tightly, unless the likelihood is very large. A support vector regression scheme is implemented to estimate the likelihood ratio, and it is shown to separate positive and negative instances well. However, determining the correct threshold for instance classification from the training set remains non-trivial. To address this problem, we propose a post-processing step based on a bag classifier computed as a linear combination of likelihood ratios. While this is shown to be suboptimal in synthetic experiments, it still achieves state-of-theart results in practical datasets, demonstrating the vast potential of the proposed approach. 2 Convex Reformulation of the Multiple Instance Constraint Let us consider a learning problem with n training instances in total, n + positive and n negative. In negative bags, every instance is negative, hence we do not separately define such bags instead we directly work with the instances. Let B = {B,B 2,...,B k } be positive bags and X + = {x +,x+ 2,...,x+ n + }, X = {x,x 2,...,x n } be the training input, where each x i belongs to a positive bagb j and eachx i is a negative instance. The goal of multiple instance learning is, given {X +,X,B}, to learn a decision rule,sign(f(x)), to predict the label{+, } for the test instance x. The MIL problem can be characterized by two properties. ) negative-exclusion: if none of the instances in a bag is positive, the bag is not positive. 2) positive-identifiability: if one of the instances in the bag is positive, the bag is positive. These properties are equivalent to a constraint max xi B j f(x i ) on positive bags. This constraint is not convex since the negativemax function is concave. Reformulation into a sum constraint such as f(x) would be convex, when 2

3 f(x) = w T x is linear [6]. However, this hardly retains positive-identifiability, since if there is only onex i with f(x i ) >, this can be superseded by other instances with f(x i ) <. Apparently, the distinction between the sum and the max operations is significant and difficult to ignore in this context. However, in this paper we show that if MIL conditions are formulated as constraints on the likelihood ratio, convexity can be achieved. For example, the constraint: Pr(y = x i ) Pr(y = x i ) > B j () x i B j can ensure both of the MIL properties. Positive-identifiability is satisfied when Pr(y = x i ) or equivalently, when the positive examples all have very large margin. B i B i + When the size of the bag is large, the assumption Pr(y = x i ) > Bj B j + can be too strong. Therefore, we exploit large deviation bounds to reduce the quantity B j, such that Pr(y = x i ) does not have to be very large to satisfy the constraint. Intuitively, if the examples are not very ambiguous, i.e. Pr(y = x i ) is not close to /2, then likelihood ratio sums on negative examples can become much smaller, hence we can adopt a significantly lower threshold at some degree of violation of the negative-exclusion property. To this end, a common assumption is the low label noise [8, 9]: M β : c >, ǫ,pr( < Pr(y = x i ) 2 ǫ) cǫβ. This assumes that the posterior Pr(y = x i ) is usually not very close to /2, meaning that most examples are not very ambiguous, which is reasonable in many practical problems. In [8, 9, 2], a number of results have been obtained implying that classifiers learned under this assumption converge to the Bayes error much faster than the conventional empirical process rate O(n /2 ) of most standard classifiers, and can be as fast aso(n ). These theoretical results show that low label noise assumptions indeed supports learning with fewer observations. AssumingM β holds, we prove the following result which allows us to relax the hard constraint (): Theorem δ >, for each x i in a bag B j, assume y i is drawn i.i.d. from the distribution Pr Bj (y i x i ) that satisfiesm β. If all instancesx i B j are negative, then the probability that Pr(y = x i ) Pr(y = x i ) β +4 2(β +)(β +2) B 4β + j + 2(β +) 2 (2β +3) B j log/δ+ log/δ (2) 3 x i B j is at most δ. The proof is given in an accompanying technical report [2]. From Theorem, we could weaken the constraint () to obtain constraint (2) and still ensure negative-exclusion with probability δ. When β is large, the reduction is significant. For example, for β = 2 and δ =.5, the right- hand side of (2) is approximately 4 B i B i +, which is an important decrease over B i, whenever B i 3. Note that the i.i.d. assumption in Theorem applies to each bag. Different bags can have different label distributions. This is often a significantly weaker assumption than the ones based on global i.i.d. of labels [8]. 3 Likelihood Ratio Estimation To estimate the likelihood ratio, one possibility would be to use kernel methods as nonparametric estimators over a RKHS. This approach was taken in [22], where predictions of the ratio provided a variational estimate of an f-divergence (or Ali-Silvey divergence) between two distributions. The formulation is powerful, yet not immediately applicable here. In our case, because of the uncertainty in the positive examples, Pr(y = x) is not observed but has to be estimated. Therefore we need to optimize jointly asmin f,pr(y= x) D(f,Pr(y = x))+λ f 2 with loss functiond(f,g). This optimization would not be convex if a framework in [22] were taken. 3

4 The requirement to estimate two sets of variables simultaneously (e.g. f and Pr(y = x) here), is one of the major difficulties in turning multiple-instance learning into a convex problem. Approaches based on classification-style loss functions lead to non-convex optimization [2, 3]. However, since we are outside a classification setting, we can optimize over divergence measuresd φ (f,g) that are convex w.r.t. both f and g. These measures are common. For example, the f-divergence family that includes many statistical distances, satisfies the following properties [23]: (x i y i) 2 x i ; L : D(x,y) = i x i y i ;χ 2 : D(x,y) = i Kullback-Leibler: D(x,y) = i x ilogx i x i logy i x i +y i ; (3) Symmetric Kullback-Leibler: D(x,y) = i (y i x i )logy i +(x i y i )logx i x i +y i In principle, any of the measures given above can be used to estimate the likelihood ratio. An important issue is the relationship between the likelihood ratio estimation and our final goal: binary classification. In [2], the authors give necessary and sufficient conditions for Bayes consistent learners by minimizing the mean of a surrogate loss function of the data. In this paper we extend these results to loss functions for likelihood ratio estimation. Let R(f) = P(sign(y) sign(f(x) )) be the - risk of a likelihood estimator f, with classification rule given by sign(f(x) ). The Bayes risk is thenr = inf f R(f). For a generic loss function C(α,η), let η = Pr(y = x), we can define the C-risk as R C (f) = E xy (C(f,η)) and RC = inf f R C (f). Our goal is to bound the excess - risk R(f) R by the excess-c riskr C (f) RC, so that minimizing the classification loss can be converted into minimizing the excess-c risk. Let us further define the optimal conditional risk ash(η) = inf α R C(α,η), and H (η) = inf α,(α )(2η ) C(α,η). We say C(α,η) is classification-calibrated if for any η /2, H (η) > H(η). Then, we define the ψ-transform of C(α,η) as ψ(θ) = ψ (θ), where ψ(θ) = H ( +θ 2 ) H(+θ 2 ),θ [,], andg is the Fenchel-Legendre biconjugate ofg, which is essentially the largest convex lower bound ofg [2]. The difference between likelihood ratio estimation and the classification setting is in the asymmetric scaling of the loss function for positive and negative examples. Let ψ = ψ( x), R (f) = Pr(y =,f(x) > ),R = inf f R (f), R + (f) = Pr(y =,f(x) < ) andr + = inf f R + (f) be the risk and Bayes risks on negative and positive examples, respectively. It is easy to prove that R(f) R = R (f) R +R + (f) R +. We derived the following theorem: Theorem 2 a) For any nonnegative loss function C(α,η), any measurable f : X R, and any probability distribution onx {±},ψ (R (f) R )+ψ(r +(f) R+ ) R C(f) RC. b) The following conditions are equivalent: ()C is classification-calibrated; (2) For any sequence(θ i ) in [,],ψ(θ i ) if and only ifθ i ; (3) For every sequence of measurable functionsf i : X R and every probability distribution onx {±},R C (f i ) RC impliesr(f i) R. The proof is given in an accompanying technical report [2]. This suggests that if ψ is well-behaved, minimizing R C (f) still gives a reasonable surrogate for the classification risk. Compared against Theorem 3 in [2] which has the form ψ(r(f) R ) R C (f) RC, the difference here stems from the different loss transforms used for the positive and the negative examples. We consider an f-divergence of the likelihood as the loss function, i.e., C(α, η) = D(α, η ), where η η is the likelihood ratio when the Pr(y = x) = η. From convexity arguments, it can be easily seen that H(η) = C( η η,η) = and H η (η) = D(, η ), therefore ψ(θ) = D(, +θ θ ). The ψ for all the loss functions listed in (4) can be computed accordingly. In fig. 3 (a) we show the ψ(θ) ofl and the KL-divergence from (4) and compare it against the hinge loss (where ψ(θ) = θ [2]) used for SVM classification. It could be seen that our approximation of the classification loss is accurate when Pr(y i = x i ) is small. However, likelihood estimation would severely penalize the misclassified positive examples with largepr(y i = x i ). This suggests that for the joint estimation of f and Pr(y i = x i ), the optimizer would tend to make Pr(y i = x i ) smaller, in order to avoid high penalties, as shown in fig. (b). In fig. (a) we plot ψ functions for different losses. We prefer an L measure as it is closer to the classification hinge loss, at least for the negative examples. It is easier to solve the nonparametric function estimation in RKHS using an epsilon-insensitive L loss, since it can be reformulated as η 4

5 .2 Positive Negative.2 Positive Negative Undecided ψ(θ) 5 L divergence KL divergence Hinge Loss Estimated Likelihood θ Example (a) (b) (c) Figure : Loss functions and their influence on the estimation bias. (a) The function ψ appearing in the losses used for likelihood estimation (L, KL-divergence) is similar to the hinge loss when θ > ; however it goes to infinity as θ approaches. This deviation essentially means the surrogate loss is going to be extremely large if an example with very largepr(y i = x i ) is misclassified. (b) Example estimated likelihood for a synthetic example. The estimated likelihood is biased towards smaller values. However, with a fully labeled training set, the threshold can still be obtained. (c) If we only know the label of the negative examples (blue) and the maximal positive example (red), determining the optimal threshold becomes non-trivial. support vector regression on the conditional likelihood, with the additional MIL constraints in (2): min f,η + x + j s.t. max( f(x + j ) η+ j ǫ,)+ x max( f(x j ) ǫ,)+λ f 2 j x + η+ j Bi j D i,η + j (4) where f 2 is the RKHS norm;d i is a constant for each bag and can be determined from Theorem Pr(y= x + i ), with appropriately chosen values for constantsβ andδ; η i + is an estimate of for the Pr(y= x + i ) training set. In this paper we use β = 2 and δ =.5, which gives the estimate of the bound for each bag asd i = 4 B 3 i + 4 B i +, whenb i 3 andd i = B i when B i < 3. It can be proved that optimization problem (4) is jointly convex in both arguments. A standard representer theorem [24] would convert it to an optimization on vectors, which we omit here. The problem can be solved by different methods. The one easiest to implement is the alternating minimization between solving for the SVM and projecting on the constraint sets given by x + j Bi y+ j D i and y + j. As this can turn out to be slow for large datasets, approaches such as the dual SMO or primal subgradient projection algorithms (in the case of linear SVM) can be used. In this paper we implement the alternating minimization approach, which is provably convergent since the optimization problem (4) is convex. In the accompanying technical report [2] we derive an SMO algorithm based on the dual of (4) and characterize the basic properties of the optimization problem. 4 Bag and Instance Classification If the likelihood ratio is obtained using an unbiased estimator, a decision rule based on sign(f(x) ) should give the optimal classifier. However as previously argued, the joint estimation on f and η + introduces a bias which is not always easy to identify. In positive bags, it is unclear whether an instance should be labeled positive or negative, as long as it does not contribute significantly to the classification error of its bag (fig. 3(b),(c)). In the synthetic experiments, we noticed that knowledge of the correct threshold would make the algorithm outperform competitors by a large margin (fig. 2). This means that based on the learned likelihood ratio, the positive examples are usually well separated from the negative ones. Developing a theory that would advance these aspects remains a promising avenue for future work. The main difficulty stems from the compound source of bias which arises from both the estimation ofη + and the loss minimization overη + andf. Here we propose a partial solution. Instead of directly estimating the threshold, we learn a linear combination of instance likelihood ratios to classify the bag. First, we sort the instance likelihood ratios for each bag into a vector of lengthmax i B i. We appendto bags that do not have enough 5

6 Error averaged over 5 runs (a) (b) (c) Bag Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM percent of positive labeled points in bags Error averaged over 5 runs Pattern Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM percent of positives in bags (d) (e) (f) Estimated witness rate Witness Rate Estimate by SVR SVM Percent of positives in bags Figure 2: Synthetic dataset (best viewed in color). (a) The true decision boundary. (b) Training points at 4% witness rate. (c) The learned regression function. (d) Bag misclassification rate of different algorithms. (e) Instance misclassification rate of different algorithms. (f) Estimated witness rate and true witness rate. instances. Under this representation, bag classification turns into a standard binary decision problem where a vector and a binary label is given for each bag, and a linear SVM is learned to solve the problem. If we were to classify only the likelihood ratio on the first instance, this procedure would reduce to simple thresholding. We instead leverage information in the entire bag, aiming to constrain the classifier to learn the correct threshold. In this linear SVM setting, regularization never helps in practice and we always fix C to very large values. Effectively no parameter tuning is needed. To classify instances, a threshold is still necessary. In the current system, we follow a simple approach and take the mean between two instances: the one with the highest likelihood among training bags that are predicted negative by the bag classifier, and the lowest scored one among instances in positive bags with a score higher than the previous one. This approach is derived from the basic MIL assumption that all instances in a negative bag are negative. Based on instance classification we could also estimate the witness rate of the dataset. This is computed as the ratio of positively classified instances and the total number of instances in the positive bags of the training set. Since our algorithm automatically adjusts to different witness rates, this estimate offers quantitative insight as to whether MIL should be used. For instance, if the witness rate is%, it may be more effective to use a conventional learning approach. 5 Experiments 5. Synthetic Data We start with an experiment on the synthetic dataset of [5], where the controlled setting helps understanding the behavior of the proposed algorithm. This is a 2-D dataset with the actual decision boundary shown in fig. 2 (a). The positive bags have a fraction of points sampled uniformly from the white region and the rest sampled uniformly from the black region. An example of the sample at 4% witness rate is shown in fig. 2 (b). In this figure, the plotted instance labels are the ones of their bags indeed, one could notice many positive (blue) instances in the negative (red) region. We have also experimented with a uniform threshold based on probabilistic estimates, as well as with predicting an instance-level threshold. While the former tends to underfit, the latter overfits. Our bag-level classifier targets an intermediate level of granularity and turns out to be the most robust in our experiments. 6

7 Table : Performance of various MIL algorithms on weak labeling benchmarks. The best result on each dataset is shown in bold. The second group of algorithms either not provide instance labels (MI-Kernel and migraph) or require a parameter that can be difficult to tune (ALP-SVM). SVR- SVM appears to give consistent results among algorithms that provide instance labels. The row denoted Est. WR gives the estimated witness rates of our method. Algorithm Musk- Musk-2 Elephant Tiger Fox CH-FD EMDD mi-svm MI-SVM MICA AW-SVM Ins-KI-SVM MI-Kernel 88.± ± ± ±.9 6.3±. migraph 88.9± ± ±.7 86.± ±.6 ALP-SVM SVR-SVM 87.9± ± ± ± ±3.5 Est. WR % 89.5 % 37.8 % 42.7 % % In order to test the effect of witness rates, different types of datasets are created by varying the rates over the range.,.2,...,. In this experiment we fix the hyperparameters C = 5 and use a Gaussian kernel with σ =.We show a trained likelihood ratio function in fig. 2 (c), estimated on the dataset shown in fig. 2 (b). Under the likelihood ratio, the positive examples are well separated from negatives. This illustrates how our proposed approach converts multiple-instance learning into the problem of deciding a one-dimensional threshold. Complete results on datasets with different witness rates are shown in fig. 2 (d) and (e). We give both bag classification and instance classification results. Our approach is referred to as SVR- SVM. BEST THRESHOLD refers to a method where the best threshold was chosen based on the full knowledge of training/test instance labels, i.e., the optimal performance our likelihood ratio estimator can achieve. Comparison is done with two other approaches, the mi-svm in [2] and the AW-SVM from [5]. SVR-SVM generally works well when the witness rate is not very low. From instance classification, one can see that the original mi-svm is only competitive when the witness rate is near this situation is close to a supervised SVM. With a deterministic annealing approach in [5], AW-SVM and mi-svm perform quite the opposite competitive when the witness rate is small but degrade when this is large. Presumably this is because deterministic annealing is initialized with the apriori assumption that datasets are multiple-instance i.e. has a small witness rate [5]. When the witness rate is large, annealing does not improve performance. On the contrary, the proposed SVR-SVM does not appear to be affected by the witness rate. With the same parameters used across all the experiments, the method self-adjusts to different witness rates. One could see the effect especially in fig. 2 (e): regardless of the witness rate, the instance error rate remains roughly the same. However, this is still inferior to our model based on the best threshold, which indicates that important room for improvement exists. 5.2 MIL Datasets The algorithm is evaluated on a number of popular MIL benchmarks. We use the common experimental setting, based on -fold cross-validation for parameter selection and we report the test results averaged over trials. The results are shown in Table, together with other competitive methods in from the literature [2, 5, ] (for some of these methods standard deviation estimates are not available). In our tests, the proposed SVR-SVM gives consistently good results among algorithms that provide instance-level labels. The only atypical case is Tiger, where the algorithm underperforms other methods. Overall, the performance of SVR-SVM is slightly worse than migraph and ALP-SVM. But we note that results in ALP-SVM are obtained by tuning the witness rate to the optimal value, which may be difficult in practical settings. The slightly lower performance compared to migraph suggests that we may be inferior in the bag classification step, which we already know is suboptimal. 7

8 Table 2: Results from 2 Newsgroups. The best result on each dataset is shown in bold, pairwise t-tests are performed to determine if the differences are statistically significantly. migraph is dominating in datasets, whereas SVR-SVM is dominating in 4. Dataset MI-Kernel migraph [] migraph (web) SVR-SVM Est. WR alt.atheism 6.2 ± ± ± ±.7.83 % comp.graphics 47. ± ± ± ± % comp.windows.misc 5. ± ±.5 7. ± ± % comp.ibm.pc.hardware 46.9 ± ± ± ± % comp.sys.mac.hardware 44.5 ± ± ± 78. ± % comp.window.x 5.8 ± ± ± ± % misc.forsale 5.8 ± ± ± 72.3 ± % rec.autos 52.9 ± ± ± ± % rec.motorcycles 5.6 ± ± ± ± % rec.sport.baseball 5.7 ± ± ± ± % rec.sport.hockey 5.3 ± ± ± 89.3 ± % sci.crypt 56.3 ± ± ± ± % sci.electronics 5.6 ± ± ± 9.5 ± % sci.med 5.6±.9 62.± ± ± % sci.space 54.7 ± ± ± ± % soc.religion.christian 49.2 ± ± ± ± % talk.politics.guns 47.7 ± ± ± ± % talk.politics.mideast 55.9 ± ± ±. 8.5 ± % talk.politics.misc 5.5 ± ± ± ± % talk.religion.misc 55.4 ± ± ±. 7.9 ± % 5.3 Text Categorization The text datasets are taken from []. These data have the benefit of being designated to have a small witness rate. Thus they serve as a better MIL benchmark compared to the previous ones. These are derived from the 2 Newsgroups corpus, with 5 positive and 5 negative bags for each of the 2 news categories. Each positive bag has around3% witness rate. We run -fold cross validation times on each dataset and compute the average accuracy and standard deviations,c is fixed to, ǫ to.2. Authors of [] reported recent results for this dataset on their website, which are vastly superior than the ones reported in the paper. Therefore, in Table 2 we included both results in the comparison, identified as migraph (paper) and migraph (website), respectively. Our SVR-SVM performs significantly better than MI-Kernel and migraph (paper). It is comparable with migraph (web), and offers a marginal improvement. It is interesting that even though we use a suboptimal second step, SVR-SVM fares well with the state-of-the-art. This shows the potential of methods based on likelihood ratio estimators for multiple instance learning. 6 Conclusion We have proposed an approach to multiple-instance learning based on estimating the likelihood ratio between the positive and the negative classes on instances. The MIL constraint is reformulated into a convex constraint on the likelihood ratio where a joint estimation of both the function and the target ratios on the training set is performed. Theoretically we justify that learning the likelihood ratio is Bayes-consistent and has desirable excess loss transform properties. Although we are not able to find the optimal classification threshold on the estimated ratio function, our proposed bag classifier based on such ratios obtains state-of-the-art results in a number of difficult datasets. In future work, we plan to explore transductive learning techniques in order to leverage the information in the learned ratio function and identify better threshold estimation procedures. Acknowledgements This work is supported, in part, by the European Commission, under a Marie Curie Excellence Grant MCEXT

9 References [] Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89 (997) 3 7 [2] Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS. (23) [3] Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: NIPS. (998) [4] Felzenszwalb, P.F., McAllester, D.A., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR. (28) [5] Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and webbased tool for image annotation. IJCV 77(-3) (28) [6] Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: Temporal grouping and dialogsupervised person recognition. In: CVPR. (2) [7] Zeisl, B., Leistner, C., Saffari, A., Bischof, H.: On-line semi-supervised multiple-instance boosting. In: CVPR. (2) [8] Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: ICML. (22) [9] Tao, Q., Scott, S., Vinodchandran, N.V., Osugi, T.T.: Svm-based generalized multiple-instance learning via approximate box counting. In: ICML. (24) [] Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-i.i.d. samples. In: ICML. (29) [] Cheung, P.M., Kwok, J.T.: A regularization framework for multiple-instance learning. In: ICML. (26) 93 2 [2] Fung, G., Dandar, M., Krishnapuram, B., Rao, R.B.: Multiple instance learning for computer aided diagnosis. In: NIPS. (27) [3] Mangasarian, O., Wild, E.: Multiple instance classification via successive linear programming. Journal of Optimization Theory and Applications 37 (28) [4] Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multipleinstance learning. In: ICML. (22) [5] Gehler, P., Chapelle, O.: Deterministic annealing for multiple-instance learning. In: AISTATS. (27) [6] Dollár, P., Babenko, B., Belongie, S., Perona, P., Tu, Z.: Multiple component learning for object detection. In: ECCV. (28) [7] Li, Y.F., Kwok, J.T., Tsang, I.W., Zhou, Z.H.: A convex method for locating regions of interest with multi-instance learning. In: ECML. (29) [8] Mammen, E., Tsybakov, A.B.: Smooth discrimination analysis. Annals of Statistics 27 (999) [9] Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32 (24) [2] Bartlett, P., Jordan, M.I., McAulliffe, J.: Convexity, classification and risk bounds. Journal of American Statistical Association (26) [2] Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood ratio. Technical report, Institute for Numerical Simulation, University of Bonn (November 2) [22] Nguyen, X., Wainwright, M., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: NIPS. (27) [23] Liese, F., Vajda, I.: Convex Statistical Distances. Teubner VG (987) [24] Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. The Annals of Statistics 36 (28)

10 Supplementary material to the paper Convex Multiple-Instance Learning by Estimating Likelihood Ratio October 5, 2 Proof of Theorem Lemma If 2x assumes a beta distribution with parameter (, β), the expection of x, x 2, x 3, x 4 is E(x) = 2 β + E(x 2 ) = 2 (β + 2)(β + ) E(x 3 ) = 3 4 (β + )(β + 2)(β + 3) E(x 4 ) = 3 2 (β + )(β + 2)(β + 3)(β + 4) Proof: It is known that the expectation of the beta distribution B(, β) is E(x) = β+, therefore the first equation is immediate. For k >, the expectations E(x k ) of a beta distribution can be computed as: = = = = B(α, β) xk x α ( x) β dx Γ(α + β) Γ(α)Γ(β) xk x α ( x) β dx Γ(α + β)γ(α + k ) Γ(α + k + β)γ(α) Γ(α + β)γ(α + k ) Γ(α + k + β)γ(α) k 2 j= k 2 Γ(α + k + β) Γ(α + k )Γ(β) xk x α ( x) β dx x B(α + k, β) xα+k 2 ( x) β dx (α + j) α + k j= (α + β + j) (k 2) () α + k + β

11 The last step use the formulae B(x+y) = Γ(x)Γ(y) Γ(x+y) and Γ(x+) = xγ(x), as well as the expectation of the beta distribution E(x) = α α+β. If 2x instead of x assumes the beta distribution, then E(x k ) = 2 k E((2x)k ) = k 2 j= (α + j) α + k 2 k k 2 j= (α + β + j) α + k + β. (2) Substite α =, k = 2, 3, 4 we obtain the Lemma. Proof of Theorem : Proof: Define η(x i ) = Pr(y = x i, Pr(y = x i ) > Pr(y = x i )). The most adversarial scenario is Pr( 2η(x i ) 2ɛ) = cɛ β, because this maximizes uniformly the chance that the class-conditional probability is close to /2. This means 2η(x i ) assumes a beta distribution B(, β). It could be proved that the likelihood ratio satisfies 9 4 η η η η 2η2 + η From Lemma, we know that E(η 2 ) = 2 (β+2)(β+), E(η) = 2 have 3 β (β + )(β + 2) E( η η ) β + 4 2(β + )(β + 2) Furthermore, a bound on the variance can be computed as η V ( η ) = E[( η η η )2 ] E[( η )]2 β+, therefore we 4E(η 4 ) + 4E(η 3 ) + E(η 2 ) 9 (β + 5) 2 64 (β + ) 2 (β + 2) 2 ( (β + ) + 5β 83 8(β + )(β + 3)(β + 4) ) (β + )(β + 2) (β+)(β+2)(β+3) in which E(η 4 ) = 3 2 (β+)(β+2)(β+3)(β+4), E(η3 ) = 3 4 involves some quite complicated arithmetics and simplification. A simpler bound can be expressed as: η V ( η ) (4β + ) 2(β + ) 2 (2β + 3), which is an upper bound of the above bound. By Bennett s inequality [], n 2V (Z) log /δ Z i E(Z) + n n i= log /δ 3n. The last step 2

12 therefore with probability δ, η n Z i ne(z) + 2nV (Z) log /δ + i= log /δ. 3 Take Z = η and plug in the mean and variance bounds computed previously, we obtain the theorem. 2 Proof of Theorem 2 (a) It is straightforward to show that R(f) R = R(f) R(η /2) = E([f(X) >, η(x) < /2]( 2η(X)) +E([f(X) <, η(x) > /2](2η(X) )) = (R (f) R ) + R + (f) R + We apply Jensen s inequality to both summands and obtain ψ( [η(x) < /2](R(f) R ))) + ψ([η(x) /2](R(f) R )) E(ψ( [η(x) < /2, f(x) > ]( 2η(X))) +ψ([η(x) /2, f(x) < ](2η(X) ))) = E([sign(f(X) ) sign(η(x) /2)]ψ((2η(X) ))) E([sign(f(X) ) sign(η(x) /2)] ψ((2η(x) ))) = E([sign(f(X) ) sign(η(x) /2)](H (η(x)) H(η(X)))) ( ( )) = E [sign(f(x) ) sign(η(x) /2)] inf C(α, η(x)) H(η(X)) α,(α )(2η(X) ) E(C(α, η(x)) H(η(X))) = R C (f) R C (b) First note that since ψ() = and ψ is continuous, θ i implies ψ(θ i ). Thus we can replace condition (2) by (2 ) For any sequence (θ i ) in [, ], ψ(θ i ) implies θ i To see that () implies (2 ), let C be classification-calibrated, and let (θ i ) be a sequence that does not converge to. Define c = lim sup θ i >, and pass to a subsequence with lim θ i = c. Then by continuity lim ψ(θ i ) = ψ(c), and ψ(c) > by classification-calibration. Thus for the original seequence (θ i ), we see lim sup ψ(θ i ) >, so we cannot have ψ(θ i ). To see that (2 ) implies (3), suppose that R C (f i ) RC. By part (a) of the theorem ψ (R (f i ) R ) + ψ(r + (f i ) R+), since both ψ and ψ are convex and positive, (2 ) implies R(f i ) R. 3

13 Finally, to see (3) implies (), suppose C is not classification-calibrated. By definition, we can choose η /2 and a sequence α, α 2,... such that sign(α i (η /2)) = but c η (α i ) H(η). Fix x and choose the probability distribution P so that P X {x} = and P (Y = X = x) = η. Define a sequence of functions f i for which f i (x) = α i. Then lim R(f i ) > R, and this is true for any infinite subsequence. But C η (α i ) H(η) implies R C (f i ) RC. The contradiction proves the final part of the theorem. 3 Equivalent Constant Transforms Proposition For loss functions that satisfy a scaling equality L(k ŷ, k y) = k 2 L(ŷ, y), solving (4) with C = C and D i for each bag is equivalent with solving (3) with C = k2 k 2 C and k D i. Proof: A variable substitution of z = k y and v = k w in (3) would obtain the conclusion. The support vector regression (SVR) we use satisfies L(k ŷ, k y; ɛ) = k L(ŷ, y; ɛ k ). In this case, the uniform scaling on all the D i s is equivalent to subsequent changes in both C and ɛ. 4 Projection to the Bag Constraint We use the following procedure to project y + : Denote the sum of scores in a bag as s i = x + j B i y+ j. For each bag B i, we check if s i < D i, if not, we simply set all the negative y + j to. If s i < D i, we add D i s i B i to each y + j. This makes s i = D i. Then we set all the negative y + j to. If this makes s i > D i, we subtract equal amount on all the positive y + j, to make s i = D i. If this created additional negative y + j, the alternating projection process goes on until both conditions: s i = D i and y + are fulfilled. 5 Derivation of the SVM dual problem First rewrite the optimization in the canonical form: min w,b,y + 2 w 2 + C( n + +n i= (ξ i + ˆξ i )) s.t. ( w, φ(x i ) + b) y i ɛ + ξ i, for each x + i, x i y i ( w, φ(x i ) + b) ɛ + ˆξ i, for each x + i, x i x + j Bi y+ j D i B i y + j where y i = for x i and K(x, y) = φ(x), φ(y). 4

14 Setting α i, α+ i as the Lagrange multipliers of the first and second set of constraints, and η j the Lagrange multipliers of the bag constraints, we obtain the following KKT condition: w = i (α+ i α i )φ(x i) α + i, α i C η j α + i α i i (α+ i α i ) = α + i, α i, η j From the KKT conditions one could see that to make the equality constraint x + j Bi y+ j = D i B i hold, we need η j >, essentially, each α + i in the bag must be positive. This means that y i ( w, φ(x i ) + b) ɛ for all the items in the bag. Thus, to make the equality hold, all items in the positive bag need to be predicted smaller than their real value y i. Another issue is, since η must hold, α + i α i, since only one of α+ i and α i is nonzero, this means α i = for instances in positive bags. Therefore, the situation that α + i =, α i > could never exist. Back to the original optimization problem, this means ( w, φ(x i ) + b) y i ɛ + ξ i never holds in equality. Therefore, ( w, φ(x i ) + b) y i + ɛ. The dual problem is very similar with the original SVM, with only minor differences introduced from η: min α +,α,η 2 (α+ α ) T K(α + α ) +ɛ n i= (α+ i + α i ) k i= D i B i η i s.t. α + i, α i C n i= (α+ i α i ) = η j α + i α i, x i B j α + i, α i, η j From the dual problem, we could see that η j needs to be made larger to improve the result of the dual. Therefore, the algorithm would always prefer to choose η j >. From our previous analysis this essentially means that the equality constraint x + j B y+ i j = D i B i is desirable. Thus the algorithm would tend to make more vectors from positive bags as support vectors. And tend to make predictions smaller than the estimated value y i. It could be seen that the best solution of η j is η j = max(min xi B j (α + i α i ), ), with this in mind, the problem can still be solved using an SMO-type active set approach of selecting two α i for one iteration. 5

15 The SMO subproblem is thus: min α i,α 2 j [ ] [ ] [ ] Q si α i s j α ii Q ij si α i j + ɛ(α Q ij Q jj s j α i + α j ) j s.t. D k B k η k D k2 B k2 η k2 α i, α j C s i α i + s j α j doesn t change where s i and s j are the signs of α i and α j, respectively. The SMO working set selection would be the same as in the original SVM, i.e., find the maximal violating pair, except that the gradient now needs to take η into consideration. References [] Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (963) 3 3 6

Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Convex Multiple-Instance Learning by Estimating Likelihood Ratio Convex Multiple-Instance Learning by Estimating Likelihood Ratio Fuxin Li and Cristian Sminchisescu Institute for Numerical Simulation, University of Bonn {fuxin.li,cristian.sminchisescu}@ins.uni-bonn.de

More information

Co-training and Learning with Noise

Co-training and Learning with Noise Co-training and Learning with Noise Wee Sun Lee LEEWS@COMP.NUS.EDU.SG Department of Computer Science and Singapore-MIT Alliance, National University of Singapore, Singapore 117543, Republic of Singapore

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification COMP 55 Applied Machine Learning Lecture 5: Generative models for linear classification Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Multiple Instance Learning via Gaussian Processes

Multiple Instance Learning via Gaussian Processes Noname manuscript No. (will be inserted by the editor) Multiple Instance Learning via Gaussian Processes Minyoung Kim and Fernando De la Torre Received: date / Accepted: date Abstract Multiple instance

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Surrogate Risk Consistency: the Classification Case

Surrogate Risk Consistency: the Classification Case Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

arxiv: v1 [stat.ml] 3 Sep 2014

arxiv: v1 [stat.ml] 3 Sep 2014 Breakdown Point of Robust Support Vector Machine Takafumi Kanamori, Shuhei Fujiwara 2, and Akiko Takeda 2 Nagoya University 2 The University of Tokyo arxiv:409.0934v [stat.ml] 3 Sep 204 Abstract The support

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

SMO Algorithms for Support Vector Machines without Bias Term

SMO Algorithms for Support Vector Machines without Bias Term Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002

More information

A convex relaxation for weakly supervised classifiers

A convex relaxation for weakly supervised classifiers A convex relaxation for weakly supervised classifiers Armand Joulin and Francis Bach SIERRA group INRIA -Ecole Normale Supérieure ICML 2012 Weakly supervised classification We adress the problem of weakly

More information

How Good is a Kernel When Used as a Similarity Measure?

How Good is a Kernel When Used as a Similarity Measure? How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum

More information

Sparse Kernel Machines - SVM

Sparse Kernel Machines - SVM Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support

More information

Support Vector Machines on General Confidence Functions

Support Vector Machines on General Confidence Functions Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Dynamic Time-Alignment Kernel in Support Vector Machine

Dynamic Time-Alignment Kernel in Support Vector Machine Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Convex Optimization and Support Vector Machine

Convex Optimization and Support Vector Machine Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Support Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

Distributionally robust optimization techniques in batch bayesian optimisation

Distributionally robust optimization techniques in batch bayesian optimisation Distributionally robust optimization techniques in batch bayesian optimisation Nikitas Rontsis June 13, 2016 1 Introduction This report is concerned with performing batch bayesian optimization of an unknown

More information