Convex Multiple-Instance Learning by Estimating Likelihood Ratio
|
|
- Eric Long
- 5 years ago
- Views:
Transcription
1 Convex Multiple-Instance Learning by Estimating Likelihood Ratio Fuxin Li and Cristian Sminchisescu Institute for Numerical Simulation, University of Bonn Abstract We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is cast as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the - classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the - loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach. Introduction Multiple Instance Learning (MIL) has been proposed over years ago as a methodology to learn models under weak labeling constraints []. Unlike traditional binary classification problems, the positive items are represented as bags, which are sets of instances. A feature vector is used to represent each instance in the bag. There is an OR relationship in a bag: if one of the feature vectors is classified as positive, the entire bag is considered positive. A simple intuition is: one has a number of keys and faces a locked door. To enter the door, we only need one matching keys. MIL is a natural weak labeling formulation for text categorization [2] and computer vision problems [3]. In document classification, one is given files made of many sentences, and often only a few are useful. In computer vision, an image can be decomposed into different regions, and only some delineate objects. Therefore, MIL can be used in sophisticated tasks, such as identifying the location of object parts from bounding box information in images [4]. Although efforts have been made to provide datasets with increasingly more detailed supervisory information [5], without automation such a minutiae level of detail becomes prohibitive for large datasets, or more complicated data like video [6, 7]. In this case, one necessarily needs to resort to multiple-instance learning. MIL is interesting mainly because of its potential to provide instance-level labels from weak supervisory information. However the state-of-the-art in MIL is often obtained by simply using a weighted sum of kernel values between all instance pairs within the bags, while ignoring the prediction of instance labels [8, 9, ]. It is intriguing why MIL algorithms that exploit instance level information cannot achieve better performance, as constraints at instance level seems abundant none of the negative instances is positive. This should provide additional constraints in defining the region of positive instances and should help classification in input space. A major challenge is the non-convexity of many instance-level MIL algorithms [2,, 2, 3, 4]. Most of these algorithms perform alternating minimization on the classifier and the instance weights.
2 This procedure usually gives only a local optimum since the objective is non-convex. The benchmark performance of MIL methods is overall quite similar, although techniques differ significantly: some assign binary weights to instances [2], some assign real weights [2, 3], yet others use probabilistic formulations [4]; some optimize using conventional alternating minimization, others use convexconcave procedures []. Gehler and Chapelle [5] have recently performed an interesting analysis of the MIL costs, where deterministic annealing (DA) was used to compute better local optima for several formulations. In the case of a previous mi-svm formulation [2], annealing methods did not improve the performance significantly. A newly proposed algorithm, ALP-SVM, was also introduced, which used a preset parameter defining the fixed ratio of witnesses the true positive instances in a positive bag. Excellent results were obtained with this witness rate parameter set to the correct value. However, in practice it is unclear whether this can be known beforehand and whether it is stationary across different bags. In principle, the witness rate should also be estimated, and this learning stage partially accounts for the non-convexity of the MIL problem. It remains however unclear whether the observed performance variations are caused by non-convexity or by other modeling aspects. Although performance considerations have hindered the application of MIL to practical problems, the methodology has started to gain momentum recently [4, 6, 7, 6]. The success of the Latent SVM for person detection [4] shows that a standard MIL procedure (the reformulation of the alternating minimization MI-SVM algorithm in [2]) can achieve good results if properly initialized. However, proper initialization of MIL remains elusive in general, as it often requires engineering experience with the individual problem structure. Therefore, it is still of broad interest to develop an initialization-independent formulation for MIL. Recently Li et al. [7] proposed a convex instancelevel MIL algorithm based on multiple kernel learning, where one kernel was used for each possible combination of instances. This creates an exponential number of constraints and requires a cuttingplane solver. Although the formulation is convex, its scalability drops significantly for bags with many instances. In this paper we make an alternative attempt towards a convex formulation: we establish that nonconvex MIL constraints can be recast reliably into convex constraints on the likelihood ratio between the positive and negative classes for each instance. We transform the multiple-instance learning problem into a convex joint estimation of the likelihood ratio function and the likelihood ratio values on training instances. The choice of the jointly convex loss function is rich, remarkably at least from a family of f-divergences. Theoretically, we prove consistency results for likelihood ratio estimation, thus showing that f-divergence loss functions upper bound the classification - loss tightly, unless the likelihood is very large. A support vector regression scheme is implemented to estimate the likelihood ratio, and it is shown to separate positive and negative instances well. However, determining the correct threshold for instance classification from the training set remains non-trivial. To address this problem, we propose a post-processing step based on a bag classifier computed as a linear combination of likelihood ratios. While this is shown to be suboptimal in synthetic experiments, it still achieves state-of-theart results in practical datasets, demonstrating the vast potential of the proposed approach. 2 Convex Reformulation of the Multiple Instance Constraint Let us consider a learning problem with n training instances in total, n + positive and n negative. In negative bags, every instance is negative, hence we do not separately define such bags instead we directly work with the instances. Let B = {B,B 2,...,B k } be positive bags and X + = {x +,x+ 2,...,x+ n + }, X = {x,x 2,...,x n } be the training input, where each x i belongs to a positive bagb j and eachx i is a negative instance. The goal of multiple instance learning is, given {X +,X,B}, to learn a decision rule,sign(f(x)), to predict the label{+, } for the test instance x. The MIL problem can be characterized by two properties. ) negative-exclusion: if none of the instances in a bag is positive, the bag is not positive. 2) positive-identifiability: if one of the instances in the bag is positive, the bag is positive. These properties are equivalent to a constraint max xi B j f(x i ) on positive bags. This constraint is not convex since the negativemax function is concave. Reformulation into a sum constraint such as f(x) would be convex, when 2
3 f(x) = w T x is linear [6]. However, this hardly retains positive-identifiability, since if there is only onex i with f(x i ) >, this can be superseded by other instances with f(x i ) <. Apparently, the distinction between the sum and the max operations is significant and difficult to ignore in this context. However, in this paper we show that if MIL conditions are formulated as constraints on the likelihood ratio, convexity can be achieved. For example, the constraint: Pr(y = x i ) Pr(y = x i ) > B j () x i B j can ensure both of the MIL properties. Positive-identifiability is satisfied when Pr(y = x i ) or equivalently, when the positive examples all have very large margin. B i B i + When the size of the bag is large, the assumption Pr(y = x i ) > Bj B j + can be too strong. Therefore, we exploit large deviation bounds to reduce the quantity B j, such that Pr(y = x i ) does not have to be very large to satisfy the constraint. Intuitively, if the examples are not very ambiguous, i.e. Pr(y = x i ) is not close to /2, then likelihood ratio sums on negative examples can become much smaller, hence we can adopt a significantly lower threshold at some degree of violation of the negative-exclusion property. To this end, a common assumption is the low label noise [8, 9]: M β : c >, ǫ,pr( < Pr(y = x i ) 2 ǫ) cǫβ. This assumes that the posterior Pr(y = x i ) is usually not very close to /2, meaning that most examples are not very ambiguous, which is reasonable in many practical problems. In [8, 9, 2], a number of results have been obtained implying that classifiers learned under this assumption converge to the Bayes error much faster than the conventional empirical process rate O(n /2 ) of most standard classifiers, and can be as fast aso(n ). These theoretical results show that low label noise assumptions indeed supports learning with fewer observations. AssumingM β holds, we prove the following result which allows us to relax the hard constraint (): Theorem δ >, for each x i in a bag B j, assume y i is drawn i.i.d. from the distribution Pr Bj (y i x i ) that satisfiesm β. If all instancesx i B j are negative, then the probability that Pr(y = x i ) Pr(y = x i ) β +4 2(β +)(β +2) B 4β + j + 2(β +) 2 (2β +3) B j log/δ+ log/δ (2) 3 x i B j is at most δ. The proof is given in an accompanying technical report [2]. From Theorem, we could weaken the constraint () to obtain constraint (2) and still ensure negative-exclusion with probability δ. When β is large, the reduction is significant. For example, for β = 2 and δ =.5, the right- hand side of (2) is approximately 4 B i B i +, which is an important decrease over B i, whenever B i 3. Note that the i.i.d. assumption in Theorem applies to each bag. Different bags can have different label distributions. This is often a significantly weaker assumption than the ones based on global i.i.d. of labels [8]. 3 Likelihood Ratio Estimation To estimate the likelihood ratio, one possibility would be to use kernel methods as nonparametric estimators over a RKHS. This approach was taken in [22], where predictions of the ratio provided a variational estimate of an f-divergence (or Ali-Silvey divergence) between two distributions. The formulation is powerful, yet not immediately applicable here. In our case, because of the uncertainty in the positive examples, Pr(y = x) is not observed but has to be estimated. Therefore we need to optimize jointly asmin f,pr(y= x) D(f,Pr(y = x))+λ f 2 with loss functiond(f,g). This optimization would not be convex if a framework in [22] were taken. 3
4 The requirement to estimate two sets of variables simultaneously (e.g. f and Pr(y = x) here), is one of the major difficulties in turning multiple-instance learning into a convex problem. Approaches based on classification-style loss functions lead to non-convex optimization [2, 3]. However, since we are outside a classification setting, we can optimize over divergence measuresd φ (f,g) that are convex w.r.t. both f and g. These measures are common. For example, the f-divergence family that includes many statistical distances, satisfies the following properties [23]: (x i y i) 2 x i ; L : D(x,y) = i x i y i ;χ 2 : D(x,y) = i Kullback-Leibler: D(x,y) = i x ilogx i x i logy i x i +y i ; (3) Symmetric Kullback-Leibler: D(x,y) = i (y i x i )logy i +(x i y i )logx i x i +y i In principle, any of the measures given above can be used to estimate the likelihood ratio. An important issue is the relationship between the likelihood ratio estimation and our final goal: binary classification. In [2], the authors give necessary and sufficient conditions for Bayes consistent learners by minimizing the mean of a surrogate loss function of the data. In this paper we extend these results to loss functions for likelihood ratio estimation. Let R(f) = P(sign(y) sign(f(x) )) be the - risk of a likelihood estimator f, with classification rule given by sign(f(x) ). The Bayes risk is thenr = inf f R(f). For a generic loss function C(α,η), let η = Pr(y = x), we can define the C-risk as R C (f) = E xy (C(f,η)) and RC = inf f R C (f). Our goal is to bound the excess - risk R(f) R by the excess-c riskr C (f) RC, so that minimizing the classification loss can be converted into minimizing the excess-c risk. Let us further define the optimal conditional risk ash(η) = inf α R C(α,η), and H (η) = inf α,(α )(2η ) C(α,η). We say C(α,η) is classification-calibrated if for any η /2, H (η) > H(η). Then, we define the ψ-transform of C(α,η) as ψ(θ) = ψ (θ), where ψ(θ) = H ( +θ 2 ) H(+θ 2 ),θ [,], andg is the Fenchel-Legendre biconjugate ofg, which is essentially the largest convex lower bound ofg [2]. The difference between likelihood ratio estimation and the classification setting is in the asymmetric scaling of the loss function for positive and negative examples. Let ψ = ψ( x), R (f) = Pr(y =,f(x) > ),R = inf f R (f), R + (f) = Pr(y =,f(x) < ) andr + = inf f R + (f) be the risk and Bayes risks on negative and positive examples, respectively. It is easy to prove that R(f) R = R (f) R +R + (f) R +. We derived the following theorem: Theorem 2 a) For any nonnegative loss function C(α,η), any measurable f : X R, and any probability distribution onx {±},ψ (R (f) R )+ψ(r +(f) R+ ) R C(f) RC. b) The following conditions are equivalent: ()C is classification-calibrated; (2) For any sequence(θ i ) in [,],ψ(θ i ) if and only ifθ i ; (3) For every sequence of measurable functionsf i : X R and every probability distribution onx {±},R C (f i ) RC impliesr(f i) R. The proof is given in an accompanying technical report [2]. This suggests that if ψ is well-behaved, minimizing R C (f) still gives a reasonable surrogate for the classification risk. Compared against Theorem 3 in [2] which has the form ψ(r(f) R ) R C (f) RC, the difference here stems from the different loss transforms used for the positive and the negative examples. We consider an f-divergence of the likelihood as the loss function, i.e., C(α, η) = D(α, η ), where η η is the likelihood ratio when the Pr(y = x) = η. From convexity arguments, it can be easily seen that H(η) = C( η η,η) = and H η (η) = D(, η ), therefore ψ(θ) = D(, +θ θ ). The ψ for all the loss functions listed in (4) can be computed accordingly. In fig. 3 (a) we show the ψ(θ) ofl and the KL-divergence from (4) and compare it against the hinge loss (where ψ(θ) = θ [2]) used for SVM classification. It could be seen that our approximation of the classification loss is accurate when Pr(y i = x i ) is small. However, likelihood estimation would severely penalize the misclassified positive examples with largepr(y i = x i ). This suggests that for the joint estimation of f and Pr(y i = x i ), the optimizer would tend to make Pr(y i = x i ) smaller, in order to avoid high penalties, as shown in fig. (b). In fig. (a) we plot ψ functions for different losses. We prefer an L measure as it is closer to the classification hinge loss, at least for the negative examples. It is easier to solve the nonparametric function estimation in RKHS using an epsilon-insensitive L loss, since it can be reformulated as η 4
5 .2 Positive Negative.2 Positive Negative Undecided ψ(θ) 5 L divergence KL divergence Hinge Loss Estimated Likelihood θ Example (a) (b) (c) Figure : Loss functions and their influence on the estimation bias. (a) The function ψ appearing in the losses used for likelihood estimation (L, KL-divergence) is similar to the hinge loss when θ > ; however it goes to infinity as θ approaches. This deviation essentially means the surrogate loss is going to be extremely large if an example with very largepr(y i = x i ) is misclassified. (b) Example estimated likelihood for a synthetic example. The estimated likelihood is biased towards smaller values. However, with a fully labeled training set, the threshold can still be obtained. (c) If we only know the label of the negative examples (blue) and the maximal positive example (red), determining the optimal threshold becomes non-trivial. support vector regression on the conditional likelihood, with the additional MIL constraints in (2): min f,η + x + j s.t. max( f(x + j ) η+ j ǫ,)+ x max( f(x j ) ǫ,)+λ f 2 j x + η+ j Bi j D i,η + j (4) where f 2 is the RKHS norm;d i is a constant for each bag and can be determined from Theorem Pr(y= x + i ), with appropriately chosen values for constantsβ andδ; η i + is an estimate of for the Pr(y= x + i ) training set. In this paper we use β = 2 and δ =.5, which gives the estimate of the bound for each bag asd i = 4 B 3 i + 4 B i +, whenb i 3 andd i = B i when B i < 3. It can be proved that optimization problem (4) is jointly convex in both arguments. A standard representer theorem [24] would convert it to an optimization on vectors, which we omit here. The problem can be solved by different methods. The one easiest to implement is the alternating minimization between solving for the SVM and projecting on the constraint sets given by x + j Bi y+ j D i and y + j. As this can turn out to be slow for large datasets, approaches such as the dual SMO or primal subgradient projection algorithms (in the case of linear SVM) can be used. In this paper we implement the alternating minimization approach, which is provably convergent since the optimization problem (4) is convex. In the accompanying technical report [2] we derive an SMO algorithm based on the dual of (4) and characterize the basic properties of the optimization problem. 4 Bag and Instance Classification If the likelihood ratio is obtained using an unbiased estimator, a decision rule based on sign(f(x) ) should give the optimal classifier. However as previously argued, the joint estimation on f and η + introduces a bias which is not always easy to identify. In positive bags, it is unclear whether an instance should be labeled positive or negative, as long as it does not contribute significantly to the classification error of its bag (fig. 3(b),(c)). In the synthetic experiments, we noticed that knowledge of the correct threshold would make the algorithm outperform competitors by a large margin (fig. 2). This means that based on the learned likelihood ratio, the positive examples are usually well separated from the negative ones. Developing a theory that would advance these aspects remains a promising avenue for future work. The main difficulty stems from the compound source of bias which arises from both the estimation ofη + and the loss minimization overη + andf. Here we propose a partial solution. Instead of directly estimating the threshold, we learn a linear combination of instance likelihood ratios to classify the bag. First, we sort the instance likelihood ratios for each bag into a vector of lengthmax i B i. We appendto bags that do not have enough 5
6 Error averaged over 5 runs (a) (b) (c) Bag Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM percent of positive labeled points in bags Error averaged over 5 runs Pattern Classification Error AL SVM : T=C mi SVM AW SVM BEST THRESHOLD SVR SVM percent of positives in bags (d) (e) (f) Estimated witness rate Witness Rate Estimate by SVR SVM Percent of positives in bags Figure 2: Synthetic dataset (best viewed in color). (a) The true decision boundary. (b) Training points at 4% witness rate. (c) The learned regression function. (d) Bag misclassification rate of different algorithms. (e) Instance misclassification rate of different algorithms. (f) Estimated witness rate and true witness rate. instances. Under this representation, bag classification turns into a standard binary decision problem where a vector and a binary label is given for each bag, and a linear SVM is learned to solve the problem. If we were to classify only the likelihood ratio on the first instance, this procedure would reduce to simple thresholding. We instead leverage information in the entire bag, aiming to constrain the classifier to learn the correct threshold. In this linear SVM setting, regularization never helps in practice and we always fix C to very large values. Effectively no parameter tuning is needed. To classify instances, a threshold is still necessary. In the current system, we follow a simple approach and take the mean between two instances: the one with the highest likelihood among training bags that are predicted negative by the bag classifier, and the lowest scored one among instances in positive bags with a score higher than the previous one. This approach is derived from the basic MIL assumption that all instances in a negative bag are negative. Based on instance classification we could also estimate the witness rate of the dataset. This is computed as the ratio of positively classified instances and the total number of instances in the positive bags of the training set. Since our algorithm automatically adjusts to different witness rates, this estimate offers quantitative insight as to whether MIL should be used. For instance, if the witness rate is%, it may be more effective to use a conventional learning approach. 5 Experiments 5. Synthetic Data We start with an experiment on the synthetic dataset of [5], where the controlled setting helps understanding the behavior of the proposed algorithm. This is a 2-D dataset with the actual decision boundary shown in fig. 2 (a). The positive bags have a fraction of points sampled uniformly from the white region and the rest sampled uniformly from the black region. An example of the sample at 4% witness rate is shown in fig. 2 (b). In this figure, the plotted instance labels are the ones of their bags indeed, one could notice many positive (blue) instances in the negative (red) region. We have also experimented with a uniform threshold based on probabilistic estimates, as well as with predicting an instance-level threshold. While the former tends to underfit, the latter overfits. Our bag-level classifier targets an intermediate level of granularity and turns out to be the most robust in our experiments. 6
7 Table : Performance of various MIL algorithms on weak labeling benchmarks. The best result on each dataset is shown in bold. The second group of algorithms either not provide instance labels (MI-Kernel and migraph) or require a parameter that can be difficult to tune (ALP-SVM). SVR- SVM appears to give consistent results among algorithms that provide instance labels. The row denoted Est. WR gives the estimated witness rates of our method. Algorithm Musk- Musk-2 Elephant Tiger Fox CH-FD EMDD mi-svm MI-SVM MICA AW-SVM Ins-KI-SVM MI-Kernel 88.± ± ± ±.9 6.3±. migraph 88.9± ± ±.7 86.± ±.6 ALP-SVM SVR-SVM 87.9± ± ± ± ±3.5 Est. WR % 89.5 % 37.8 % 42.7 % % In order to test the effect of witness rates, different types of datasets are created by varying the rates over the range.,.2,...,. In this experiment we fix the hyperparameters C = 5 and use a Gaussian kernel with σ =.We show a trained likelihood ratio function in fig. 2 (c), estimated on the dataset shown in fig. 2 (b). Under the likelihood ratio, the positive examples are well separated from negatives. This illustrates how our proposed approach converts multiple-instance learning into the problem of deciding a one-dimensional threshold. Complete results on datasets with different witness rates are shown in fig. 2 (d) and (e). We give both bag classification and instance classification results. Our approach is referred to as SVR- SVM. BEST THRESHOLD refers to a method where the best threshold was chosen based on the full knowledge of training/test instance labels, i.e., the optimal performance our likelihood ratio estimator can achieve. Comparison is done with two other approaches, the mi-svm in [2] and the AW-SVM from [5]. SVR-SVM generally works well when the witness rate is not very low. From instance classification, one can see that the original mi-svm is only competitive when the witness rate is near this situation is close to a supervised SVM. With a deterministic annealing approach in [5], AW-SVM and mi-svm perform quite the opposite competitive when the witness rate is small but degrade when this is large. Presumably this is because deterministic annealing is initialized with the apriori assumption that datasets are multiple-instance i.e. has a small witness rate [5]. When the witness rate is large, annealing does not improve performance. On the contrary, the proposed SVR-SVM does not appear to be affected by the witness rate. With the same parameters used across all the experiments, the method self-adjusts to different witness rates. One could see the effect especially in fig. 2 (e): regardless of the witness rate, the instance error rate remains roughly the same. However, this is still inferior to our model based on the best threshold, which indicates that important room for improvement exists. 5.2 MIL Datasets The algorithm is evaluated on a number of popular MIL benchmarks. We use the common experimental setting, based on -fold cross-validation for parameter selection and we report the test results averaged over trials. The results are shown in Table, together with other competitive methods in from the literature [2, 5, ] (for some of these methods standard deviation estimates are not available). In our tests, the proposed SVR-SVM gives consistently good results among algorithms that provide instance-level labels. The only atypical case is Tiger, where the algorithm underperforms other methods. Overall, the performance of SVR-SVM is slightly worse than migraph and ALP-SVM. But we note that results in ALP-SVM are obtained by tuning the witness rate to the optimal value, which may be difficult in practical settings. The slightly lower performance compared to migraph suggests that we may be inferior in the bag classification step, which we already know is suboptimal. 7
8 Table 2: Results from 2 Newsgroups. The best result on each dataset is shown in bold, pairwise t-tests are performed to determine if the differences are statistically significantly. migraph is dominating in datasets, whereas SVR-SVM is dominating in 4. Dataset MI-Kernel migraph [] migraph (web) SVR-SVM Est. WR alt.atheism 6.2 ± ± ± ±.7.83 % comp.graphics 47. ± ± ± ± % comp.windows.misc 5. ± ±.5 7. ± ± % comp.ibm.pc.hardware 46.9 ± ± ± ± % comp.sys.mac.hardware 44.5 ± ± ± 78. ± % comp.window.x 5.8 ± ± ± ± % misc.forsale 5.8 ± ± ± 72.3 ± % rec.autos 52.9 ± ± ± ± % rec.motorcycles 5.6 ± ± ± ± % rec.sport.baseball 5.7 ± ± ± ± % rec.sport.hockey 5.3 ± ± ± 89.3 ± % sci.crypt 56.3 ± ± ± ± % sci.electronics 5.6 ± ± ± 9.5 ± % sci.med 5.6±.9 62.± ± ± % sci.space 54.7 ± ± ± ± % soc.religion.christian 49.2 ± ± ± ± % talk.politics.guns 47.7 ± ± ± ± % talk.politics.mideast 55.9 ± ± ±. 8.5 ± % talk.politics.misc 5.5 ± ± ± ± % talk.religion.misc 55.4 ± ± ±. 7.9 ± % 5.3 Text Categorization The text datasets are taken from []. These data have the benefit of being designated to have a small witness rate. Thus they serve as a better MIL benchmark compared to the previous ones. These are derived from the 2 Newsgroups corpus, with 5 positive and 5 negative bags for each of the 2 news categories. Each positive bag has around3% witness rate. We run -fold cross validation times on each dataset and compute the average accuracy and standard deviations,c is fixed to, ǫ to.2. Authors of [] reported recent results for this dataset on their website, which are vastly superior than the ones reported in the paper. Therefore, in Table 2 we included both results in the comparison, identified as migraph (paper) and migraph (website), respectively. Our SVR-SVM performs significantly better than MI-Kernel and migraph (paper). It is comparable with migraph (web), and offers a marginal improvement. It is interesting that even though we use a suboptimal second step, SVR-SVM fares well with the state-of-the-art. This shows the potential of methods based on likelihood ratio estimators for multiple instance learning. 6 Conclusion We have proposed an approach to multiple-instance learning based on estimating the likelihood ratio between the positive and the negative classes on instances. The MIL constraint is reformulated into a convex constraint on the likelihood ratio where a joint estimation of both the function and the target ratios on the training set is performed. Theoretically we justify that learning the likelihood ratio is Bayes-consistent and has desirable excess loss transform properties. Although we are not able to find the optimal classification threshold on the estimated ratio function, our proposed bag classifier based on such ratios obtains state-of-the-art results in a number of difficult datasets. In future work, we plan to explore transductive learning techniques in order to leverage the information in the learned ratio function and identify better threshold estimation procedures. Acknowledgements This work is supported, in part, by the European Commission, under a Marie Curie Excellence Grant MCEXT
9 References [] Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89 (997) 3 7 [2] Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS. (23) [3] Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: NIPS. (998) [4] Felzenszwalb, P.F., McAllester, D.A., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR. (28) [5] Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and webbased tool for image annotation. IJCV 77(-3) (28) [6] Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: Temporal grouping and dialogsupervised person recognition. In: CVPR. (2) [7] Zeisl, B., Leistner, C., Saffari, A., Bischof, H.: On-line semi-supervised multiple-instance boosting. In: CVPR. (2) [8] Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: ICML. (22) [9] Tao, Q., Scott, S., Vinodchandran, N.V., Osugi, T.T.: Svm-based generalized multiple-instance learning via approximate box counting. In: ICML. (24) [] Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-i.i.d. samples. In: ICML. (29) [] Cheung, P.M., Kwok, J.T.: A regularization framework for multiple-instance learning. In: ICML. (26) 93 2 [2] Fung, G., Dandar, M., Krishnapuram, B., Rao, R.B.: Multiple instance learning for computer aided diagnosis. In: NIPS. (27) [3] Mangasarian, O., Wild, E.: Multiple instance classification via successive linear programming. Journal of Optimization Theory and Applications 37 (28) [4] Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multipleinstance learning. In: ICML. (22) [5] Gehler, P., Chapelle, O.: Deterministic annealing for multiple-instance learning. In: AISTATS. (27) [6] Dollár, P., Babenko, B., Belongie, S., Perona, P., Tu, Z.: Multiple component learning for object detection. In: ECCV. (28) [7] Li, Y.F., Kwok, J.T., Tsang, I.W., Zhou, Z.H.: A convex method for locating regions of interest with multi-instance learning. In: ECML. (29) [8] Mammen, E., Tsybakov, A.B.: Smooth discrimination analysis. Annals of Statistics 27 (999) [9] Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32 (24) [2] Bartlett, P., Jordan, M.I., McAulliffe, J.: Convexity, classification and risk bounds. Journal of American Statistical Association (26) [2] Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood ratio. Technical report, Institute for Numerical Simulation, University of Bonn (November 2) [22] Nguyen, X., Wainwright, M., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: NIPS. (27) [23] Liese, F., Vajda, I.: Convex Statistical Distances. Teubner VG (987) [24] Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. The Annals of Statistics 36 (28)
10 Supplementary material to the paper Convex Multiple-Instance Learning by Estimating Likelihood Ratio October 5, 2 Proof of Theorem Lemma If 2x assumes a beta distribution with parameter (, β), the expection of x, x 2, x 3, x 4 is E(x) = 2 β + E(x 2 ) = 2 (β + 2)(β + ) E(x 3 ) = 3 4 (β + )(β + 2)(β + 3) E(x 4 ) = 3 2 (β + )(β + 2)(β + 3)(β + 4) Proof: It is known that the expectation of the beta distribution B(, β) is E(x) = β+, therefore the first equation is immediate. For k >, the expectations E(x k ) of a beta distribution can be computed as: = = = = B(α, β) xk x α ( x) β dx Γ(α + β) Γ(α)Γ(β) xk x α ( x) β dx Γ(α + β)γ(α + k ) Γ(α + k + β)γ(α) Γ(α + β)γ(α + k ) Γ(α + k + β)γ(α) k 2 j= k 2 Γ(α + k + β) Γ(α + k )Γ(β) xk x α ( x) β dx x B(α + k, β) xα+k 2 ( x) β dx (α + j) α + k j= (α + β + j) (k 2) () α + k + β
11 The last step use the formulae B(x+y) = Γ(x)Γ(y) Γ(x+y) and Γ(x+) = xγ(x), as well as the expectation of the beta distribution E(x) = α α+β. If 2x instead of x assumes the beta distribution, then E(x k ) = 2 k E((2x)k ) = k 2 j= (α + j) α + k 2 k k 2 j= (α + β + j) α + k + β. (2) Substite α =, k = 2, 3, 4 we obtain the Lemma. Proof of Theorem : Proof: Define η(x i ) = Pr(y = x i, Pr(y = x i ) > Pr(y = x i )). The most adversarial scenario is Pr( 2η(x i ) 2ɛ) = cɛ β, because this maximizes uniformly the chance that the class-conditional probability is close to /2. This means 2η(x i ) assumes a beta distribution B(, β). It could be proved that the likelihood ratio satisfies 9 4 η η η η 2η2 + η From Lemma, we know that E(η 2 ) = 2 (β+2)(β+), E(η) = 2 have 3 β (β + )(β + 2) E( η η ) β + 4 2(β + )(β + 2) Furthermore, a bound on the variance can be computed as η V ( η ) = E[( η η η )2 ] E[( η )]2 β+, therefore we 4E(η 4 ) + 4E(η 3 ) + E(η 2 ) 9 (β + 5) 2 64 (β + ) 2 (β + 2) 2 ( (β + ) + 5β 83 8(β + )(β + 3)(β + 4) ) (β + )(β + 2) (β+)(β+2)(β+3) in which E(η 4 ) = 3 2 (β+)(β+2)(β+3)(β+4), E(η3 ) = 3 4 involves some quite complicated arithmetics and simplification. A simpler bound can be expressed as: η V ( η ) (4β + ) 2(β + ) 2 (2β + 3), which is an upper bound of the above bound. By Bennett s inequality [], n 2V (Z) log /δ Z i E(Z) + n n i= log /δ 3n. The last step 2
12 therefore with probability δ, η n Z i ne(z) + 2nV (Z) log /δ + i= log /δ. 3 Take Z = η and plug in the mean and variance bounds computed previously, we obtain the theorem. 2 Proof of Theorem 2 (a) It is straightforward to show that R(f) R = R(f) R(η /2) = E([f(X) >, η(x) < /2]( 2η(X)) +E([f(X) <, η(x) > /2](2η(X) )) = (R (f) R ) + R + (f) R + We apply Jensen s inequality to both summands and obtain ψ( [η(x) < /2](R(f) R ))) + ψ([η(x) /2](R(f) R )) E(ψ( [η(x) < /2, f(x) > ]( 2η(X))) +ψ([η(x) /2, f(x) < ](2η(X) ))) = E([sign(f(X) ) sign(η(x) /2)]ψ((2η(X) ))) E([sign(f(X) ) sign(η(x) /2)] ψ((2η(x) ))) = E([sign(f(X) ) sign(η(x) /2)](H (η(x)) H(η(X)))) ( ( )) = E [sign(f(x) ) sign(η(x) /2)] inf C(α, η(x)) H(η(X)) α,(α )(2η(X) ) E(C(α, η(x)) H(η(X))) = R C (f) R C (b) First note that since ψ() = and ψ is continuous, θ i implies ψ(θ i ). Thus we can replace condition (2) by (2 ) For any sequence (θ i ) in [, ], ψ(θ i ) implies θ i To see that () implies (2 ), let C be classification-calibrated, and let (θ i ) be a sequence that does not converge to. Define c = lim sup θ i >, and pass to a subsequence with lim θ i = c. Then by continuity lim ψ(θ i ) = ψ(c), and ψ(c) > by classification-calibration. Thus for the original seequence (θ i ), we see lim sup ψ(θ i ) >, so we cannot have ψ(θ i ). To see that (2 ) implies (3), suppose that R C (f i ) RC. By part (a) of the theorem ψ (R (f i ) R ) + ψ(r + (f i ) R+), since both ψ and ψ are convex and positive, (2 ) implies R(f i ) R. 3
13 Finally, to see (3) implies (), suppose C is not classification-calibrated. By definition, we can choose η /2 and a sequence α, α 2,... such that sign(α i (η /2)) = but c η (α i ) H(η). Fix x and choose the probability distribution P so that P X {x} = and P (Y = X = x) = η. Define a sequence of functions f i for which f i (x) = α i. Then lim R(f i ) > R, and this is true for any infinite subsequence. But C η (α i ) H(η) implies R C (f i ) RC. The contradiction proves the final part of the theorem. 3 Equivalent Constant Transforms Proposition For loss functions that satisfy a scaling equality L(k ŷ, k y) = k 2 L(ŷ, y), solving (4) with C = C and D i for each bag is equivalent with solving (3) with C = k2 k 2 C and k D i. Proof: A variable substitution of z = k y and v = k w in (3) would obtain the conclusion. The support vector regression (SVR) we use satisfies L(k ŷ, k y; ɛ) = k L(ŷ, y; ɛ k ). In this case, the uniform scaling on all the D i s is equivalent to subsequent changes in both C and ɛ. 4 Projection to the Bag Constraint We use the following procedure to project y + : Denote the sum of scores in a bag as s i = x + j B i y+ j. For each bag B i, we check if s i < D i, if not, we simply set all the negative y + j to. If s i < D i, we add D i s i B i to each y + j. This makes s i = D i. Then we set all the negative y + j to. If this makes s i > D i, we subtract equal amount on all the positive y + j, to make s i = D i. If this created additional negative y + j, the alternating projection process goes on until both conditions: s i = D i and y + are fulfilled. 5 Derivation of the SVM dual problem First rewrite the optimization in the canonical form: min w,b,y + 2 w 2 + C( n + +n i= (ξ i + ˆξ i )) s.t. ( w, φ(x i ) + b) y i ɛ + ξ i, for each x + i, x i y i ( w, φ(x i ) + b) ɛ + ˆξ i, for each x + i, x i x + j Bi y+ j D i B i y + j where y i = for x i and K(x, y) = φ(x), φ(y). 4
14 Setting α i, α+ i as the Lagrange multipliers of the first and second set of constraints, and η j the Lagrange multipliers of the bag constraints, we obtain the following KKT condition: w = i (α+ i α i )φ(x i) α + i, α i C η j α + i α i i (α+ i α i ) = α + i, α i, η j From the KKT conditions one could see that to make the equality constraint x + j Bi y+ j = D i B i hold, we need η j >, essentially, each α + i in the bag must be positive. This means that y i ( w, φ(x i ) + b) ɛ for all the items in the bag. Thus, to make the equality hold, all items in the positive bag need to be predicted smaller than their real value y i. Another issue is, since η must hold, α + i α i, since only one of α+ i and α i is nonzero, this means α i = for instances in positive bags. Therefore, the situation that α + i =, α i > could never exist. Back to the original optimization problem, this means ( w, φ(x i ) + b) y i ɛ + ξ i never holds in equality. Therefore, ( w, φ(x i ) + b) y i + ɛ. The dual problem is very similar with the original SVM, with only minor differences introduced from η: min α +,α,η 2 (α+ α ) T K(α + α ) +ɛ n i= (α+ i + α i ) k i= D i B i η i s.t. α + i, α i C n i= (α+ i α i ) = η j α + i α i, x i B j α + i, α i, η j From the dual problem, we could see that η j needs to be made larger to improve the result of the dual. Therefore, the algorithm would always prefer to choose η j >. From our previous analysis this essentially means that the equality constraint x + j B y+ i j = D i B i is desirable. Thus the algorithm would tend to make more vectors from positive bags as support vectors. And tend to make predictions smaller than the estimated value y i. It could be seen that the best solution of η j is η j = max(min xi B j (α + i α i ), ), with this in mind, the problem can still be solved using an SMO-type active set approach of selecting two α i for one iteration. 5
15 The SMO subproblem is thus: min α i,α 2 j [ ] [ ] [ ] Q si α i s j α ii Q ij si α i j + ɛ(α Q ij Q jj s j α i + α j ) j s.t. D k B k η k D k2 B k2 η k2 α i, α j C s i α i + s j α j doesn t change where s i and s j are the signs of α i and α j, respectively. The SMO working set selection would be the same as in the original SVM, i.e., find the maximal violating pair, except that the gradient now needs to take η into consideration. References [] Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (963) 3 3 6
Convex Multiple-Instance Learning by Estimating Likelihood Ratio
Convex Multiple-Instance Learning by Estimating Likelihood Ratio Fuxin Li and Cristian Sminchisescu Institute for Numerical Simulation, University of Bonn {fuxin.li,cristian.sminchisescu}@ins.uni-bonn.de
More informationCo-training and Learning with Noise
Co-training and Learning with Noise Wee Sun Lee LEEWS@COMP.NUS.EDU.SG Department of Computer Science and Singapore-MIT Alliance, National University of Singapore, Singapore 117543, Republic of Singapore
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationCOMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification
COMP 55 Applied Machine Learning Lecture 5: Generative models for linear classification Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationLogistic Regression and Boosting for Labeled Bags of Instances
Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationMultiple Instance Learning via Gaussian Processes
Noname manuscript No. (will be inserted by the editor) Multiple Instance Learning via Gaussian Processes Minyoung Kim and Fernando De la Torre Received: date / Accepted: date Abstract Multiple instance
More informationCalibrated Surrogate Losses
EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationSecond Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs
Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationSupport Vector Machine for Classification and Regression
Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationSurrogate Risk Consistency: the Classification Case
Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationA Tutorial on Support Vector Machine
A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationL5 Support Vector Classification
L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationSUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS
SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it
More informationarxiv: v1 [stat.ml] 3 Sep 2014
Breakdown Point of Robust Support Vector Machine Takafumi Kanamori, Shuhei Fujiwara 2, and Akiko Takeda 2 Nagoya University 2 The University of Tokyo arxiv:409.0934v [stat.ml] 3 Sep 204 Abstract The support
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationSMO Algorithms for Support Vector Machines without Bias Term
Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002
More informationA convex relaxation for weakly supervised classifiers
A convex relaxation for weakly supervised classifiers Armand Joulin and Francis Bach SIERRA group INRIA -Ecole Normale Supérieure ICML 2012 Weakly supervised classification We adress the problem of weakly
More informationHow Good is a Kernel When Used as a Similarity Measure?
How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum
More informationSparse Kernel Machines - SVM
Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support
More informationSupport Vector Machines on General Confidence Functions
Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationClass Prior Estimation from Positive and Unlabeled Data
IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,
More informationSupport Vector Machines
Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationDynamic Time-Alignment Kernel in Support Vector Machine
Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationConvex Optimization and Support Vector Machine
Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationSurrogate regret bounds for generalized classification performance metrics
Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2
More informationA Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes
A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationReview: Support vector machines. Machine learning techniques and image analysis
Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationSupport Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationScale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract
Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationFoundation of Intelligent Systems, Part I. SVM s & Kernel Methods
Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:
More informationDistributionally robust optimization techniques in batch bayesian optimisation
Distributionally robust optimization techniques in batch bayesian optimisation Nikitas Rontsis June 13, 2016 1 Introduction This report is concerned with performing batch bayesian optimization of an unknown
More information