A convex relaxation for weakly supervised classifiers

Size: px

Start display at page:

Download "A convex relaxation for weakly supervised classifiers"

Jeremy York
6 years ago
Views:

1 A convex relaxation for weakly supervised classifiers Armand Joulin and Francis Bach SIERRA group INRIA -Ecole Normale Supérieure ICML 2012

2 Weakly supervised classification We adress the problem of weakly supervision: Instances are grouped into bags that are associated with observable partial labelling We suppose that each instance possesses its own true latent label

3 Example Introduction Bags = images Instances = pixels Set of true labels = {horse, human, background} set of partial Labelling = 2 {horse, human, background} Partially labeled data y ={horse, background} y = {background} y = {human, background} y = {horse, background} Fully labeled data

4 Weakly supervised classification: Examples Semi-supervised learning Multiple instance learning Unsupervised learning set of partial Labelling: {,, } set of partial Labelling: {, } set of partial Labelling: set of Latent true labels: { } {, } Examples of partial labelling depending for different weakly supervised problems

5 Inferring the labels and learning the model Latent true labelling classifier = set = {, } The goal is to jointly estimate these true latent labels and learn a classifier based on them This usually leads to non-convex formulations They are typically optimized with an EM procedure which converges to local minima.

6 Our approach Introduction We propose: A general weakly supervised framework based on the likelihood of a probabilistic model, A convex relaxation of the related cost function, A dedicated optimization scheme.

7 Notations Introduction Notations Discriminative classifier Cluster size balancing term Overall problem Partial Labelling set ={, } Bags = I bags of instances, Each instance n is associated with: a feature x n X a weight π n, a partial label y n, common to a bag, a latent label z n depending on y n

8 Discriminative classifier Notations Discriminative classifier Cluster size balancing term Overall problem We consider a reguralized discriminative classifier: L(z,wφ(x)+b)+ λ 2 w 2 F

9 Discriminative classifier Notations Discriminative classifier Cluster size balancing term Overall problem We consider a reguralized discriminative classifier: L(z,wφ(x)+b)+ λ 2 w 2 F where the loss function L(z,w,b) is the reweighted soft-max loss function: N ( exp(w T ) p φ(x n )+b p ) π n y nl z np log n=1 p P l k P l exp(wk Tφ(x, n)+b k ) l L Our cost function is equivalent to the log-likelihood of a multinomial model (w,b) = parameters x = feature φ = feature map y n = label z n = latent label π n = weight

10 Cluster size balancing term Notations Discriminative classifier Cluster size balancing term Overall problem In unsupervised learning or MIL, the same latent label to all the instances

11 Cluster size balancing term Notations Discriminative classifier Cluster size balancing term Overall problem In unsupervised learning or MIL, the same latent label to all the instances => Perfect separation

12 Cluster size balancing term Notations Discriminative classifier Cluster size balancing term Overall problem we penalizing by the entropy of the proportion of instances per class and per bag (Joulin et al., 2010): H(z) = ( ) ( ) π n z nk log π n z nk i I k P n N i n N i This penalization is related to a graphical model (x z y): No additional parameter i = bag n = instance x n = feature I = set of bags z n = latent label π n = weight

13 Overall problem Introduction Notations Discriminative classifier Cluster size balancing term Overall problem Our overall problem is formulated as: min min f(z,w,b) = z P w, b Not jointly convex in z and (w,b). [ L(z,w,b) H(z)+ λ 2 w 2 F ] H = cluster size balancing term (w, b) = classifier parameters P = set of latent labels L = cost function λ = regularization parameter z = latent label

14 - Overview - Overview Fenchel duality Reparametrization [ min min f(z,w,b) = L(z,w,b) H(z)+ λ ] z P w, b 2 w 2 F We use a dual formulation based on Fenchel duality We reparametrize the problem following Guo and Schuurmans, 2008 Finally we relax it to a semi-definite program (SDP)

15 Duality with Fenchel conjugate - Overview Fenchel duality Reparametrization The Fenchel conjugate of the log-partition is: log( t k ) = max q k t k q k log(q k ) q k k k The minimization in (w,b) leading to the dual formulation is in closed form: min H(z)+ π n h(q n ) 1 z P λ tr ( (q z)(q z) T K ) i I n N i max q S N P (q z) T π=0 where K nm = π n φ(x n ),π m φ(x m ). i = bag n = instance x n = feature φ = feature map π n = weight z n = latent label S P = simplex h = entropy (w,b) = classifier parameters

16 Sources of non-convexity - Overview Fenchel duality Reparametrization min max z P q SP N (q z) T π=0 H(z)+ i I π n h(q n ) 1 λ tr ( (q z)(q z) T K ) n N i 2 sources of non-convexity: A constraint joining a variable of a convex and a concave problem A function which is not jointly convex/concave in z/q. Proposed solution (Guo and Schuurmans, 2008): Reparametrization in q SDP relaxation in z

17 Reparametrization in q - Overview Fenchel duality Reparametrization We reparametrize the problem by introducing an N N matrix Ω such that: q = Ωz The constraints on q become convex constraints over Ω The problem becomes: min max H(z)+ z P Ω IR N N + i I Ω T π=π Ω1 N =1 N π n h(ω n z) 1 λ tr ( (I Ω)zz T (I Ω) T K ) n N i q = dual variables z = latent label π = weights

18 Tight upper-bound on the entropy - Overview Fenchel duality Reparametrization The entropy in Ω and z is not convex We use the following upper bound: π n h(ω n z) π n h(ω n )+H(z)+C 0. n N i n i I This upper-bound is tight for discrete values of z This leads to: min z P N max Ω IR N N + Ω T π=π Ω1 N =1 N π n h(ω n ) 1 λ tr( (I Ω)zz T (I Ω) T K ) n h = entropy z = latent label π = weights

19 Reparametrization in z - Overview Fenchel duality Reparametrization The cost function depends only on zz T => Z = zz T [ min Z Z max Ω IR N N + Ω T π=π Ω1 N =1 N π n h(ω n ) 1 λ tr( (I Ω)Z(I Ω) T K )] n The cost function is the maximum of linear functions of Z, which is convex We relax the set Z of Z to its convex hull, the Elliptope: E N = {Z IR N N diag(z) = 1 N, Z 0} h = entropy z = latent label π = weights

20 - Overview Overview Rounding [ min Z Z max Ω IR N N + Ω T π=π Ω1 N =1 N π n h(ω n ) 1 λ tr( (I Ω)Z(I Ω) T K )] n It is a saddle-point problem where there is no explicit form for either the primal or dual in a nutshell: We use proximal method with Kullback-Leibler divergence for both maximization and the minimation The overall complexity is O(N 3 )

21 Rounding Introduction Overview Rounding we use k-means clustering on the eigenvectors associated with the k highest eigenvalues We then use EM procedure on the orginal problem to obtain a better solution

22 - Overview Toy example on an unsupervised problem MIL and SSL results on classical datasets

23 Unsupervised classification (a) (b) (c) (d) (e) Figure: (b) K = xx T, (c) the matrix Z obtained with (Bach and Harchaoui, 2007), (d) with no intercept and (e) with our method. Bach and Harchaoui (2007) use square loss instead of logistic loss Our method with no intercept is similar to Guo and Schuurman (2008)

24 Multiple-instance learning Algorithm Musk1 Tiger Elephant Fox Trec1 C. k-nn (Wang and Zucker, 2000) EM-DD (Zhang and Goldman, 2001) mi-svm (Andrews et al., 2003) MI-SVM (Andrews et al., 2003) PPMM Kernel (Wang et al., 2008) Rand. init / Uniform Rand. init / Weight No inter. / Uniform 75.0 ± ± ± ± ± 5.2 No inter. / Weight 77.8 ± ± ± ± ± 5.6 Ours / Uniform 84.4 ± ± ± ± ± 4.7 Ours / Weight 87.7 ± ± ± ± ± 6.2 Figure: Accuracy of our approach and of standard methods for MIL. We evaluate our method with and without the intercept and with two types of weights. In bold, the significantly best performances.

25 Semi-supervised learning Dataset Lin. Nonlin. Ent-Reg. Ours (Linear) Ours (Nonlinear) Digit ± ± 2.9 BCI ± ± 1.1 l=10 g241c ± ± 0.4 g241d ± ± 10.1 USPS ± ± 0.5 Digit ± ± 1.0 BCI ± ± 0.8 l=100 g241c ± ± 0.7 g241d ± ± 3.0 USPS ± ± 0.2 Figure: Comparison in accuracy on SSL databases with methods proposed in Chapelle et al. (2006). In bold, the significantly best performances.

26 Conclusion Introduction A general weakly supervised framework A convex formulation Competitive performances on different problems Limitation: Cannot scale to more than 10,000 instances, How it should be used: On a subset of data as a robust initialization of an EM, After a dimension reduction algorithm (such as k-means)

27 Thank you.

Convex optimization for cosegmentation

Convex optimization for cosegmentation Armand Joulin Advisors: Francis Bach and Jean Ponce INRIA -Ecole Normale Supérieure December 17, 2012 Acknowledgements Advisors: Francis Bach Jean Ponce Reviewers: Kristen Grauman Dale Schuurmans Examinateurs: