Active Learning for Sparse Bayesian Multilabel Classification

Size: px

Start display at page:

Download "Active Learning for Sparse Bayesian Multilabel Classification"

Emory Bridges
5 years ago
Views:

1 Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond

2 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

3 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

4 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space

5 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space Iraq Flowers Human Brick Sea Sun Sky

6 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space Iraq Flowers Human Brick Sea Sun Sky

7 Training

8 Training

9 Training WikiLSHTC has 325k labels. Good luck with that!!

10 Training Is Expensive Training data can also be very expensive, like genomic data, chemical data Getting each label incurs additional cost

11 Training Is Expensive Training data can also be very expensive, like genomic data, chemical data Getting each label incurs additional cost Need to reduce the required training data.

12 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints N

13 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L 1 0 Datapoints N

14 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L 1 0 Datapoints N

15 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints N Which data points should I label?

16 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints For a particular datapoint, N which labels should I reveal?

17 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints Can I choose datapoint-label pairs N to annotate?

18 In this talk An active learner for Multi-label classification that: Answers all your questions Is Computationally Cheap Is Non myopic and near-optimal Incorporates label sparsity Achieves higher accuracy than state-of-the-art

19 Classification

20 Classification Model* x i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012

21 Classification Model* x i Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012

22 Classification Model* x i W Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012

23 Classification Model* x i W Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i Sparsity 1 i 2 i 3 i L i *Kapoor et al, NIPS 2012

24 Classification Model: Potentials x i f xi (W, z i )=e W T xi z i W z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

25 Classification Model: Potentials x i f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

26 Classification Model: Priors x i W z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

27 Classification Model: Priors x i W z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

28 Sparsity Priors a 0 = 10 6,b 0 = 10 6

29 Classification Model x i W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

30 Classification Model x i W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) yi 1 yi 2 yi 3 yi L Problem: Exact inference is intractable. j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

31 Inference: Variational Bayes x i W z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

32 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

33 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

34 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i Approximate Gamma 1 i 2 i 3 i L i

35 Inference: Variational Bayes x i W z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

36 Active Learning Criteria Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as:! H(X) = X i P (x i ) log(p (x i )) Picks points far apart from each other For a Gaussian process, H = 1 2 log( )+const

37 Active Learning Criteria Mutual Information: Measures reduction in uncertainty over unlabeled space! MI(A, B) =H(A) H(A B) Used in past work successfully for regression

38 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A = arg A U max H(Y U\A ) H(Y U\A A)

39 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Variance is not preserved across layers A = arg A U max H(Y U\A ) H(Y U\A A)

40 Idea: Collapsed Variational Bayes

41 Idea: Collapsed Variational Bayes xi W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

42 Idea: Collapsed Variational Bayes x i f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i 2 2 2 W y

43 Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

44 Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i W Use Variational Bayes for sparsity y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

45 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A = arg A U max H(Y U\A ) H(Y U\A A)

46 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A = arg A U max H(Y U\A ) exponential time H(Y U\A A)

47 Solution: Approximate Mutual Information Approximate the final distribution over Y by a Gaussian Use the Gaussian to estimate the mutual information Theorem 1: lim a 0!0,b 0!0 MI ˆ! MI

48 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Subset selection problem is NP complete A = arg A U max H(Y U\A ) H(Y U\A A)

49 Solution: Use Submodularity Under some weak conditions, the objective is sub-modular Sub-modularity ensures that the greedy solution is a constant times the optimal solution

50 Algorithm Input: Feature vectors for a set of unlabeled instance, U and a budget n Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI x arg max x2u\a MI(A ˆ [ x) MI(A) ˆ

51 Performance Evaluation

52 Datasets Dataset Type Instances Features Labels Yeast Biology MSRC Image Medical Text Enron Text Mediamill Video RCV1 Text Bookmarks Text Delicious Text

53 Setup Unlabeled pool size: 4000 points, test size: 2000 points For smaller datasets, the entire data was in unlabeled pool. Testing on all unlabeled data Initial seed size: 500 points

54 Compared Algorithms MIML: Mutual Information for Multilabel Classification (proposed method). Uncert: Uncertainty sampling (Entropy based) Rand: Random sampling Li-Adaptive*: SVM based adaptive active learning *Li et al, IJCAI 2013

55 Traditional Active Learning Labels Iraq Flowers Sun Sky L Datapoints N Which data points should I label?

56 Traditional Active Learning

57 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points

58 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points

Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML 0.54 0.

59 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points

60 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints For a particular datapoint, N which labels should I reveal?

61 Active Diagnosis

62 Active Diagnosis 0.7 Rand Uncert MIML RCV F Score # labels

63 Active Diagnosis 0.7 Rand Uncert MIML RCV F Score # labels

64 Generalized Active Learning Labels Iraq Flowers Sun Sky L Datapoints Can I choose datapoint-label pairs N to annotate?

65 Generalized Active Learning

66 Generalized Active Learning 0.26 Rand Uncert MIML RCV F Score #points

67 Generalized Active Learning 0.26 Rand Uncert MIML RCV F Score #points

68 Time Complexity Dataset Labels MIML Li-Adaptive Yeast 14 3m 25s 1m 54s Mediamill m 29s 54m 35s RCV m 45s 37m 35s Bookmarks m 58s 3h 57m Delicious 983 1h 11m 20h 15m

69 Related Work SVM based Active Learning: Li et al [IJCAI, 2013], Yang et al [KDD 2009], Esuli et al [ECIR 2009], Li et al [ICIP 2004], Mutual Information: Krause et al [UAI 2005], Krause et al [JMLR 2008], Singh et al [JAIR 2009],

70 Conclusion Proposed mutual information based active learning for multi-label classification Collapsed Variational Bayes to infer variances Theoretical analysis of mutual information approximation showing that it is near-optimal Showed significant empirical improvements over the state-of-the-art

Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning

Proceedings of Machine Learning Research 76:1 16, 2017 Algorithmic Learning Theory 2017 Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning Alan Fern School of