Multi-label learning from batch and streaming data

Size: px

Start display at page:

Download "Multi-label learning from batch and streaming data"

Benjamin Craig
5 years ago
Views:

1 Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia

2 Introduction x = Binary classification y {sunset, non sunset}

3 Introduction x = Multi-class classification y {sunset, people, foliage, beach, urban}

4 Introduction x = Multi-label classification y {sunset, people, foliage, beach, urban} {, } 5 = [,,,,, ] i.e., multiple labels per instance instead of a single label.

5 Introduction K = 2 K > 2 L = binary multi-class L > multi-label multi-output also known as multi-target, multi-dimensional. Figure: For L target variables (labels), each of K values. multi-output can be cast to multi-label, just as multi-class can be cast to binary set of labels (L) is predefined (contrast to tagging/keyword assignment)

6 Introduction Table: Academic articles containing the phrase multi-label classification (Google Scholar, 26) year in text in title

7 Single-label vs. Multi-label Table: Single-label Y {, } X X 2 X 3 X 4 X 5 Y ? Table: Multi-label Y {λ,..., λ L } X X 2 X 3 X 4 X 5 Y. 3 {λ 2, λ 3 }.9 {λ }. {λ 2 }.8 2 {λ, λ 4 }. 2 {λ 4 }. 3?

8 Single-label vs. Multi-label Table: Single-label Y {, } X X 2 X 3 X 4 X 5 Y ? Table: Multi-label [Y,..., Y L ] 2 L X X 2 X 3 X 4 X 5 Y Y 2 Y 3 Y ????

9 Outline Introduction 2 Applications 3 Methods 4 Label Dependence 5 Multi-label Classification in Data Streams

10 Text Categorization and Tag Recommendation For example, the IMDb dataset: Textual movie plot summaries associated with genres (labels). Also: Bookmarks, Bibtex, del.icio.us datasets.

11 Text Categorization and Tag Recommendation For example, the IMDb dataset: Textual movie plot summaries associated with genres (labels). Also: Bookmarks, Bibtex, del.icio.us datasets. abandoned accident... violent wedding i X X 2... X X Y Y 2... Y 27 Y horror romance... comedy action

12 Labelling Images Images are labelled to indicate multiple concepts multiple objects multiple people e.g., Associating Scenes with concepts {beach, sunset, foliage, field, mountain, urban}

13 Labelling Audio For example, labelling music with emotions, concepts, etc. amazed happy relaxing quiet sad angry amazed-surprised happy-pleased relaxing-calm quiet-still sad-lonely angry-aggressive amazed happy relaxing quiet sad angry

14 Related Tasks multi-output classification: outputs are nominal rank gender group X X 2 X 3 X 4 X 5 x x 2 x 3 x 4 x 5 M 2 x x 2 x 3 x 4 x 5 4 F 2 x x 2 x 3 x 4 x 5 2 M multi-output regression: outputs are real-valued price age percent X X 2 X 3 X 4 X 5 x x 2 x 3 x 4 x x x 2 x 3 x 4 x x x 2 x 3 x 4 x label ranking, i.e., preference learning λ 3 λ λ 4... λ 2 aka multi-target, multi-dimensional

15 Related Areas multi-task learning: multiple tasks, shared representation sequential learning: predict across time indices instead of across label indices; each label may have a different input y y 2 y 3 y 4 x x 2 x 3 x 4 structured output prediction: assume particular structure amoung outputs, e.g., pixels, hierarchy n n2 n3 n4 n5 n6 n7 n8 n9 n 6 5 n n2 n3 n4 n5 8 7 n6 n7 n8 n9 n2 9 n2 n22 n23 n24 n25

16 Streaming Multi-label Data Many advanced applications must deal with data streams: Data arrives continuously, potentially infinitely Prediction must be made immediately Expect concept drift For example, Demand prediction Intrusion detection Pollution detection

17 Demand Prediction Outputs (labels) represent the demand at multiple points. Figure: Stops in the greater Helsinki region. The Kutsuplus taxi service could be called to any of these. We are interested in predicting, for each label [y,..., y L ], p(y j = x) probability of demand at j-th node

18 Localization and Tracking Outputs represent points in space which may contain an object (y j = ) or not (y j = ). Observations are given as x Figure: Modelled on a real-world scenario; a room with a single light source and a number of light sensors. We are interested in predicting, for each label [y,..., y L ], y j = if j-th tile occupied

19 Route/Destination Forecasting Personal nodes of a traveller and predicted trajectory L number of geographic points of interest x observed data (e.g., GPS, sensor activity, time of day) p(yj = x) probability an object is present at the j-th node {xi, yi }N i= training data

20 Outline Introduction 2 Applications 3 Methods 4 Label Dependence 5 Multi-label Classification in Data Streams

21 Multi-label Classification x y y 2 y 3 y 4 and then, ŷ j = h j (x) = argmax p(y j x) for index, j =,..., L y j {,} ŷ = h(x) = [ŷ,..., ŷ 4 ] [ = p(y x),, argmax = argmax y {,} [ f (x),, f 4 (x) y 4 {,} ] = f (W x) This is the Binary Relevance method (BR). ] p(y 4 x)

22 BR Transformation Transform dataset... X Y Y 2 Y 3 Y 4 x () x (2) x (3) x (4) x (5)... into L separate binary problems (one for each label) X Y x () x (2) x (3) x (4) x (5) X Y 2 x () x (2) x (3) x (4) x (5) X Y 3 x () x (2) x (3) x (4) x (5) X Y 4 x () x (2) x (3) x (4) x (5) 2 and train with any off-the-shelf binary base classifier.

23 Why Not Binary Relevance? BR ignores label dependence, i.e., p(y x) = which may not always hold! L p(y j x) j= Example (Film Genre Classification) p(y romance x) p(y romance x, y horror )

24 Why Not Binary Relevance? BR ignores label dependence, i.e., p(y x) = which may not always hold! L p(y j x) j= Table: Average predictive performance (5 fold CV, EXACT MATCH) L BR MCC Music Scene Yeast Genbase Medical Enron Reuters.29.37

25 Modelling label dependence, Classifier Chains x y y 2 y 3 y 4 and, p(y x) = p(y x) L p(y j x, y,..., y j ) j=2 ŷ = argmax p(y x) y {,} L

26 Bayes Optimal CC Example p(y = [,, ]) =.4 2 p(y = [,, ]) =.4 3 p(y = [,, ]) = p(y = [,, ]) = p(y = [,, ]) =.9 return argmax y p(y x) Search space of {, } L paths is too much

27 CC Transformation Similar to BR: make L binary problems, but include previous predictions as feature attributes, X Y x () x (2) x (3) x (4) x (5) X Y Y 2 x () x (2) x (3) x (4) x (5) X Y Y 2 Y 3 x () x (2) x (3) x (4) x (5) X Y Y 3 Y 3 Y 4 x () x (2) x (3) x (4) x (5) and, again, apply any classifier (not necessarily a probabilistic one)!

28 Greedy CC x y y 2 y 3 y 4 L classifiers for L labels. For test instance x, classify ŷ = h ( x) 2 ŷ 2 = h 2 ( x, ŷ ) 3 ŷ 3 = h 3 ( x, ŷ, ŷ 2 ) 4 ŷ 4 = h 4 ( x, ŷ, ŷ 2, ŷ 3 ) and return ŷ = [ŷ,..., ŷ L ]

29 Example x y y 2 y 3 ŷ = h( x) = [?,?,?]

30 Example x.6 y y 2 y 3.4 ŷ = h ( x) = argmax y p(y x) = ŷ = h( x) = [,?,?]

31 Example y x y 2 y 3 ŷ = h ( x) = argmax y p(y x) = 2 ŷ 2 = h 2 ( x, ŷ ) =... = ŷ = h( x) = [,,?]

32 Example ŷ = h( x) = [,, ] y x y 2 y 3 ŷ = h ( x) = argmax y p(y x) = 2 ŷ 2 = h 2 ( x, ŷ ) =... = 3 ŷ 3 = h 3 ( x, ŷ, ŷ 2 ) =... =

33 Example x y y 2 y 3 ŷ = h ( x) = argmax y p(y x) = 2 ŷ 2 = h 2 ( x, ŷ ) =... = ŷ = h( x) = [,, ] 3 ŷ 3 = h 3 ( x, ŷ, ŷ 2 ) =... = Improves over BR; similar build time (if L < D); able to use any off-the-shelf classifier for h j ; parralelizable But, errors may be propagated down the chain

34 Example Monte-Carlo search for CC Sample T times... p([,, ]) = =.252 p([,, ]) = =.288 return argmax yt p(y t x) Tractable, with similar accuracy to (Bayes Optimal) PCC Can use other search algorithms, e.g., beam search

35 Does Label-order Matter? x x y y 2 y 3 y 4 y 4 y 2 y 3 y In theory, models are equivalent, since p(y x) = p(y x)p(y 2 y, x) = p(y 2 x)p(y y 2, x) vs but we are estimating p from finite and noisy data; thus ˆp(y x)ˆp(y 2 y, x) ˆp(y 2 x)ˆp(y y 2, x) and in the greedy case, ˆp(y 2 y, x) ˆp(y 2 ŷ, x) = ˆp(y 2 y = argmax ˆp(y x) x) y

36 Does Label-order Matter? In theory, models are equivalent, since p(y x) = p(y x)p(y 2 y, x) = p(y 2 x)p(y y 2, x) but we are estimating p from finite and noisy data; thus ˆp(y x)ˆp(y 2 y, x) ˆp(y 2 x)ˆp(y y 2, x) and in the greedy case, ˆp(y 2 y, x) ˆp(y 2 ŷ, x) = ˆp(y 2 y = argmax ˆp(y x) x) y The approximations cause high variance on account of error propagation. We can can reduce variance with an ensemble of classifier chains 2 we can search space of chain orders (huge space, but a little search makes a difference)

37 Label Powerset Method (LP) One multi-class problem (taking many values), x y, y 2, y 3, y 4 ŷ = argmax p(y x) where Y {, } L y Y Each value is a label vector, 2 L in total, but typically, only the occurrences of the training set. (in practice, Y size of training set, and Y 2 L )

38 Label Powerset Method (LP) Transform dataset... X Y Y 2 Y 3 Y 4 x () x (2) x (3) x (4) x (5)... into a multi-class problem, taking 2 L possible values: X Y 2 L x () x (2) x (3) x (4) x (5) 2... and train any off-the-shelf multi-class classifier.

39 Issues with LP complexity (up to 2 L combinations) imbalance: few examples per class label overfitting: how to predict new value? Example In the Enron dataset, 44% of labelsets are unique (to a single training example or test instance). In del.icio.us dataset, 98% are unique.

40 Meta Labels Improving the label-powerset approach: decomposition of label set into M subsets of size k (k < L) pruning, such that, e.g., Y,2 {[, ], [, ], [, ]} combine together with random subspace method with a voting scheme Y Y 2 Y 3 Y,2 Y 2,3 X X 2 X 3

41 Meta Labels Improving the label-powerset approach: decomposition of label set into M subsets of size k (k < L) pruning, such that, e.g., Y,2 {[, ], [, ], [, ]} combine together with random subspace method with a voting scheme Method Inference Complexity Label Powerset O(2 L D) Pruned Sets O(P D) Decomposition / RAkEL O(M 2 k D) Meta Labels O(M P D ) where P < 2 L and P < 2 k, D < D.

42 Summary of Mehtods Two views of a multi-label problem of L labels: L binary problems 2 a multi-class problem with up to 2 L classes Problem Transformation: Transform data into subproblems (binary or multi-class) 2 Apply some off-the-shelf base classifier or, Algorithm Adaptation: Take a suitable single-label classifier (knn, neural networks, decision trees... ) 2 Adapt it (if necessary) for multi-label classification

43 Outline Introduction 2 Applications 3 Methods 4 Label Dependence 5 Multi-label Classification in Data Streams

44 Label Dependence in MLC Common approach: Present methods to measure label dependence 2 find a structure that best represents this and then apply classifiers, compare results to BR. Example x x y 24 y 234 y 3 y y 4 y 2 y y 2 y 3 y 4 Link particular labels (nodes) together (CC-based methods) Form particular label subsets (LP-based methods)

45 Label Dependence in MLC Common approach: Present methods to measure label dependence 2 find a structure that best represents this and then apply classifiers, compare results to BR. Problem Measuring label dependence is expensive, models built on it often do not improve over models built on random dependence! Problem For some metrics (such as Hamming-loss / label accuracy), knowledge of label dependence is theoretically unnecessary!

46 Marginal label dependence Marginal dependence When the joint is not the product of the marginals, i.e., p(y 2 ) p(y 2 y ) p(y )p(y 2 ) p(y, y 2 ) Y Y 2 Estimate from co-occurrence frequencies in training data

47 Example Marginal label dependence amazed amazed happy relaxing quiet sad angry happy relaxing quiet sad angry Figure: Music dataset - Mutual Information

48 Example Marginal label dependence beach beach sunset foliage field mountain urban sunset foliage field mountain urban Figure: Scene dataset - Mutual Information

49 Marginal label dependence Marginal dependence When the joint is not the product of the marginals, i.e., p(y 2 ) p(y 2 y ) p(y )p(y 2 ) p(y, y 2 ) Y Y 2 Estimate from co-occurrence frequencies in training data Used for regularization/constraints: ŷ = h(x) makes a prediction 2 ŷ = g(ŷ) regularizes the prediction

50 Conditional label dependence But at classification time, we condition on the input! Conditional dependence... conditioned on input observation x. p(y 2 y, x) p(y 2 x) X Y Y 2 Have to build and measure models Indication of conditional dependence if the performance of LP/CC exeeds that of BR errors among the binary models are correlated But what does this mean?

51 Conditional label dependence But at classification time, we condition on the input! Conditional independence... conditioned on input observation x. For example, p(y 2 ) p(y 2 y ) but p(y 2 x) = p(y 2, y, x) X Y Y 2 vs X Y Y 2 Have to build and measure models Indication of conditional dependence if the performance of LP/CC exeeds that of BR errors among the binary models are correlated But what does this mean?

52 The LOGICAL Problem Example (The LOGICAL Toy Problem) OR AND XOR X X 2 Y Y 2 Y 3 Each label is a logical operation (independent of the others!)

53 The LOGICAL Problem x x x or and xor or and xor or, and, xor Figure: BR (left), CC (middle), LP (right) Table: The LOGICAL problem, base classifier logistic regression. Metric BR CC LP HAMMING SCORE.83.. EXACT MATCH.5.. Dependence is introduced by an inadequate model! Dependence depends on the model.

54 The LOGICAL Problem.2 Model Y OR x,x 2 Y OR =.2 Model Y AND x,x 2 Y AND =. Y OR =. Y AND = x 2.4 x x x.2. Model Y XOR x,x 2 Y XOR = Y XOR =.8 x x or and xor x Figure: Binary Relevance (BR): linear decision boundary (solid line, estimated with logistic regression) not viable for Y XOR label

55 Solution via Structure.2 Model Y OR x,x 2 Y OR =.2 Model Y AND x,x 2 Y AND =. Y OR =. Y AND = x 2.4 x x x Model Y XOR Y XOR,x,x 2 or x and xor x x 2 y OR Figure: Classifier chains (CC): linear model now applicable to Y XOR

56 Solution via Multi-class Decomposition.2..8 Model Y OR,AND,XOR x,x 2 Y [OR,AND,XOR] =[,,] Y [OR,AND,XOR] =[,,] Y [OR,AND,XOR] =[,,] x.6 x or, and, xor x Figure: Label Powerset (LP): solvable with one-vs-one multi-class decomposition for any (e.g., linear) base classifier.

57 Solution via Con. Independence.3 Model Y OR,z,z 2 Y OR =.2 Model Y AND,z,z 2 Y AND =.2 Y OR =. Y AND = z z z z.6.4 Model Y XOR,z,z 2 Y XOR = Y XOR = φ(x) z or and xor z Figure: Solution via latent structure (e.g., random RBF) to new input space z; creating independence: p(y XOR z, y OR, y AND ) p(y XOR z).

58 Solution via Suitable Base-classifier no x = yes x 2 = x 2 = no yes no yes [,, ] [,, ] [,, ] [,, ] Figure: Solution via non-linear classifier (e.g., Decision Tree). Leaves hold examples, where y = [y AND, y OR, y XOR ]

59 Detecting Dependence Conditional label dependence and the choice of base model are inseperable. y j = h j (x) + ɛ j y k = h k (x) + ɛ k ɛ = [ɛor, ɛxor]. ɛ = [ɛor, ɛxor] ɛxor.2.5 ɛxor ɛor ɛor Figure: Errors from logistic regression (left) and decision tree (right).

60 A fresh look at Problem Transformation Y 3 Y 2 Y Y 2 Y 3 Y Y,2 Y 2,3 X X 2 X 3 X X 2 X 3 Figure: Standard methods can be viewed as ( deep /cascaded) basis functions on the label space.

61 Label Dependence: Summary Marginal dependence for regularization Conditional dependence... depends on the model... may be introduced Should consider together: base classifier label structure inner-layer structure An open problem Much existing research is relevant (latent-variable models, neural networks, deep learning,... )

62 Outline Introduction 2 Applications 3 Methods 4 Label Dependence 5 Multi-label Classification in Data Streams

63 Classification in Data Streams Setting: sequence is potentially infinite high speed of arrival stream is one-way Implications work in limited memory adapt to concept drift

64 Multi-label Streams Methods Batch-incremental Ensemble 2 Problem transformation with an incremental base learner 3 Multi-label knn 4 Multi-label incremental decision trees 5 Neural networks

65 Batch-Incremental Ensemble Build regular multi-label models on batches/windows of instances (typically in a [weighted] ensemble). [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] x x x x x x x x,,,,,,,, y y y y y y y y }{{}}{{} h h 2 [ ] x, y 9 [ ] x ŷ A common approach in the literature, and can be surprisingly effective Free choice of base classifier (e.g., C4.5, SVM) What batch size to use? Too small = models insufficient Too large = slow to adapt Too many batches = too slow

66 Problem Transformation with Incremental Base Learner Use an incremental learner (Naive Bayes, SGD, Hoeffding trees) with any problem transformation method (BR, LP, CC,... ) Simple implementation, x Risk of overfitting (e.g., with classifier chains) y y 2 y 3 y 4 Concept drift may invalidate structure Limited choice of base learner (must be incremental)

67 Multi-label knn Maintain a dynamic buffer of instances, compare each test instance x to the k neighbouring instances, { ( ) k i x ŷ j = (i) Ne( x) y(i) j >.5 otherwise efficient wrt L... but not wrt D limited buffer size, not suitable for all problems

68 ML Incremental Decision Trees A small sample can suffice to choose a splitting attribute (Hoeffding bound gives guarantees) As in regular tree, with modified splitting criteria, e.g., H ML (S) = L j= c {,} P(y j = c) log 2 P(y j = c) Examples with multiple labels collect at the leaves. no x = yes [,, ], [,, ] x 2 = no yes [,, ] [,, ]

69 ML Incremental Decision Trees Fast, and usually competitive, But tree may grow very conservatively,... and need to replace it (or part of it) when concept changes. no x = yes [,, ], [,, ] x 2 = no yes [,, ] [,, ] Place multi-label classifiers at the leaves of the tree and wrap it in an ensemble.

70 Neural networks Each label is a node. Trained with SGD y y 2 y 3 y 4 ŷ = W x x x 2 x 3 x 4 x 5 Can be applied natively W (t+) j,k g = E(W) = W (t) j,k + λg j,k One layer = BR, should use hidden layers to model label dependence / improve performance Hyper-parameter tuning can be tedious Relatively poor performance in empirical comparisons on standard data streams (improving now with recent advances in SGD, more common use of basis expansion)

71 Multi-label Data Streams: Issues Overfitting Class imbalance Multi-dimensional concept drift Labelled examples difficult to obtain (semi-supervised) Dynamic label set Time dependence

72 Multi-label Concept Drift Consider the relative frequencies of the j-th and k-th labels: [ ] pj p j,k p k (if marginal independence then p j,k = p j p k ). Possible drift: p j increases (j-th label relatively more frequent) p j and p k both decrease (label cardinality decreasing) p j,k changes relative to p j p k (change in marginal dependence relation between the labels)

73 Multi-label Concept Drift And when conditioned on input x, we consider the relative frequencies/values of the j-th and k-th errors: [ ] ɛj ɛ j,k ɛ k (if conditional independence, then ɛ j,k ɛ j ɛ k ). Possible drift: ɛ j increases (more errors on j-th label) ɛ j and ɛ k both increase (more errors) ɛ j,k changes relative to ɛ j, ɛ k (change in conditional dependence relation)

74 Example Recall the distribution of errors.3 ɛ = [ɛor, ɛxor].25.2 ɛxor ɛor This shape may change over time and structures may need to be adjusted to cope

75 Dealing with Concept Drift Possible approaches Just ignore it batch models must be replaced anyway, knn and SGD adapt; in other cases can use weighted ensembles/fading factor Monitor a predictive performance statistic with a change detector (e.g., window based-detection, ADWIN) and reset models Monitor the distribution with a change detector (e.g., window based, KL divergence) and reset/recalibrate models (similar to single-labelled data, except more complex measurement)

76 Dealing with Unlabelled Instances Ignore instances with no label Use active learning to get good labels Use predicted labels (self-training) Use an unsupervised process for example clustering, latent-variable representations.

77 Dealing with Unlabelled Instances Use an unsupervised process for example clustering, latent-variable representations. z t = g(x t ) 2 ŷ t = h(z t ) 3 update g with (x t, z t ) 4 update h with (z t, y t ) (if y t is available) update(, ) y t ŷ t predict( ) z t update(, ) z t propagate( ) x t Can also be as one single model

78 Summary Multi-label classification is an active area of research, relevant to many real-world problems Methods that deal appropriately wiith label dependence can achieve significant gains over a naive approach Many multi-label problems come in the form of a data stream, incurring particular challenges

79 Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia

A Deep Interpretation of Classifier Chains

A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and